# Natural Language Processing Application

This notebook is intended to analyze the similarities and differences of 2020 special reports by NCSES. The analysis my lead to avoid duplication and create collaboration.  This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding data that is not processed properly into a model will give us results that are meaningless.

##  1) Data Pre-processing

1.1. Getting the data - Scraping data from NCSES website 

1.2. Cleaning the data - Apply text preprocessing techniques

1.3. Organizing the data - Organize the cleaned data into a way that is easy to input into other algorithms

The output of this stage  will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from https://www.nsf.gov/statistics/publication-index.cfm
def url_to_transcript(url):
    '''Returns transcript data specifically from https://www.nsf.gov/statistics/publication-index.cfm.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find().find_all('p')]
    print(url)
    return text

# URLs of executive summaries of special reports
urls = ['https://ncses.nsf.gov/pubs/nsb20201/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20202/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20203/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20204/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20205/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20206/executive-summary/',
        'https://ncses.nsf.gov/pubs/nsb20207/executive-summary/']

# CSpecial reports  names
SR = ['NSB-2020-1', 'NSB-2020-2', 'NSB-2020-3', 'NSB-2020-4', 'NSB-2020-5','NSB-2020-6', 'NSB-2020-7']



In [2]:
SR

['NSB-2020-1',
 'NSB-2020-2',
 'NSB-2020-3',
 'NSB-2020-4',
 'NSB-2020-5',
 'NSB-2020-6',
 'NSB-2020-7']

In [3]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from https://www.nsf.gov/statistics/publication-index.cfm
def url_to_transcript(url):
    '''Returns transcript data specifically from https://www.nsf.gov/statistics/publication-index.cfm.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find().find_all('p')]
    print(url)
    return text

# URLs of executive summaries of special reports
urls = ['https://www.nsf.gov/awardsearch/showAward?AWD_ID=1849735&HistoricalAwards=false',
        'https://www.nsf.gov/awardsearch/showAward?AWD_ID=2050833&HistoricalAwards=false']

# CSpecial reports  names
SR = ['NSB-2020-1', 'NSB-2020-2']

urls

['https://www.nsf.gov/awardsearch/showAward?AWD_ID=1849735&HistoricalAwards=false',
 'https://www.nsf.gov/awardsearch/showAward?AWD_ID=2050833&HistoricalAwards=false']

In [4]:
# Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]


https://www.nsf.gov/awardsearch/showAward?AWD_ID=1849735&HistoricalAwards=false
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2050833&HistoricalAwards=false


In [5]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(SR):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

A subdirectory or file transcripts already exists.


In [6]:
# Load pickled files---this helps to use saved files to reduce running
data = {}
for i, c in enumerate(SR):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [7]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['NSB-2020-1', 'NSB-2020-2'])

In [8]:
# More checks
data['NSB-2020-1'][:2]

['\n',
 "\nABSTRACT\n\nThis project will advance efforts of the Innovative Technology Experiences for Students and Teachers (ITEST) program by preparing high-achieving high school students for advanced science and math courses and eventually courses in engineering. ITEST seeks to better understand and promote practices that increase students' motivations and capacities to pursue careers in fields of science, technology, engineering, and mathematics (STEM). To be globally competitive, the US must optimize the available workforce to include women in STEM. There are two fundamental issues that need to be addressed in engaging women in engineering: motivation and preparedness to pursue STEM careers. The Academy of Natural Sciences will partner with Drexel University: College of Engineering and School of Education, and local engineering professionals, through the Engineering WINS (EngWINS) program. EngWINS will develop the capabilities of working engineers and faculty to serve as mentors in

In [None]:
# Importing libraries for analysis
import pandas as pd
import pickle
import os
from gensim import matutils, models
import scipy.sparse

In [None]:
# Reading the document-term matrix
data = pd.read_pickle('../../Users/muluken/WorkingFiles/GitHUB_Notebooks/NLTK/nlp-in-python-tutorial/dtm_stop.pkl')
data

In [None]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

In [None]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [None]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("../../Users/muluken/WorkingFiles/GitHUB_Notebooks/NLTK/nlp-in-python-tutorial/cv_stop.pkl", "rb"))
#id2word = dict((v, k) for k, v in cv.vocabulary_.items())