### Introduction
Text analysis of 2019 Award Data from National Science Foundation which is avaialble at https://www.nsf.gov/awardsearch/download.jsp.  NSF provides the data as XML files and it has been transformed into a CSV file that is used in this notebook.

The goal is to analyze the Award Titles and Award Abstracts to understand the the type of Awards granted in 2019. 

This notebook is meant to present the final result of the analysis at a high level.  It does not show the intermediate, debugging, and QA-ing steps.


In [1]:
import string
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# To make cell outputs easier to read 
pd.set_option('display.max_colwidth', -1)
get_ipython().ast_node_interactivity = 'all'

[nltk_data] Downloading package punkt to /Users/heather/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/heather/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv("data/2015-2019-awards.csv");
df_2019 = df[df.AwardYear == 2019]


#### Thought process
Do a few checks to make sure the data seems reasonable. 

In [3]:
df_2019.shape
df_2019.AwardTitle.dtypes
df_2019.AwardAmt.dtypes
df_2019.Abstract.dtypes
df_2019.dtypes
df_2019[df_2019.columns[:-1]].head(1)
df_2019.Abstract.head(1)


(11754, 17)

dtype('O')

dtype('int64')

dtype('O')

Filename               object 
AwardYear              int64  
AwardTitle             object 
InstName               object 
City                   object 
Zip                    object 
Phone                  float64
Street                 object 
Country                object 
StateName              object 
StateCode              object 
AwardAmt               int64  
InitialAwardAmt        float64
AwardInstrumentType    object 
Directorate            object 
Division               object 
Abstract               object 
dtype: object

Unnamed: 0,Filename,AwardYear,AwardTitle,InstName,City,Zip,Phone,Street,Country,StateName,StateCode,AwardAmt,InitialAwardAmt,AwardInstrumentType,Directorate,Division
25353,1900008.xml,2019,Collaborative research: Weighted Estimates with Matrix Weights and Non-Homogeneous Harmonic Analysis,Kent State University,KENT,442420001,3306722000.0,OFFICE OF THE COMPTROLLER,United States,Ohio,OH,119539,194964.0,Continuing Grant,Direct For Mathematical & Physical Scien,Division Of Mathematical Sciences


25353    Calderon-Zygmund operators are objects that are largely responsible for our understanding of a number of physical phenomena, from heat transfer to turbulence. Recently, these operators have found application in big data analysis. The classical theory was built by Alberto Calderon and Antoni Zygmund in the early 1950s, and was intrinsically designed to work on smooth objects. However, nature often puzzles us with very irregular medium. Thus, the need arose for a very low regularity Calderon-Zygmund theory, which the three PIs have, in fact, constructed. One possible application of such low regularity theory is that by the action of Calderon-Zygmund operators on a set in a space of a very high dimension, we can conclude that the set itself is nicely structured and can be analyzed. This is a typical big data problem. Our other recent observation is that well-studied problems for such an operator can be dualized to provide new information for analysis on the hypercube - another wi

#### Thought process 
* Use def's to make the code easier to understand 
* Steps for cleaning the Award Title and Abstract
  1. Fill nulls with empty string 
  1. Change all words to lower case 
  1. Remove common words that do not add meaning 
  1. Remove puncutation, non-alpha characters  
  1. Set AwardTitle and Abstract back to the cleaned text
  1. Proceed with finding nGrams and concepts/topics
* Decided not to expand contractions, use stemming, use lemmatization   
* Probably room to make the cleaning code more efficient but it is running quickly for this dataset.  

In [4]:
# Being very intentional about what is being removed.  Decided against regex becuase it seems more likely to
# have unintentionally consequences.  Decided to specify exactly what remove

stopset=set(stopwords.words("english"))
stopset.update(['also', 'research', "'s", 'project', 'use', 'using', 'used'])
noise = ['(e.g.', '<br/>', '-', '>', '<', '(', ')', '2019', '?', '&', ':', ';', ',', '.']
print(sorted(stopset))

["'s", 'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'also', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'project', 're', 'research', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 

In [5]:
#code to handle cleaning the text

def RemoveNoise(txt_col):
    txt_col = txt_col.fillna("").str.lower()
    for n in noise:
        txt_col = txt_col.str.replace(n, " ", regex=False)
    return txt_col

def GetMeaningfulText(txt):
    tokens = word_tokenize(txt)
    meaningful_words = [w for w in tokens if w not in stopset]
    return ' '.join(meaningful_words)
    
#does the remove noise need to happen cell by cell? 
def UpdateColsRemoveNoise(df):
    df.Abstract = RemoveNoise(df.Abstract)
    df.AwardTitle = RemoveNoise(df.AwardTitle)
    df.AwardTitle = df.AwardTitle.map(GetMeaningfulText)
    df.Abstract = df.Abstract.map(GetMeaningfulText)
    return df

In [6]:
# Code to handle nGram logic

def GetAllWordsInCorpus(ngrams_col):
    word_list = []
    for row in ngrams_col:
        for words in row:
            word_list.append(words)
    return word_list    

# Turning the NLTK fdist into a dataframe, to help get the top results and create 
# the phrases with spaces.  
def GetDataFrameFromFDistTopNgrams(fdist):
    return pd.DataFrame.from_records(data=fdist.most_common(10000),
                                      columns=['NGram', 'Count'])

#should this param be called _col? 
def GetTopPhrases(ngrams_col):
    if (ngrams_col.size > 0):
        words = GetAllWordsInCorpus(ngrams_col)
        ngram_dist = nltk.FreqDist(words)
        ngram_df = pd.DataFrame()
        ngram_df = GetDataFrameFromFDistTopNgrams(ngram_dist)
        return ngram_df['NGram'].agg(' '.join).head(25).to_numpy()
    else:
        return null

def CreateNGramSummaryDf(df_directorate):
    df_summary = pd.DataFrame()
    
    df_directorate['words'] = df_directorate.AwardTitle.apply(lambda row: list(nltk.ngrams (row.split(), 1)))
    df_summary['TitleWord'] = GetTopPhrases(df_directorate.words)
    
    df_directorate['bigrams'] = df_directorate.AwardTitle.apply(lambda row: list(nltk.ngrams (row.split(), 2)))
    df_summary['TitleBigram'] = GetTopPhrases(df_directorate.bigrams)

    df_directorate['trigrams'] = df_directorate.AwardTitle.apply(lambda row: list(nltk.ngrams (row.split(), 3)))
    df_summary['TitleTrigram'] = GetTopPhrases(df_directorate.trigrams)
    
    df_directorate['abstractWords'] = df_directorate.Abstract.apply(lambda row: list(nltk.ngrams (row.split(), 1)))
    df_summary['AbstractWord'] = GetTopPhrases(df_directorate.abstractWords)
    
    df_directorate['abstractBigrams'] = df_directorate.Abstract.apply(lambda row: list(nltk.ngrams (row.split(), 2)))
    df_summary['AbstractBigram'] = GetTopPhrases(df_directorate.abstractBigrams)
    
    df_directorate['abstractTrigrams'] = df_directorate.Abstract.apply(lambda row: list(nltk.ngrams (row.split(), 3)))
    df_summary['AbstractTrigram'] = GetTopPhrases(df_directorate.abstractTrigrams)
    
    return df_summary


#### Explanation of choices for TfidfVectorizer parameters

* **use_idf** controls how the frequency per word is calculated, using either the:
   1. Raw counts OR 
   2. Counts when taking into consideration the importance of the word (so that common words like 'the' will be less important) 

Decision: Since this analysis removes stop words before vectorizing, setting use_idf to True or False yeilds the same concepts. 


* **ngram_range** controls the number of words in the topics. 

Decision: Since the nGram exploration in this analysis considered words, bigrams, and trigrams, ngram_range was set to match with (1,3)

#### Explanation of choices for TruncatedSVD parameters

* According to the documentation, **n_iter**=100 is recommended for LSA
* **n_components**=5 seems like a reasonable number of concepts to investigate

In [7]:
def GetConceptsWithTopics(textAsList):
    vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=False, ngram_range=(1,3))
    X = vectorizer.fit_transform(textAsList)
    lsa = TruncatedSVD(n_components=5, n_iter=100)
    lsa.fit(X)
    terms = vectorizer.get_feature_names()
    
    all_concepts=""
    for i, comp in enumerate(lsa.components_):
        termsInComp = zip(terms, comp)
        sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)[:10]
        conceptWithTopics = ("Concept %d: " % i)
        conceptWithTopics += ', '.join(str(t[0]) for t in sortedTerms)
        all_concepts += conceptWithTopics
        all_concepts += "\n\r"
    return all_concepts
    

In [8]:
# Decide which Directorates to investigate.  Can investigate many or one.  
# But exclude the ones with incomplete data

#directorates = df_2019.Directorate.unique()
directorates = ['Direct For Computer & Info Scie & Enginr']

to_exclude = ['nan', 'Office Of Information & Resource Mgmt', \
              'Office of Budget, Finance, & Award Management', \
             'National Coordination Office', \
             'Natl Nanotechnology Coordinating Office']

In [10]:
for d in [d for d in directorates if str(d) not in to_exclude]:
    df_dir = df_2019[df_2019.Directorate==d].copy(deep=True)
    dir_mean = df_dir.AwardAmt.mean()
    df_dir['IsAboveMeanAwardAmt'] = (df_dir.AwardAmt > dir_mean)  
    
    print (d)
    print(" ")
    
    df_dir = UpdateColsRemoveNoise(df_dir)
    
    df_above = df_dir[df_dir.IsAboveMeanAwardAmt == True].copy(deep=True)
    df_overview_above = CreateNGramSummaryDf(df_above) 
    df_overview_above
    concepts_above = GetConceptsWithTopics(df_above.Abstract.values.tolist())
    print ("Directorate Award Amount mean: ", '${:,.2f}'.format(dir_mean))
    print("Number of Above Average awards: " + str(len(df_above)))
    print(concepts_above) 
    
    df_below = df_dir[df_dir.IsAboveMeanAwardAmt == False].copy(deep=True)
    df_overview_below = CreateNGramSummaryDf(df_below) 
    df_overview_below
    concepts_below = GetConceptsWithTopics(df_below.Abstract.values.tolist())
    print("Number of At or Below Average awards: " + str(len(df_below)))
    print(concepts_below) 


Direct For Computer & Info Scie & Enginr
 


Unnamed: 0,TitleWord,TitleBigram,TitleTrigram,AbstractWord,AbstractBigram,AbstractTrigram
0,small,core small,cns core small,data,intellectual merit,award reflects nsf
1,collaborative,medium collaborative,satc core medium,learning,broader impacts,reflects nsf statutory
2,data,cns core,satc core small,new,support evaluation,nsf statutory mission
3,learning,satc core,core medium collaborative,support,award reflects,statutory mission deemed
4,core,shf small,cps medium collaborative,science,reflects nsf,mission deemed worthy
5,medium,machine learning,cyber physical systems,systems,nsf statutory,deemed worthy support
6,systems,core medium,cns core medium,nsf,statutory mission,worthy support evaluation
7,shf,ri small,shf medium collaborative,foundation,mission deemed,support evaluation foundation
8,science,cif small,oac core small,broader,deemed worthy,evaluation foundation intellectual
9,satc,chs small,nri int collab,impacts,worthy support,foundation intellectual merit


Directorate Award Amount mean:  $383,801.41
Number of Above Average awards: 778
Concept 0: data, learning, new, support, science, systems, nsf, foundation, broader, impacts
Concept 1: data, science, data science, big, data revolution, national, revolution, harnessing data, harnessing data revolution, big data
Concept 2: learning, machine, machine learning, models, algorithms, deep, neural, deep learning, networks, methods
Concept 3: science, students, learning, data science, materials, university, computer, computer science, engineering, community
Concept 4: systems, materials, new, models, methods, physical, software, science, engineering, manufacturing



Unnamed: 0,TitleWord,TitleBigram,TitleTrigram,AbstractWord,AbstractBigram,AbstractTrigram
0,collaborative,small collaborative,nsf student travel,data,broader impacts,intellectual merit broader
1,small,medium collaborative,student travel grant,support,intellectual merit,award reflects nsf
2,data,student travel,core small collaborative,new,merit broader,reflects nsf statutory
3,career,travel grant,cns core small,students,award reflects,nsf statutory mission
4,learning,nsf student,shf small collaborative,learning,reflects nsf,statutory mission deemed
5,medium,cns core,core medium collaborative,systems,nsf statutory,mission deemed worthy
6,student,core small,iii small collaborative,science,statutory mission,deemed worthy support
7,systems,machine learning,af small collaborative,nsf,mission deemed,worthy support evaluation
8,eager,shf small,cns core medium,award,deemed worthy,support evaluation foundation
9,computing,iii small,ri small collaborative,broader,worthy support,evaluation foundation intellectual


Number of At or Below Average awards: 1330
Concept 0: data, support, students, new, learning, systems, science, award, nsf, broader
Concept 1: data, science, learning, data science, machine, machine learning, models, big, algorithms, data revolution
Concept 2: data, students, conference, science, data science, researchers, student, travel, support, workshop
Concept 3: learning, machine, machine learning, students, science, models, conference, algorithms, student, deep
Concept 4: science, materials, computing, software, new, computer, computer science, computational, development, scientific

