# NLP
Find your favorite news source and grab the article text. 

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [None]:
# !pip3 install spacy
# !python3 -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md 

In [1]:
import spacy
import pandas as pd
import numpy as np
from collections import Counter

In [2]:
nlp = spacy.load('en_core_web_md')

#### <font color='purple'> Reading in data source</font>
#### <font color='purple'> I am narrowing this data down so that only a list containing the abstract text for each proposal is left.</font>

In [3]:
award_data = pd.read_excel("NSF_AI_sbirAwards.xlsx")
abstracts = award_data["Abstract"]

#### <font color='purple'> Iterating through each abstract in the list and running the abstract through the NLP</font>

<font color='purple'>Since there were 78 awards in my data, this will output a list of 78 doc objects (78 separately processed text blocks in a list). </font>

In [4]:
abstracts_processed = []

for i in abstracts.index:
    abstracts_processed.append(nlp(abstracts[i]))

### <font color='red'>1. Finding the most common words</font>

<font color='purple'> I will first iterate through each abstract and extract the 10 most common words in each abstract. I will keep a running list of each abstract's most common words. </font>

In [5]:
most_common_words = []

for i in range(len(abstracts_processed)):
    words = [token.lemma_ for token in abstracts_processed[i] 
             if not token.is_stop and not token.is_punct]
    
    common10 = Counter(words).most_common(10)
    most_common_words.append(common10)
    

<font color='purple'>Since I want the most common words overall (accross all abstracts), I need to iterate through the 10 most common words in each of the 78 abstracts. I will extract only the words themselves (I will not extract the count) and add them to a running list. Having all the most common words from all abstracts in a single list will allow me to see the most common words across all abstracts, rather than just from one abstract at a time. </font>

In [6]:
most_common_words_long = []

for i in range(len(most_common_words)):
    for a in range(len(most_common_words[i])):
        most_common_words_long.append(most_common_words[i][a][0])
            
                    

In [7]:
most_common_words_long

['project',
 'weight',
 'american',
 '\n',
 'adult',
 'accord',
 'loss',
 'food',
 'diet',
 'personalized',
 'project',
 'language',
 'learning',
 'dialog',
 'conversational',
 'AI',
 'system',
 '\n',
 'impact',
 'offer',
 'mosquito',
 'surveillance',
 'insect',
 'project',
 'control',
 'reduce',
 '\n',
 'support',
 'vector',
 'improve',
 'deformation',
 'insar',
 'impact',
 'surface',
 'processing',
 '\n',
 'time',
 'broad',
 'project',
 'change',
 'wind',
 'model',
 'energy',
 'pressure',
 'sensor',
 'farm',
 'ML',
 'project',
 'increase',
 'technology',
 'language',
 'sign',
 'project',
 '\n',
 'domain',
 'improve',
 'individual',
 'hh',
 'technology',
 'method',
 'career',
 'STEM',
 'student',
 'field',
 'school',
 'help',
 '\n',
 'impact',
 'project',
 'provide',
 'cobot',
 'task',
 'project',
 'use',
 'human',
 '\n',
 'guide',
 'vision',
 'processing',
 'learn',
 'project',
 'student',
 'platform',
 '\n',
 'impact',
 'workforce',
 'create',
 'environment',
 'technology',
 'collab

<font color='purple'> Finally, to find the most common words across all 78 abstracts, I will find the most common words in the full list (most_common_words_long). </font>

<font color='purple'>In the code, I am also removing certain words from this list, since they fall into the boiler plate language (template_txt) used in all of the abstracts.</font>

In [10]:
template_txt = ['project', 'broader', 'impact', 'Small', 'Business', 'Innovation', 
                '\n', 'AI', 'SBIR','STTR', 'Phase', 'Research', 'propose', 'broad']

for i in range(len(template_txt)):
    remove_word = template_txt[i]
    while remove_word in most_common_words_long:
        most_common_words_long.remove(remove_word)



In [11]:
Counter(most_common_words_long).most_common(20)

[('model', 10),
 ('learning', 9),
 ('system', 9),
 ('datum', 9),
 ('result', 8),
 ('technology', 7),
 ('time', 6),
 ('provide', 6),
 ('platform', 6),
 ('reduce', 5),
 ('learn', 5),
 ('health', 5),
 ('language', 4),
 ('control', 4),
 ('increase', 4),
 ('student', 4),
 ('video', 4),
 ('commercial', 4),
 ('enable', 4),
 ('information', 4)]

### <font color='red'>2. Finding the most common nouns, adjectives, and verbs.</font>

<font color='purple'> For this iteration, I don't care as much about granularity. I am going to iterate through every abstract and pick out all verbs, nouns, and adjectives. I will then narrow each list down to the most common of each. </font>


In [12]:
all_nouns = []
all_adj = []
all_verbs = []


for i in range(len(abstracts_processed)):
    nouns = [token.lemma_ for token in abstracts_processed[i] 
             if token.pos_ == "NOUN"]
    adj = [token.lemma_ for token in abstracts_processed[i]
              if token.pos_ == "ADJ"]
    verbs = [token.lemma_ for token in abstracts_processed[i]
                 if token.pos_ == "VERB"]
    
    all_nouns.append(nouns)
    all_adj.append(adj)
    all_verbs.append(verbs)
    
    
    

In [13]:
all_nouns = list(np.concatenate(all_nouns))
all_adj = list(np.concatenate(all_adj))
all_verbs = list(np.concatenate(all_verbs))

In [14]:
template_txt = ['project', 'broader', 'impact', 'Small', 'Business', 'Innovation', 
                '\n', 'AI', 'SBIR','STTR', 'Phase', 'Research', 'merit', 'award', 'criterion',
                'broad', 'intellectual', '-']

for i in range(len(template_txt)):
    remove_word = template_txt[i]
    while remove_word in all_nouns:
        all_nouns.remove(remove_word)
    while remove_word in all_adj:
        all_adj.remove(remove_word)
    while remove_word in all_verbs:
        all_verbs.remove(remove_word)


In [15]:
Counter(all_nouns).most_common(30)

[('datum', 101),
 ('model', 89),
 ('system', 81),
 ('learning', 81),
 ('support', 63),
 ('evaluation', 63),
 ('technology', 63),
 ('time', 59),
 ('mission', 59),
 ('review', 57),
 ('machine', 54),
 ('cost', 47),
 ('platform', 47),
 ('algorithm', 46),
 ('intelligence', 42),
 ('language', 42),
 ('student', 36),
 ('health', 35),
 ('research', 35),
 ('method', 34),
 ('application', 32),
 ('development', 32),
 ('information', 30),
 ('user', 29),
 ('training', 29),
 ('solution', 28),
 ('control', 26),
 ('video', 26),
 ('potential', 25),
 ('level', 25)]

In [16]:
Counter(all_verbs).most_common(30)

[('use', 141),
 ('develop', 64),
 ('reflect', 58),
 ('improve', 58),
 ('propose', 57),
 ('deem', 57),
 ('reduce', 55),
 ('provide', 54),
 ('learn', 49),
 ('base', 49),
 ('enable', 43),
 ('create', 40),
 ('increase', 40),
 ('have', 32),
 ('make', 31),
 ('include', 29),
 ('generate', 28),
 ('help', 28),
 ('require', 26),
 ('allow', 25),
 ('identify', 24),
 ('build', 23),
 ('result', 21),
 ('aim', 20),
 ('automate', 20),
 ('advance', 18),
 ('lead', 18),
 ('address', 18),
 ('train', 17),
 ('drive', 17)]

In [17]:
Counter(all_adj).most_common(30)

[('statutory', 57),
 ('worthy', 57),
 ('artificial', 41),
 ('new', 41),
 ('high', 36),
 ('commercial', 33),
 ('such', 29),
 ('real', 29),
 ('other', 27),
 ('current', 23),
 ('accurate', 23),
 ('human', 22),
 ('technical', 22),
 ('deep', 21),
 ('large', 18),
 ('novel', 17),
 ('advanced', 17),
 ('neural', 15),
 ('effective', 14),
 ('small', 14),
 ('low', 13),
 ('multiple', 13),
 ('many', 13),
 ('medical', 12),
 ('different', 12),
 ('social', 12),
 ('robust', 12),
 ('available', 12),
 ('reliable', 12),
 ('critical', 12)]

### <font color='red'>3. Find a subject/object relationship through the dependency parser in any sentence.</font>


In [18]:
## just picking a random abstract to pick a sentence from
abstracts_processed[39]

The broader impact of this Small Business Innovation Research (SBIR) Phase I project will result from creating a unique identification system using artificial intelligence (AI)-based facial recognition for horses and other animals that require vaccinations for birth control and disease inoculation. Wild horses and other wildlife that require remote vaccinations need to be identified so that populations are not over/under vaccinated. The means to vaccinate either manually or using remote technology exists, but most current methods are expensive, inhumane, or inefficient and identification is limited to photographs, sketches, memory, or RFID microchips. Federal agencies currently spend well over one hundred million dollars to deal with the problem. The commercial opportunity for a facial recognition system along with remote vaccination in this country and abroad is substantial. Wildlife managers will be able to relieve unhealthy overcrowding and allow livestock and domestic animals to co

In [19]:
sentence = "This project will develop an artificial intelligence (AI) identification system for horses using facial recognition technology and couple this with remote automated vaccination at feeding stations to ensure wild horses are correctly vaccinated for birth control and inoculated against disease."

In [20]:
processed_sent = nlp(sentence)

In [21]:
for token in processed_sent:
    if token.dep_ == "nsubj" or  token.dep_ == "dobj" or token.dep_ == "pobj":
        print(
        f"""
TOKEN: {token.text}
=====
{token.tag_ = }
{token.head.text = }
{token.dep_ = }"""
     )
    


TOKEN: project
=====
token.tag_ = 'NN'
token.head.text = 'develop'
token.dep_ = 'nsubj'

TOKEN: system
=====
token.tag_ = 'NN'
token.head.text = 'develop'
token.dep_ = 'dobj'

TOKEN: horses
=====
token.tag_ = 'NNS'
token.head.text = 'for'
token.dep_ = 'pobj'

TOKEN: technology
=====
token.tag_ = 'NN'
token.head.text = 'using'
token.dep_ = 'dobj'

TOKEN: this
=====
token.tag_ = 'DT'
token.head.text = 'couple'
token.dep_ = 'dobj'

TOKEN: vaccination
=====
token.tag_ = 'NN'
token.head.text = 'with'
token.dep_ = 'pobj'

TOKEN: stations
=====
token.tag_ = 'NNS'
token.head.text = 'feeding'
token.dep_ = 'dobj'

TOKEN: horses
=====
token.tag_ = 'NNS'
token.head.text = 'ensure'
token.dep_ = 'dobj'

TOKEN: control
=====
token.tag_ = 'NN'
token.head.text = 'for'
token.dep_ = 'pobj'

TOKEN: disease
=====
token.tag_ = 'NN'
token.head.text = 'against'
token.dep_ = 'pobj'


### <font color='red'>4. Show the most common Entities and their types.</font>


In [22]:
all_ents = []
all_labels = []
for i in range(len(abstracts_processed)):
    text = abstracts_processed[i]
    for entity in text.ents:
        all_ents.append(entity)  
        all_labels.append(entity.label_)
    
        
        

In [23]:
all_ents_labs = tuple(zip(all_ents, all_labels))

In [24]:
Counter(all_ents_labs).most_common(20)

[((this Small Business Innovation Research, 'ORG'), 1),
 ((American, 'NORP'), 1),
 ((American, 'NORP'), 1),
 ((12%, 'PERCENT'), 1),
 ((1990, 'DATE'), 1),
 ((over 40%, 'PERCENT'), 1),
 ((today, 'DATE'), 1),
 (($260 billion, 'MONEY'), 1),
 ((2016, 'DATE'), 1),
 ((the Center for Disease Control, 'ORG'), 1),
 ((CDC, 'ORG'), 1),
 ((the National Institute for Health, 'ORG'), 1),
 ((NIH, 'ORG'), 1),
 ((70%, 'PERCENT'), 1),
 ((American, 'NORP'), 1),
 ((2014, 'DATE'), 1),
 ((2013, 'DATE'), 1),
 ((American, 'NORP'), 1),
 (($60 billion, 'MONEY'), 1),
 ((annually, 'DATE'), 1)]

### <font color='red'>5. Find Entites and their dependency</font>


In [25]:
all_deps = []

for entity in (all_ents):
    all_deps.append(entity.root.head)

In [26]:
all_ents_deps = tuple(zip(all_ents, all_deps))

In [27]:
set_ents_deps = list(set(all_ents_deps))

<font color='purple'> In the list of tuples, the first value of each tuple is the entity and the second value is the dependency </font>

In [28]:
set_ents_deps

[(3, reduce),
 (5, producing),
 (Telegram, LINE),
 (This Small Business Innovation Research, Research),
 ($260 billion, of),
 (SQUID, on),
 (U.S, Africa),
 (PhD, student),
 (this Small Business Innovation Research, Phase),
 (millions, through),
 (American, adults),
 (Foundation, merit),
 (This Small Business Innovation Research, Phase),
 ($87 billion, possess),
 (this Small Business Innovation Research, Phase),
 (this Small Business Innovation Research, project),
 (this Small Business Innovation Research, project),
 (Foundation, merit),
 (first, intervention),
 (NSF, mission),
 (first, is),
 (12%, from),
 (this Small Business Innovation Research, Phase),
 (AI/ML, systems),
 (Foundation, merit),
 (380 terawatt hours, produced),
 (24 hours, window),
 (This Small Business Technology Transfer, project),
 (over 40%, to),
 (This Small Business Innovation Research, Phase),
 (two, components),
 (second, is),
 (x000D, x000D),
 (This Small Business Innovation Research, Research),
 (1.2 billion, 

### <font color='red'>6. Find the most similar noun chunks in the article</font>


In [29]:
 for noun_chunk in abstracts_processed[0].noun_chunks:
        print(noun_chunk)

The broader impact
(SBIR
the health
welfare
the American public
Obesity
American adults
12%
over 40%
an estimated medical cost
the Center
Disease Control
(CDC
the National Institute
Health
NIH
70%
American adults
American adults
weight loss
US News
World Report
A 2008 American Journal
Preventive Medicine study
those
who
daily food journals
twice as much weight
those
who
existing diet tracking methods
long-term weight loss
A personalized artificial intelligence (AI) chatbot
food
fun
millions
Americans
who
weight
knowledge
spoken dialogue systems._x000D
x000D
(SBIR
knowledge
the field
spoken dialogue systems
several ways
the project
a new research area
AI and spoken dialogue systems
nutrition
conversational agents
factual question answering
tasks
flight booking
an opportunity
big data
relationships
diet
health
this project
a neural generative chatbot model
memory
the benefit
personalized conversational interactions
intelligent agents
that
the history
conversations
personal details
the us

In [76]:
all_nounchunks = []
all_simscores = []
ai = nlp("Artificial Intelligence")
for i in range(len(abstracts_processed)):
    for noun_chunk in abstracts_processed[i].noun_chunks:
        all_nounchunks.append(noun_chunk)
        for simscore in abstracts_processed[i]:
            all_simscores.append(noun_chunk.similarity(abstracts_processed[i]))
            


  all_simscores.append(noun_chunk.similarity(abstracts_processed[i]))


In [77]:
noun_chunk_simscores = tuple(zip(all_nounchunks, all_simscores))

In [78]:
noun_chunk_simscores

((The broader impact, 0.6894289147260094),
 ((SBIR, 0.6894289147260094),
 (the health, 0.6894289147260094),
 (welfare, 0.6894289147260094),
 (the American public, 0.6894289147260094),
 (Obesity, 0.6894289147260094),
 (American adults, 0.6894289147260094),
 (12%, 0.6894289147260094),
 (over 40%, 0.6894289147260094),
 (an estimated medical cost, 0.6894289147260094),
 (the Center, 0.6894289147260094),
 (Disease Control, 0.6894289147260094),
 ((CDC, 0.6894289147260094),
 (the National Institute, 0.6894289147260094),
 (Health, 0.6894289147260094),
 (NIH, 0.6894289147260094),
 (70%, 0.6894289147260094),
 (American adults, 0.6894289147260094),
 (American adults, 0.6894289147260094),
 (weight loss, 0.6894289147260094),
 (US News, 0.6894289147260094),
 (World Report, 0.6894289147260094),
 (A 2008 American Journal, 0.6894289147260094),
 (Preventive Medicine study, 0.6894289147260094),
 (those, 0.6894289147260094),
 (who, 0.6894289147260094),
 (daily food journals, 0.6894289147260094),
 (twice as

In [80]:
set_nchunk_simscores = tuple(set(noun_chunk_simscores))

In [82]:
for n in range(len(set_nchunk_simscores)):
        if set_nchunk_simscores[n][1] >= 0.77:
            print(set_nchunk_simscores[n])

(management, 0.7883743384038132)
(that, 0.7883743384038132)
(the language, 0.7883743384038132)
(labeling accuracy, 0.7883743384038132)
(the modelling, 0.7797034193325124)
(opportunities, 0.7883743384038132)
(clean margins, 0.7883743384038132)
(training data, 0.7797034193325124)
(support, 0.7883743384038132)
(it, 0.7883743384038132)
(The broader impact/commercial potential, 0.7797034193325124)
(radiation therapy planning, 0.7883743384038132)
(their unique interests, 0.7797034193325124)
((AI) driven recommender system, 0.7797034193325124)
(agricultural stakeholders, 0.7797034193325124)
(The capability, 0.7797034193325124)
(studies, 0.7797034193325124)
(the field, 0.7883743384038132)
(a major challenge, 0.7797034193325124)
(the text chats, 0.7883743384038132)
(adjustable weights, 0.7883743384038132)
(things, 0.7883743384038132)
(more accurate ways, 0.7797034193325124)
(review criteria, 0.7797034193325124)
(an additional subsystem, 0.7883743384038132)
(textbook abstractions, 0.788374338403