# NLP
Find your favorite news source and grab the article text. 

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [1]:
# !pip3 install spacy
# !python3 -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md 

In [2]:
import spacy
import pandas as pd
import numpy as np
from collections import Counter

In [3]:
nlp = spacy.load('en_core_web_md')

#### <font color='purple'> Reading in data source</font>
<font color='purple'> Data pulled from Small Business Admin. site: https://www.sbir.gov/sbirsearch/award/all/?topic=AI&f%5B0%5D=im_field_agencies%3A105738</font>
<font color='purple'>All 78 of these awards were made to small bussiness, via the National Science Foundation's SBIR/STTR award program. All of these awards were tagged as AI-related.</font>
<font color='purple'> I would like to use Spacy to explore the abstract data for all 78 awards. The goal is to just generate some keywords, specific to AI awards. Eventually, I want to find a way to generate keywords that can be used to search for AI-related awards, given the abstract text. I believe that AI-related tools are likely used in many awarded projects that are tagged under other topic areas. The goal is to find a way to pull those AI-related awards using a list of keywords.</font>

#### <font color='purple'> I am narrowing this data down so that only a list containing the abstract text for each proposal is left.</font>

In [4]:
award_data = pd.read_excel("NSF_AI_sbirAwards.xlsx")
abstracts = award_data["Abstract"]

#### <font color='purple'> Iterating through each abstract in the list and running the abstract through the NLP</font>

<font color='purple'>Since there were 78 awards in my data, this will output a list of 78 doc objects (78 separately processed text blocks in a list). </font>

<font color='purple'> I did not want to concatenate all of the abstract text together into one string because I wanted to run each abstract through the NLP as it's own entity. I figured that may be a better way to preserve the integrity of any vector relationships that only exist in a small number of abstracts.</font>


In [5]:
abstracts_processed = []

for i in abstracts.index:
    abstracts_processed.append(nlp(abstracts[i]))

### <font color='red'>1. Finding the most common words</font>

<font color='purple'> I will first iterate through each abstract and extract the 10 most common words in each abstract. I will keep a running list of each abstract's most common words. </font>

In [6]:
most_common_words = []

for i in range(len(abstracts_processed)):
    words = [token.lemma_ for token in abstracts_processed[i] 
             if not token.is_stop and not token.is_punct]
    
    common10 = Counter(words).most_common(10)
    most_common_words.append(common10)
    

<font color='purple'>Since I want the most common words overall (accross all abstracts), I need to iterate through the 10 most common words in each of the 78 abstracts. I will extract only the words themselves (I will not extract the count) and add them to a running list. Having all the most common words from all abstracts in a single list will allow me to see the most common words across all abstracts, rather than just from one abstract at a time. </font>

In [7]:
most_common_words_long = []

for i in range(len(most_common_words)):
    for a in range(len(most_common_words[i])):
        most_common_words_long.append(most_common_words[i][a][0])
            
                    

<font color='purple'> Finally, to find the most common words across all 78 abstracts, I will find the most common words in the full list (most_common_words_long). </font>

<font color='purple'>In the code, I am also removing certain words from this list, since they fall into the boiler plate language (template_txt) used in all of the abstracts.</font>

In [8]:
template_txt = ['project', 'broader', 'impact', 'Small', 'Business', 'Innovation', 
                '\n', 'AI', 'SBIR','STTR', 'Phase', 'Research', 'propose', 'broad']

for i in range(len(template_txt)):
    remove_word = template_txt[i]
    while remove_word in most_common_words_long:
        most_common_words_long.remove(remove_word)



In [9]:
Counter(most_common_words_long).most_common(20)

[('model', 10),
 ('learning', 9),
 ('system', 9),
 ('datum', 9),
 ('result', 8),
 ('technology', 7),
 ('time', 6),
 ('provide', 6),
 ('platform', 6),
 ('reduce', 5),
 ('learn', 5),
 ('health', 5),
 ('language', 4),
 ('control', 4),
 ('increase', 4),
 ('student', 4),
 ('video', 4),
 ('commercial', 4),
 ('enable', 4),
 ('information', 4)]

### <font color='red'>2. Finding the most common nouns, adjectives, and verbs.</font>

<font color='purple'> For this iteration, I don't care as much about granularity, so I will not treat each abstract as its own entity. I am going to iterate through every abstract and pick out all verbs, nouns, and adjectives. I will then narrow each list down to the most common of each. </font>


In [10]:
all_nouns = []
all_adj = []
all_verbs = []


for i in range(len(abstracts_processed)):
    nouns = [token.lemma_ for token in abstracts_processed[i] 
             if token.pos_ == "NOUN"]
    adj = [token.lemma_ for token in abstracts_processed[i]
              if token.pos_ == "ADJ"]
    verbs = [token.lemma_ for token in abstracts_processed[i]
                 if token.pos_ == "VERB"]
    
    all_nouns.append(nouns)
    all_adj.append(adj)
    all_verbs.append(verbs)
    
    
    

In [11]:
all_nouns = list(np.concatenate(all_nouns))
all_adj = list(np.concatenate(all_adj))
all_verbs = list(np.concatenate(all_verbs))

<font color='purple'> Removing the boiler plate text

In [12]:
template_txt = ['project', 'broader', 'impact', 'Small', 'Business', 'Innovation', 
                '\n', 'AI', 'SBIR','STTR', 'Phase', 'Research', 'merit', 'award', 'criterion',
                'broad', 'intellectual', '-']

for i in range(len(template_txt)):
    remove_word = template_txt[i]
    while remove_word in all_nouns:
        all_nouns.remove(remove_word)
    while remove_word in all_adj:
        all_adj.remove(remove_word)
    while remove_word in all_verbs:
        all_verbs.remove(remove_word)


In [13]:
noun_list = Counter(all_nouns).most_common(30)
noun_list

[('datum', 101),
 ('model', 89),
 ('system', 81),
 ('learning', 81),
 ('support', 63),
 ('evaluation', 63),
 ('technology', 63),
 ('time', 59),
 ('mission', 59),
 ('review', 57),
 ('machine', 54),
 ('cost', 47),
 ('platform', 47),
 ('algorithm', 46),
 ('intelligence', 42),
 ('language', 42),
 ('student', 36),
 ('health', 35),
 ('research', 35),
 ('method', 34),
 ('application', 32),
 ('development', 32),
 ('information', 30),
 ('user', 29),
 ('training', 29),
 ('solution', 28),
 ('control', 26),
 ('video', 26),
 ('potential', 25),
 ('level', 25)]

In [14]:
verb_list = Counter(all_verbs).most_common(30)
verb_list

[('use', 141),
 ('develop', 64),
 ('reflect', 58),
 ('improve', 58),
 ('propose', 57),
 ('deem', 57),
 ('reduce', 55),
 ('provide', 54),
 ('learn', 49),
 ('base', 49),
 ('enable', 43),
 ('create', 40),
 ('increase', 40),
 ('have', 32),
 ('make', 31),
 ('include', 29),
 ('generate', 28),
 ('help', 28),
 ('require', 26),
 ('allow', 25),
 ('identify', 24),
 ('build', 23),
 ('result', 21),
 ('aim', 20),
 ('automate', 20),
 ('advance', 18),
 ('lead', 18),
 ('address', 18),
 ('train', 17),
 ('drive', 17)]

In [15]:
adj_list = Counter(all_adj).most_common(30)
adj_list

[('statutory', 57),
 ('worthy', 57),
 ('artificial', 41),
 ('new', 41),
 ('high', 36),
 ('commercial', 33),
 ('such', 29),
 ('real', 29),
 ('other', 27),
 ('current', 23),
 ('accurate', 23),
 ('human', 22),
 ('technical', 22),
 ('deep', 21),
 ('large', 18),
 ('novel', 17),
 ('advanced', 17),
 ('neural', 15),
 ('effective', 14),
 ('small', 14),
 ('low', 13),
 ('multiple', 13),
 ('many', 13),
 ('medical', 12),
 ('different', 12),
 ('social', 12),
 ('robust', 12),
 ('available', 12),
 ('reliable', 12),
 ('critical', 12)]

### <font color='red'>3. Find a subject/object relationship through the dependency parser in any sentence.</font>


In [16]:
## just picking a random abstract to pick a sentence from
abstracts_processed[39]

The broader impact of this Small Business Innovation Research (SBIR) Phase I project will result from creating a unique identification system using artificial intelligence (AI)-based facial recognition for horses and other animals that require vaccinations for birth control and disease inoculation. Wild horses and other wildlife that require remote vaccinations need to be identified so that populations are not over/under vaccinated. The means to vaccinate either manually or using remote technology exists, but most current methods are expensive, inhumane, or inefficient and identification is limited to photographs, sketches, memory, or RFID microchips. Federal agencies currently spend well over one hundred million dollars to deal with the problem. The commercial opportunity for a facial recognition system along with remote vaccination in this country and abroad is substantial. Wildlife managers will be able to relieve unhealthy overcrowding and allow livestock and domestic animals to co

In [17]:
sentence = "This project will develop an artificial intelligence (AI) identification system for horses using facial recognition technology and couple this with remote automated vaccination at feeding stations to ensure wild horses are correctly vaccinated for birth control and inoculated against disease."

In [18]:
processed_sent = nlp(sentence)

In [19]:
for token in processed_sent:
    if token.dep_ == "nsubj" or  token.dep_ == "dobj" or token.dep_ == "pobj":
        print(
        f"""
TOKEN: {token.text}
=====
{token.tag_ = }
{token.head.text = }
{token.dep_ = }"""
     )
    


TOKEN: project
=====
token.tag_ = 'NN'
token.head.text = 'develop'
token.dep_ = 'nsubj'

TOKEN: system
=====
token.tag_ = 'NN'
token.head.text = 'develop'
token.dep_ = 'dobj'

TOKEN: horses
=====
token.tag_ = 'NNS'
token.head.text = 'for'
token.dep_ = 'pobj'

TOKEN: technology
=====
token.tag_ = 'NN'
token.head.text = 'using'
token.dep_ = 'dobj'

TOKEN: this
=====
token.tag_ = 'DT'
token.head.text = 'couple'
token.dep_ = 'dobj'

TOKEN: vaccination
=====
token.tag_ = 'NN'
token.head.text = 'with'
token.dep_ = 'pobj'

TOKEN: stations
=====
token.tag_ = 'NNS'
token.head.text = 'feeding'
token.dep_ = 'dobj'

TOKEN: horses
=====
token.tag_ = 'NNS'
token.head.text = 'ensure'
token.dep_ = 'dobj'

TOKEN: control
=====
token.tag_ = 'NN'
token.head.text = 'for'
token.dep_ = 'pobj'

TOKEN: disease
=====
token.tag_ = 'NN'
token.head.text = 'against'
token.dep_ = 'pobj'


### <font color='red'>4. Show the most common Entities and their types.</font>


In [20]:
all_ents = []
all_labels = []
for i in range(len(abstracts_processed)):
    text = abstracts_processed[i]
    for entity in text.ents:
        all_ents.append(entity)  
        all_labels.append(entity.label_)
    
        
        

In [21]:
all_ents_labs = tuple(zip(all_ents, all_labels))

In [22]:
Counter(all_ents_labs).most_common(20)

[((this Small Business Innovation Research, 'ORG'), 1),
 ((American, 'NORP'), 1),
 ((American, 'NORP'), 1),
 ((12%, 'PERCENT'), 1),
 ((1990, 'DATE'), 1),
 ((over 40%, 'PERCENT'), 1),
 ((today, 'DATE'), 1),
 (($260 billion, 'MONEY'), 1),
 ((2016, 'DATE'), 1),
 ((the Center for Disease Control, 'ORG'), 1),
 ((CDC, 'ORG'), 1),
 ((the National Institute for Health, 'ORG'), 1),
 ((NIH, 'ORG'), 1),
 ((70%, 'PERCENT'), 1),
 ((American, 'NORP'), 1),
 ((2014, 'DATE'), 1),
 ((2013, 'DATE'), 1),
 ((American, 'NORP'), 1),
 (($60 billion, 'MONEY'), 1),
 ((annually, 'DATE'), 1)]

### <font color='red'>5. Find Entites and their dependency</font>


In [23]:
all_deps = []

for entity in (all_ents):
    all_deps.append(entity.root.head)

In [24]:
all_ents_deps = tuple(zip(all_ents, all_deps))

In [25]:
set_ents_deps = list(set(all_ents_deps))

<font color='purple'> In the list of tuples, the first value of each tuple is the entity and the second value is the dependency </font>

In [26]:
set_ents_deps[:20]

[(This Small Business Innovation Research, Phase),
 (This Small Business Technology Transfer, Phase),
 (NSF, mission),
 (Foundation, merit),
 (Foundation, merit),
 (annually, B),
 (millions, through),
 (NSF, mission),
 (hours, within),
 (1, Phase),
 (this Small Business Innovation Research, Phase),
 (Foundation, merit),
 (NSF, mission),
 (this Small Business Innovation Research, Phase),
 (This Small Business Innovation Research, Research),
 (this Small Business Innovation Research, Phase),
 (Science, Technology, Engineering, in),
 (x000D, x000D),
 (Riemannian, metrics),
 (this Small Business Innovation Research, centers)]

### <font color='red'>6. Instead of finding the most similar noun chunks on this one, I am going to see what the similarity scores of the keywords generated in Question 2 are, when compared to the term "Artificial Intelligence". I will also find the most common noun chunks and do the same.</font>


<font color='purple'>Finding most common noun chunks. </font>



In [27]:
 for noun_chunk in abstracts_processed[0].noun_chunks:
        print(noun_chunk)

The broader impact
(SBIR
the health
welfare
the American public
Obesity
American adults
12%
over 40%
an estimated medical cost
the Center
Disease Control
(CDC
the National Institute
Health
NIH
70%
American adults
American adults
weight loss
US News
World Report
A 2008 American Journal
Preventive Medicine study
those
who
daily food journals
twice as much weight
those
who
existing diet tracking methods
long-term weight loss
A personalized artificial intelligence (AI) chatbot
food
fun
millions
Americans
who
weight
knowledge
spoken dialogue systems._x000D
x000D
(SBIR
knowledge
the field
spoken dialogue systems
several ways
the project
a new research area
AI and spoken dialogue systems
nutrition
conversational agents
factual question answering
tasks
flight booking
an opportunity
big data
relationships
diet
health
this project
a neural generative chatbot model
memory
the benefit
personalized conversational interactions
intelligent agents
that
the history
conversations
personal details
the us

In [28]:
all_nounchunks = []

for i in range(len(abstracts_processed)):
    for noun_chunk in abstracts_processed[i].noun_chunks:
        all_nounchunks.append(noun_chunk)
        


In [29]:
nounchunk_list = Counter(all_nounchunks).most_common(50)
nounchunk_list

[(The broader impact, 1),
 ((SBIR, 1),
 (the health, 1),
 (welfare, 1),
 (the American public, 1),
 (Obesity, 1),
 (American adults, 1),
 (12%, 1),
 (over 40%, 1),
 (an estimated medical cost, 1),
 (the Center, 1),
 (Disease Control, 1),
 ((CDC, 1),
 (the National Institute, 1),
 (Health, 1),
 (NIH, 1),
 (70%, 1),
 (American adults, 1),
 (American adults, 1),
 (weight loss, 1),
 (US News, 1),
 (World Report, 1),
 (A 2008 American Journal, 1),
 (Preventive Medicine study, 1),
 (those, 1),
 (who, 1),
 (daily food journals, 1),
 (twice as much weight, 1),
 (those, 1),
 (who, 1),
 (existing diet tracking methods, 1),
 (long-term weight loss, 1),
 (A personalized artificial intelligence (AI) chatbot, 1),
 (food, 1),
 (fun, 1),
 (millions, 1),
 (Americans, 1),
 (who, 1),
 (weight, 1),
 (knowledge, 1),
 (spoken dialogue systems._x000D, 1),
 (x000D, 1),
 ((SBIR, 1),
 (knowledge, 1),
 (the field, 1),
 (spoken dialogue systems, 1),
 (several ways, 1),
 (the project, 1),
 (a new research area, 1)

<font color='purple'>Pulling just the words, not the counts, from most common word lists in Q2 and noun chunk list </font>



In [30]:
new_adj_list=[]
new_noun_list=[]
new_verb_list=[]
new_chunk_list = []

for i in range(len(noun_list)):
    noun = noun_list[i][0]
    new_noun_list.append(noun)
    
for j in range(len(adj_list)):
    adj = adj_list[j][0]
    new_adj_list.append(adj)
    
for k in range(len(verb_list)):
    verb = verb_list[k][0]
    new_verb_list.append(verb)

for c in range(len(nounchunk_list)):
    nchunk = nounchunk_list[c][0]
    new_chunk_list.append(nchunk)

<font color='purple'>Running the word lists through the nlp</font>



In [31]:
adj_processed = []

for i in range(len(new_adj_list)):
    adj_processed.append(nlp(str(new_adj_list[i])))


In [32]:
verb_processed = []

for i in range(len(new_verb_list)):
    verb_processed.append(nlp(str(new_verb_list[i])))

In [33]:
noun_processed = []

for i in range(len(new_noun_list)):
    noun_processed.append(nlp(str(new_noun_list[i])))

In [34]:
chunk_processed = []

for i in range(len(new_chunk_list)):
    chunk_processed.append(nlp(str(new_chunk_list[i])))

### <font color='purple'>Now let's see how similar each of these tokens are to "Artificial Intelligence"</font>



In [35]:
noun_sim_scores = []
for i in range(len(noun_processed)):
    score_tup = (noun_processed[i], noun_processed[i].similarity(nlp("Artificial Intelligence")))
    noun_sim_scores.append(score_tup)

In [36]:
adj_sim_scores = []
for i in range(len(adj_processed)):
    score_tup = (adj_processed[i], adj_processed[i].similarity(nlp("Artificial Intelligence")))
    adj_sim_scores.append(score_tup)

In [37]:
verb_sim_scores = []
for i in range(len(verb_processed)):
    score_tup = (verb_processed[i], verb_processed[i].similarity(nlp("Artificial Intelligence")))
    verb_sim_scores.append(score_tup)

In [44]:
chunk_sim_scores = []
for i in range(len(chunk_processed)):
    score_tup = (chunk_processed[i], chunk_processed[i].similarity(nlp("Artificial Intelligence")))
    chunk_sim_scores.append(score_tup)

  score_tup = (chunk_processed[i], chunk_processed[i].similarity(nlp("Artificial Intelligence")))


In [38]:
noun_sim_scores

[(datum, 0.16229846542910573),
 (model, 0.3584834386589548),
 (system, 0.4184719545872337),
 (learning, 0.46545453761360495),
 (support, 0.45484264365717275),
 (evaluation, 0.611053014799221),
 (technology, 0.5920905624578098),
 (time, 0.1551375607866306),
 (mission, 0.49339629497424203),
 (review, 0.368899479603463),
 (machine, 0.29071500562829494),
 (cost, 0.13468692324543607),
 (platform, 0.44343556096788234),
 (algorithm, 0.37940958735720265),
 (intelligence, 0.7534147843071304),
 (language, 0.3600757617059399),
 (student, 0.29192001859595706),
 (health, 0.30730281271486815),
 (research, 0.5477582517676998),
 (method, 0.3518276609476514),
 (application, 0.6059586760265948),
 (development, 0.5687852237378012),
 (information, 0.5713244979079949),
 (user, 0.24889266108900693),
 (training, 0.4700695505090661),
 (solution, 0.516126339600525),
 (control, 0.4422738424290547),
 (video, 0.2075852411612967),
 (potential, 0.4586656792996185),
 (level, 0.340708301894093)]

In [39]:
verb_sim_scores

[(use, 0.2708556199461809),
 (develop, 0.4382611501030176),
 (reflect, 0.23552155365401065),
 (improve, 0.3367729849605888),
 (propose, 0.36343975973940656),
 (deem, 0.14918708598538288),
 (reduce, 0.24169892192303288),
 (provide, 0.37870110973974336),
 (learn, 0.15861211260498925),
 (base, 0.28112849797286527),
 (enable, 0.4342192174281438),
 (create, 0.32369977944425465),
 (increase, 0.3470589400107229),
 (have, 0.16567279131147744),
 (make, 0.07299417311196903),
 (include, 0.3941786506152489),
 (generate, 0.3789617673713044),
 (help, 0.19288026401161093),
 (require, 0.3038872149764361),
 (allow, 0.23903739008843508),
 (identify, 0.419122005553052),
 (build, 0.21127415143505704),
 (result, 0.3396724185598072),
 (aim, 0.19195868838080615),
 (automate, 0.4925967332225406),
 (advance, 0.4530784082293404),
 (lead, 0.20234254075420816),
 (address, 0.3282341942434057),
 (train, 0.11416316395336243),
 (drive, 0.07277728018849687)]

In [40]:
adj_sim_scores

[(statutory, 0.34161351467311185),
 (worthy, 0.02409471807301838),
 (artificial, 0.7181813251221405),
 (new, 0.2594265977444017),
 (high, 0.21745876260938274),
 (commercial, 0.44930071010520195),
 (such, 0.33926811868514983),
 (real, 0.1976167858195671),
 (other, 0.37166003257766117),
 (current, 0.4403509108482855),
 (accurate, 0.3580312914939001),
 (human, 0.38616404703687013),
 (technical, 0.5654320535617686),
 (deep, 0.14236245576745077),
 (large, 0.27610697255206146),
 (novel, 0.14623246434457848),
 (advanced, 0.5232243321366429),
 (neural, 0.41318993503194257),
 (effective, 0.5181157915169209),
 (small, 0.19171740224475403),
 (low, 0.10188361339250633),
 (multiple, 0.37848785364242427),
 (many, 0.21052204542830283),
 (medical, 0.3901301222397854),
 (different, 0.37415293934846),
 (social, 0.3637190322616282),
 (robust, 0.29570670759182394),
 (available, 0.40672720986020416),
 (reliable, 0.42674694652927575),
 (critical, 0.5284022278859833)]

In [45]:
chunk_sim_scores

[(The broader impact, 0.49764284122726915),
 ((SBIR, 0.2450353185621908),
 (the health, 0.4669891702642012),
 (welfare, 0.21816397763488432),
 (the American public, 0.518324983808169),
 (Obesity, 0.2323724093992703),
 (American adults, 0.38860581864549093),
 (12%, 0.04107140033942633),
 (over 40%, 0.06492811747024285),
 (an estimated medical cost, 0.3351983507705724),
 (the Center, 0.5314765390451474),
 (Disease Control, 0.5481757777291704),
 ((CDC, 0.22819281941103012),
 (the National Institute, 0.6091128569315473),
 (Health, 0.4106357266719943),
 (NIH, 0.18224507944486326),
 (70%, 0.04232412947204514),
 (American adults, 0.38860581864549093),
 (American adults, 0.38860581864549093),
 (weight loss, 0.1809382545871975),
 (US News, 0.36099220663633097),
 (World Report, 0.5187706762296186),
 (A 2008 American Journal, 0.3932253594670052),
 (Preventive Medicine study, 0.5865217631512722),
 (those, 0.20376603473029403),
 (who, 0.078623235718873),
 (daily food journals, 0.2600020837104913),


### <font color='purple'>Here are the highest similarity scores </font>



In [41]:
for i in range(len(adj_sim_scores)):
    if adj_sim_scores[i][1] > 0.40:
        print(adj_sim_scores[i])

(artificial, 0.7181813251221405)
(commercial, 0.44930071010520195)
(current, 0.4403509108482855)
(technical, 0.5654320535617686)
(advanced, 0.5232243321366429)
(neural, 0.41318993503194257)
(effective, 0.5181157915169209)
(available, 0.40672720986020416)
(reliable, 0.42674694652927575)
(critical, 0.5284022278859833)


In [42]:
for i in range(len(verb_sim_scores)):
    if verb_sim_scores[i][1] > 0.4:
        print(verb_sim_scores[i])

(develop, 0.4382611501030176)
(enable, 0.4342192174281438)
(identify, 0.419122005553052)
(automate, 0.4925967332225406)
(advance, 0.4530784082293404)


In [43]:
for i in range(len(noun_sim_scores)):
    if noun_sim_scores[i][1] > 0.50:
        print(noun_sim_scores[i])

(evaluation, 0.611053014799221)
(technology, 0.5920905624578098)
(intelligence, 0.7534147843071304)
(research, 0.5477582517676998)
(application, 0.6059586760265948)
(development, 0.5687852237378012)
(information, 0.5713244979079949)
(solution, 0.516126339600525)


In [46]:
for i in range(len(chunk_sim_scores)):
    if chunk_sim_scores[i][1] > 0.4:
        print(chunk_sim_scores[i])

(The broader impact, 0.49764284122726915)
(the health, 0.4669891702642012)
(the American public, 0.518324983808169)
(the Center, 0.5314765390451474)
(Disease Control, 0.5481757777291704)
(the National Institute, 0.6091128569315473)
(Health, 0.4106357266719943)
(World Report, 0.5187706762296186)
(Preventive Medicine study, 0.5865217631512722)
(existing diet tracking methods, 0.4821760890865779)
(A personalized artificial intelligence (AI) chatbot, 0.5730297946412343)
(knowledge, 0.4911135228538597)
(knowledge, 0.4911135228538597)
(the field, 0.4820226949924722)
(spoken dialogue systems, 0.4724625370548312)
(the project, 0.5159254115803398)
(AI and spoken dialogue systems, 0.6579521391424739)
