# M1 Text Processing Using spaCy Library

## Objective

Preprocess the dataset using spaCy library.

## Load all relevant Python libraries and a spaCy language model.

In [470]:
import json
import spacy

In [86]:
# !python -m spacy download en_core_web_sm

In [471]:
sp = spacy.load("en_core_web_sm")
stopwords = sp.Defaults.stop_words

##  Open the provided JSON file. 

It contains a list of dictionaries with summaries from Wikipedia articles, where each dictionary has three key-value pairs. The keys title, text and url correspond to:

- Title of the Wikipedia article the text is taken from.


- Wikipedia article text. (In this dataset we included only the summary.)


- Link to the Wikipedia article.

In [325]:
with open('data/data.json', 'r') as outfile:
    summaries = json.load(outfile)
print(summaries[0].keys())

dict_keys(['title', 'text', 'url'])


In [326]:
summaries[0]['text']

'A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.\nThroughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.'

In [327]:
len(summaries)

26

## Create a Python function that takes in a text string, performs all operations described in the previous step, and outputs a list of tokens (lemmas).

- Lowercases the text string.


- Creates a spaCy document with the text lemmas and their attributes using a spaCy model of your choice.


- Removes stop words, punctuation, and other unclassified lemmas.


- Returns a list of tokens (lemmas) found in the text.

In [213]:
# Lowercase data. Lowercase the text
# Explore the attributes of each token returned SpaCy.
text = summaries[0]['text']
text_tokenized = sp(text.lower())
for token in text_tokenized[:5]:
    print(type(token), token.text, token.pos_, token.dep_, token.lemma_)

<class 'spacy.tokens.token.Token'> a DET det a
<class 'spacy.tokens.token.Token'> pandemic ADJ nsubj pandemic
<class 'spacy.tokens.token.Token'> ( PUNCT punct (
<class 'spacy.tokens.token.Token'> from ADP prep from
<class 'spacy.tokens.token.Token'> greek ADJ amod greek


In [214]:
text_tokenized

a pandemic (from greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. a widespread endemic disease with a stable number of infected people is not a pandemic. widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.
throughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. the most fatal pandemic in recorded history was the black death (also known as the plague), which killed an estimated 75–200 million people in the 14th century. the term was not used yet but was for later pandemics including the 1918 influenza pandemic (spanish flu). current pandemics include covid-19 (sars-cov-2) and hiv/aids.

In [215]:
def lower(text):
    return sp(text.lower())

In [216]:
lower(summaries[0]['text'])

a pandemic (from greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. a widespread endemic disease with a stable number of infected people is not a pandemic. widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.
throughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. the most fatal pandemic in recorded history was the black death (also known as the plague), which killed an estimated 75–200 million people in the 14th century. the term was not used yet but was for later pandemics including the 1918 influenza pandemic (spanish flu). current pandemics include covid-19 (sars-cov-2) and hiv/aids.

In [217]:
# Remove stop words and punctuation
def remove_redundant_tokens(text_tokenized):
    return ' '.join([token.text for token in text_tokenized if not token.is_stop and not token.is_punct])

In [218]:
# Lemmatize (tokenize) the texts
def lemmatize(text):
    return [token.lemma_ for token in sp(text)]

In [219]:
# Build a tokenizer function
def tokenizer(document):
    """
    This function accepts a text string and:
    1. Lowercases it
    2. Removes redundant tokens
    3. Performs token lemmatization
    """ 
    text_tokenized = lower(document)
    clean_text =  remove_redundant_tokens(text_tokenized)
    token_lemmatized = lemmatize(clean_text)
    return token_lemmatized

## Use this function to preprocess all text documents in the dataset (text field only), and add the resulting lists to the dictionaries from step 1. 

You should end up with a list of dictionaries, each of which now has four key-value pairs:

- title: Title of the Wikipedia article the text is taken from.


- text: Wikipedia article text. (In this dataset we included only the summary.)


- tokenized_text: Tokenized Wikipedia article text.


- url: Link to the Wikipedia article.

In [220]:
# Preprocess all the documents using the tokenizer function
for doc in summaries:
    doc['tokenized_text'] = tokenizer(doc['text'])

In [221]:
summaries[0]['tokenized_text'][:10]

['pandemic',
 'greek',
 'πᾶν',
 'pan',
 'δῆμος',
 'demos',
 'people',
 'epidemic',
 'infectious',
 'disease']

In [222]:
len(summaries[0]['tokenized_text'])

88

## Save the new list of dictionaries in JSON format.

In [223]:
# Save the tokenized texts to file:
with open('data/summaries.json', 'w') as outfile:
    json.dump(summaries, outfile)
outfile.close()

<_io.TextIOWrapper name='data/summaries.json' mode='w' encoding='UTF-8'>

# M2 TF-IDF Search Using Cosine Similarity

## Objective

Implement a basic Tf-Idf search.

- In your search for an optimal document retrieval method in the CDC’s huge knowledge base, you decide to try the term frequency search first because of its simplicity. It is a well-developed technique and is a great place to start!


- In Milestone 1, you prepared the documents for Tf-Idf-based search. You also computed the Tf-Idf vectors for every document in the CDC’s knowledge base. The standard approach to finding the most relevant documents to your query is to compute similarities between the Tf-Idf vectors of the documents and the query. It works, but you realize that it can be very inefficient for very large document sets since you need to compute the similarities between the query and every one of the documents in your database. What would be a better solution? Let us move on to the last milestone of the project to find out!

## Load all relevant Python libraries and a spaCy language model.

In [206]:
import json
import itertools
from collections import Counter

import spacy
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Access the tokenized text in your new dataset from the previous milestone. 

Each document dictionary should now include a new key-value pair with the lemmatized text of the articles.

In [528]:
with open('data/summaries.json', 'r') as f:
    data = json.load(f)
f.close()

## Create a corpus vocabulary. It should simply be a list of unique tokens in the provided set of documents. 

Count how many times each unique token appears in the corpus, you will need these counts for the next step.

In [395]:
# concatenate all tokenized texts into a single list
tokenized_texts = [i["tokenized_text"] for i in data]
print(np.array(tokenized_texts).shape)

# flatten the list of lists (use itertools.chain)
flattened_tokenized_texts = list(itertools.chain(*tokenized_texts))
print(np.array(flattened_tokenized_texts).shape)

# remove duplicates
vocab = list(set(flattened_tokenized_texts))
len(vocab)

(26,)
(3617,)


1506

In [396]:
# Save the vocabulary as a json file
with open('data/vocab.json', 'w') as outfile:
    json.dump(vocab, outfile)
outfile.close()

In [499]:
# count how many times each token occurs in a document
# you will need it for TfIdf calculations
docs_token_counter = []
for doc in data:
    # For each document, count how many of each token they have
    # Counter function from collections is very handy
    docs_token_counter.append(Counter(doc['tokenized_text']))

In [500]:
len(docs_token_counter)

26

In [494]:
number_docs_with_token  = {}
for token in vocab:
   # For each token in corpus vocabulary, count in how many documents it occurs
    doc_count = 0
    for document in data:
        if token in document['tokenized_text']:
            doc_count = doc_count + 1
    number_docs_with_token[token] = doc_count

In [501]:
[v for v in number_docs_with_token.values() if v > 1][:10]

[4, 2, 16, 2, 3, 4, 3, 2, 2, 2]

In [502]:
number_docs_with_token['ebola']

2

## Calculate Tf-Idf vectors for every article in the dataset and add these vectors to the article dictionaries. 

You should end up the same list of dictionaries as before, but with a new key-value pair containing Tf-Idf vectors:

- title: Title of the Wikipedia article the text is taken from.


- text: Wikipedia article text. (In this dataset we included only the summary.)


- tokenized_text: Tokenized Wikipedia article text.


- url: Link to the Wikipedia article.


- tf_idfs: Tf_Idf vector.

$tf = \frac{count(token\:in\:document)}{count(all\tokens\:in\:document)}$


$idf(token) = \frac{number\:of\:documents}{number\:of\:documents\:containing\:the\:token}$

In [509]:
token = 'disease'

In [510]:
docs_token_counter[0][token]

4

In [511]:
len(data[i]["tokenized_text"])

310

In [515]:
tf = docs_token_counter[0][token]/len(data[0]["tokenized_text"])
tf

0.045454545454545456

In [513]:
number_docs_with_token[token]

16

In [517]:
idf = np.log(len(data)/number_docs_with_token['disease'])
idf

0.4855078157817008

In [503]:
for i, token_counter in enumerate(docs_token_counter):
    tfidf_vec = []
    for token in vocab:
        # compute a term frequency (tf) per document
        tf = token_counter[token] / len(data[i]["tokenized_text"])

        # compute a log of inverse document frequency per document
        idf = np.log(len(data)/number_docs_with_token[token])
        
        # Compute tfidf for the token and append to a list of tf_idfs for this document
        tfidf_vec.append(tf*idf)
    
    # add tf_idf vector to the corresponding data dictionary
    data[i]['tfidf_vec'] = tfidf_vec       

In [463]:
# Save an updates summary with computed Tf-Idf vectors
with open('data/summaries.json', 'w') as json_file:
    json.dump(data, json_file)
json_file.close()

In [404]:
query = "highest pandemic casualties"

In [445]:
# Reuse the workflow for article Tf-Idf calculation
# to build a vectorizer function for search queries

def vectorize(query, vocab = vocab): 
    query_vec = []
    # Tokenize query
    tokenized_query = tokenizer(query)
    query_length = len(tokenized_query)
    # Count unique tokens in query
    for token in vocab:
        # Build a TfIdf vector of the same shape as the document TfIdfs
        tf = Counter(tokenized_query)[token]/query_length
        idf = np.log(len(data)/number_docs_with_token[token])
        query_vec.append(tf*idf)
        
    return query_vec        

## Now we can try to search our list of dictionaries using this Tf-Idf field using existing tools for similarity. 

We suggest you use scikit-learn library and its cosine_similarity function.

In [454]:
vec_1 = np.array(vectorize(query)).reshape(1, -1)

In [467]:
vec_2 = np.array(data[0]['tfidf_vec']).reshape(1, -1)

In [468]:
cosine_similarity(vec_1, vec_2)[0][0]

0.0160686743357183

In [532]:
# Build a search function
def search_tfidf(query, docs):
    rankings = []
    # vectorize query
    try:
        vectorized_query = vectorize(query)
    except:
        print(query)
        return rankings
    # Build a list of results using sklearn cosine_similarity function
    for doc in docs:
        # compute cosine similarity rank
        rank = cosine_similarity(np.array(vectorized_query).reshape(1, -1), np.array(doc['tfidf_vec']).reshape(1, -1))[0][0]
        if rank > 0:
            # add this document to results along with its similarity rank
            rankings.append({'title': doc['title'], 'rank': rank})
     
    # The results should be a list of dictionaries with at least the 'title' and 'rank' fields
    return sorted(rankings, key=lambda item: item.get("rank"), reverse=True)

In [533]:
# Lets test how well this fuction works
search_tfidf("ebola", data)

[{'title': 'Plague of Cyprian', 'rank': 0.11778345241757451},
 {'title': 'Science diplomacy and pandemics', 'rank': 0.07118947137494436}]

In [440]:
for s in data:
    if s["title"] == 'Plague of Cyprian':
        print(s["text"])

The Plague of Cyprian was a pandemic that afflicted the Roman Empire about from AD 249 to 262. The plague is thought to have caused widespread manpower shortages for food production and the Roman army, severely weakening the empire during the Crisis of the Third Century. Its modern name commemorates St. Cyprian, bishop of Carthage, an early Christian writer who witnessed and described the plague. The agent of the plague is highly speculative because of sparse sourcing, but suspects have included smallpox, pandemic influenza and viral hemorrhagic fever (filoviruses) like the Ebola virus.


# M3 Implement an Inverted Index and Search

## Objective

Implement an inverted index and search.

- After testing out a simple Tf-Idf search, you realize that you can successfully search your repository of documents. However, the CDC’s library is enormous, and looping over every document to compute cosine similarities for every one of them does not seem like the best way to search. There must be a better way to do it!


- Inverted index is the most commonly used data structure in document retrieval systems because it enables very fast full-text search. Instead of looking up query tokens in every document in the database, we can quickly retrieve the documents that are already known to contain the tokens by their keys. Our inclusion of Tf-Idf helps to take into account the number of times the token occurred in the document or its relevance.


- The downside of using an inverted index is the increased processing cost. We have to tokenize every document and compute all the Tf-Idf values to be able to search them.

## Create a new Jupyter Notebook and load dependencies required to complete this milestone.

In [621]:
# import dependencies
import json
from collections import defaultdict
import spacy
import operator

## Load the two JSON files you created and saved in Milestone 2:

- The vocabulary file, containing all tokens in our corpus.


- The file with the documents. Each document dictionary now should contain the following fields: title, text, URL, tokenized_text, tf_idf.

In [472]:
with open('data/summaries.json', 'r') as f:
    summaries = json.load(f)
f.close()

In [473]:
with open('data/vocab.json', 'r') as f:
    vocab = json.load(f)
f.close()

## Build an inverted index. 

- Use the previously calculated Tf_Idf values to do it.

In [601]:
inverted_index = {}

# Create a lookup dictionary for each word in vocabulary
for i, word in enumerate(vocab):
    inverted_index[word] = []  
    # for each word in corpus vocabulary list all articles
    # it occurs in and this word's TfIdf score for this article
    for doc in summaries:
        if word in doc['tokenized_text']:
            inverted_index[word].append([doc['title'], np.array(doc['tfidf_vec']).mean(axis=0)])
# Now you have a lookup table of all articles that have a particular keyword
# lets request a list of articles with the word "coronavirus" in them
inverted_index["coronavirus"]

[['COVID-19 pandemic', 0.0015344736613540093]]

In [525]:
# Check if "coronavirus" is indeed in the article 
for s in summaries:
    if s["title"] == 'COVID-19 pandemic':
        print(s["text"])

The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, China. The outbreak was declared a Public Health Emergency of International Concern in January 2020, and a pandemic in March 2020. As of 17 October 2020, more than 39.5 million cases have been confirmed, with more than 1.1 million deaths attributed to COVID-19.

Common symptoms include fever, cough, fatigue, breathing difficulties, and loss of smell. Complications may include pneumonia and acute respiratory distress syndrome. The incubation period is typically around five days but may range from one to 14 days. There are several vaccine candidates in development, although none have proven their safety and efficacy. There is no known specific antiviral medication, so primary treatment is currently symptomatic.
Recommended preventive m

## Now that we have the index, we need a search function. 

Build a search function that accepts query texts, searches the inverted index, and returns sorted search results.

{"query": "black death", "relevant_article_titles": [["Pandemic", 0.047336756359133265], ["Cholera", 0.014518164813892178], ["Antonine Plague", 0.013233865618817103], ["Epidemiology of HIV/AIDS", 0.011947239794765438], ["Bills of mortality", 0.008868054280650635], ["Spanish flu", 0.008602012652231115], ["1929\u20131930 psittacosis pandemic", 0.008231591054766618], ["Pandemic Severity Assessment Framework", 0.008039264160963658], ["HIV/AIDS", 0.006826994168437393], ["COVID-19 pandemic", 0.005060007442488892], ["Swine influenza", 0.004675006876212563]]}, {"query": "zoonotic diseases", "relevant_article_titles": [["Swine influenza", 0.035414092804581326], ["Disease X", 0.029604135108640295], ["Pandemic", 0.022322198426744867], ["Pandemic prevention", 0.013871651879477165], ["HIV/AIDS", 0.013486328216158356], ["Targeted immunization strategies", 0.0105545177343848], ["Science diplomacy and pandemics", 0.01032995352727023], ["HIV/AIDS in Yunnan", 0.008593058686401785], ["Cholera", 0.008194224738931659], ["Antonine Plague", 0.007469351012026167], ["Superspreader", 0.0066507919970096], ["COVID-19 pandemic", 0.005711856656255304], ["Basic reproduction number", 0.0056020132590196255], ["1929\u20131930 psittacosis pandemic", 0.004646007806523453], ["Pandemic Severity Assessment Framework", 0.004537456222258886], ["Epidemiology of HIV/AIDS", 0.0016857910270197945], ["Virus", 0.0015866268489598066]]}

In [645]:
# Create a search function to search the inverted index

def search(query, index = inverted_index):
    
    query_tokens = tokenizer(query)
    
    # Lookup all query tokens in the inverted index
    # and build an list of articles that have them~
    # The results should be a list of tuples with article titles and TfIdf scores
    newlist = []
    for token in query_tokens:
        newlist.extend(inverted_index[token])
        
    # create a dictionary with compound TfIdf scores 
    # to take into account that an article can include multiple keywords
    # from your query
    #     output = defaultdict(int) 
    #     for k, v in newlist: 
    #         output[k] += v 
    
    # sort search results by their TfIdf scores
    results = {'query': query, "relevant_article_titles": sorted(newlist, key=lambda column: column[1], reverse=True)}
    return results

## Test your search engine with a couple of different search queries:

- Try to search "symptoms of swine flu". Great news, - the article titled Swine influenza should be the first title on the list.


- Let us submit another, slightly more ambiguous query. Say, you want to find out which other organizations, besides CDC, are working on pandemic prevention programs. Try to search “pandemic prevention organizations” and check what comes up. Disappointingly, the titles with the highest Tf-Idf ranks are not going to answer your question. The main disadvantage of keyword search is that it does not understand the context and the meaning of your request. It only knows how often the keywords appear in the given document compared to other documents in the database.


- Compare your results for these example queries and a few other suggested queries in theexample_queries.json file with our top three results for each search request. The results are provided in the example_query_results.json file.

In [646]:
# Check how well this search performs for multi-word queries:
results = search(query = "world health organization")
results

{'query': 'world health organization',
 'relevant_article_titles': [['Basic reproduction number',
   0.0017412627311117635],
  ['1929–1930 psittacosis pandemic', 0.0016544015885326235],
  ['Event 201', 0.0016429198403741917],
  ['Event 201', 0.0016429198403741917],
  ['Event 201', 0.0016429198403741917],
  ['Science diplomacy and pandemics', 0.0016395346300336333],
  ['Science diplomacy and pandemics', 0.0016395346300336333],
  ['Science diplomacy and pandemics', 0.0016395346300336333],
  ['Disease X', 0.0015825540903668617],
  ['Disease X', 0.0015825540903668617],
  ['Disease X', 0.0015825540903668617],
  ['Cholera', 0.0015811471307073243],
  ['Pandemic severity index', 0.0015604349960286247],
  ['Epidemiology of HIV/AIDS', 0.001544404664042881],
  ['COVID-19 pandemic', 0.0015344736613540093],
  ['Spanish flu', 0.0015079630243867441],
  ['Crimson Contagion', 0.0014762184447524571],
  ['Swine influenza', 0.0013703364096327216],
  ['Swine influenza', 0.0013703364096327216],
  ['Swine in

In [642]:
for s in summaries:
    if s["title"] == title:
        print(s["text"])

In [641]:
# Lets try another multi-word query
search(query = "Ebola virus")

{'query': 'Ebola virus',
 'relevant_article_titles': [['Plague of Cyprian', 0.0016430040145475272],
  ['Plague of Cyprian', 0.0016430040145475272],
  ['Science diplomacy and pandemics', 0.0016395346300336333],
  ['Virus', 0.0016226086401492994],
  ['Disease X', 0.0015825540903668617],
  ['Epidemiology of HIV/AIDS', 0.001544404664042881],
  ['Viral load', 0.0015372392135629888],
  ['COVID-19 pandemic', 0.0015344736613540093],
  ['Spanish flu', 0.0015079630243867441],
  ['Crimson Contagion', 0.0014762184447524571],
  ['HIV/AIDS in Yunnan', 0.001446525799411811],
  ['HIV/AIDS', 0.0014317072468028223],
  ['Swine influenza', 0.0013703364096327216]]}

In [None]:
for s in summaries:
    if s["title"] == 'Virus':
        print(s["text"])