# M1 Text Processing Using spaCy Library

## Objective

Preprocess the dataset using spaCy library.

## Load all relevant Python libraries and a spaCy language model.

In [1]:
import json
import spacy

In [5]:
# !python -m spacy download en_core_web_sm

In [26]:
sp = spacy.load("en_core_web_sm")
stopwords = sp.Defaults.stop_words

##  Open the provided JSON file. 

It contains a list of dictionaries with summaries from Wikipedia articles, where each dictionary has three key-value pairs. The keys title, text and url correspond to:

- Title of the Wikipedia article the text is taken from.


- Wikipedia article text. (In this dataset we included only the summary.)


- Link to the Wikipedia article.

In [10]:
with open('data/data.json', 'r') as outfile:
    summaries = json.load(outfile)
print(summaries[0].keys())

dict_keys(['title', 'text', 'url'])


In [12]:
summaries[0]['text']

'A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.\nThroughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.'

## Create a Python function that takes in a text string, performs all operations described in the previous step, and outputs a list of tokens (lemmas).

- Lowercases the text string.


- Creates a spaCy document with the text lemmas and their attributes using a spaCy model of your choice.


- Removes stop words, punctuation, and other unclassified lemmas.


- Returns a list of tokens (lemmas) found in the text.

In [45]:
# Lowercase data. Lowercase the text
# Explore the attributes of each token returned SpaCy.
text = summaries[0]['text']
text_tokenized = sp(text.lower())
for token in text_tokenized[:5]:
    print(type(token), token.text, token.pos_, token.dep_, token.lemma_)

<class 'spacy.tokens.token.Token'> a DET det a
<class 'spacy.tokens.token.Token'> pandemic ADJ nsubj pandemic
<class 'spacy.tokens.token.Token'> ( PUNCT punct (
<class 'spacy.tokens.token.Token'> from ADP prep from
<class 'spacy.tokens.token.Token'> greek ADJ amod greek


In [46]:
text_tokenized

a pandemic (from greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. a widespread endemic disease with a stable number of infected people is not a pandemic. widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.
throughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. the most fatal pandemic in recorded history was the black death (also known as the plague), which killed an estimated 75–200 million people in the 14th century. the term was not used yet but was for later pandemics including the 1918 influenza pandemic (spanish flu). current pandemics include covid-19 (sars-cov-2) and hiv/aids.

In [20]:
# Find tokens which do not have a description (token.dep_)
# They belong to tokens that need to be removed
text = 'jarvel in the gaabe'
text_tokenized = sp(text.lower())
for token in text_tokenized:
    print(token.text, token.dep_)

jarvel ROOT
in prep
the det
gaabe pobj


In [49]:
def lower(text):
    return sp(text.lower())

In [47]:
# Remove stop words and punctuation
def remove_redundant_tokens(text_tokenized):
    return ' '.join([token.text for token in text_tokenized if not token.is_stop and not token.is_punct])

In [50]:
remove_redundant_tokens(lower(summaries[0]['text']))

'pandemic greek πᾶν pan δῆμος demos people epidemic infectious disease spread large region instance multiple continents worldwide affecting substantial number people widespread endemic disease stable number infected people pandemic widespread endemic diseases stable number infected people recurrences seasonal influenza generally excluded occur simultaneously large regions globe spread worldwide \n human history number pandemics diseases smallpox tuberculosis fatal pandemic recorded history black death known plague killed estimated 75–200 million people 14th century term later pandemics including 1918 influenza pandemic spanish flu current pandemics include covid-19 sars cov-2 hiv aids'

In [43]:
# Lemmatize (tokenize) the texts
def lemmatize(text):
    return ' '.join([token.lemma_ for token in text_tokenized])

In [52]:
lemmatize(remove_redundant_tokens(lower(summaries[0]['text'])))

'pandemic greek πᾶν pan δῆμος demos people epidemic infectious disease spread large region instance multiple continent worldwide affect substantial number people widespread endemic disease stable number infect people pandemic widespread endemic disease stable number infect people recurrence seasonal influenza generally exclude occur simultaneously large region globe spread worldwide \n  human history number pandemic disease smallpox tuberculosis fatal pandemic record history black death know plague kill estimate 75–200 million people 14th century term later pandemic include 1918 influenza pandemic spanish flu current pandemic include covid-19 sar cov-2 hiv aid'

In [53]:
# Build a tokenizer function
def tokenizer(document):
    """
    This function accepts a text string and:
    1. Lowercases it
    2. Removes redundant tokens
    3. Performs token lemmatization
    """ 
    text_tokenized = lower(document)
    clean_text =  remove_redundant_tokens(text_tokenized)
    token_lemmatized = lemmatize(clean_text)
    return token_lemmatized

## Use this function to preprocess all text documents in the dataset (text field only), and add the resulting lists to the dictionaries from step 1. 

You should end up with a list of dictionaries, each of which now has four key-value pairs:

- title: Title of the Wikipedia article the text is taken from.


- text: Wikipedia article text. (In this dataset we included only the summary.)


- tokenized_text: Tokenized Wikipedia article text.


- url: Link to the Wikipedia article.

In [54]:
# Preprocess all the documents using the tokenizer function
for doc in summaries:
    doc['tokenized_text'] = tokenizer(doc['text'])

## Save the new list of dictionaries in JSON format.

In [60]:
# Save the tokenized text
with open('data/tokenized_text.json', 'w') as f:
    summaries = json.dump(summaries, f)
f.close()

In [None]:
# Build and save a corpus vocabulary
