# Python for Natural Language Processing (NLP)

A hands-on tutorial from [GW Libraries & Academic Innovation](https://library.gwu.edu)

Created & presented by Dolsy Smith, dsmith@gwu.edu.


## How to use this tutorial

This tutorial is designed for those with some prior exposure to the Python programming language and the Jupyter environment. If you're looking for an introduction to Python, check out GW LAI's [workshops on programming](https://library.gwu.edu/events?f%5B0%5D=series%3A252), which include curricula for beginners, such as [Python Camp](https://library.gwu.edu/events/python-camp-1).

1. Download this notebook and open it with a Jupyter notebook running Python 3.
   - To download from GitHub, click the `Raw` button on the menu bar directly above the title of this notebook.
   - You should see a screen with a lot of indented text and curly braces. Right click on the screen and select `Save As...` or `Save Page As...` from your browser's pop-up menu.
   - Save the `.ipynb` file in a folder where you can access it from your local Jupyter environment.
2. **OR** open this Google Colab [notebook]() and copy it to your Google Drive.
3. Execute the code cells, following the instructions above each cell where provided.
4. If a cell is marked as a **video**, run the cell to load the embedded video content, then watch the video for an exposition of key concepts.

## Introduction & Setup
#### Video 1: Welcome

Run the cell below and watch the embedded video for an introduction to this workshop.

In [13]:
from IPython.display import IFrame
IFrame('https://player.vimeo.com/video/672086906?h=cdae634450', width="640", height="360")

#### Video 2: What is a _natural_ language?

Run the cell below and watch the video.

In [12]:
IFrame('https://player.vimeo.com/video/672086622?h=34079fb3f7', width="640", height="360")

### Installing spaCy

[spaCy](https://spacy.io/) is a Python library for NLP, combining a user-friendly interface with high performance. 

Below we `pip install` the library in our local Python environment.

In [None]:
!pip install -U spacy

spaCy processes text with the aid of **models**, which are files containing numerical weights derived from neural networks. These networks are trained on linguistic and textual features like parts-of-speech and named entities (which we'll discuss below). 

In this tutorial, we'll use one of the pre-trained models for parsing English. Because the models involve large files, they require downloading separately. 

In [None]:
!python -m spacy download en_core_web_md

A number of models are available for languages other than English. See the [spaCy documentation](https://spacy.io/usage/models) for more information. You can also train your own models, if you are working with special kinds of texts. 

Now we can import `spacy` and load our model. This may take a few moments.

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

If you get an error on loading the module, try running this code instead:

```
import spacy
import en_core_web_md
nlp = en_core_web_md.load()
```

We'll also import some other Python libraries that will be helpful for our analysis. 

In [None]:
import pandas as pd

The **pandas** library may already be installed, if you're using a Colab notebook or an Anaconda distribution of Python. Otherwise, you can install it first as follows before running the `import` command.

```
!pip install pandas
```

In [None]:
# These modules are part of the standard library and should require no installation
from collections import defaultdict, Counter
import json

### Getting the data

#### Video 3: NLP for Twitter data

Run the cell below and watch the video.

In [5]:
IFrame("https://player.vimeo.com/video/672082587?h=a3815c9039" , width="640", height="360")

You can see the code I used to prepare the dataset [here](https://github.com/gwu-libraries/gwlibraries-workshops/blob/master/text-analysis-python/workshop_data_prep.ipynb).

The dataset for this tutorial is shared with the GW community via Box.com. Here's how to access and use it:
1. Follow [this link](https://gwu.box.com/shared/static/ngzd15ylkzrfi7i9ihs86jyprc72wshi.json) and log into Box with your GW NetID and password.
2. Complete the two-factor authentications step (if necessary) to verify your identity.
3. The download should begin automatically, or else a pop up will ask you what you want to do with the file. If prompted, choose the option to save it to your computer.
4. The file is called `senate-tweetset-sample-2020.json`. Locate this file in your Downloads folder (or wherever you have saved it on your computer).
5. If you are running this notebook in your own Jupyter environment (not in the cloud), move the file to the same folder where this notebook is saved.
6. If you are running this notebook in Google Colab, do the following:
   
   a. From the menu on the left-hand side of your notebook, click the folder icon to open the Files panel.
   
   b. Click the file upload button and use the files browser to select the JSON file from step 4 from the folder where you have saved it on your computer. 
   
   c. After uploading, `senate-tweetset-sample-2020.json` should appear in the Files panel of your Colab notebook. 

Now we're ready to do some NLP!

## Processing text with spaCy

### Loading the JSON dataset

The first step is to load the JSON file you made available to your Jupyter/Colab environment. We'll use the `json` library to do this.

In [None]:
with open('senate-tweetset-sample-2020.json', 'r') as f:
    tweets = json.load(f)

If you get a `FileNotFoundError` when running the cell above, make sure that the file you downloaded is in the same folder as your Jupyter notebook (or that you have uploaded it to your Colab notebook environment).

Our `tweets` variable should now hold a list of Python dictionaries. Let's inspect the first one.

In [None]:
tweets[0]

Note that the text of each tweet is accessible via the `full_text` key. 

In [None]:
tweets[0]['full_text']

### Processing the text

Processing a Python string with spaCy yields an object called a `Doc`. Since our dataset consists of several thousand documents, we can use the `.pipe` method for processing a collection of strings in parallel.

First we create a list of just the strings in the `full_text` field of each Tweet. Then we pass that list to `nlp.pipe`. 

In [None]:
texts = [tweet['full_text'] for tweet in tweets]
docs = list(nlp.pipe(texts))

We wrap the result of the `nlp.pipe` method call in a call to the Python `list` function, which will allow us to access each parsed document by its position in the list.

**Note** 

We did not overwrite our `tweets` variable because the original dataset contains useful metadata such as the account that authored each Tweet. Now we have two lists, `tweets` and `docs`. As long as we keep the two lists intact and in the same order, we can reference back from the second to the first when we need any of those metadata elements.

### NLP fundamentals

#### Video 4: Tokenization

Watch the video below for an explanation of _tokens_ in NLP.

In [6]:
IFrame("https://player.vimeo.com/video/672083751?h=3a5f963601" , width="640", height="360")

#### Video 5: Stopwords

Watch the video below for an explanation of _stopwords_ in NLP.

In [7]:
IFrame("https://player.vimeo.com/video/672085745?h=31a7425a86" , width="640", height="360")

#### Filtering stopwords and other kinds of tokens

With spaCy, we can use a token's properties to filter out those that may not be relevent to our analysis. In the cell below, we create a function that accepts an instance of spaCy `Document` class as its argument and returns only those tokens in the document that are **not** one of the following:
 - punctuation
 - white space
 - stopwords 
 - URL's
 
**Note**
 
We filter on white space because even though spaCy removes the white space between words when tokenizing, it treats any extra white spaces, like tabs and line breaks, as separate tokens.

In [None]:
def remove_stops(doc):
    '''
    Returns a list of spaCy tokens, excluding stopwords and other less semantically relevant content
    :param doc: a spaCy Document object
    '''
    tokens = []
    for token in doc:
        if not token.is_stop and not token.is_space and not token.is_punct and not token.like_url:
            tokens.append(token)
    return tokens

Now let's test our function on a Tweet.

In [None]:
remove_stops(docs[0])

In the original text, `Monday, May 4` is given as a date. Our function kept the `4` but discarded `May`: it doesn't have a way to distinguish `May` the month from `may` the auxillary verb, which is a stopword. 

You can [augment or even replace](https://spacy.io/usage/linguistic-features#language-subclass) spaCy's built-in list of stopwords with your own. For instance, if your text has a lot of dates in it, you may want to remove `may` from the list.

We could also weed out tokens like `4` by checking the `Token.is_digit` and/or `Token.like_num` flags.

#### Lemmas and parts-of-speech

spaCy is more than just a tokenizer. When we pass a string to the `nlp` function, it analyzes the text using a series of models. One of these models tags every token with its grammatical part of speech (POS). The models are probabilistic, so depending on the nature of the text, the results may be more or less accurate.

We can view the POS tags using the `.pos_` attribute of any given token. (Note the underscore at the end of the attribute!) Definitions of the tags are available on the website of the [Universal Dependencies project](https://universaldependencies.org/docs/u/pos/) 

In [None]:
{token.text: token.pos_ for token in docs[0]}

Another useful piece of token metadata is the _lemma_ of each word, which is a normalized form intended to facilitate comparisons between different grammatical inflections (plurals, verb tense, etc.). We can access it via the `Token.lemma_` attribute.

In [None]:
{token.text: token.lemma_ for token in docs[0]}

We can see that lemmatization in spaCy leaves proper nouns (like the `Groups` in `Toyota Working Groups`) alone. But `practiced` becomes `practice`, `authorities` becomes `authority`, and `best` becomes `good`. 

### Named entities

#### Video 6: Named entities 

In [8]:
IFrame("https://player.vimeo.com/video/672077809?h=4008747699" , width="640", height="360")

A spaCy `Document` has a property call `ents` that returns only the named entities recognized for that document. Each entity has a `label_` attribute that identifies [the classification](https://spacy.io/usage/linguistic-features#accessing-ner) assigned to it.

Below we use a dictionary comprehension to create a dictionary of all the entities in the documents in our dataset.

In [None]:
entities = {ent.text: ent.label_ for doc in docs for ent in doc.ents}

In [None]:
entities

Using the built-in visualizer, we can see each document with the named entities highlighted, along with the category (the `label_`) it's been assigned.

In [None]:
from spacy import displacy

In [None]:
displacy.render(docs[0], style='ent', jupyter=True)

#### Exercise

Use `displacy` to examine a few different tweets and see how the named-entity recognition has performed. What do you notice? Is it particularly bad at recognizing or classifying certain kinds of entities?

#### Analyzing named entities

Using spaCy's tagging, we can begin to look at our dataset as **a collection** of texts, rather than text by text. 

Below we write a function that calculates the number of times each unique entity appears in our dataset, classified by entity label. Our function should accept a list of spaCy `Document` objects and return a dictionary such that `ent_dict['PERSON']` (for instance) shows all the unique `PERSON` entities and their frequency in the collection.

To do this, we'll use a couple of special Python types from the `collections` module. (We imported them earlier.)
- `defaultdict` creates a Python dictionary whose **values** are initialized as another Python collection type, such as a list or a dictionary. `defaultdict` provides a convenient way to make a nested data structure.
- `Counter` is a Python dictionary that initializes every value to 0. It's useful for counting a collection of objects.

In [None]:
def count_ents(docs):
    '''
    Creates a nested dictionary: the top-level contains types of named entities. 
    Each named-entity entry contains a count of how many times each particular entity appears in the dataset.
    :param docs: A list of spaCy document objects
    '''
    ent_dict = defaultdict(Counter) # Initialize our nested dictionary with a Counter
    for doc in docs:
        for ent in doc.ents: # Loop over doc.ents to get the entities, not over doc, which returns the tokens
            label = ent.label_ # The descriptive label/category
            text = ent.text # The string representation of the entity; note that entities aren't lemmatized
            ent_dict[label][text] += 1 # Increment the Counter associated with that label
    return ent_dict

In [None]:
ent_dict = count_ents(docs)
ent_dict['PERSON'].most_common(20)

If we run the above to find the top 20 persons, we can observe a few things:
- `#COVID19` is mistakenly labeled as a `PERSON`.
- Named entities aren't lemmatized, so names like `Bolton` and `John Bolton` show up as separate entities.

You can update individual entities to correct their classification, but this has to be done at the document level, so it would require writing code to update each document where the entity appears. You can also train your own entity model, but that requires a pre-tagged dataset. See the [spaCy docs](https://spacy.io/usage/linguistic-features#named-entities) for more info.

### Word and document similarity

#### Video 7: Word vectors

In [9]:
IFrame("https://player.vimeo.com/video/672086104?h=7586ab4062"  , width="640", height="360")

#### Analyzing document similarity with word vectors

spaCy's token- and document-level vectors  are derived from a technique called [word2vec](https://en.wikipedia.org/wiki/Word2vec). These vectors are represented in spaCy as objects of type `numpy.array`. 

We could use these vectors to create a similarity score between every document in our collection and every other document, but that's a fairly intensive computation, since our dataset contains about 8,000 documents.

For illustration, let's pick one Tweet and find other Tweets similar to it. We can use our original dataset to include Tweet metadata in our analysis.

By converting our dataset into a pandas `DataFrame` object, we can leverage the latter's fast indexing and sorting methods to isolate particular documents based on their metadata.

In [None]:
tweet_df = pd.DataFrame.from_records(tweets)

Let's find Elizabeth Warren's most popular Tweet.
1. Filter out all Tweets where the name on the acount does not contain the string `Warren`.
2. Sort on the `retweet_count` column to find her most popular Tweet.


In [None]:
warren_df = tweet_df.loc[tweet_df['name'].str.contains('Warren')]
warren_df.sort_values(by='retweet_count', ascending=False)

Because slicing and sorting a `DataFrame` don't change its index, we can use that to find the corresponding document in our collection for analysis.

In [None]:
docs[1358]

In [None]:
warren_tweet_index = 1358

Now we write a function that will take 1) a collection of documents, and 2) an index to a document in that collection that serves as the comparator (the _target_). We'll calculate the similarity score between that comparator document and every other document, returning the scores in a `Counter` object.

In [None]:
def compute_scores(docs, idx_of_target): 
    '''
    Creates a similarity score between each document in the collection and the document with the provided index.
    :param docs: a list of spaCy documents
    :param idx_of_target: an integer corresponding the index in the list of the comparator document
    '''
    sim_scores = Counter()
    # Assumes that our target is in the collection
    target = docs[idx_of_target]  
    # Enumerate lets us keep track of the index of each document in the collection
    for i, doc in enumerate(docs): 
         # Compute the cosine similarity
        score = doc.similarity(target)  
        sim_scores[i] = score
    return sim_scores

In [None]:
scores = compute_scores(docs, warren_tweet_index)

The `scores` object by itself doesn't tell us much, since it references each document only by its index. But it's a `Counter` object, so we can easily find the top N scores. 

And then we can use those entries to slice our `DataFrame` of Tweets to see the Tweet text and associated metadata.

We have to write `score[0]` in the `DataFrame.loc[]` expression because `Counter.most_common` returns a Python tuple, the first element of which is the key -- in this case, the document index.

In [None]:
tweet_df.loc[[score[0] for score in scores.most_common(10)]]

It will be easier to see the full text if we just inspect the `.values` attribute of that column.

We can see that our similarity scoring did a pretty good job of identifying as most similar Tweets about big corporations, billionaires, and unemployment.

In [None]:
tweet_df.loc[[score[0] for score in scores.most_common(10)]]['full_text'].values

Here we've taken our first steps toward building a document classifier! The initial results look pretty good, though as with any use of NLP, it's worth doing more exploration in order to evaluate the accuracy and robustness of the method and possibly to fine-tune it. 

#### Video 8: Word vectors without stopwords

The following video explains the code below for calculating a document similarity score without stopwords, punctuation, white space, or URL's.

In [10]:
IFrame("https://player.vimeo.com/video/672087275?h=81d1d33bf2"  , width="640", height="360")

In [11]:
import numpy as np

In [None]:
def vectorize_without_stops(doc):
    '''
    Creates a document vector based on the vectors of all tokens filtered by the remove_stops function
    :param doc: a spaCy Document object
    '''
    vectors = np.array([token.vector for token in remove_stops(doc)])
    if not vectors.any():
        return vectors
    return np.mean(vectors, axis=0)

In [None]:
doc_vecs = [vectorize_without_stops(doc) for doc in docs]

In [None]:
def cosine_sim(vec1, vec2):
    '''
    Computes the cosine similarity between two numpy vectors.
    :param vec1, vec2: numpy arrays
    '''
    if not vec1.any() or not vec2.any():
        return 0
    return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [None]:
def compute_scores_2(vectors, idx_of_target): 
    '''
    Computes the scores between a collection of document vectors and the vector of one document in that collection.
    :param vectors: a list of document vectors (numpy arrays)
    :param idx_of_target: an integer representing the index of the document in the collection to be compared against
    '''
    sim_scores = Counter()         
    target_vec = vectors[idx_of_target]   
    for i, vec in enumerate(vectors): 
        score = cosine_sim(target_vec, vec)   # Compute the cosine similarity
        sim_scores[i] = score
    return sim_scores

In [None]:
scores2 = compute_scores_2(doc_vecs, warren_tweet_index)

In [None]:
tweet_df.loc[[score[0] for score in scores2.most_common(10)]]['full_text'].values

### Resources for further exploration

If you're interested in document classification, there are other approaches that might be of interest. 
- The [gensim](https://radimrehurek.com/gensim_3.8.3/) Python library includes support for topic modeling, as well as tools for creating custom word embeddings.
- spaCy includes a [TextCategorizer](https://spacy.io/api/textcategorizer) component that can be used to train a neural network for text classification.

A wealth of resources exist on natural language processing and NLP with Python, many of them freely available on the web.

The [spaCy documentation](https://spacy.io/) is a great place to start.

For members of the GW community, the [O'Reilly Online Library](https://www.safaribooksonline.com/library/view/temporary-access) provides access to a wide variety of ebooks on this topic, including titles like [Natural Language Processing with Python and spaCy](https://learning.oreilly.com/library/view/natural-language-processing/9781098122652/) (No Starch Press, 2020) and [Practical Natural Language Processing with Python](https://learning.oreilly.com/library/view/practical-natural-language/9781484262467/?ar=) (Apress, 2020).

### Coding assistance for GW affiliates

Current GW faculty, students, and staff can [get assistance](https://library.gwu.edu/program-and-code) with research projects involving Python by scheduling a [coding consultation](https://calendly.com/gwul-coding).