## Python for Natural Language Processing (NLP)

**GW Libraries and Academic Innovation**

Friday, February 12, 2021

### Workshop goals

This workshop introduces some fundamental concepts of natural-language processing through a hands-on approach with Python.

By the conclusion of this workshop, you will have worked through the following:
- Processing a collection of texts with the `spacy` Python library
- Exploring core features of `spacy` for NLP, including tokenization, part-of-speech tagging, and named-entity recognition
- Computing basic metrics about the dateset using some of these features
- Exploring vector-based word-representation as a measure of document similarity.

### Tips for using this Google Colab notebook

When working in a Google Colaboratory notebook, `Shift-Return` (`Shift-Enter`) runs the cell you're on. You can also run the cell using the `Play` button at the left edge of the cell.

There are many other keyboard shortcuts. You can access the list via the menu bar, at `Tools`-->`Command palette`. In fact, you can even customize your keyboard shortcuts using `Tools`-->`Keyboard shortcuts`.

(If you're working in an Anaconda/Jupyter notebook: 
 - `Control-Enter` (`Command-Return`) runs the cell you're on. You can also run the cell using the `Run` button in the toolbar. `Esc`, then `A` inserts a cell above where you are.
 - `Esc`, then `B` inserts a cell below where you are.
 - More shortcuts under `Help` --> `Keyboard Shortcuts`)

You will probably get some errors in working through this notebook. That's okay, you can just go back and change the cell and re-run it.

The notebook auto-saves as you work, just like gmail and most Google apps.

### Introduction

#### Defining NLP

For the purposes of this workshop, natural-language processing (NLP for short) refers to the use of computational methods to analyze samples of written or spoken language produced by human beings. The term _natural_ refers to the fact that the languages in question are those that have emerged from the crucible of human speech and social interaction. An example of a language **not** considered natural would be a programming language, such as Python. Computers, of course, regularly process these kinds of languages; in a sense, that is what they are designed to do. 

Natural language, for many reasons, some of which we will explore below, prove far less amenable to computational processing. 

#### Why NLP?

- To make computers better at talking to us / thinking like us (semantic search, artificial intelligence)
- To improve digital interfaces (voice recognition)
- To automate human communicative tasks (chatbots, autocomplete, text generation)
- To understand human discourse computationally (textual analysis)
- To enhance surveillance, monitoring, and control (content filtering, predictive analysis)

### Setup

[spaCy](https://spacy.io/) is a Python library for NLP. Other libraries exists, most notably [NLTK](https://www.nltk.org/) that provide similar functionality. But spaCy combines a user-friendly interface with high performance (the _Cy_ in _spaCy_ alludes to the fact that under the hood, a lot of the library is written in Cython, which is a hybrid of Python and the programming language C). 

If you haven't used it before, you'll probably need to install `spaCy`. (The latest version is 2.3.)

In [None]:
!pip install -U spacy

`spacy` processes text with the aid of **models**, which are files containing numerical weights derived from neural networks. These networks are trained on linguistic and textual features like parts-of-speech and named entities (which we'll discuss below). 

A number of models are available for languages other than English. See the [spaCy documentation](https://spacy.io/usage/models) for more information. You can also train your own models, if you are working with special kinds of texts. 

But in this workshop, we'll use one of the pre-trained models for parsing English. Because the models are rather larger, they require downloading separately. 

We'll use the "medium" model for English, because the "small" model lacks some features it will be useful to explore. For your own projects, you might want to try the "large" model, which is the most fully featured and most accurate.

In [None]:
!python -m spacy download en_core_web_md

Now we can import `spacy` and load our model. This may take a few moments.

If you get an error on loading the module, try this instead:

```
import spacy
import en_core_web_md
nlp = en_core_web_md.load()
```

In [1]:
import spacy
nlp = spacy.load('en_core_web_md')

We'll also import some other Python libraries that will be helpful for our analysis. 

`pandas`, `numpy`, `matplotlib`, and `requests` may already be installed, if you're using a Colab notebook or an Anaconda distribution of Python. Otherwise, you can install them first as follows:

```
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install requests
```

The `collections` library should be part of the standard Python installation (no need to install separately).

In [2]:
import pandas as pd
import numpy as np

In [3]:
import requests
from collections import defaultdict, Counter

### Getting the data

Now we need some textual data to work with. NLP generally works best on clean texts without typos, special characters, etc. But those often aren't the kinds of text we want to process. To simulate something more interesting, we're going to use a Twitter dataset in this workshop.

The tweets in this dataset are from the official accounts of members of the United States Senate, collected between January 1, 2020 and May 7, 2020 by GW LAI's Social Feed Manager project and available on [TweetSets](https://tweetsets.library.gwu.edu/). 

I have downsampled this dataset to a size more manageable for live coding. The initial dataset contained 40,000 tweets. I have also removed most of the metadata from the original tweets, keeping just a few fields (in addition to the full text). 

If you wish to replicate my extraction and sampling process, see [this notebook](). **ADD LINK**

We'll use the `requests` library to fetch the tweet data as a JSON file and convert it to Python objects.

In [4]:
resp = requests.get('https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/text-analysis-python/senate-tweetset-sample-2020.json')

In [5]:
tweets = resp.json()

Note that the data is a list of dictionaries, and that the text of each tweet is accessible via the `full_text` key. 

### Processing the text

Processing a text with Spacy will an object called a `Doc`. This process is computationally expensive, though spaCy is highly optimized, and it includes the special `.pipe` method for handling multiple texts in parallel.

**Note** When I say "text," I mean a Python **string**. spaCy cannot process other Python data types. 

We can pass a list of texts to `.pipe` instead of iterating over them with a `for` loop, and we will get back a collection of `Doc` objects.

Here I'm using a list comprehension to create a list of just the strings in the `full_text` field of each tweet. Then I pass that list to `nlp.pipe`. I wrap the latter in `list` because technically, `.pipe` returns an iterable, which we can loop over, but converting it to a list will allow us to get specific elements by index.

In [6]:
texts = [tweet['full_text'] for tweet in tweets]
docs = list(nlp.pipe(texts))

That should have taken more than a few seconds, unless you have a very fast machine. But now our texts are processed, and we can start to explore their structure.

**Important** I did _not_ overwrite my `tweets` variable with the processed versions of the tweets, because the original dataset contains important metadata (like each tweet's author, number of times it was retweeted, etc.). When processing texts, you'll generally need to keep the metadata separate. So now we have two lists, `tweets` and `docs`. As long as we keep the two lists in tact and in the same order, we can easily reference back from the second to the first (e.g., to associate a given tweet's contents with its creator).

### NLP fundamentals

There are a few key components of a `Doc` that we will review. 

#### Tokens

A token is a unique set of characters obtained from a larger string by discarding some elements of the string. That definition is vague on purpose, because **tokenization** can be achieved in different ways. Let's see if we can determine how spaCy tokenizes. 

Note that if I just inspect a spaCy `doc`, it doesn't look much different from a string. But to inspect the tokens in it, I can simply wrap it in Python's `list` command.

In [7]:
docs[0]

Kudos to @Toyota as its workforce prepares to return to work at TMMTX (San Antonio) on Monday, May 4.  Their "Safe at Work Playbook" is based on guidelines from the CDC, WHO, and OSHA, best practices developed by Toyota Working Groups, and local orders and other authorities.

In [8]:
list(docs[0])

[Kudos,
 to,
 @Toyota,
 as,
 its,
 workforce,
 prepares,
 to,
 return,
 to,
 work,
 at,
 TMMTX,
 (,
 San,
 Antonio,
 ),
 on,
 Monday,
 ,,
 May,
 4,
 .,
  ,
 Their,
 ",
 Safe,
 at,
 Work,
 Playbook,
 ",
 is,
 based,
 on,
 guidelines,
 from,
 the,
 CDC,
 ,,
 WHO,
 ,,
 and,
 OSHA,
 ,,
 best,
 practices,
 developed,
 by,
 Toyota,
 Working,
 Groups,
 ,,
 and,
 local,
 orders,
 and,
 other,
 authorities,
 .]

Compare that with what we get by simply splitting the **string** version of this text on whitespace with Python's builtin `.split` method. What do you notice?

In [9]:
tweets[0]['full_text'].split()

['Kudos',
 'to',
 '@Toyota',
 'as',
 'its',
 'workforce',
 'prepares',
 'to',
 'return',
 'to',
 'work',
 'at',
 'TMMTX',
 '(San',
 'Antonio)',
 'on',
 'Monday,',
 'May',
 '4.',
 'Their',
 '"Safe',
 'at',
 'Work',
 'Playbook"',
 'is',
 'based',
 'on',
 'guidelines',
 'from',
 'the',
 'CDC,',
 'WHO,',
 'and',
 'OSHA,',
 'best',
 'practices',
 'developed',
 'by',
 'Toyota',
 'Working',
 'Groups,',
 'and',
 'local',
 'orders',
 'and',
 'other',
 'authorities.']

Python's built-in `split` command separates a string based on a single character or character-combination at a time. By default, it splits on whitespace. For English text, this leaves punctuation marks attached to the words they preceded or follow. 

Using regular expressions, it's possible to create more complex string separations with the `re` library in Python. But that's still not an easy task, given the occurence of tokens that actually contain punctuation: for example, _U.S._ or _U.K._ spaCy includes recipes to account for such tokens in its tokenizing routine.

**Note** spaCy's tokenization performs quite well as a general rule, at least on what we might call standard English text. But if you're working with text that is non-standard or just messy, it may not be as accurate. As with most of its functionality, it's possible to customize spaCy's tokenizer to handle special cases, but that's beyond the scope of this workshop. See the [dcumentation](https://spacy.io/usage/linguistic-features#tokenization) for more details.

#### Stopwords, punctuation, and URL's

If you compare the two lists above, you'll see that in the list derived from our spaCy `doc`, the elements are not surrounded by single quotes. This is a signal that a spaCy `Token` is not a Python `str`. But like Python strings, which have special methods like `split` built-in, tokens have their own methods and properties. 

For example, spaCy attempts to mark URL's and email addresses as such.

In [10]:
# The last token in the second doc is a URL.
docs[1]

Many @SocialSecurity beneficiaries were surprised by a recent @IRSnews rule that required a tax return to receive direct #COVID19 checks. My colleagues and I urged @USTreasury to waive this burdensome requirement.
 
Tonight, this rule was reversed. https://t.co/Zx4jujdris

In [11]:
docs[1][-1].like_url

True

We can use some of the token's properties to identify what we might call "content" words in our text, preparatory to performing some form of semantic analysis. We'll most likely want to filter out punctuation. We'll also probably want to filter out **stopwords**, which in English include articles (_a_, _an_, and _the_), conjunctions (_and_, _or_), prepositions (_of_, _in_), and the like. 

Let's write a function that accepts a spaCy `Document` object as its argument and returns only those tokens that are neither punctuation, whitespace, stopwords, nor URL's.

In [12]:
def remove_stops(doc):
    tokens = []
    for token in doc:
        # We include is_space because even though the default tokenization ignores the space between words, extra spaces
        # like line breaks can register as distinct tokens
        if not token.is_stop and not token.is_space and not token.is_punct and not token.like_url:
            tokens.append(token)
    return tokens

In [13]:
remove_stops(docs[0])

[Kudos,
 @Toyota,
 workforce,
 prepares,
 return,
 work,
 TMMTX,
 San,
 Antonio,
 Monday,
 4,
 Safe,
 Work,
 Playbook,
 based,
 guidelines,
 CDC,
 OSHA,
 best,
 practices,
 developed,
 Toyota,
 Working,
 Groups,
 local,
 orders,
 authorities]

As you can see, it's not perfect. In the original text, "Monday, May 4" is given as a date. Our function kept the "4" but discarded "May": it doesn't have a way to distinguish _May_ the month from _may_ the auxillary verb, which is a stopword. You can augment or even replace spaCy's built-in list of stopwords with your own. For instance, if your text has a lot of dates in it, you may want to remove _may_ from the list.

We could also weed out tokens like "4" by checking the `Token.is_digit` and/or `Token.like_num` flags.

#### Lemmas and parts-of-speech

We'll use our `remove_stops` function a bit later. Now let's look at some properties of tokens that we might want to analyze across our collection of documents.

spaCy is more than just a tokenizer. When we pass a string to the `nlp` function, it analyzes the text using a series of models. One of these models tags every token with its grammatical part of speech (POS). The models are probabilistic, so depending on the nature of the text, the results may be more or less accurate.

We can view the POS tags using the `.pos_` attribute of any given token. (Note the underscore at the end of the attribute!) Definitions of the tags are available on the website of the [Universal Dependencies project](https://universaldependencies.org/docs/u/pos/) 

In [14]:
# Using a dictionary comprehension to view the .pos_ attribute of the tokens in a spaCy doc, along with the token's string representation
{token.text: token.pos_ for token in docs[0]}

{'Kudos': 'NOUN',
 'to': 'ADP',
 '@Toyota': 'PUNCT',
 'as': 'SCONJ',
 'its': 'DET',
 'workforce': 'NOUN',
 'prepares': 'VERB',
 'return': 'VERB',
 'work': 'NOUN',
 'at': 'ADP',
 'TMMTX': 'PROPN',
 '(': 'PUNCT',
 'San': 'PROPN',
 'Antonio': 'PROPN',
 ')': 'PUNCT',
 'on': 'ADP',
 'Monday': 'PROPN',
 ',': 'PUNCT',
 'May': 'PROPN',
 '4': 'NUM',
 '.': 'PUNCT',
 ' ': 'SPACE',
 'Their': 'DET',
 '"': 'PUNCT',
 'Safe': 'ADJ',
 'Work': 'PROPN',
 'Playbook': 'PROPN',
 'is': 'AUX',
 'based': 'VERB',
 'guidelines': 'NOUN',
 'from': 'ADP',
 'the': 'DET',
 'CDC': 'PROPN',
 'WHO': 'PROPN',
 'and': 'CCONJ',
 'OSHA': 'PROPN',
 'best': 'ADJ',
 'practices': 'NOUN',
 'developed': 'VERB',
 'by': 'ADP',
 'Toyota': 'PROPN',
 'Working': 'PROPN',
 'Groups': 'PROPN',
 'local': 'ADJ',
 'orders': 'NOUN',
 'other': 'ADJ',
 'authorities': 'NOUN'}

Along with the part of speech, another useful piece of token metadata is the _lemma_ of each word, which is a sort of normalized form of it intended to make comparison between different grammatical inflections (plurals, verb tense, etc.) easy to compare. We can access it via the `Token.lemma_` attribute.

In [15]:
{token.text: token.lemma_ for token in docs[0]}

{'Kudos': 'kudo',
 'to': 'to',
 '@Toyota': '@Toyota',
 'as': 'as',
 'its': '-PRON-',
 'workforce': 'workforce',
 'prepares': 'prepare',
 'return': 'return',
 'work': 'work',
 'at': 'at',
 'TMMTX': 'TMMTX',
 '(': '(',
 'San': 'San',
 'Antonio': 'Antonio',
 ')': ')',
 'on': 'on',
 'Monday': 'Monday',
 ',': ',',
 'May': 'May',
 '4': '4',
 '.': '.',
 ' ': ' ',
 'Their': '-PRON-',
 '"': '"',
 'Safe': 'safe',
 'Work': 'Work',
 'Playbook': 'Playbook',
 'is': 'be',
 'based': 'base',
 'guidelines': 'guideline',
 'from': 'from',
 'the': 'the',
 'CDC': 'CDC',
 'WHO': 'WHO',
 'and': 'and',
 'OSHA': 'OSHA',
 'best': 'good',
 'practices': 'practice',
 'developed': 'develop',
 'by': 'by',
 'Toyota': 'Toyota',
 'Working': 'Working',
 'Groups': 'Groups',
 'local': 'local',
 'orders': 'order',
 'other': 'other',
 'authorities': 'authority'}

We can see that lemmatization in spaCy leaves proper nouns (like the _Groups_ in _Toyota Working Groups_) alone. But _practiced_ becomes _practice_, _authorities_ becomes _authority_, and _best_ becomes _good_. 

#### Named entities

So far we've looked at **syntactic** features of our dataset. spaCy also includes some tools that allow us to look at the **semantic** information in an quantitative way. It's worth pointing out (again) that semantic computational analysis is quite challenging and remains a very active area of research. spaCy isn't necessarily intended to be used on its own for this work, but rather as a pre-processing tool to feed into other kinds of tools and models capable of more sophisticated analysis.

One kind of semantic analysis identifies the named entities in a collection of documents. These are words or phrases that refer to people, places, organizations, etc. Because such names can be either single words or phrases, it's not sufficient to identify the proper nouns in a text. spaCy uses a special probabilistic model to extract and classify named entities. Let's see how accurate it is.

A`Document` has a property call `ents` that returns only the named entities recognized for that document. Each entity has a `label_` attribute that identifies [the classification](https://spacy.io/api/annotation#named-entities) assigned to it.

Using the built in visualizer, we can see each document with the named entities highlighted, along with the category (the `label_`) it's been assigned.

In [21]:
from spacy import displacy

In [25]:
displacy.render(docs[10], style='ent')

#### Exercise

Use `displacy` to examine a few different tweets and see how the named entity recognition has performed. What do you notice? Is it particularly bad at recognizing or classifying certain kinds of entities?

Now let's look at these entities a bit more programmatically. 

To start with, we can use a dictionary comprehension to create a dictionary of all the entities in the documents in our dataset.


In [26]:
entities = {ent.text: ent.label_ for doc in docs for ent in doc.ents}

In [27]:
entities

{'TMMTX': 'ORG',
 'San Antonio': 'GPE',
 'Monday': 'DATE',
 'May 4': 'DATE',
 'CDC': 'ORG',
 'OSHA': 'ORG',
 'Toyota Working Groups': 'ORG',
 'Tonight': 'TIME',
 'a year': 'DATE',
 'House': 'ORG',
 'Mitch McConnell': 'PERSON',
 'GOP': 'ORG',
 'Senate': 'ORG',
 'Kansas': 'GPE',
 '@kscosmosphere': 'ORG',
 'PPP': 'ORG',
 '&amp': 'ORG',
 'https://t.co/0tPRHeIXVI': 'ORG',
 'Trump Administration': 'ORG',
 'EPA': 'ORG',
 'AZ': 'LOC',
 'Jack': 'PERSON',
 'Scottsdale': 'GPE',
 'Rudy &amp': 'ORG',
 'First': 'ORDINAL',
 'the United States': 'GPE',
 'More than $67 billion': 'MONEY',
 'OpportunityZones': 'MONEY',
 'Bernie Sanders': 'PERSON',
 'Donald Trump': 'PERSON',
 'Texas': 'GPE',
 'last night': 'TIME',
 'Russia': 'GPE',
 '2,600 mile': 'QUANTITY',
 'China': 'GPE',
 'El Al': 'LOC',
 'Israel &amp': 'ORG',
 'Air France': 'ORG',
 'China &amp': 'ORG',
 'France': 'GPE',
 'The United States': 'GPE',
 'U.S.': 'GPE',
 'the Senate HELP Committee': 'ORG',
 'the Coronavirus Aid': 'ORG',
 'Economic Security

At first glance, the accuracy isn't bad, but it's far from perfect. A `GPE` is a geopolitical designation, like a city or a state, and those seem largley to be classified correctly. The results for `ORG` and `PERSON` are spottier. It flags certain hashtags as `PERSON`s and "PPP" as an `ORG`, which in thise case, it's probably not. Also, the entity models picks out some phrases in Spanish which do not appear to refer to named entities at all

Using spacy's tagging, can begin to look at our dataset as **a collection** of texts, rather than text by text. 

Let's write a function a function that calculates the number of times each unique entity appears in our dataset, classified by entity label. Our function should accept a list of spaCy `Document`s and return a dictionary such that `ent_dict['PERSON']` (for instance) shows all the unique `PERSON` entities and their frequency in the collection.

To do this, we'll use a couple of special Python types from the `collections` module. (We imported them earlier.)
- `defaultdict` creates a Python dictionary whose **values** are initialized as another Python collection type, such as a list or a dictionary. `defaultdict` provides a convenient way to make a nested data structure.
- `Counter` is a Python dictionary that initializes every value to 0. It's useful for counting a collection of objects (in this case, our lemmas.)

In [18]:
def count_ents(docs):
    ent_dict = defaultdict(Counter) # Initialize our nest dictionary with a Counter
    for doc in docs:
        for ent in doc.ents: # Loop over doc.ents to get the entities, not over doc, which returns the tokens
            label = ent.label_ # The descriptive label/category
            text = ent.text # The string representation of the entity; note that entities aren't lemmatized
            ent_dict[label][text] += 1 # Increment the Counter associated with that label
    return ent_dict

In [19]:
ent_dict = count_ents(docs)
ent_dict['PERSON'].most_common(20)

[('Trump', 255),
 ('#COVID19', 88),
 ('Donald Trump', 69),
 ('Rubio', 61),
 ('McConnell', 60),
 ('#DemDebate', 48),
 ('@realDonaldTrump', 46),
 ('Mitch McConnell', 36),
 ('SOTU', 24),
 ('Hawley', 23),
 ('John Bolton', 23),
 ('#SOTU', 23),
 ('Bernie Sanders', 21),
 ('Obama', 20),
 ('Pelosi', 19),
 ('#ICYMI', 16),
 ('Bolton', 16),
 ('Marco Rubio', 16),
 ('Nevadans', 16),
 ('Joe Biden', 15)]

If we run the above to find the top 20 persons, we can observe a few things:
- "#COVID19" is mistakenly labeled as a PERSON.
- Named entities aren't lemmatized, so names like "Bolton" and "John Bolton" show up as separate entities.

You can update individual entities to correct their classification, but this has to be done at the document level, so it would require writing code to update each document where the entity appears. You can also train your own entity model, but that requires having a pre-tagged dataset to train the model on. See the [spaCy docs](https://spacy.io/usage/linguistic-features#named-entities) for more info.

#### Word and document similarity

A more ambitious form of semantic analysis seeks quantitatively to represent the _meanings_ of words based on their relative similarity to other words. As you might imagine, this task is fraught with difficulty, since meaning in natural languages is so highly contextual. The words we speak or write represent just the tip of the iceberg of what we mean -- a fact attested to by how frequently a community of human speakers can disagree about the meaning of the utterances their members produce. Certain words stand out as flashpoints of controversy: think about the range of meanings people attribute to a word like _racism_ or _safety_ or _science_. But in general, this indeterminancy affects all human language. Beneath the written text or spoken utterance lies a huge weight of personal experience and collective history.

NLP algorithms generally represent "meaning" in terms of a much narrower version of context: namely, the collocation of linguistic features across a corpous of texts. These algorithms tend either to apply sophisticated statistical techniques or -- more recently -- to make use of neural networks. But the basic premise is that "similar" or "related" words tend to occur more frequently together than dismilar or unrelated words. 



spaCy's models produce both `Token` and `Document` **vectors**, which are derived from a technique called [word2vec](https://en.wikipedia.org/wiki/Word2vec). These vectors are represented in spaCy as objects of type `numpy.array`. The latter is basically a more performant version of a Python list, optimized for numeric operations.

The vectors by themselves are not terribly informative:

In [28]:
docs[0][0].vector

array([-3.4808e-01, -2.8244e-02, -1.5513e-03, -4.4549e-01, -6.6233e-01,
        3.6924e-01,  2.1900e-01, -4.7495e-01,  5.7171e-02,  1.1867e+00,
       -9.2176e-02, -8.6183e-02,  7.4074e-02, -2.0607e-01, -1.9715e-01,
        5.9195e-02,  1.1783e-01, -1.0074e-01,  1.5625e-01,  4.0891e-01,
        2.4840e-04, -5.1546e-02,  3.1825e-01, -1.9916e-01,  1.4442e-02,
        1.2510e-02,  7.0528e-02, -2.3162e-01,  2.3331e-01, -4.0152e-01,
        3.3923e-01,  9.7194e-02, -1.5903e-01,  4.0869e-01,  1.0443e-01,
        1.0303e-01, -2.5341e-01,  2.0457e-02,  3.0006e-01,  1.5161e-01,
       -1.7320e-01,  1.0995e-01, -3.2460e-01, -1.6000e-01,  1.1650e-01,
        2.9631e-01,  2.5647e-02,  7.6841e-01, -1.3053e-01, -3.8559e-01,
        9.0185e-02,  3.1918e-01,  4.1669e-01, -2.8903e-01, -2.6488e-01,
       -4.8447e-02,  2.4093e-01,  2.9484e-01, -5.0827e-01, -3.1536e-01,
       -4.2685e-01, -3.3898e-01, -2.1983e-01, -2.6768e-01, -1.0850e-01,
       -1.7611e-01,  3.3296e-01, -3.0399e-01,  2.9700e-01, -5.39

But the vectors allow us to compare two tokens, using each token's built-in `.similarity` method. Tokens with a higher score are supposedly more similar. Let's take a few tokens in isolation to illustrate.

Note that we are procssing the words first with the `nlp` function to create of each a spaCy `Document` consisting of a single token. We can't compare the similarity of unprocessed Python strings.

In [29]:
# Running the model on individual words
banana = nlp('banana')
orange = nlp('orange')
apple = nlp('apple')
dog = nlp('dog')
cat = nlp('cat')

In [30]:
banana.similarity(orange)

0.5629939782223348

In [31]:
orange.similarity(cat)

0.3288468980287254

In [32]:
cat.similarity(dog)

0.8016854705531046

It seems fairly reasonable to say that an orange is more similar to a banana than to a cat. I'm not sure why _cat_ and _dog_ appear so much more similar than _orange_ and _banana_, however.

Also, the dubious logic of collocation appears in this example:

In [33]:
dog.similarity(nlp('wolf'))

0.5206573188004238

It stands to reason that the accuracy of the similarity model depends on the size and nature of the corpus used to train it. You can train your own word2vec models using other Python libraries (like [Gensim](https://radimrehurek.com/gensim/)) and import the results into spaCy; this approach might be particularly useful if you're working with a corpus of fairly specialized texts.

How can we use word-vector similarity to analyze our dataset? 

One approach might be to look for the documents that are most similar to a given document. spaCy automatically assigns each `Document` a vector that represents the average of its word vectors, and this is used as the basis for the `Document.similarity` method.

In [34]:
docs[0].similarity(docs[1])

0.9066829319507925

In [35]:
docs[0]

Kudos to @Toyota as its workforce prepares to return to work at TMMTX (San Antonio) on Monday, May 4.  Their "Safe at Work Playbook" is based on guidelines from the CDC, WHO, and OSHA, best practices developed by Toyota Working Groups, and local orders and other authorities.

In [36]:
docs[1]

Many @SocialSecurity beneficiaries were surprised by a recent @IRSnews rule that required a tax return to receive direct #COVID19 checks. My colleagues and I urged @USTreasury to waive this burdensome requirement.
 
Tonight, this rule was reversed. https://t.co/Zx4jujdris

We could use this code to create a similarity score for every document in our collection with every other document, but that's a fairly intensive computation, since our dataset contains about 8,000 documents.

For illustration, let's pick one tweet and find other tweets similar to it. We can use our original dataset to include tweet metadata in our analysis.

By converting our dataset of tweets into a pandas `DataFrame`, we can leverage its fast indexing and sorting methods to isolate particular documents based on their metadata

In [37]:
tweet_df = pd.DataFrame.from_records(tweets)

Let's find Elizabeth Warren's most popular tweet.

First we filter out all tweets where the name on the acount does not contain "Warren."

Then we sort on the `retweet_count` column to find her most popular tweet.


In [38]:
warren_df = tweet_df.loc[tweet_df['name'].str.contains('Warren')]
warren_df.sort_values(by='retweet_count', ascending=False)

Unnamed: 0,full_text,retweet_count,created_at,name,screen_name
1358,Millions may now lose their jobs. And Trump wa...,26512,Sun Mar 22 16:35:24 +0000 2020,Elizabeth Warren,SenWarren
5596,You are threatening to commit war crimes. We a...,23778,Sun Jan 05 03:35:11 +0000 2020,Elizabeth Warren,ewarren
6897,Trump told states they were on their own to pu...,23542,Tue Mar 31 15:04:44 +0000 2020,Elizabeth Warren,SenWarren
3877,"My oldest brother, Don Reed, died from coronav...",18842,Thu Apr 23 14:39:30 +0000 2020,Elizabeth Warren,ewarren
2373,Russia is interfering in our election again to...,16764,Thu Feb 20 23:51:41 +0000 2020,Elizabeth Warren,ewarren
...,...,...,...,...,...
5204,I'll be joining @Lawrence on @MSNBC shortly to...,172,Thu Feb 06 03:26:25 +0000 2020,Elizabeth Warren,ewarren
4210,"Together, we imagined a country where everyone...",169,Tue Jan 07 00:26:26 +0000 2020,Elizabeth Warren,ewarren
2972,"It was so great to meet Vivian and Riley, two ...",139,Sat Feb 08 00:05:43 +0000 2020,Elizabeth Warren,ewarren
2427,"With my #WealthTax, we can achieve #UniversalC...",115,Wed Jan 08 17:12:52 +0000 2020,Elizabeth Warren,ewarren


Because slicing and sorting a `DataFrame` don't change its index, we can use that to find the corresponding document in our collection for analysis.

In [39]:
docs[1358]

Millions may now lose their jobs. And Trump wants our response to be a half-trillion dollar slush fund to boost favored companies and corporate executives – while they continue to pull down huge paychecks and fire their workers. Here’s what I know and how we stop it:

In [40]:
warren_tweet_index = 1358

#### Exercise

Can you write a function that accepts a list of documents and an index to a particular document, and then returns a dictionary mapping each document to its similarity score (measured against the document specified by index)? 

**Answer**
```
def compute_scores(docs, idx_of_target): 
    # Use a Counter to keep track of the scores, so that we can find the top scores
    sim_scores = Counter()
    # Assumes that our target is in the collection
    target = docs[idx_of_target]  
    # Enumerate lets us keep track of the index of each vector in the collection
    for i, doc in enumerate(docs): 
         # Compute the cosine similarity
        score = doc.similarity(target)  
        sim_scores[i] = score
    return sim_scores
```

In [42]:
def compute_scores(docs, idx_of_target): 
    # Use a Counter to keep track of the scores, so that we can find the top scores
    sim_scores = Counter()
    # Assumes that our target is in the collection
    target = docs[idx_of_target]  
    # Enumerate lets us keep track of the index of each vector in the collection
    for i, doc in enumerate(docs): 
         # Compute the cosine similarity
        score = doc.similarity(target)  
        sim_scores[i] = score
    return sim_scores

In [43]:
scores = compute_scores(docs, warren_tweet_index)

  score = doc.similarity(target)


In [44]:
scores

Counter({0: 0.9023731383696801,
         1: 0.9113016769504528,
         2: 0.9044361164571186,
         3: 0.9229561342013487,
         4: 0.9071659039975094,
         5: 0.9149042462180738,
         6: 0.9113866126475357,
         7: 0.9453963733314896,
         8: 0.7964267834218549,
         9: 0.8562253413182859,
         10: 0.9128990524784202,
         11: 0.9371595737605315,
         12: 0.8078544759653438,
         13: 0.9328621593059064,
         14: 0.8829108094183139,
         15: 0.9463916080672117,
         16: 0.9171521922515129,
         17: 0.8725937555790683,
         18: 0.8975047950136265,
         19: 0.8548121543407899,
         20: 0.843086803064335,
         21: 0.9555311901597404,
         22: 0.9389441957155181,
         23: 0.9177341584641999,
         24: 0.8488799208644495,
         25: 0.6596111122299475,
         26: 0.9321730780227734,
         27: 0.9227242952055731,
         28: 0.9222298301004517,
         29: 0.8685814855996463,
         30: 0.905513

The `scores` object by itself doesn't tell us much, since it references each document only by its index. But it's a `Counter` object, so we can easily find the top N scores. 

And then we can use those entries to slice our `DataFrame` of tweets to see the tweet text and associated metadata.

We have to write `score[0]` in the `DataFrame.loc[]` expression because `Counter.most_common` returns a Python tuple, the first element of which is the key -- in this case, the document index.

The `scores` object by itself doesn't tell us much, since it references each document only by its index. But it's a `Counter` object, so we can easily find the top N scores. 

And then we can use those entries to slice our `DataFrame` of tweets to see the tweet text and associated metadata.

We have to write `score[0]` in the `DataFrame.loc[]` expression because `Counter.most_common` returns a Python tuple, the first element of which is the key -- in this case, the document index.

In [45]:
tweet_df.loc[[score[0] for score in scores.most_common(10)]]

Unnamed: 0,full_text,retweet_count,created_at,name,screen_name
1358,Millions may now lose their jobs. And Trump wa...,26512,Sun Mar 22 16:35:24 +0000 2020,Elizabeth Warren,SenWarren
5079,So when giant private equity firms don’t like ...,11,Wed Jan 08 20:48:23 +0000 2020,Senator Tina Smith,SenTinaSmith
4868,"Now is the time to support our neighbors, espe...",6,Wed Mar 25 23:43:30 +0000 2020,Ed Markey,EdMarkey
2487,Too many have lost their jobs and livelihoods ...,14,Fri Apr 10 20:02:20 +0000 2020,Senator Hawley Press Office,SenHawleyPress
4188,Millions of workers have been able to keep the...,63,Thu Apr 16 16:38:16 +0000 2020,Cory Gardner,SenCoryGardner
2540,This crisis calls for a massive federal respon...,804,Wed Mar 18 14:34:11 +0000 2020,Cory Booker,CoryBooker
7899,We should SIGNIFICANTLY boost the Unemployment...,194,Wed Mar 25 21:39:57 +0000 2020,Rick Scott,SenRickScott
4742,The $2 trillion relief package will go a long ...,20,Thu Mar 26 18:16:38 +0000 2020,Sen. Cory Booker,SenBooker
5992,"Too many ""American"" companies have only one re...",455,Fri Jan 03 23:39:22 +0000 2020,Elizabeth Warren,ewarren
1320,The 3rd stimulus plan should focus more on get...,9,Fri Mar 20 20:51:26 +0000 2020,Senator Ben Cardin,SenatorCardin


It will be easier to see the full text if we just inspect the `.values` attribute of that column.

We can see that our similarity scoring did a pretty good job of identifying as most similar tweets about big corporations, billionaires, and unemployment.

In [46]:
tweet_df.loc[[score[0] for score in scores.most_common(10)]]['full_text'].values

array(['Millions may now lose their jobs. And Trump wants our response to be a half-trillion dollar slush fund to boost favored companies and corporate executives – while they continue to pull down huge paychecks and fire their workers. Here’s what I know and how we stop it:',
       'So when giant private equity firms don’t like legislation we’re working on to end surprise medical billing, they fund a dark money group and blitz the airwaves with MILLIONS in misleading ads. Got it.\n\nAt least have the guts to say who you are.  \nhttps://t.co/uGVKwa9dWs',
       'Now is the time to support our neighbors, especially those who have lost their jobs due to the coronavirus. We need to dramatically expand unemployment insurance and deliver direct cash assistance to the American people. It’s the right thing to do. \n\nhttps://t.co/jd5U7zrsK3',
       'Too many have lost their jobs and livelihoods in this coronavirus crisis.\n\nThe best way to get America ready to go back to work is to REHIRE 

Here we've taken our first steps toward building a document classifier! The initial results look pretty good, though as with any use of NLP, it's worth doing more exploration in order to evaluate the accuracy and robustness of the method and possibly to fine tune it. 

If you're interested in document classification, there are other approaches that don't use word vectors/embeddings. **Topic modeling** is among the most widely used. 

In recent years, **deep neural networks** have garnered a lot of attention in NLP, including for text classification. spaCy includes a `TextCategorizer` component that can be used to train a network for text classification. Unlike topic modeling and other statistical approaches, such networks usually require a labeled training dataset. 

See the [spaCy docs](https://spacy.io/usage/training#textcat) for more info.

Because natural-language processing is such a diverse and active field, there is a wealth of resources available for further study. One place to begin is with the [O'Reilly Online Library](https://www.safaribooksonline.com/library/view/temporary-access), which has many books devoted to Python and NLP. Access to O'Reilly Online is free for GW faculty, students, and staff. 

#### Bonus Material: Document Similarity without Stopwords

Since every token in a document has a vector, including punctuation, stopwords, and URL's, this score might be weighted by a lot of information we don't really care about. Can we make it cleaner by comparing only the vectors of "content" words?

We can, but it requires a little bit of reverse-engineering. In what follows, we'll implement our own version of spaCy's `.similarity` method to measure the similarity between only the "content" words in our documents.

In what follows, we're using special `numpy` methods -- the functions prefixed by `np`, which is the alias we used when import `numpy` -- to work with the word vectors. 

In [47]:
# The vector of a document = average of the token vectors
# We can use this to get the vector of the tokens minus stopwords, etc.
def vectorize_without_stops(doc):
    # Remember that our function remove_stops returns the list of content words in a given spaCy Document
    # We're using numpy.array to create an array of the word vectors in that list
    # Each word vector is already a numpy array, so we're creating a 2-dimensional array
    vectors = np.array([token.vector for token in remove_stops(doc)])
    # Some document vectors might have only null values, which will cause numpy to raise an error
    # We just return a placeholder case, so that we can filter it out later
    if not vectors.any():
        return vectors
    # Then we use the numpy mean method to average all those word vectors into a single vector 
    # (which is how Document.vector is created in spaCy)
    return np.mean(vectors, axis=0)

In [48]:
# Now we can create the vectorized version of our dataset
# Each element in this list will be a single vector, representing the mean of the word vectors in a single document
# but ONLY for the "content" words in that document
doc_vecs = [vectorize_without_stops(doc) for doc in docs]

In [49]:
# This function does some math to find the cosine similarity 
# Based on the second answer provided here: https://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists
def cosine_sim(vec1, vec2):
    # Each argument should be a document vector (mean of token vectors)
    # This just returns 0 for the similarity score if one or the other of the vectors is null
    if not vec1.any() or not vec2.any():
        return 0
    # This is the cosine similarity formula: the inner product of two vectors divided by the product of their norms
    return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [50]:
def compute_scores_2(vectors, idx_of_target): 
    sim_scores = Counter()          # We'll again use a Counter to keep track of the scores, so that we can easily find the top scores
    target_vec = vectors[idx_of_target]    # Assumes that our target is in the collection
    for i, vec in enumerate(vectors): # Enumerate lets us keep track of the index of each vector in the collection
        score = cosine_sim(target_vec, vec)   # Compute the cosine similarity
        sim_scores[i] = score
    return sim_scores

In [52]:
scores2 = compute_scores_2(doc_vecs, warren_tweet_index)

In [53]:
tweet_df.loc[[score[0] for score in scores2.most_common(10)]]['full_text'].values

array(['Millions may now lose their jobs. And Trump wants our response to be a half-trillion dollar slush fund to boost favored companies and corporate executives – while they continue to pull down huge paychecks and fire their workers. Here’s what I know and how we stop it:',
       'In the midst of an unprecedented national crisis, Republicans can’t seriously expect us to tell people who are suffering that we shortchanged hospitals, students, workers, &amp; small biz but gave big corporations hundreds of billions of dollars in a secretive slush fund.',
       'Big corporations are spending billions on stock buybacks to reward wealthy shareholders, while workers are getting pink slips.\n\nWe need my #RewardWork Act to rein in corporate stock buybacks &amp; give workers a voice in how their company’s profits are spent https://t.co/MECn84WBSq',
       'If taxpayers are being asked to give corporations a multi-billion-dollar lifeline, there need to be some strings attached.\n \nWe’ve got