# spaCy and POS tagging

For more info: [spaCy](https://spacy.io/)

### Installing spaCy and language model

If running this notebook locally, you'll only have to do the next two lines once.

In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

## Aside: two ways of installing spaCy

### 1. Jupyter notebook
You can install it the way you see above, by running the 2 lines above in your Jupyter notebook. Remember you only have to do this once. After you run this notebook on your local machine, you can comment those two lines out.

### 2. Command prompt
* In Windows, if you have Ananconda, you can open an Anaconda powershell prompt. If you don't have Anaconda, just open a Windows powershell in admin mode.
* On a Mac, open a terminal window (spotlight, "terminal"). Or look for "terminal" in your apps folder. 

Now that you have a command window open, simply go to the spaCy website and choose your operating system to copy and paste the right commands (one at a time). Click on the right options for you from here: https://spacy.io/usage  

### Loading spaCy and language model
Installation (if local) only needs to be done once. However, you need to import the spaCy module and load the language model every time you want to use it. 

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

### Defining some sentences

Remember that you can call your variables what you want. Play with the names of the variables and the contents of the sentences below. You don't need to print the sentence each time; I've only done it for the first so that you see it's stored in the right variable. Note the difference between `print(sent1)` and `sent1`.

In [None]:
sent1 = "The motivational speaker exhorted us to change the way we live today, rather than looking always toward some vague distant futurity."

In [None]:
print(sent1)

In [None]:
sent1

In [None]:
# from a CBC story, https://www.cbc.ca/news/canada/nova-scotia/loblaws-will-no-longer-offer-50-discount-on-expiring-food-products-1.7084299
sent2 = "According to an email from Loblaw Companies Ltd. reviewed by CBC News, it will no longer discount perishable foods like meat, fruit, and vegetables, by 50 per cent off as they near their expiration date."

### Converting string to doc with spaCy
spaCy has a special type of object, a `Doc`. It's the entire processing pipeline for any NLP system, in a single object. It takes a text, e.g., `sent1` and applies all the NLP steps to it (tokenization, tagging, named entity recognition). Once you have converted a string (a sentence) or a whole text to Doc, you can access everything that spaCy has done with it, i.e., the entire structure of language information that it has applied to it, with labels. spaCy refers to that language information and labels as 'linguistic annotations'. spaCy does this with a simple function, `nlp()`.

![spaCy pipeline](https://spacy.io/images/pipeline.svg)

Image from https://spacy.io/usage/processing-pipelines

In [None]:
doc1 = nlp(sent1)

`doc1` now contains a complex python data type. Note that the ouput of just calling the variable (`doc1`) is slightly different than when we call a string variable, above (`sent1`).

In [None]:
type(doc1)

In [None]:
print(doc1)

In [None]:
doc1

### Accesing the information in the Doc object

`doc1` contains lots of [useful information](https://spacy.io/api/doc):

* tokens (words)
* lemmas
* morphology
* POS tags
* syntactic structure (a parse tree)
* named entities


In [None]:
# print word tokens

for token in doc1:
    print(token)

In [None]:
# lemmas

for token in doc1:
    print(token.lemma_)

In [None]:
# morphology

for token in doc1:
    print(token.text, token.morph)

In [None]:
# POS tags (more on this below)

for token in doc1:
    print(token.text, token.pos_)

In [None]:
# syntactic structure, as dependencies

for token in doc1:
    print(token.i, token, token.dep_, token.head.i, token.head)

In [None]:
# named entities

for ent in doc1.ents:
    print(ent.text, ent.label_)

### Some of the same things, but for `sent2`

In [None]:
doc2 = nlp(sent2)

In [None]:
for token in doc2:
    print(token.text, token.morph)

In [None]:
for token in doc2:
    print(token.text, token.pos_)

In [None]:
for ent in doc2.ents:
    print(ent.text, ent.label_)

### Prettier visualizations

`displacy` is another module that allows us to see dependencies and entities in a user-friendly way.

In [None]:
from spacy import displacy

In [None]:
# syntactic structure, dependency
displacy.render(doc1, style='dep', jupyter=True)

In [None]:
# you can change the look of the dependency tree
options_parse = {"compact": True, "bg": "#f3f4f5",
           "color": "black", "font": "Source Sans Pro"}

displacy.render(doc1, style="dep", options=options_parse, jupyter=True)

In [None]:
# you can change the look of the dependency tree
options_parse = {"compact": True, "bg": "#f3f4f5",
           "color": "black", "font": "Source Sans Pro"}

displacy.render(doc2, style="dep", options=options_parse, jupyter=True)

In [None]:
displacy.render(doc1, style="ent", jupyter=True)

In [None]:
displacy.render(doc2, style="ent", jupyter=True)

 ### Practice!

Play with different sentences. You can also do more than one sentence in one `sent` string. And you can try all of the above with a longer text. You can also test how well it detects sentences.

### POS tags and how to examine them

Let's look a bit more at part of speech tags. The POS tagger in spaCy produces [a lot of information](https://spacy.io/api/token) and you can query it in different ways. 

The first bit of code gives you most of the information attached to each word. Note the difference between `token.pos_` and `token.tag_`. The first gives you a high-level tag (DET, NOUN, VERB) from the [Universal POS tagset](https://universaldependencies.org/u/pos/). The second gives you a more specific subtype of that (NN = singular noun, VBD = verb in the past tense, VBG = verb in the gerund). 

If you don't know what a tag means, you can ask spaCy to explain: `spacy.explain()` (next few code blocks).

You can also query spaCy for all its tags (last code block below). 

In [None]:
for token in doc1:
    print(token.text, token.lemma_, token.pos_, token.tag_)

In [None]:
# explain a specific tag

spacy.explain("NN")

In [None]:
# explain a specific tag

spacy.explain("VBD")

In [None]:
# explain a specific tag

spacy.explain("VBG")

In [None]:
# explain all the labels spaCy produces

for label in nlp.get_pipe("tagger").labels:
    print(label, " -- ", spacy.explain(label))

### A few more things about POS tags

You can list all the words that have a specific POS in the sentence/text. 

In [None]:
print("Nouns:", [token.lemma_ for token in doc2 if token.pos_ == "NOUN"])
print("Verbs:", [token.lemma_ for token in doc2 if token.pos_ == "VERB"])

You can list all the phrases in a sentence (which is really part of the dependency structure, not the POS tags)

In [None]:
print("Noun phrases:", [chunk.text for chunk in doc1.noun_chunks])

You can list the dependency structure of the sentence, by asking what the dependency relation is for each word, and what its head is. 

In [None]:
for token in doc1:
    print(token.text, token.tag_, token.dep_, token.head)

### Test ambiguities

The lecture notes had examples of ambiguous sentences or words that could have multiple tags. Try some of those and see how well spaCy tags them.

In [None]:
sents_amb = "That is the back door. He was lying on his back. We want to win the voters back. She promised to back the bill. "

In [None]:
doc_amb = nlp(sents_amb)

In [None]:
for token in doc_amb:
    print(token.text, token.tag_, token.dep_, token.head)

### Optional: making it all easier to ready

This is optional and you don't need to know pandas yet, but, in case you want to make the output a bit easier to read.

In [None]:
import pandas as pd

In [None]:
data1 = []

for token in doc1:
    data1.append([token.text, token.tag_, token.dep_, token.head])
    
df = pd.DataFrame(data1)
df.columns = ['Text', 'Tag', 'Dependency', 'Head']

df