## Part of Speech Tagging is Weird

NLTK shows us some weird part of speech tagging. We might decide that it's just a bit unreliable here. In this notebook, we're going to compare the part of speech tagging from NLTK with a newer library called spaCy that's used widely in industry for natural language processing (NLP). We'll also practice preparing some simple files for output. 

### Installs
To try out spaCy, you need to **open your terminal and install it first**:
`pip install spacy` 

### Language Models
Like NLTK's collections that we downloaded, [spaCy has trained language models](https://spacy.io/usage/models) to download. You download these in your Python script after you import spacy, and after you download once, you don't need to do it again (so you can comment out the download line). We're going to try the medium and large models in English. (It's good to know that both spaCy and NLTK have resources for NLP on other languages, too!)  

In [None]:
import spacy


### Downloading language models
To work with spaCy's pre-trained language models, you need to download them to you virtual environment. There are:
* en_core_web_sm (smallest--not as much info as the others)
* **en_core_web_md** (Pretty good date here, size: 34 MB )
* **en_core_web_lg** (Lots of data here, size: 400 MB.)
Note that the LARGEST  one will have the most data and probably be the most reliable. 

In [None]:
# CAN YOU SKIP THIS???
# After you download a model into your virtual environment for the first time, you can comment out the download line.
# spaCy's medium and large models will give us the best results for NLP tagging.
# nlp = spacy.cli.download("en_core_web_sm")
# nlp = spacy.cli.download("en_core_web_md")
nlp = spacy.cli.download("en_core_web_lg")

### Load the model 
Now we redefine the nlp variable to LOAD the model you downloaded.

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
filepath = 'hughes-txt/sixteen.txt'
f = open(filepath, 'r', encoding='utf8').read()

## spaCy Part of Speech Tagging
spaCy needs to read the str() of our text to generate a dictionary of information on each word.
So we go back to our opened file! 

In [None]:
# We used nlp as our variable for the spaCy operations. 
# f is our variable for the source file. spaCy doesn't tell you how it tokenizes or what it's doing (sigh). 
spacyRead = nlp(f)
for token in spacyRead:
    print(token.text, "---->", token.pos_, ":::::", token.lemma_)
    

In [None]:
spacy.explain("DET")

### For spaCy: define a function to collect the words you want

In [None]:
def wordCollector(words):
    wordList = []
    count = 0
    for token in words:
        if token.pos_ == "VERB":
            count += 1
            # print(count, ": ", token.text, " lemma: ", token.lemma_, " pos: ", token.pos_)
            # don't forget the underscore after token.lemma_ , token.pos_, etc.!
            wordList.append(token.lemma_)
            # print(count, ": ", token, token.pos_)
    # print(count, ": ", adjectives)
    return wordList
myWords = wordCollector(spacyRead)
print(myWords)

### Frequency of words
Here is something we can do: Because we didn't make a set of unique words, we have a list full of the original words. 
The Counter function in collections offers a speedy way to count the number of times something appears in a list.

In [None]:
from collections import Counter

word_freq = Counter(myWords)
# most_common() converts the Counter's dictionary to a tuple series and sorts it
ranked = word_freq.most_common()
topTen = word_freq.most_common(10)
print(topTen)
lastTen = word_freq.most_common()[:-11:-1]
print(lastTen)

## Write something to an output file (just text)

In [None]:
with open("verbFreq.txt", "w") as o:
    for r in ranked:
        o.write(str(r) + "\n")