## Part of Speech Tagging is Weird

NLTK shows us some weird part of speech tagging. We might decide that it's just a bit unreliable here. In this notebook, we're going to compare the part of speech tagging from NLTK with a newer library called spaCy that's used widely in industry for natural language processing (NLP). We'll also practice preparing some simple files for output. 

### Installs
To try out spaCy, you need to **open your terminal and install it first**:
`pip install spacy` 

### Language Models
Like NLTK's collections that we downloaded, [spaCy has trained language models](https://spacy.io/usage/models) to download. You download these in your Python script after you import spacy, and after you download once, you don't need to do it again (so you can comment out the download line). We're going to try the medium and large models in English. (It's good to know that both spaCy and NLTK have resources for NLP on other languages, too!)  

In [None]:
import spacy


### Downloading language models
To work with spaCy's pre-trained language models, you need to download them to you virtual environment. There are:
* en_core_web_sm (smallest--not as much info as the others)
* **en_core_web_md** (Pretty good date here, size: 34 MB )
* **en_core_web_lg** (Lots of data here, size: 400 MB.)
Note that the LARGEST  one will have the most data and probably be the most reliable. 

In [None]:
# CAN YOU SKIP THIS???
# After you download a model into your virtual environment for the first time, you can comment out the download line.
# spaCy's medium and large models will give us the best results for NLP tagging.
# nlp = spacy.cli.download("en_core_web_sm")
# nlp = spacy.cli.download("en_core_web_md")
nlp = spacy.cli.download("en_core_web_lg")

### Load the model 
Now we redefine the nlp variable to LOAD the model you downloaded.

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
filepath = 'hughes-txt/sixteen.txt'
f = open(filepath, 'r', encoding='utf8').read()

## spaCy Part of Speech Tagging
spaCy needs to read the str() of our text to generate a dictionary of information on each word.
So we go back to our opened file! 

In [None]:
# We used nlp as our variable for the spaCy operations. 
# f is our variable for the source file. spaCy doesn't tell you how it tokenizes or what it's doing (sigh). 
spacyRead = nlp(f)
for token in spacyRead:
    print(token.text, "---->", token.pos_, ":::::", token.lemma_)
  

In [None]:
spacy.explain("VERB")

### For spaCy: define a function to collect the words you want

In [None]:
def wordCollector(words):
    wordList = []
    count = 0
    for token in words:
        if token.pos_ == "VERB":
            count += 1
            # print(count, ": ", token.text, " lemma: ", token.lemma_, " pos: ", token.pos_)
            # don't forget the underscore after token.lemma_ , token.pos_, etc.!
            wordList.append(token.lemma_)
            # print(count, ": ", token, token.pos_)
    # print(count, ": ", adjectives)
    return wordList
myWords = wordCollector(spacyRead)
print(myWords)

### Frequency of words
Here is something we can do: Because we didn't make a set of unique words, we have a list full of the original words. 
The Counter function in collections offers a speedy way to count the number of times something appears in a list.

In [None]:
from collections import Counter

word_freq = Counter(myWords)
# most_common() converts the Counter's dictionary to a tuple series and sorts it
ranked = word_freq.most_common()
topTen = word_freq.most_common(10)
print(topTen)
lastTen = word_freq.most_common()[:-11:-1]
print(lastTen)

## Write something to an output file (just text)

In [None]:
o = open("verbFreq.txt", "w")
for r in ranked:
    o.write(str(r) + "\n")

## Plotting a simple chart with a Python Library

We have data we can use to plot. There are SVG plotting libraries designed for Python, and one of the best for us working on websites is PyGal, since it's easy to customize and outputs a nice file for use on a website. 

Read the [PyGal documentation](https://www.pygal.org/en/stable/documentation/index.html) (or any plotting library you find) to see how you can adapt it as you wish. 

We will need to open a terminal and **pip install pygal** or **python3.12 -m pip install pygal**.

In [None]:
import pygal
from pygal.style import DarkSolarizedStyle

# There are lots of chart types in pygal. Bar is a nice simple one:
bar_chart = pygal.Bar(style=DarkSolarizedStyle)
bar_chart.title = 'Top 10 Most Frequent Verbs'

# Add data to the chart
for word, freq in topTen:
    bar_chart.add(word, freq)

# Render to file (SVG format)
bar_chart.render_to_file('top10_verbs.svg')

### Send my list of words OR word lemmas to WordNet and get me synset info

In [None]:
import nltk
from nltk.corpus import wordnet as wn


In [None]:
setOfWords = set(myWords)
lowerCased = [w.lower() for w in setOfWords]

sortedWords = sorted(lowerCased)
print(sortedWords)

In [None]:
for w in myWords:
    synsets = len(wn.synsets(w))
    print("The word ", w, "belongs to ", synsets, "synsets in WordNet.")

## PyGal Fun with Ambiguity Data
Okay, what if I want to customize the colors of my plot using the synset ambiguity information? 
This is a little tricky, but we can work with the rgb() color values to plot on a scale of 0 to 255. Here's a little recipe I cooked up, and I got a little help from ChatGPT for the color function. 

In [None]:
from collections import Counter
import pygal
from pygal.style import Style
from nltk.corpus import wordnet as wn

word_freq = Counter(myWords)
topTen = word_freq.most_common(10)
print(topTen)

# Step 1: Get synset count for each of the top 10 verbs
synset_counts = {}
for word, freq in topTen:
    synset_counts[word] = len(wn.synsets(word, pos='v'))

# Step 2: Normalize synset counts to 0–255 for coloring
max_syn = max(synset_counts.values())
min_syn = min(synset_counts.values())

def synset_to_color(syn_count):
    """Maps synset count to a shade of blue (darker = more ambiguous)"""
    if max_syn == min_syn:
        blue = 150
    else:
        blue = int(50 + 205 * (syn_count - min_syn) / (max_syn - min_syn))
    return f'#{0:02x}{0:02x}{blue:02x}'

# Step 3: Create custom Pygal style
custom_style = Style(
    background='white',
    plot_background='white',
    foreground='black',
    foreground_strong='black',
    foreground_subtle='gray',
    opacity='.8',
    opacity_hover='.9',
    transition='400ms ease-in',
    colors=('#cccccc',)  # Give PyGal a dummy color to start
)

# Step 4: Create the bar chart
bar_chart = pygal.Bar(style=custom_style)
bar_chart.title = 'Top 10 Verb Frequencies Colored by Ambiguity (Synsets)'

for word, freq in topTen:
    syns = synset_counts[word]
    color = synset_to_color(syns)
    bar_chart.add(word, [{'value': freq, 'label': f'{word} has {syns} synsets', 'style': f'fill:{color}'}])
    # I can add whatever new info I want (beyond the 'value') to PyGal's tooltip using 'label': 

# Step 5: Render the SVG chart
bar_chart.render_to_file('top_verbs_colored_by_synsets.svg')

## Your Turn! 
Adapt this notebook to explore one or more of your project files. Be sure that you are able to:
* **Open a file with Python first before you tokenize it**. (This time, try out spaCy.)
* Collect a selection of words, with POS tagging or without it. (Can you output all the tokens that just have the same beginning or ending?)
* Options:
     * Send your words or word lemmas to WordNet and return synset info, and/or
     * Find out about how frequently each of your selected tokens is used in the text using the Counter.
* This time, **open a file to write some of the information you retrieve to output**. 
* Add, commit, and push to your GitHub repo.