# Assignment 08 Worked

### Step 0: Working around VS Code

I use VS Code to present materials in class because it presents the simplest / cleanest interface. Having said that, it is very frustrating that Code does not wrap text in the output window. The following code is a workaround and one you need not include in your notebooks:

In [23]:
import textwrap

# You will see lines that look like this:
# wrapped_text = textwrap.fill(text, width=80) 
# print(wrapped_text)

### Step 1: Loading All the Things

I like to load all the things I need at the beginning of my notebook. This way, I can see all the packages I am using and it makes it easier to find them later.

After my imports, I define a custom function which takes a string and returns a list of tuples with the first element being the word (or token) and the second element being that word's part of speech tag.

After that, I load my test file. 

I am using separate cells for each task so I can troubleshoot more readily.

In [24]:
# IMPORTS
import nltk
from pathlib import Path # only needed for complete corpus

In [25]:
# CUSTOM FUNCTION
# Notice how there's documentation built into the function:
def tagPOS (a_string):
    """
    Takes a string and returns a list of tuples with the word 
    and its part of speech
    """
    tokens = nltk.word_tokenize(a_string)
    tagged = nltk.pos_tag(tokens)
    return tagged

Not only am I loading data with this cell, but I am also the ability to "slice" a string by character index to begin to explore the text. I repeatedly enter various starting number with a 50-character limit above that until I get to some place where the text seems to start. Here I discovered there was a synopsis that starts at the 300 mark. 

NOTE: I came back later and decided to add `lower()` to the line where the file gets read. I realized that some adverbs (especially) might occur at the start of sentences, and I wanted to make sure that I was not missing any of those. 

In [26]:

# DATA
with open("../queue/scifi/alien.txt", mode="r", encoding="utf-8") as f:
            the_string = f.read().lower()

print(the_string[300:500])

                  synopsis

     en route back to earth from a far part of the galaxy, the crew of the

     starship snark intercepts a transmission in an 

alien language

,

     originating from a


### Step 2: Getting the Parts-of-Speech

I could probably place the custom function here, and normally I would, keeping the custom functions to where I will be using them in a notebook. (Composing a notebook is different than composing a script: notebooks are about you documenting your efforts; scripts are for well-developed code that is going to be used in something like production.)

In [27]:
# Break them into a list of tokens
tagged_text = tagPOS(the_string)

for i in tagged_text[30:40]:
    print(i)

('en', 'IN')
('route', 'NN')
('back', 'RB')
('to', 'TO')
('earth', 'NN')
('from', 'IN')
('a', 'DT')
('far', 'RB')
('part', 'NN')
('of', 'IN')


In [30]:
len(tagged_text)

27277

I am tempted to write another custom function to get that part of speech out of the list, but for now I am going to work my way through this exercise with simply for loops.

In [29]:
# A for loop to work through our list of POS tuples
# and retrieve only the words that match the POS we want:

adverbs = []
for i in tagged_text:
    if i[1] == "RB":
        adverbs.append(i[0])

print(adverbs[0:50])
print(len(adverbs))


['formerly', 'back', 'far', 'not', 'then', 'horribly', 'not', 'ultimately', 'however', 'now', 'then', 'finally', 'only', 'alone', 'only', 'broussard', 'along', 'eerily', 'sterile', 'softly', 'abruptly', 'close', 'gradually', 'slowly', 'groggily', 'now', 'triumphantly', 'not', 'fully', 'around', 'well', 'not', 'just', 'too', 'right', 'blackness', 'silently', 'back', 'only', "n't", 'broussard', 'just', "n't", 'even', 'yet', 'not', 'home', 'only', 'halfway', 'here']
1282


With all the adverbs in a list, I can now go through the list and count them. I am going to use a dictionary to keep track of the counts.

In [31]:
# Create a blank dictionary:
adverb_counts = {}

# Loop through the list of adverbs and count them:
for i in adverbs:
    if i in adverb_counts:
        adverb_counts[i] += 1
    else:
        adverb_counts[i] = 1

I am going to create a new dictionary here. I don't know why. It's a habit as well as a strange phobia I have not to mess with established objects. Nothing should stop you from writing over the old dictionary.

The ability to sort a dictionary is fairly recent (3.6+). All the  code says is: get the items and sort them by the second item in the tuple (the count) and then reverse the order so the highest count is first. (Default sort is ascending order, so the lowest number would be first.)

In [32]:
# Let's get that dictionary sorted:
most_freq_adverbs = sorted(adverb_counts.items(), key=lambda x: x[1], reverse=True)

print(most_freq_adverbs[0:10])

[("n't", 122), ('then', 86), ('here', 57), ('back', 54), ('now', 51), ('just', 48), ('not', 43), ('there', 37), ('well', 29), ('right', 26)]


Well, wow, those are some underwhelming adverbs. I am not sure what I was expecting, but I was hoping for something a little more exciting. Maybe I need to use a stop word list to get rid of the common adverbs. 

### Dropping Common Adverbs

What I am hoping to do here is use a list of common adverbs that I found [online](https://engdic.org/100-most-common-adverbs-list/), which I tweaked by removing adverbs that I thought might be interesting with regard to a collection of science fiction texts, and then I will use that revised list to filter out the common adverbs. The result, I hope, will be a list of adverbs that are more interesting.

In [33]:
with open("../data/adverbs.txt", mode="r", encoding="utf-8") as f:
    adverbs = f.read().split(', ')

wrapped_text = textwrap.fill(' '.join(adverbs), width=72) 
print(wrapped_text)

about above absolutely actually after almost always anywhere around
backward basically before below brightly broadly carefully certainly
clearly closely completely daily definitely directly early enough
especially even everywhere exactly firmly forward frequently generally
here indirectly just kindly late likely loudly mainly meanwhile merely
naturally nearly never notably nowhere obviously often partially
particularly possibly pretty probably quickly quietly quite rarely
rather really remarkably sadly sharply significantly simply slightly
slowly so solely sometimes somewhat somewhere specifically suddenly
there too totally universally upward usually utterly very warmly wholly
widely "n't" then back now not well right again only still away maybe


Now the question becomes: when in the series of steps above is it best to filter out the common adverbs? As I am splitting the string into a list? Filter the original list? Filter the dictionary?

For reference, filtering the list would look like this:

```python
# Create a new list to hold the filtered words 
filtered_words = [] 

# Iterate over the list of words 
for word in words: 
  # If the word is not in the stop word list, add it to the filtered list 
  if word not in stop_words: 
    filtered_words.append(word) 
```
I think I will try filtering the dictionary first. That seems to me the least computationally expensive. (But I don't really know.)

In [34]:
# This is a dictionary comprehension
# It could be written as a for loop, but this is more compact
# "if not in" filters out the adverbs we don't want
filtered = {key: adverb_counts[key] for key in adverb_counts if key not in adverbs}

In [35]:
# Now we will need to sort the filtered dictionary:
freq_adverbs = sorted(filtered.items(), key=lambda x: x[1], reverse=True)

print(len(freq_adverbs))
for i in freq_adverbs[0:50]:
    print(i)

161
("n't", 122)
('down', 26)
('broussard', 18)
('up', 14)
('as', 13)
('finally', 12)
('instantly', 11)
('yet', 10)
('far', 8)
('alone', 8)
('close', 8)
('faust', 8)
('immediately', 8)
('along', 6)
('inside', 6)
('abruptly', 5)
('yes', 5)
('out', 5)
('ahead', 5)
('nervously', 5)
('frantically', 5)
('long', 5)
('silently', 4)
('extremely', 4)
('together', 4)
('first', 4)
('standard', 4)
('fast', 4)
('tightly', 4)
('later', 4)
('gradually', 3)
('halfway', 3)
('heavily', 3)
('violently', 3)
('i', 3)
('else', 3)
('barely', 3)
('strangely', 3)
('no', 3)
('much', 3)
('also', 3)
('somehow', 3)
('anyway', 3)
('already', 3)
('softly', 2)
('atmosphere', 2)
('hardly', 2)
('soon', 2)
('longer', 2)
('okay', 2)


I appear to be having difficult knocking out *n't* from my strings. *n't* is an artifact of the way NLTK's word tokenization works. If I used my backup tokenizer that uses regex, I would not have this problem. Since this is one small problem among many others, I am choosing to ignore it.

To be honest, the adverbs for one screenplay are not terribly interesting. What if we did the entire corpus?

And, honestly, I have no way to account for "broussard" being an adverb.

### Step 3: Adverbs of the Entire Corpus

We know how to load a bunch of files. For our purposes, we can merge all the screenplays into one big string, and then we can run the same code as above to get the adverbs.

In [36]:
# DATA
screenplays = []
for p in Path('../queue/scifi/').glob('*.txt'):
    with open(p, encoding="utf8", errors='ignore') as f:
        contents = f.read()
        screenplays.append(contents)

all_screenplays = ' '.join(screenplays)

The function below combines the custom function with the for loop we created in the second section to get all the adverbs out of 

In [37]:
def getAdverbs (a_string):
    """
    Takes a string and returns a list of tuples with the word 
    and its part of speech
    """
    tokens = nltk.word_tokenize(a_string)
    tagged = nltk.pos_tag(tokens)
    adverbs = []
    for i in tagged:
        if i[1] == "RB":
            adverbs.append(i[0])
    return adverbs

With custom function loaded, let's turn it loose on all our screenplays. 

I admit I wrote it as a custom function because I thought I would use it in a for loop to iterate over all the screen plays, but I realized that I could just combine all the screen plays into one giant text if all I care about is *all* the adverbs.

That noted, it would be interested to run this as a for loop, keeping screenplays separate, and then running TF-IDF to see which adverbs were common to all screenplays and which ones distinguished one text, or group of texts, from the rest of the collection.

In [38]:
all_adverbs = getAdverbs(all_screenplays)
print(len(all_adverbs))

176159


In [None]:
# Let's see the first adverbs
for adverb in all_adverbs[0:20]:
    print(adverb)

not
not
enough
Then
not
here
Suddenly
together
finally
apart
no
so
back
literally
literally
then
then
simply
violently
forward


I was a little surprised to see duplicates until I realized that this is a complete list, and not a list of counts, and so duplicates would be present. I also see that I did not lowercase the texts. *Sigh.* I will have to do this again.

In [40]:
filtered = [word for word in all_adverbs if word not in adverbs]
print(len(filtered))

85563


In [41]:
# Create a blank dictionary:
counts = {}

# Loop through the list of adverbs and count them:
for i in filtered:
    if i in counts:
        counts[i] += 1
    else:
        counts[i] = 1

print(len(counts))

3489


In [42]:

# Now we will need to sort the filtered dictionary:
counts_sorted = sorted(counts.items(), key=lambda x: x[1], reverse=True)

# Let's see the top 20 adverbs:
for i in counts_sorted[0:20]:
    print(i)

("n't", 19558)
('down', 3590)
('Then', 2959)
('up', 2212)
('Now', 1658)
('Not', 1342)
('Suddenly', 1227)
('as', 1179)
('ever', 1123)
('far', 1080)
('together', 998)
('ahead', 949)
('already', 895)
('close', 847)
('once', 776)
('long', 763)
('finally', 743)
('alone', 739)
('else', 724)
('inside', 686)


Job done. You can stop looking at this notebook now. The next section is simply me exploring what kinds of views pandas might be able to offer us.

### Step 3b: Lowercasing & Using Regex

I will be curious to see if there's a speed difference between using NLTK's `word_tokenize` and using regex.

In [None]:
def getAdverbs (a_string):
    """
    Takes a string and returns a list of tuples with the word 
    and its part of speech
    """
    tokens = nltk.word_tokenize(a_string)
    tagged = nltk.pos_tag(tokens)
    adverbs = []
    for i in tagged:
        if i[1] == "RB":
            adverbs.append(i[0])
    return adverbs

### Step 4: The Pandas Option

In [None]:
import pandas as pd

df = pd.DataFrame(counts_sorted, columns=['adverb', 'count'])
df.head()

In [None]:
df = pd.DataFrame.from_dict(counts,
                            orient='index', 
                            columns=['count'])
df.head()