# Create a custom list of tokens

This notebook was used to create a custom list of tokens-- specifically, every word in the Seinfeld script. The output is a text file containing each word exactly once, in alphabetical order, one word per line.

<br>

## The task

Go from this (CSV containing Seinfeld dialogue and metadata):

```
...
56,GEORGE,Im not gonna watch you do laundry.,1,S01E01,1
57,JERRY,"Oh, come on, be a come-with guy.",1,S01E01,1
58,GEORGE,"Come on, Im tired.",1,S01E01,1
59,CLAIRE,"(to Jerry) Dont worry, I gave him a little caffeine. Hell perk up.",1,S01E01,1
60,GEORGE,"(panicking) Right, I knew I felt something!",1,S01E01,1
61,GEORGE,Jerry? I have to tell you something. This is the dullest moment Ive ever experienced.,1,S01E01,1
62,JERRY,"Well, look at this guy. Look, hes got everything, hes got detergents, sprays, fabric softeners.  This is not his first load.",1,S01E01,1
...
```

To this (a list of each word spoken in Seinfeld):

```
[...'agenda', 'agent', 'agents', 'ages', 'aggravate',...]
```

## Store all of the words in the dialogue as a list

In [None]:
from pandas import read_csv
import re

# Use pandas read_csv to read the csv data and
# store it in a dataframe object
df = read_csv('seinfeld_raw.csv')

# Go through each row in the dataframe's Dialogue
# column, and add that row's value to the string
# called all_text
all_text = ''
for row in df['Dialogue'].astype(str):
    all_text += ' ' + row

# Use a regular expression to replace punctuation with
# space throughout the text
all_text = re.sub(r'[^a-z]', ' ', all_text.lower())

# Split the string into a list of tokens, using space
# as the separator
all_tokens = all_text.split(' ')

# Remove empty tokens from list
all_tokens = list(filter(None, all_tokens))

# Sort the list
all_tokens.sort()

## Create a de-deuplicated version of the list

In [None]:
# Create a new list that holds exactly one of each token
unique_tokens = []
for token in all_tokens:
    if token not in unique_tokens:
        unique_tokens.append(token)

## Write the deduplicated items to a txt file

In [None]:
# Write each item as as a newline in the output text file
with open('seinfeld_tokens.txt', 'w') as f:
    for token in unique_tokens:
        f.write(token + '\n')

<br><br>

## More sophisticated processing...

Not used in today's class example, but good to know about...

In [None]:
# Tokenization
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

print(tokenizer.tokenize('Hello.'))
print(tokenizer.tokenize('www.newschool.edu/'))
print(tokenizer.tokenize('So...what\'s up?!'))

In [None]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('dog'))
print(lemmatizer.lemmatize('dogs'))

print(lemmatizer.lemmatize('goose'))
print(lemmatizer.lemmatize('geese'))

print(lemmatizer.lemmatize('went', 'v'))
print(lemmatizer.lemmatize('going', 'v'))

print(lemmatizer.lemmatize('smaller', 'a'))
print(lemmatizer.lemmatize('smallest', 'a'))

In [None]:
# Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

print(stemmer.stem('going'))
print(stemmer.stem('bicycling'))

print(stemmer.stem('bananas'))
print(stemmer.stem('apples'))