# Working with a Corpus of Screenplays

This notebook is to establish a different route for those working with a smaller number of larger documents. 37 plays is a very small number, but it's a toy corpus to which we all have access. The method should scale to several hundred texts quite easily.

## Getting the Modules and Data

In [1]:
# IMPORTS
from pathlib import Path

In [2]:
# DATA
# Change the variable/object below to reflect your genre
mysteries = []

# Now we populate the object with strings
for p in Path(f"../queue/Mystery").glob('*.txt'):
    with open(p, mode="r") as f:
        contents = f.read()
        mysteries.append(contents)

# If we have the same number as we can see in the folder, we got them all
len(mysteries)

107

## Tokenizing

There are two paths forward here. The first time around we are going to keep it simple and simply lowercase and tokenize our words as we have in previous sessions. There is a second path which drops out common words as well as lemmatizing the word types so that there are fewer tokens. 

### Standard Tokenization

The SciKit-Learn module comes with a tokenizer built in. It makes a lot of assumptions, but it's not a bad place to start any exploration. Run the code below to see what your vocabulary is, and then try adding the following parameters:

- `min_df=`: Start with 2 and iterate up to 5. (Bonus points if you write this as a loop.)
- `max_df=`: Conventionally this tends to be by percentage. You can start with 0.99 and work down to 0.8, using 0.05 steps to see the affect.

Make sure you understand the **df** that you are minning and maxing.

After you have tried both and arrived at a possible optimum, try combining them. The convention for some of these functions which can go long is to use separate lines for each parameter. Jupyter will normally handle the indent for you.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Basic vectorizer: no tweaks to parameters
vectorizer = CountVectorizer(  )

# fit the model to the data 
matrix = vectorizer.fit_transform(mysteries)

# We'll need these later
vocabulary = vectorizer.get_feature_names_out()

# This will repeat our screenplay count
# but also report our overall vocabulary FOR ALL SCREENPLAYS
matrix.shape

(107, 45272)

Rows = 

min_df = 1: 45272
min_df = 2: 24344
min_df = 3: 18553

### Custom Tokenization

Next we're going to build a custom text "pre-processor" that we can use with a variety of Sci-Kit Learn (and perhaps other) functions. By building our own, in effect, tokenizer, we make it possible to be more in control of the input.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def processit(a_string):
    """ Requires nltk """
    # first we lower-case everything
    lowered = a_string.lower()
    # then tokenize
    tokens = word_tokenize(lowered)
    # remove stopwords
    words = [token for token in tokens if token not in stop_words]
    # lemmatize
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    # rejoin the list of lemmas into a string and return
    return " ".join(lemmas)

In [None]:
# Basic vectorizer with our preprocessor added
vectorizer = CountVectorizer(min_df = 2, preprocessor = processit)

# fit the model to the data 
matrix_pp = vectorizer.fit_transform(mysteries)

# We'll need these later
vocabulary = vectorizer.get_feature_names_out()

# without our preprocessor
# there are 11491 features
matrix_pp.shape

## Creating and Saving a Dataframe

Now that we have our corpus transformed into a document-term matrix (DTM), it would be nice if we could not only make it more computable but also to make it storable as a file. That way, we need not recreate the DTM every time we want to work with it: we can simply load it from a file. 

In [8]:
import pandas as pd

# Load up a dataframe with our DTM
df = pd.DataFrame(matrix.toarray(), 
                  columns = vectorizer.get_feature_names_out())

# Check our work:
df.shape
df.head()

Unnamed: 0,00,000,000x,001,003,008,00pm,01,0134,014,...,zut,zweimal,zx,zy,zying,zzi,zzzapppp,zzzzp,zzzzt,zzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


It would be nice to have each of those rows labeled. This is less complicated than it looks:

In [None]:
# First we need to get the file names from the files
titles= []

for p in Path('../queue/Mystery/').glob('*.txt'):
    with open(p, mode="r", encoding="utf-16") as f:
        title = p.name[:-4]
        titles.append(title)

# Check
print(titles[50:55])

['losthighway', 's.darko', 'curiouscaseofbenjaminbuttonthe', 'orphan', 'cherryfalls']


In [None]:
# Create a column in our data frame
# and then populate it with our list of titles
df["title"] = titles

# Set the title to be the first column (the index) of the dataframe
df.set_index('title', inplace = True)

# See it:
df.head()

Having done all this work, let's save it to a file. We are saving to a CSV file, but you can also save to an Excel file, if that's more useful to you. (If you do want to try saving to Excel, please also save to a CSV file: it will be easier to load into a dataframe in our next session.)

In [None]:
df.to_csv('../data/mysteries.csv')

In [None]:
import os

# Size of the CSV
csv_size = os.path.getsize('../data/mysteries.csv')

# Size of the directory
texts_size = sum(d.stat().st_size for d in os.scandir('../queue/Mystery') if d.is_file())

# Let's compare the sizes in megabytes (millions of bytes)
print(csv_size/10**6, texts_size/10**6)