Cell 1:Import Libraries and Setup NLTK 
In this cell, we import the required libraries. We use:

os for interacting with the file system (loading screenplays).

re for regular expressions, which help clean and preprocess the text.

nltk to process natural language, including downloading tokenizers and stopwords.

pandas for organizing the data into a DataFrame.

CountVectorizer from sklearn to convert text into a Document-Term Matrix (DTM) for analysis.

In [1]:
# Import necessary libraries
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Setup: Ensure NLTK knows where to download data
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Cell 2:Load Screenplay Files
This section defines the folder path containing screenplay files, limits how many to process, and loads the contents into a dictionary.

In [12]:
folder_path = "/Users/beauxcreel/code/ENGL370-2025/Creel/Family"

screenplay_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
MAX_FILES = 5
screenplay_files = screenplay_files[:MAX_FILES]

screenplays = {}
for file in screenplay_files:
    with open(os.path.join(folder_path, file), "r", encoding="utf-8") as f:
        screenplays[file] = f.read()

print(f"Loaded {len(screenplays)} screenplays.")
print("Loaded screenplays:", list(screenplays.keys()))


Loaded 5 screenplays.
Loaded screenplays: ['aladdin.txt', 'princessbridethe.txt', 'findingnemo.txt', 'kungfupanda.txt', 'e.t..txt']


Cell 3: preview one screenplay 

This gives a preview of the first 500 characters from the first screenplay loaded so you can check its content before processing.

In [21]:
sample_script = list(screenplays.keys())[0]
print(f"\nPreview of {sample_script}:\n")
print(screenplays[sample_script][:500])


Preview of aladdin.txt:

ALADDIN:  THE COMPLETE SCRIPT

COMPILED BY BEN SCRIPPS 

(Portions Copyright (c) 1992 The Walt Disney Company

PEDDLER:    Oh I come from a land

    From a faraway place

    Where the caravan camels roam

    Where they cut off your ear /Where it's flat and immense

    If they don't like your face /And the heat is intense

    It's barbaric, but hey--it's home!

    When the wind's at your back

    And the sun's from the west

    And the sand in the glass is right

    Come on down,

    St


Cell 4: Define preprocessing function
This function cleans each script by lowercasing, removing punctuation, tokenizing, and removing stopwords to prepare for analysis.

In [16]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)


Cell 5: Apply preprocessing to all screenplays
This loop applies the preprocessing function to every screenplay, cleaning the entire dataset.

In [17]:
cleaned_screenplays = {title: preprocess_text(text) for title, text in screenplays.items()}


 Cell 6: Preview cleaned text
Here we print the first 500 characters of the cleaned version of one screenplay to verify the cleaning process.

In [None]:
sample_script = list(cleaned_screenplays.keys())[0]
print(f"\nCleaned Preview of {sample_script}:\n")
print(cleaned_screenplays[sample_script][:500])

Cell 7: Create Document-Term Matrix (DTM)
This cell uses `CountVectorizer` to convert the cleaned text into a numerical format showing the frequency of each word across the documents.

In [18]:
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(cleaned_screenplays.values())

 Cell 8: Convert to DataFrame
This transforms the DTM into a Pandas DataFrame so it's easier to view and manipulate.

In [19]:
dtm_df = pd.DataFrame(dtm.toarray(), index=cleaned_screenplays.keys(), columns=vectorizer.get_feature_names_out())
print(dtm_df.head()) 

                      10  100  101  102  103  104  105  106  106a  107  ...  \
aladdin.txt            0    0    0    0    0    0    0    0     0    0  ...   
princessbridethe.txt   1    1    1    1    1    1    1    1     0    1  ...   
findingnemo.txt        1    1    0    0    0    0    0    1     1    1  ...   
kungfupanda.txt        1    0    0    0    0    0    0    0     0    0  ...   
e.t..txt               0    0    0    0    0    0    0    0     0    0  ...   

                      zero  zeroed  zings  zip  zips  zombie  zones  zoo  \
aladdin.txt              0       0      1    0     1       1      0    1   
princessbridethe.txt     0       1      0    0     0       0      0    0   
findingnemo.txt          0       0      0    0     0       0      4    1   
kungfupanda.txt          4       0      0    0     0       0      0    0   
e.t..txt                 0       0      0    1     2       0      0    0   

                      zoom  zooms  
aladdin.txt              4      

 Cell 9: Save DTM to CSV
Finally, the structured document-term data is saved to a CSV file so it can be used in later analysis.

In [20]:
dtm_df.to_csv("screenplays_dtm.csv")
print("Document-Term Matrix saved as 'screenplays_dtm.csv'.")

Document-Term Matrix saved as 'screenplays_dtm.csv'.


 FINAL OUTCOME:
 This cell summarizes the final outcomes of the script. After processing and generating the Document-Term Matrix (DTM), we now have a structured dataset that can be used for various text analysis tasks:

Word frequency analysis to study the most common and unique words in screenplays.

Topic modeling to identify themes or genres.

Sentiment analysis to examine the emotional tone of the scripts.

Machine learning to build models for classifying screenplays automatically.

This provides a foundation for any of the above analysis.

