In [8]:
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

"""
This script processes text files containing screenplays and converts them into a structured Document-Term Matrix (DTM) for analysis.

Steps:
1. Load necessary NLTK resources for text processing.
2. Retrieve and read screenplay files from a specified folder.
3. Preprocess the text by:
   - Converting to lowercase
   - Removing punctuation
   - Tokenizing words
   - Removing common stopword
4. Transform the cleaned text into a Document-Term Matrix (DTM) using CountVectorizer.
5. Save the DTM as a CSV file for further analysis.
"""

# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Define the path to the "family." folder containing screenplay text files
folder_path = "/Users/beauxcreel/code/ENGL370-2025/Creel/Family."

# Retrieve a list of all text files in the specified folder
screenplay_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Limit the number of files to process to avoid excessive memory usage
MAX_FILES = 5  # Adjust as needed for scalability
screenplay_files = screenplay_files[:MAX_FILES]  # Ensure only 5 files are processed

# Load the screenplays into a dictionary
screenplays = {}
for file in screenplay_files:
    with open(os.path.join(folder_path, file), "r", encoding="utf-8") as f:
        screenplays[file] = f.read()

# Display the number of loaded screenplays
print(f"Loaded {len(screenplays)} screenplays.")
print("Loaded screenplays:", list(screenplays.keys()))

# Preview the content of the first screenplay (first 500 characters)
sample_script = list(screenplays.keys())[0]
print(f"\nPreview of {sample_script}:\n")
print(screenplays[sample_script][:500])

# Function to clean and preprocess text
def preprocess_text(text):
    """
    Cleans and tokenizes text by:
    - Converting to lowercase to standardize text
    - Removing punctuation to reduce noise
    - Tokenizing words using nltk's word_tokenize function
    - Removing stopwords to focus on meaningful words
    
    This ensures that only relevant words are retained for analysis, improving the accuracy of text-based models.
    
    Returns:
        str: The cleaned and processed text as a single string
    """
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation using regex
    tokens = word_tokenize(text)  # Tokenize text into words
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return ' '.join(tokens)  # Join tokens back into a cleaned string

# Apply text preprocessing to all screenplays
cleaned_screenplays = {title: preprocess_text(text) for title, text in screenplays.items()}

# Display a preview of cleaned text
sample_script = list(cleaned_screenplays.keys())[0]
print(f"\nCleaned Preview of {sample_script}:\n")
print(cleaned_screenplays[sample_script][:500])

# Convert cleaned text into a Document-Term Matrix (DTM)
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(cleaned_screenplays.values())

# Create a DataFrame for easier visualization
dtm_df = pd.DataFrame(dtm.toarray(), index=cleaned_screenplays.keys(), columns=vectorizer.get_feature_names_out())

# Display the first few rows of the DTM to examine word frequency across scripts
print(dtm_df.head())

# Save the Document-Term Matrix as a CSV file for future analysis
dtm_df.to_csv("screenplays_dtm.csv")
print("Document-Term Matrix saved as 'screenplays_dtm.csv'.")

"""
Final Outcome:
--------------
This script provides a foundation for further text analysis, such as:
- Word frequency studies to identify common and unique words in screenplays.
- Topic modeling to categorize scripts based on prevalent themes.
- Sentiment analysis to examine emotional tones in screenplays.
- Machine learning applications for automated text classification.
"""

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loaded 5 screenplays.
Loaded screenplays: ['aladdin.txt', 'princessbridethe.txt', 'findingnemo.txt', 'kungfupanda.txt', 'e.t..txt']

Preview of aladdin.txt:

ALADDIN:  THE COMPLETE SCRIPT

COMPILED BY BEN SCRIPPS 

(Portions Copyright (c) 1992 The Walt Disney Company

PEDDLER:    Oh I come from a land

    From a faraway place

    Where the caravan camels roam

    Where they cut off your ear /Where it's flat and immense

    If they don't like your face /And the heat is intense

    It's barbaric, but hey--it's home!

    When the wind's at your back

    And the sun's from the west

    And the sand in the glass is right

    Come on down,

    St

Cleaned Preview of aladdin.txt:

aladdin complete script compiled ben scripps portions copyright c 1992 walt disney company peddler oh come land faraway place caravan camels roam cut ear flat immense dont like face heat intense barbaric heyits home winds back suns west sand glass right come stop hop carpet fly another arabian night arabia

'\nFinal Outcome:\n--------------\nThis script provides a foundation for further text analysis, such as:\n- Word frequency studies to identify common and unique words in screenplays.\n- Topic modeling to categorize scripts based on prevalent themes.\n- Sentiment analysis to examine emotional tones in screenplays.\n- Machine learning applications for automated text classification.\n'

Cell 1:Import Libraries and Setup NLTK 
In this cell, we import the required libraries. We use:

os for interacting with the file system (loading screenplays).

re for regular expressions, which help clean and preprocess the text.

nltk to process natural language, including downloading tokenizers and stopwords.

pandas for organizing the data into a DataFrame.

CountVectorizer from sklearn to convert text into a Document-Term Matrix (DTM) for analysis.

In [1]:
# Import necessary libraries
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Setup: Ensure NLTK knows where to download data
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Cell 2:Load Screenplay Files
This cell loads the screenplay text files from the specified folder. We limit the number of files processed to avoid running out of memory, particularly when working with large text files. After loading the screenplays, we print how many files were successfully loaded and display a preview of the first screenplay (first 500 characters) to check its content.

In [2]:
# Define the path to the "family." folder containing screenplay text files
folder_path = "/Users/beauxcreel/code/ENGL370-2025/Creel/Family."

# Retrieve a list of all text files in the specified folder
screenplay_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Limit the number of files to process to avoid excessive memory usage
MAX_FILES = 5  # Adjust as needed for scalability
screenplay_files = screenplay_files[:MAX_FILES]  # Ensure only 5 files are processed

# Load the screenplays into a dictionary
screenplays = {}
for file in screenplay_files:
    with open(os.path.join(folder_path, file), "r", encoding="utf-8") as f:
        screenplays[file] = f.read()

# Display the number of loaded screenplays
print(f"Loaded {len(screenplays)} screenplays.")
print("Loaded screenplays:", list(screenplays.keys()))

# Preview the content of the first screenplay (first 500 characters)
sample_script = list(screenplays.keys())[0]
print(f"\nPreview of {sample_script}:\n")
print(screenplays[sample_script][:500])


FileNotFoundError: [Errno 2] No such file or directory: '/Users/beauxcreel/code/ENGL370-2025/Creel/Family.'

Cell 3:Preprocess the Text
This cell defines the preprocess_text function, which performs several text cleaning tasks:

Converts the text to lowercase to standardize it.

Removes punctuation using a regular expression.

Tokenizes the text into individual words using NLTK's word_tokenize.

Removes stopwords (common words like "the", "and", etc.) that don't provide significant meaning for analysis.

The function then returns the cleaned and tokenized text. After defining the function, we apply it to all the screenplays to prepare them for analysis. We also preview the cleaned text of the first screenplay to verify the process.

In [3]:
# Function to clean and preprocess text
def preprocess_text(text):
    """
    Cleans and tokenizes text by:
    - Converting to lowercase to standardize text
    - Removing punctuation to reduce noise
    - Tokenizing words using nltk's word_tokenize function
    - Removing stopwords to focus on meaningful words
    
    This ensures that only relevant words are retained for analysis, improving the accuracy of text-based models.
    
    Returns:
        str: The cleaned and processed text as a single string
    """
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation using regex
    tokens = word_tokenize(text)  # Tokenize text into words
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return ' '.join(tokens)  # Join tokens back into a cleaned string

# Apply text preprocessing to all screenplays
cleaned_screenplays = {title: preprocess_text(text) for title, text in screenplays.items()}

# Display a preview of cleaned text
sample_script = list(cleaned_screenplays.keys())[0]
print(f"\nCleaned Preview of {sample_script}:\n")
print(cleaned_screenplays[sample_script][:500])


NameError: name 'screenplays' is not defined

Cell 4: Create the Document-Term Matrix (DTM)
This cell transforms the cleaned text into a Document-Term Matrix (DTM) using CountVectorizer. The DTM represents the frequency of each word in the documents (screenplays). Each row corresponds to a screenplay, and each column corresponds to a unique word found in the entire corpus.

We then create a pandas DataFrame to make the matrix more readable, with words as columns and screenplays as rows. Finally, we display the first few rows of the DTM to verify its structure.

In [4]:
# Convert cleaned text into a Document-Term Matrix (DTM)
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(cleaned_screenplays.values())

# Create a DataFrame for easier visualization
dtm_df = pd.DataFrame(dtm.toarray(), index=cleaned_screenplays.keys(), columns=vectorizer.get_feature_names_out())

# Display the first few rows of the DTM to examine word frequency across scripts
print(dtm_df.head())


NameError: name 'cleaned_screenplays' is not defined

Cell 5: Save the Document-Term Matrix as CSV
This cell saves the Document-Term Matrix (DTM) into a CSV file, allowing for further analysis and exportation. The CSV file will contain the frequency of each word across the screenplays. Saving it in this format ensures that the DTM can be loaded and analyzed later.

In [5]:
# Save the Document-Term Matrix as a CSV file for future analysis
dtm_df.to_csv("screenplays_dtm.csv")
print("Document-Term Matrix saved as 'screenplays_dtm.csv'.")


NameError: name 'dtm_df' is not defined

 FINAL OUTCOME:
 This cell summarizes the final outcomes of the script. After processing and generating the Document-Term Matrix (DTM), we now have a structured dataset that can be used for various text analysis tasks:

Word frequency analysis to study the most common and unique words in screenplays.

Topic modeling to identify themes or genres.

Sentiment analysis to examine the emotional tone of the scripts.

Machine learning to build models for classifying screenplays automatically.

This provides a foundation for any of the above analysis.

