# Screeenplay Genre Classifier

## Preprocessing

In this section, I split the data into training and test sets. Then I do text preprocessing on both sets using WordNetLemmation from my custom class TextPreprocesser located in the preprocessing.py. After the lemmatization of both training and test sets, I find the intersection of the 300 most common words by genre in the training set to create a custom stop word list. That stop word list is then applied to the testing set to avoid any data leakage.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from preprocessing import TextPreprocessor #custom classes
from topicmodels import Modeling #custom classes
from nltk import FreqDist

## First Step: splitting data into train and test sets to avoid data leakage

In [2]:
data = pd.read_csv('data/cleaned_data.csv', index_col=[0])

In [3]:
#generating cols for target sets
genre_list = list(data.columns[-18:])
genre_list

['Crime',
 'Romance',
 'Animation',
 'SciFi',
 'Fantasy',
 'History',
 'Action',
 'Drama',
 'War',
 'Thriller',
 'Mystery',
 'Documentary',
 'Horror',
 'Family',
 'Adventure',
 'Music',
 'Comedy',
 'Western']

In [7]:
#splitting data
X = data.loc[:, ['title', 'text']].copy()
y = data.loc[:, genre_list].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Second Step: Lemmatizing and removing basic stop words from target and test set

In [12]:
tp = TextPreprocessor(activator_type='wnl', lem_or_stem='lem') #custom preprocessing class

X_train_lem = X_train.text.apply(lambda x: tp.lem_process_doc(x)) #this function lemmatizes and removes stop words from the nltk pacakge
X_test_lem = X_test.text.apply(lambda x: tp.lem_process_doc(x))


## Third Step: Finding the intersection of stop words from the X_train set and removing those words from both X_train and X_test

*Finding the intersection of stopwords in the X_train set will avoid data leakage because it assumes no information about the words in the X_test

In [22]:
df_train = pd.concat([X_train_lem, y_train], axis=1) #concating to find stop words by genre

genre_dfs = {} #creating a dict of dataframes by genre

for i in genre_list:
    genre_dfs[i] = df_train[df_train[i] == 1]

In [27]:
freq_dct = {} #creating a dict to store word frequency by genre

for i in genre_list:

    freq_dct[i] = FreqDist((" ".join(genre_dfs[i].text)).split())

In [61]:
#function to get most frequent words in freqdist
def getting_stops(freq_dist, num): 

    """
    This function takes in a frequency distrubtion, a list, and a number and returns the number of words by count.
    This function will help in created an optimial stop_word list.

    """
    lst = []
    
    for i in freq_dist.most_common(num):
        lst.append(i[0])

    return lst

In [69]:
lst = [] #appending sets of common words by genre to list

for i in genre_list:
    
    lst.append(set(getting_stops(freq_dct[i], 500))) #500 words chosen based on eda in previous notebook

stops = list(set.intersection(*lst)) #finding the intersection

In [73]:
with open('data/names.txt') as f: #this file contains common names of people
    line = f.readlines()

for i in line:
    stops.append(i.strip('\n').lower()) #appending names to the stop list

In [74]:
def stopping(row, stops):

    """
    This function takes in text and a list of stop words. 
    It returns updated text without the stopwords specified in the argument

    """

    row_split = row.split()
    updated_row = [x for x in row_split if x not in stops]

    return " ".join(updated_row)

In [75]:
X_train_no_stops = X_train_lem.apply(lambda x: stopping(x, stops)) #applying new stop words to training data
X_test_no_stops = X_test_lem.apply(lambda x: stopping(x, stops)) #applying new stop words to testing data

In [79]:
X_train_no_stops_title = pd.concat([X_train_no_stops, X_train.title], axis=1) #to keep track of title of each screenplay
X_test_no_stops_title = pd.concat([X_test_no_stops, X_test.title], axis=1)

In [80]:
#saving data for model building and testing
X_train_no_stops_title.to_csv('data/X_train.csv')
X_test_no_stops_title.to_csv('data/X_test.csv')
y_train.to_csv('data/y_train.csv')
y_test.to_csv('data/y_test.csv')