## Movie Cleaning and Pre-processing

+ The main purpose of this notebook is to clean and pre-process the data obtained from the website
   + imsdb.com and imdb.com

##### This notebook accomplishes four primary tasks:

+ using the regex library, replace the contracted words in the dataset models
  + such as, can't - cannot,
    've - have,
    I'm - I am
+ eliminate punctuation and stop words, so also multiple names reoccurring in the beginning of dialogues.
+ replace and sort the data
+ save the cleaned data for analysis and modelling
---

+ import necessary libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import string
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ojoho\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

This function below replaces the contracted words in the data and replaces it with its proper extension. This was done prior so it wouldn't be affected by the punction. 

In [3]:
def initial_clean(text):
    '''words like can't can't be processed properly, words ending with 's replaced to is and 'll as will ''' 
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"\'s+", " is", text)
    text = re.sub(r"\'ll+", " will", text)
    text = re.sub(r"\'re+", " are", text)
    text = re.sub(r"[n]\'t+", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"[n]\'", " not", text)
    text = re.sub(r"[^a-zA-z]ll[^a-zA-z]", " will", text)
    text = re.sub(r"[^a-zA-z]re[^a-zA-z]", " are", text)
    text = re.sub(r"[^a-zA-z]ve[^a-zA-z]", " have", text)
    return text

## Speaker Cues and Transitions
---
+ Speaker cues and transitions in the data were all expressed in Uppercase. So multiple functions were created to 
  + segregate data in uppercase and eliminate all words appearing more than 10 times from the dataset. 
    This was done to minimize the character names in the dataset.
  + Punctuations and stopwords were eliminated from the text as well.
  + script to be tokenized
  + data to be lemmatized 

In [27]:
def elimate_multiple(new_d):
    '''The regex takes into account all values that have 2 uppercase letters consecutively
    it then does a count of all the letters in uppercase and returns it to the next fuction'''
    capital_letters = re.findall('[A-Z][A-Z]+', new_d)
    frequency = ' '.join(capital_letters)
    counts = dict()
    words = frequency.split()
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1        
    return counts

def remove_values(counts):
    #This function elimates words with multiple occurance based on its counts 
    repeated_words = []
    for key,value in counts.items():
        if value > 10:
            repeated_words.append(key)
        else:
            key.lower()
    return repeated_words  

def clean_text(text):
    #Make text lowercase, remove text in square brackets, remove punctuation and remove numbers.
    #text = text.lower()
    text = initial_clean(text)
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    stop_words = remove_values(elimate_multiple(text))
    for i in stop_words:
        text = re.sub(f'[^A-Za-z]{i}[^A-Za-z]', ' ', text, flags=re.IGNORECASE)
    return text

def clean_stopword(x):
    #clean stopwords
    stop_words = stopwords.words('english')
    for name in stop_words:#issue with capital letters and lower letters
        return " ".join([w.lower() for w in x.split() if w.lower() not in stop_words and len(w) > 1])

def lemmatize(x):
    # tokenize and lemmatize the data
    lemmatizer = WordNetLemmatizer()
    clean_word = clean_stopword(x)
    tokenized_word = [lemmatizer.lemmatize(word) for word in clean_word]
    return "".join(tokenized_word)

Using Pandas we read into our uncleaned dataset

In [5]:
df = pd.read_csv('movie_dataBase.csv')

Eliminating missing data: Some Urls on the website were inconsistent causing absent scripts and age rating

In [None]:
df.info()

In [6]:
df['Movie_Script'] = df['Movie_Script'].astype('string') #convert from datatype object to string
df.dropna(subset=['Movie_Script'], inplace=True) #empty rows are dropped

In [8]:
df['final_text'] = df['Movie_Script'].apply(lambda x : clean_text(x)) #applying clean_text function
df['final_text2'] = df['final_text'].apply(lambda x : lemmatize(x)) # lemmatize the data

Age Rating. values were substituted and grouped into 4 different ages

In [10]:
age_rating = {'12A': '12',
              'U':'PG',
              'AA': '15',
              'X': '18',
              'TV-MA': '18',
              'Adult':'18',
              'Passed': '12',
              'PG-13': '12',
              'TV-14': '15',
              'A':'18',
              'R': '18',
              'Not Rated': np.nan,
              ' ' : np.nan
             }

In [11]:
df['age_rating'] = df['age_rating'].replace(age_rating)
df.dropna(subset=['age_rating'], inplace=True)

Movie Genre. 
---
+ a genre_type Music and Musical were substituted based on similarity. 
+ column was further cleaned

In [12]:
df['Movie_Genre'] = df['Movie_Genre'].str[1:-1].astype('string')

In [22]:
def clean_genre(text):
    text = re.sub('Musical', 'Music', text)
    text = re.sub("'", '', text)
    text = re.sub(r"\.", ",", text)
    text = re.sub(' ', '', text)
    return text

In [23]:
df['Movie_Genre'] = df['Movie_Genre'].apply(lambda x : clean_genre(x)) #function is applied

In [24]:
# Returns the unique genres based on splitting entire column to list. A set is used because it doesn't give room for repitition
unique_genres = list(set([y for x in df['Movie_Genre'] for y in x.split(',')]))
print (unique_genres)

['Mystery', 'History', 'Biography', 'Crime', 'Sci-Fi', 'Romance', 'Horror', 'Action', 'Film-Noir', 'Family', 'War', 'Music', 'Animation', 'Sport', 'Fantasy', 'Thriller', 'Western', 'Comedy', 'Drama', 'Adventure']


In [25]:
df

Unnamed: 0,title,age_rating,Movie_Genre,Movie_Script,final_text,final_text2
0,Reservoir Dogs,18,"Action,Crime,Thriller",Quentin Tara...,Quentin Tara...,quentin tarantino october movie dedicated foll...
1,How to Train Your Dragon,PG,"Animation,Adventure,Comedy",HOW TO TRAIN YO...,HOW TRAIN YOUR ...,train dragon written dean deblois chris sander...
2,Scream,18,"Horror,Mystery,Thriller",...,...,scream scary movie kevin williamson rewrite ju...
3,Groundhog Day,PG,"Comedy,Fantasy,Romance",GROUNDH...,GROUNDH...,groundhog written danny rubin second revision ...
4,Black Panther,12,"Action,Adventure,Sci-Fi",BLACK PANTHER ...,BLACK PANTHER ...,black panther written ryan coogler joe robert ...
...,...,...,...,...,...,...
1207,You Can Count On Me,15,Drama,"""YOU CAN COU...",YOU CAN COUN...,count screenplay kenneth lonergan shooting dra...
1208,You've Got Mail,PG,"Comedy,Romance",You've Got Mail You've Got Mail by...,You have Got Mail \t\t\tYou have Got Mail...,got mail got mail nora ephron delia ephron bas...
1209,Youth in Revolt,15,"Comedy,Drama,Romance",...,...,youth revolt written gustin nash july black co...
1210,Zero Dark Thirty,15,"Drama,Thriller",ZERO DARK...,ZERO DARK...,zero dark thirty written mark boal october voi...


### Saving cleaned dataFrame to a csv for analysis

In [26]:
df.to_csv("cleaned_dataBase.csv", index = False)