# Data Cleaning

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

In [2]:
def writeResults(msg):
    outFile = open("logtestResult.log", "a", encoding="utf-8")
    outFile.write(msg)

def getReviews(soup):
    reviews = [p.text for p in soup.find_all('div', class_ ='text show-more__control')]
    #writeResults(str(reviews))
    #print(reviews)
    return reviews

# Scrapes transcript data from imdb links
def url_to_review(url):
    '''Returns transcript data specifically from imdb reviews.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'lxml')
    reviews = getReviews(soup)
    print(url)
    return reviews

In [3]:
# we have the 5 most popular movies for each actor (where they appear as stars)
number_movies = 5

#URLs of reviews in scope
# 25 reviews for each movie composed with and without spoilers, all ratings and sorted by helpful votes
# Please note that "helpful" doesn't necessarily imply the review was "positive" or "favorable".
#A review can be helpful even if it rips the film to shreds, because it can help readers decide not to waste their time or money watching it.
urls = ['https://www.imdb.com/title/tt8946378/reviews?ref_=tt_urv', # Knives Out
        'https://www.imdb.com/title/tt7979142/reviews?ref_=tt_urv', # The Night Clerk 
        'https://www.imdb.com/title/tt1856101/reviews?ref_=tt_urv', # Blade Runner 2049
        'https://www.imdb.com/title/tt8079248/reviews?ref_=tt_urv', # Yesterday
        'https://www.imdb.com/title/tt1833116/reviews?ref_=tt_urv',
        
        'https://www.imdb.com/title/tt9446688/reviews?ref_=tt_urv', # I Am Not Okay with This
        'https://www.imdb.com/title/tt9086228/reviews?ref_=tt_urv', # Gretel & Hansel
        'https://www.imdb.com/title/tt1396484/reviews?ref_=tt_urv', # IT
        'https://www.imdb.com/title/tt7349950/reviews?ref_=tt_urv', # It Chapter Two
        'https://www.imdb.com/title/tt2649356/reviews?ref_=tt_urv', # Sharp Objects 
        
        'https://www.imdb.com/title/tt2661044/reviews?ref_=tt_urv', #The 100
        'https://www.imdb.com/title/tt0460649/reviews?ref_=tt_urv', #How I Met Your Mother
        'https://www.imdb.com/title/tt0056758/reviews?ref_=tt_urv', #General Hospital
        'https://www.imdb.com/title/tt1587678/reviews?ref_=tt_urv', #Happy Endings 
        'https://www.imdb.com/title/tt2477230/reviews?ref_=tt_urv', #The Night Shift 
        
        'https://www.imdb.com/title/tt8806524/reviews?ref_=tt_urv', #Star Trek: Picard 
        'https://www.imdb.com/title/tt0203259/reviews?ref_=tt_urv', #Law and Order
        'https://www.imdb.com/title/tt0364845/reviews?ref_=tt_urv', #NCIS
        'https://www.imdb.com/title/tt2193021/reviews?ref_=tt_urv', #Arrow
        'https://www.imdb.com/title/tt0112178/reviews?ref_=tt_urv', #Star Trek: Voyager 
        
        'https://www.imdb.com/title/tt2119532/reviews?ref_=tt_urv', #Hacksaw Ridge
        'https://www.imdb.com/title/tt2177461/reviews?ref_=tt_urv', #A Discovery of Witches 
        'https://www.imdb.com/title/tt1464540/reviews?ref_=tt_urv', #I Am Number Four
        'https://www.imdb.com/title/tt4786282/reviews?ref_=tt_urv', #Lights Out
        'https://www.imdb.com/title/tt2058673/reviews?ref_=tt_urv' #Point Break
        ]

actors = ['AnaA', 
          'SophiaL',
          'LindseyM',
          'JeriR',
          'TeresaP'
         ]

In [4]:
movies_reviews = [url_to_review(url) for url in urls]

https://www.imdb.com/title/tt8946378/reviews?ref_=tt_urv
https://www.imdb.com/title/tt7979142/reviews?ref_=tt_urv
https://www.imdb.com/title/tt1856101/reviews?ref_=tt_urv
https://www.imdb.com/title/tt8079248/reviews?ref_=tt_urv
https://www.imdb.com/title/tt1833116/reviews?ref_=tt_urv
https://www.imdb.com/title/tt9446688/reviews?ref_=tt_urv
https://www.imdb.com/title/tt9086228/reviews?ref_=tt_urv
https://www.imdb.com/title/tt1396484/reviews?ref_=tt_urv
https://www.imdb.com/title/tt7349950/reviews?ref_=tt_urv
https://www.imdb.com/title/tt2649356/reviews?ref_=tt_urv
https://www.imdb.com/title/tt2661044/reviews?ref_=tt_urv
https://www.imdb.com/title/tt0460649/reviews?ref_=tt_urv
https://www.imdb.com/title/tt0056758/reviews?ref_=tt_urv
https://www.imdb.com/title/tt1587678/reviews?ref_=tt_urv
https://www.imdb.com/title/tt2477230/reviews?ref_=tt_urv
https://www.imdb.com/title/tt8806524/reviews?ref_=tt_urv
https://www.imdb.com/title/tt0203259/reviews?ref_=tt_urv
https://www.imdb.com/title/tt03

In [8]:
movies_reviews

[['What an excellent film by Rian Johnson; definitely feels like the film he was destined to make. Writing that is slick as hell, sublime performances (most notably Daniel Craig who brings his A-game in a wonderfully charismatic turn), superb editing and wonderfully atmospheric music - all tied together by masterful direction. Will probably be among the most fun you have at a theatre this year and fans of Agatha Christie and old murder mystery stories will have plenty to love here - a nostalgically entertaining time!',
  'Nothing was typical about this. Everything was beautifully done in this movie, the story, the flow, the scenario, everything.\nI highly recommend it for mystery lovers, for anyone who wants to watch a good movie!',
  'This is a movie one would not regret spending money on. After a long time I am rating a movie perfect 10 and this movie totally deserves it. I really like the subtle comedy sprinkled in the movie. It easies out the tense atmosphere. In a good detective m

In [5]:
# # Pickle files for later use

# # Make a new directory to hold the text files
#!mkdir moviesReviews

for ind_actor, actorName in enumerate(actors):
    with open("moviesReviews/" + actorName + ".txt", "wb") as file:
        ii = number_movies * ind_actor
        join_reviews = []
        for x in range(ii, ii + number_movies):
            join_reviews += (movies_reviews[x])
        pickle.dump(join_reviews, file)

In [6]:
# Load pickled files
data = {}
for i, c in enumerate(actors):
    with open("moviesReviews/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)


# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['AnaA', 'SophiaL', 'LindseyM', 'JeriR', 'TeresaP'])

In [None]:
data.values()

In [9]:
print(len(data['LindseyM']))
print("=========")
print(data['LindseyM'][50])





125
In 1980, Gloria Monty along with other General Hospital writers decided to make a couple out of two actors that had amazing chemistry.  Genie Francis and Tony Geary who have become the phenomenal Luke and Laura.  Luke and Laura made General Hospital the unforgettable soap opera it is, and still today, Luke and Laura's though painful relationship, keeps viewers going through their fascinating characterization and never ending passion between them.Alan and Monica, played by Leslie Charleson and Stuart Damon, Frisco and Felicia, played by Jack and Kristina Wagner, Robert and Holly, played by Tristan Rogers and Emma Samms, Sonny and Brenda, played by Maurice Bernard and Vanessa Marcil, and Luke and Laura have showed every other soap opera the definition of a "love story".  Though never simple, always intense, always unforgettable.General Hospital has also shown amazing technique in storylines that are not love stories.  Such as Monica Quartermaine's heartwrenching breast cancer storyli

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove all punctuation
* Remove numerical values
* Remove common non-sensical text (/n)

**More data cleaning steps after tokenization:**
* Tokenize text
* Remove stop words
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [12]:
# We are going to change this to key: movie, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
data_combined

 'SophiaL': ['Am I the only one who felt that way?Don\'t get me wrong. This series is really entertaining and witty. But it just felt like the intro to the story itself.Binge watching is done easily and quickly with seven 22 minutes episodes on average which just felt like 3 at the most or a 153 minutes movie.At the end i just asked myself, why they didn\'t go on so that the story gets told?! Couldn\'t they finish in time?!That\'s the only reason, i just gave it 7 stars. Now we have to wait up to a year for the continuation.Everything else was really good. Great show for someone who has grown up with Stephen King stories, Carrie and Breakfast Club. Building up the characters takes time, and I see that some other viewers think it is too slow - but I like the story been build up from layer to layer. Nice start for an interesting story... And great work from young stars! Bingeworthy perfect length 7 episodes ..strong performances...with a cliffhanger which makes me want season 2 right now

# Organizing the Data
I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

## 1) Create the Corpus

In [13]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',250)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
#data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
AnaA,"What an excellent film by Rian Johnson; definitely feels like the film he was destined to make. Writing that is slick as hell, sublime performances (most notably Daniel Craig who brings his A-game in a wonderfully charismatic turn), superb editin..."
SophiaL,Am I the only one who felt that way?Don't get me wrong. This series is really entertaining and witty. But it just felt like the intro to the story itself.Binge watching is done easily and quickly with seven 22 minutes episodes on average which ju...
LindseyM,"First off, I enjoyed Season 1 of The 100. Season 3 was inferior compared to other seasons.The 100 is an American post-apocalyptic drama television series. I love The 100 because it has a unique story line. 97 years after a nuclear apocalypse, 100..."
JeriR,"Sorry, I've gotta go with the negative nerds on this one. I love Star Trek. I'm an old man, and I watched TOS episodes as a kid. I shushed my kids during new TNG episodes because I couldn't wait for the video tape to watch it. I watched DS9, Voya..."
TeresaP,"We knew already that Mel Gibson is a filmmaker with a powerful vision and the craftsmanship to go with it. Extraordinary battle scenes. Violence, Gibson style, which means Peckinpah plus, because here there is such a personal intention that makes..."


In [15]:
# Let's take a look at the review of THE SHAWSHANK REDEMPTION
data_df.transcript.loc['AnaA']



In [16]:
contractions = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
""
}

def change_abrevations(text):
    for word in text.split():
        if word.lower() in contractions:
            text = text.replace(word, contractions[word.lower()])
    return text

In [28]:
# Apply a first round of text cleaning techniques
import re
import string

# Make text all lower case
# get rid off everything in square brackets
# get rid off digits
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = change_abrevations(text)
    text = re.sub('\[.*?\]', ' ', text)
    #text = re.sub('[‘’“”…]', ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('\n', '', text)
    
    return text

cleaning_process = lambda x: clean_text(x)

In [29]:
# Let's take a look at the updated text
data_with_punctuation = pd.DataFrame(data_df.transcript.apply(cleaning_process))
data_with_punctuation

Unnamed: 0,transcript
AnaA,"what an excellent film by rian johnson; definitely feels like the film he was destined to make. writing that is slick as hell, sublime performances (most notably daniel craig who brings his a-game in a wonderfully charismatic turn), superb editin..."
SophiaL,am i the only one who felt that way?do not get me wrong. this series is really entertaining and witty. but it just felt like the intro to the story itself.binge watching is done easily and quickly with seven minutes episodes on average which ju...
LindseyM,"first off, i enjoyed season of the . season was inferior compared to other seasons.the is an american post-apocalyptic drama television series. i love the because it has a unique story line. years after a nuclear apocalypse, prisoner..."
JeriR,"sorry, I have gotta go with the negative nerds on this one. i love star trek. I am an old man, and i watched tos episodes as a kid. i shushed my kids during new tng episodes because i could not wait for the video tape to watch it. i watched , vo..."
TeresaP,"we knew already that mel gibson is a filmmaker with a powerful vision and the craftsmanship to go with it. extraordinary battle scenes. violence, gibson style, which means peckinpah plus, because here there is such a personal intention that makes..."


In [30]:
# Let's pickle it for later use
#!mkdir pickles
# pickle the cleaned data (before we put it in document-term matrix format) 
data_with_punctuation.to_pickle('pickles/data_with_punctuation.pkl')

In [31]:
# get rid off punctuactions
def remove_punct(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    return text

remove_punctuation = lambda x: remove_punct(x)

In [32]:

data_clean = pd.DataFrame(data_with_punctuation.transcript.apply(remove_punctuation))
data_clean

Unnamed: 0,transcript
AnaA,what an excellent film by rian johnson definitely feels like the film he was destined to make writing that is slick as hell sublime performances most notably daniel craig who brings his a game in a wonderfully charismatic turn superb editin...
SophiaL,am i the only one who felt that way do not get me wrong this series is really entertaining and witty but it just felt like the intro to the story itself binge watching is done easily and quickly with seven minutes episodes on average which ju...
LindseyM,first off i enjoyed season of the season was inferior compared to other seasons the is an american post apocalyptic drama television series i love the because it has a unique story line years after a nuclear apocalypse prisoner...
JeriR,sorry I have gotta go with the negative nerds on this one i love star trek I am an old man and i watched tos episodes as a kid i shushed my kids during new tng episodes because i could not wait for the video tape to watch it i watched vo...
TeresaP,we knew already that mel gibson is a filmmaker with a powerful vision and the craftsmanship to go with it extraordinary battle scenes violence gibson style which means peckinpah plus because here there is such a personal intention that makes...


In [33]:
# Let's pickle it for later use
#!mkdir pickles
# pickle the cleaned data (before we put it in document-term matrix format) 
data_clean.to_pickle('pickles/data_clean.pkl')

NOTE: This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:

Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
Combine 'thank you' into one term (bi-grams)
And a lot more...

In [25]:
# Let's add the comedians' full names as well
full_names = ['Ana de Armas', 'Sophia Lillis','Lindsey Morgan','Jeri Ryan','Teresa Palmer']

data_clean['full_name'] = full_names
data_clean

Unnamed: 0,transcript,full_name
AnaA,what an excellent film by rian johnson definitely feels like the film he was destined to make writing that is slick as hell sublime performances most notably daniel craig who brings his a game in a wonderfully charismatic turn superb editin...,Ana de Armas
SophiaL,am i the only one who felt that way do not get me wrong this series is really entertaining and witty but it just felt like the intro to the story itself binge watching is done easily and quickly with seven minutes episodes on average which ju...,Sophia Lillis
LindseyM,first off i enjoyed season of the season was inferior compared to other seasons the is an american post apocalyptic drama television series i love the because it has a unique story line years after a nuclear apocalypse prisoner...,Lindsey Morgan
JeriR,sorry I have gotta go with the negative nerds on this one i love star trek I am an old man and i watched tos episodes as a kid i shushed my kids during new tng episodes because i could not wait for the video tape to watch it i watched vo...,Jeri Ryan
TeresaP,we knew already that mel gibson is a filmmaker with a powerful vision and the craftsmanship to go with it extraordinary battle scenes violence gibson style which means peckinpah plus because here there is such a personal intention that makes...,Teresa Palmer


In [27]:
# Let's pickle it for later use
data_clean.to_pickle("pickles/corpus.pkl")

## 2) Create a Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [22]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', ngram_range=(1, 2))# want to include bi-grams
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,ab,ab school,aback,aback things,abandoned,abandoned creepy,abandoned father,abandoned incoherent,abandoned later,abandoned says,...,zone effect,zone melodrama,zoom,zoom really,zzzz,zzzz make,åeople,åeople working,édgar,édgar ramírez
AnaA,1,1,1,1,1,1,0,0,0,0,...,1,0,0,0,1,1,1,1,0,0
SophiaL,0,0,0,0,2,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
LindseyM,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
JeriR,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TeresaP,0,0,0,0,1,0,0,0,0,1,...,0,1,1,1,0,0,0,0,3,3


In [23]:
# Let's pickle it for later use
data_dtm.to_pickle("pickles/docTermMatrix.pkl")

In [24]:
# Let's also pickle the CountVectorizer object

pickle.dump(cv, open("pickles/CountVectorizer.pkl", "wb"))
