# Data Cleaning

To clean the scraped data, we convert them into transcripts as a whole, then into lowercase, remove punctuation and numerical values, remove \n and stop words. Finally we tokenize the data.

## Load Scraped Data

In [165]:
# Imports
import pickle

transcripts = []
with open('saves/1.transcripts_data.json', 'rb') as file:
    transcripts = pickle.load(file)

## Combine texts and convert to lowercase

In [166]:
for comedian in transcripts:
    transcripts[comedian] = [' '.join(transcripts[comedian]).lower()]

## Convert into dataframe for ease of access

In [167]:
# Imports
import pandas as pd

data = pd.DataFrame.from_dict(transcripts).transpose()
data.columns = ['Transcript']
data.head()

Unnamed: 0,Transcript
Lousic C.K.,intro\nfade the music out. let’s roll. hold th...
Dave Chappelle,this is dave. he tells dirty jokes for a livin...
Ricky Gervais,hello. hello! how you doing? great. thank you....
Bo Burham,bo what? old macdonald had a farm e i e i o an...
Bill Burr,"[cheers and applause] all right, thank you! th..."


## Remove text in brackets, numbers and punctuations

In [168]:
# Imports
import re
import string

def clean_data(text):
    ''' Phase 1 '''
    
    text = re.sub('\[.*?\]', '', text)                                 # Texts in brackets
    text = re.sub('\w*\d\w*', '', text)                                # Numbers
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)    # Punctuations
    text = re.sub('\n', ' ', text)                                     # \n
    text = re.sub('[“”–]', '', text)                                   # More punctuations
    text = text.strip()                                                # Strip
    
    return text
    
cleaned_data = pd.DataFrame(data['Transcript'].apply(lambda x: clean_data(x)))
cleaned_data['Transcript'][0]

'intro fade the music out let’s roll hold there lights do the lights thank you thank you very much i appreciate that i don’t necessarily agree with you but i appreciate very much well this is a nice place this is easily the nicest place for many miles in every direction that’s how you compliment a building and shit on a town with one sentence it is odd around here as i was driving here there doesn’t seem to be any difference between the sidewalk and the street for pedestrians here people just kind of walk in the middle of the road i love traveling and seeing all the different parts of the country i live in new york i live in a there’s no value to your doing that at all the old lady and the dog i live i live in new york i always like there’s this old lady in my neighborhood and she’s always walking her dog she’s always just she’s very old she just stands there just being old and the dog just fights gravity every day just the two of them it’s really the dog’s got a cloudy eye and she’s g

## Lemmatize the data

In [169]:
import spacy

nlp = spacy.load('en_core_web_sm')

def lemmatize(text):
    text = re.sub('’', '\'', text)    # Spacy doesn't understand '’'!
    
    return ' '.join([token.lemma_ for token in nlp(text)])

lemmatized_data = pd.DataFrame(cleaned_data['Transcript'].apply(lambda x: lemmatize(x)))
lemmatized_data['Transcript'][0]

"intro fade the music out let 's roll hold there light do the light thank you thank you very much I appreciate that I do n't necessarily agree with you but I appreciate very much well this be a nice place this be easily the nice place for many mile in every direction that be how you compliment a building and shit on a town with one sentence it be odd around here as I be drive here there do n't seem to be any difference between the sidewalk and the street for pedestrian here people just kind of walk in the middle of the road I love travel and see all the different part of the country I live in new york I live in a there be no value to your do that at all the old lady and the dog I live I live in new york I always like there be this old lady in my neighborhood and she be always walk her dog she be always just she be very old she just stand there just be old and the dog just fight gravity every day just the two of they it be really the dog 's get a cloudy eye and she be get a cloudy eye a

In [170]:
def fill_data(text):
    ''' Phase 2 '''
    
    text = re.sub('’em', ' them ', text)
    text = re.sub(' ’ ', ' is ', text)
    text = re.sub('’re', ' are ', text)
    text = re.sub('n’t', 'not', text)
    text = re.sub('n\'t', 'not', text)
    text = re.sub('\'s', '', text)
    text = re.sub('\'d', 'would', text)
    text = re.sub('\'ll', 'will', text)
    
    text = re.sub('\'', '', text)
    
    return text
    
filled_data = pd.DataFrame(lemmatized_data['Transcript'].apply(lambda x: fill_data(x)))
filled_data['Transcript'][0]

'intro fade the music out let  roll hold there light do the light thank you thank you very much I appreciate that I do not necessarily agree with you but I appreciate very much well this be a nice place this be easily the nice place for many mile in every direction that be how you compliment a building and shit on a town with one sentence it be odd around here as I be drive here there do not seem to be any difference between the sidewalk and the street for pedestrian here people just kind of walk in the middle of the road I love travel and see all the different part of the country I live in new york I live in a there be no value to your do that at all the old lady and the dog I live I live in new york I always like there be this old lady in my neighborhood and she be always walk her dog she be always just she be very old she just stand there just be old and the dog just fight gravity every day just the two of they it be really the dog  get a cloudy eye and she be get a cloudy eye and t

## Save cleaned data (Pandas)

In [171]:
filled_data.to_csv('saves/2.cleaned_transcripts_df.csv')