<h1>Sentiment Analysis for Funny Classification</h1>

Main idea is to extract words within brackets in the corpus and see if there is a way to use the output polarity score as a metric for how funny a sentence is in a transcript. Hypothesis: Polarity score for 'cheering' > 'laughing' > 'chuckling'.

We will be using the original `transcripts.csv`data in our sentiment analysis of the words in brackets, and rule-based approach is used as a preliminary data exploration step.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
import pickle
import seaborn as sns
#from textblob import TextBlob
#from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

In [2]:
#Load corpus
df = pd.read_csv('transcripts.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,[slow instrumental music playing] [funk drums ...
1,1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,[siren wailing] I don’t know what you were thi...
2,2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,Join me in welcoming the author of six number ...
3,3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...","Premiered on December 13, 2022 Ladies and gent..."
4,4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,"Please welcome to the stage, Jim Jefferies! He..."


<h2>Data Exploration</h2>

<h3>Identify Inconsistencies in Context Annotations</h3>

Since we are analyzing the contexts within the square brackets, we need to check if the bracket punctuations are consistent.

In [3]:
def count_square_brackets(transcript):
    '''Naive way of checking for consistency - check if number of open brackets 
    and number of close brackets are the same.''' 
    
    open_brackets = transcript.count("[")
    close_brackets = transcript.count("]")
    
    if open_brackets == close_brackets:
        return open_brackets
    
    return -1

df_brackets = df.copy()
df_brackets['num_sqbrackets'] = df_brackets["Transcript"].map(count_square_brackets)

# Return the transcripts with inconsistent square brackets
print("Number of transcripts with inconsistent square brackets: ", len(df_brackets[df_brackets.num_sqbrackets == -1]))

# Return the transcripts with no square brackets
print("Number of transcripts with no square brackets: ", len(df_brackets[df_brackets.num_sqbrackets == 0]))

Number of transcripts with inconsistent square brackets:  9
Number of transcripts with no square brackets:  166


We see that there are as many as 166 rows that there are no square brackets, though some of the transcripts use parentheses instead. We check for these in the next steps.

In [4]:
def count_round_brackets(transcript):
    '''Naive way of checking for consistency - check if number of open brackets 
    and number of close brackets are the same.''' 
    
    open_brackets = transcript.count("(")
    close_brackets = transcript.count(")")
    
    if open_brackets == close_brackets:
        return open_brackets
    
    return -1

df_brackets['num_rbrackets'] = df_brackets["Transcript"].map(count_round_brackets)

# Return the number of transcripts with inconsistent round brackets
print("Number of transcripts with inconsistent round brackets: ", len(df_brackets[df_brackets.num_rbrackets == -1]))

# Return the transcripts with no round brackets
print("Number of transcripts with no round brackets: ", len(df_brackets[df_brackets.num_rbrackets == 0]))

Number of transcripts with inconsistent round brackets:  6
Number of transcripts with no round brackets:  318


In [5]:
# Return the transcripts with no round AND square brackets

print("Number of transcripts with no context annotations: ", len(df_brackets[(df_brackets.num_sqbrackets == 0) & (df_brackets.num_rbrackets == 0)]))
df_brackets[(df_brackets.num_sqbrackets == 0) & (df_brackets.num_rbrackets == 0)]

Number of transcripts with no context annotations:  114


Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript,num_sqbrackets,num_rbrackets
2,2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,Join me in welcoming the author of six number ...,0,0
3,3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...","Premiered on December 13, 2022 Ladies and gent...",0,0
4,4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,"Please welcome to the stage, Jim Jefferies! He...",0,0
14,14,Kate Berlant,"September 24, 2022",Cinnamon in the Wind (2022) | Transcript,"Kate Berlant performs an intimate, absurdist s...","Whoa! Okay, yeah. Good. Okay, don’t embarrass ...",0,0
27,27,Jerrod Carmichael,"May 24, 2022",Rothaniel (2022) | Transcript,Features Jerrod Carmichael in a standup comedy...,Man… We were waiting for you. I’m happy you’re...,0,0
...,...,...,...,...,...,...,...,...
387,387,LOUIS C.K.,"May 4, 2017",LIVE AT THE COMEDY STORE (2015) – Transcript,The comic puts his trademark hilarious/thought...,"Thank you. Oh. Oh, my God, you guys. Oh, my Go...",0,0
394,394,RICHARD PRYOR,"April 26, 2017",LIVE ON THE SUNSET STRIP (1982) – Full Transcript,Comedy legend Richard Pryor touches on all sub...,Recorded at the Circle Star Theater in San Car...,0,0
401,401,GEORGE CARLIN,"April 12, 2017",JAMMING IN NEW YORK (1992) – Testo italiano co...,,"Ciao, grazie. Grazie. Grazie. Grazie molte. Gr...",0,0
405,405,Eddie Murphy,"April 11, 2017",Delirious (1983) – Transcript,"Transcript of Eddie Murphy's 'Delirious', stan...","Filmed on August 17, 1983 at DAR Constitution ...",0,0


In [6]:
# Union the unique transcripts with inconsistent square and round brackets

print("Number of transcripts with inconsistent square brackets OR inconsistent round brackets OR both: ", len(df_brackets[(df_brackets.num_sqbrackets == -1) | (df_brackets.num_rbrackets == -1)]))
df_brackets[(df_brackets.num_sqbrackets == -1) | (df_brackets.num_rbrackets == -1)]

Number of transcripts with inconsistent square brackets OR inconsistent round brackets OR both:  13


Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript,num_sqbrackets,num_rbrackets
87,87,Hannah Gadsby,"May 26, 2020",Douglas (2020) – Transcript,"In her second Netflix special ""Douglas"", Hanna...",The following is the transcript of Hannah Gadb...,-1,0
93,93,LEE MACK,"May 8, 2020",GOING OUT LIVE (2010) – FULL TRANSCRIPT,"Lee Mack, star of BBC comedy shows 'Not Going ...",This programme contains strong language [APPLA...,-1,-1
125,125,Tiffany Haddish,"December 5, 2019",Black Mitzvah (2019) – Transcript,The comedy special for the comedian/actress wa...,[roaring cheers] [Haddish] People think they k...,-1,0
144,144,,"July 14, 2019",Tom Segura Overdoses – This Is Not Happening [...,Tom Segura returns home from college for the f...,Episode aired 30 July 2013 This woman goes “He...,-1,0
185,185,Trevor Noah,"November 21, 2018",Son of Patricia (2018) – Transcript,"Trevor Noah gets out from behind the ""Daily Sh...",A NETFLIX ORIGINAL COMEDY SPECIAL [distant tra...,-1,0
236,236,GREG DAVIES,"April 27, 2018",YOU MAGNIFICENT BEAST (2018) – Full Transcript,British comedian Greg Davies revisits terrifyi...,"♪ Now I’m not trying to be rude ♪ But, hey, pr...",51,-1
250,250,FRED ARMISEN,"February 13, 2018",STANDUP FOR DRUMMERS (2018) – Full Transcript,"For an audience of drummers, comedian Fred Arm...",[man] Drummers only tonight. Drummers only. No...,-1,1
260,260,D.L. Hughley,"January 19, 2018",Unapologetic (2007) – Transcript,"D.C., Hughley focuses on such topics as the da...",[Audience cheering] (DL Hughley enters from st...,114,-1
270,270,BILL HICKS,"January 12, 2018","LIVE AT LAFF STOP, AUSTIN, TX, AND COBBS, SAN ...","Well, folks, this is kind of a sentimental eve...","Recorded Live at Laff Stop, Austin, TX, and Co...",3,-1
310,310,Gabriel Iglesias,"November 7, 2017",Hot And Fluffy (2007) – Transcript,"Gabriel Iglesias blends storytelling, sound ef...",[Latino-style music] [audience cheering] (male...,-1,-1


In [7]:
# Return transcripts with less than or equal to X annotations each for rounded and square brackets

X = 20

print("Number of transcripts with annotations but <= 20: ", 
      len(df_brackets[((df_brackets.num_sqbrackets > 0) & (df_brackets.num_sqbrackets <= X)) & 
            ((df_brackets.num_rbrackets > 0) & (df_brackets.num_rbrackets <= X))]))
df_brackets[((df_brackets.num_sqbrackets > 0) & (df_brackets.num_sqbrackets <= X)) & 
            ((df_brackets.num_rbrackets > 0) & (df_brackets.num_rbrackets <= X))]

Number of transcripts with annotations but <= 20:  22


Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript,num_sqbrackets,num_rbrackets
31,31,,"May 2, 2022",Trevor Noah at the White House Correspondents’...,Trevor Noah headlined the annual White House C...,Great. I got a promise I will not be going to ...,12,2
47,47,Bo Burnham,"June 1, 2021",Inside (2021) – Transcript,"'Inside' carries satire, social commentary and...","Exploring mental health decline over 2020, the...",5,1
86,86,BILLY CONNOLLY,"May 27, 2020",HIGH HORSE TOUR LIVE (2016) – FULL TRANSCRIPT,Billy may had some tougher times in recent yea...,"Ladies and gentlemen, would you please welcome...",1,1
102,102,DAVE ALLEN,"April 30, 2020",FIRST DAY AT CATHOLIC SCHOOL [TRANSCRIPT],Dave Allen on his first day at Catholic school...,Dave Allen on his first day at Catholic school...,3,1
105,105,George Carlin,"April 13, 2020",The Indian Drill Sergeant – Transcript,"In 1965 “The Indian Sergeant,” was emerging as...","In 1965 “The Indian Sergeant,” was emerging as...",1,2
120,120,Kevin Bridges,"December 22, 2019",A Whole Different Story (2015) – Full Transcript,,"Ladies and gentlemen, please welcome Kevin Bri...",6,1
138,138,Dave Chappelle,"August 26, 2019",Sticks & Stones (2019) – Transcript,Legendary comedian Dave Chappelle is back with...,Sticks & Stones is Dave Chappelle’s fifth Netf...,1,1
153,153,,"May 18, 2019",Doug Stanhope on babies and abortion,There's a specific group of over four million ...,From “Dead Beat Hero” (2004) Immigration. Ther...,5,1
184,184,,"November 27, 2018",Volker Pispers about USA (2004) – Transcript,"Last part of Volker Pispers' program ""Bis neul...",Last part of Volker Pispers’ program “Bis neul...,7,4
221,221,Bill Burr,"July 6, 2018",The Philadelphia Incident (2006) – Transcript,"Transcript of the ""Philadelphia incident"" wher...",NEW! The full transcript of Bill’s monologue ...,9,1


From the above, we have three problems with our data:

1. Lack of context annotations i.e. transcripts with no annotations (114 transcripts) 
2. Inconsistent square/round brackets (13 transcripts)
3. Transcripts with very few annotations i.e. <= 20 (22 transcripts)

Assuming transcripts for each category are unique, there is a maximum of 149 transcripts that we are potentially leaving out of the dataset, leaving us with roughly 2/3 of the data. (Will this affect our analysis greatly?)

We remove the transcripts with no annotations from our corpus. This will skew our classification results to classifying potentially funny content to being unfunny due to the lack of annotations in the transcripts.

For the remaining problematic scripts, it may still be possible to handle the data such that more comedy segments can be added to the corpus for which the brackets are consistent. However, for our baseline model, we will try to remove these scripts first.

In [8]:
# Filter problematic transcripts and remove them from our dataset

df_no_annotations = df_brackets[(df_brackets.num_sqbrackets == 0) & (df_brackets.num_rbrackets == 0)]
df_inconsistent = df_brackets[(df_brackets.num_sqbrackets == -1) | (df_brackets.num_rbrackets == -1)]
df_few_annotations = df_brackets[((df_brackets.num_sqbrackets > 0) & (df_brackets.num_sqbrackets <= X)) & 
                        ((df_brackets.num_rbrackets > 0) & (df_brackets.num_rbrackets <= X))]

def populate_removed_idx():
    no_annotations_idx = df_no_annotations.index
    inconsistent_idx = df_inconsistent.index
    few_annotations_idx = df_few_annotations.index
    
    newList = list(set().union(no_annotations_idx, inconsistent_idx, few_annotations_idx))
    return newList
                   
removed_idx = populate_removed_idx() # Store removed indices

df_clean = df.copy().drop(index = removed_idx) # Drop problematic indices
df_clean.head()
# df_clean.shape #266 transcripts
    

Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,[slow instrumental music playing] [funk drums ...
1,1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,[siren wailing] I don’t know what you were thi...
5,5,,"January 22, 2023",Dave Chappelle Stand-Up Monologue – SNL (2022)...,"Dave Chappelle talks about Kanye West, the 202...","Original air date: November 12, 2022 * * * Lad..."
6,6,Dave Chappelle,"January 22, 2023",What’s in a Name (2022) | Transcript,Dave Chappelle delivers a speech at his presti...,What’s in a Name? is a 40-minute talk Chappell...
7,7,Iliza Shlesinger,"December 20, 2022",Hot Forever (2022) | Transcript,With topics ranging from tight rompers to ugly...,[upbeat music playing] [crowd cheering] Clevel...


In [9]:
# Filter transcripts with no comedian

no_comedian_idx = df_clean[df_clean["Comedian"].isna()].index
#len(no_comedian_idx) #20 transcripts without comedian
clean_transcripts = df_clean.drop(index = no_comedian_idx)
clean_transcripts.head()
#clean_transcripts.shape # 246 transcripts

Unnamed: 0.1,Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,[slow instrumental music playing] [funk drums ...
1,1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,[siren wailing] I don’t know what you were thi...
6,6,Dave Chappelle,"January 22, 2023",What’s in a Name (2022) | Transcript,Dave Chappelle delivers a speech at his presti...,What’s in a Name? is a 40-minute talk Chappell...
7,7,Iliza Shlesinger,"December 20, 2022",Hot Forever (2022) | Transcript,With topics ranging from tight rompers to ugly...,[upbeat music playing] [crowd cheering] Clevel...
8,8,Gabriel Iglesias,"November 28, 2022",Stadium Fluffy (2022) | Transcript,"Features Gabriel ""Fluffy"" Iglesias as he talks...",[man] Can you please state your name? Martin M...


<h3>Split Sentences and Annotations</h3>

In [10]:
# Select a sample transcript from the cleaned transcripts

sample_transcript = clean_transcripts.loc[0, "Transcript"]
sample_title = clean_transcripts.loc[0, "Title"]

print("Transcript:\n", sample_transcript)
print("Title:\n", sample_title)

Transcript:
 [slow instrumental music playing] [funk drums playing] [indistinct chatter] [man] Let’s go! [hip-hop music playing] [audience cheering] [Chris Rock] She said, “$300, I’ll do anything you want.” I said, “Bitch, paint my house.” We don’t need the death penalty! We got the tossed salad man! ‘Cause if a bullet costs $5,000, there’ll be no more innocent bystanders. I ain’t scared of Al-Qaeda. I’m scared of Al-Cracker. You cannot lend money to people you’re fucking. ‘Cause they think that sex is a payback. We just got a few bad apples that like to crash into mountains. [audience laughing] [audience cheering] [hip-hop music playing] [female announcer] Ladies and gentlemen. Ladies and gentlemen. Chris Rock! [audience cheering] [audience continue cheering] [Chris Rock] What’s up, Baltimore? [audience cheers loudly] Yes! Yes, yes. Thank you! Thank you so much! Thank you so much to coming to my Netflix special. Thank you. [audience cheering] That’s right. That’s right! Okay. I’mma tr

In [11]:
# Combine annotations if they are next to each other
def combine_annotations(transcript):
    word_split = transcript.split(" ")
    
    words_w_annot = []
    curr_idx = 0
    annot_prev = False #True if previous index is annotation
    while curr_idx < len(word_split):
        word = word_split[curr_idx]
        
        if word.startswith("["):
            #Append with other words within the brackets
            annot_split = [word]
            while not word_split[curr_idx].endswith("]"):
                curr_idx += 1
                annot_split.append(word_split[curr_idx])
                
            result = " ".join(annot_split) #e.g. [audience laughs]
            
            #If the previous word is an annotation
            if annot_prev:
                prev = words_w_annot.pop() #Get the previous annotation
                new_annot = prev + result # e.g. [audience laughs][audience continues laughing]
                result = new_annot.replace("][", "; ")
            
            #Append result to stored words and annotations list
            words_w_annot.append(result)
            annot_prev = True            
        else:
            words_w_annot.append(word_split[curr_idx])
            annot_prev = False
            
        curr_idx += 1
    
    return " ".join(words_w_annot)
        
            
#Testing function
print(combine_annotations("[audience laughs] [audience cries] Hello everybody!")) #['[audience laughs; audience cries]']

[audience laughs; audience cries] Hello everybody!


In [12]:
# Let annotations be at the front of each sentence
def reorganise_transcript(transcript):
    combined_annot = combine_annotations(transcript)
    return [ sentence.strip() + "]" for sentence in combined_annot.split("]") if sentence != ""]  

def extract_features_labels(transcript):
    reorged = reorganise_transcript(transcript)
    
    annot_script = [sentence.split("[") for sentence in reorged]
    script = [pair[0].strip() for pair in annot_script]
    annots = [pair[1][:-1] for pair in annot_script]
    
    return pd.DataFrame(zip(script, annots), columns = ["Script", "Annotation"])

sample_data = extract_features_labels(sample_transcript)

#Remove sentences with no script or annotation
sample_data_cleaned = sample_data[(sample_data["Script"] != "") & (sample_data["Annotation"] != "")]
sample_data_cleaned

Unnamed: 0,Script,Annotation
1,Let’s go!,hip-hop music playing; audience cheering; Chri...
2,"She said, “$300, I’ll do anything you want.” I...",audience laughing; audience cheering; hip-hop ...
3,Ladies and gentlemen. Ladies and gentlemen. Ch...,audience cheering; audience continue cheering;...
4,"What’s up, Baltimore?",audience cheers loudly
5,"Yes! Yes, yes. Thank you! Thank you so much! T...",audience cheering
...,...,...
86,No. It’s never gonna happen. No. Fuck that shi...,audience cheering
87,"I took it like motherfucking Pacquiao, okay? S...",audience cheering
88,"Pookie, motherfucker. I played a piece of corn...",audience cheering
89,I didn’t. I did not have any entanglement. For...,audience cheering


<h3>Sentiment Analysis of Annotations (Label)</h3>

In [13]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [14]:
import re
import string 
from textblob import TextBlob

In [15]:
def clean_text(text) :
    '''Make text lowercase, remove text in square brackets, remove punctuations, 
    remove quotation marks, remove words containing numbers, remove \n'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)   
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text) 
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    
    return text

cleaning = lambda x : clean_text(x)

In [16]:
# apply data cleaning to Annotation column 
label_data = sample_data_cleaned.copy()
label_data['Annotation'] = label_data['Annotation'].apply(cleaning)

label_data.head()

Unnamed: 0,Script,Annotation
1,Let’s go!,hiphop music playing audience cheering chris rock
2,"She said, “$300, I’ll do anything you want.” I...",audience laughing audience cheering hiphop mus...
3,Ladies and gentlemen. Ladies and gentlemen. Ch...,audience cheering audience continue cheering c...
4,"What’s up, Baltimore?",audience cheers loudly
5,"Yes! Yes, yes. Thank you! Thank you so much! T...",audience cheering


<h4>Rule-based Approach</h4>

In [17]:
# lambda functions for TextBlob to find polarity and subjectivity of each transcript 
pol = lambda x: TextBlob(x).sentiment.polarity

In [18]:
df_textblob = label_data.copy()

In [19]:
# get the polarity and subjectivity score of each transcript 
df_textblob['Polarity_Score'] = df_textblob['Annotation'].apply(pol)

In [20]:
df_textblob

Unnamed: 0,Script,Annotation,Polarity_Score
1,Let’s go!,hiphop music playing audience cheering chris rock,0.0
2,"She said, “$300, I’ll do anything you want.” I...",audience laughing audience cheering hiphop mus...,0.0
3,Ladies and gentlemen. Ladies and gentlemen. Ch...,audience cheering audience continue cheering c...,0.0
4,"What’s up, Baltimore?",audience cheers loudly,0.1
5,"Yes! Yes, yes. Thank you! Thank you so much! T...",audience cheering,0.0
...,...,...,...
86,No. It’s never gonna happen. No. Fuck that shi...,audience cheering,0.0
87,"I took it like motherfucking Pacquiao, okay? S...",audience cheering,0.0
88,"Pookie, motherfucker. I played a piece of corn...",audience cheering,0.0
89,I didn’t. I did not have any entanglement. For...,audience cheering,0.0


In [21]:
sentence = ["audience cheering"]

df_test = pd.DataFrame(sentence)
df_test[0].apply(pol)

0    0.0
Name: 0, dtype: float64

Polarity score is dependent on how much emotion is used in the sentence. As "audience cheering" is a fact/narration, there is no emotion detected in the context annotation.

<h3>Assign Binary Label</h3>

In [22]:
df_labelled = sample_data_cleaned.copy()

# Label laughter if there is the word "laugh" in the annotation
annots = list(df_labelled["Annotation"])
labels = [False for _ in range(len(annots))]
for idx in range(len(annots)):
    laughter = "laugh" in annots[idx].lower()
    labels[idx] = laughter

df_labelled["Label"] = labels
df_labelled.head()

Unnamed: 0,Script,Annotation,Label
1,Let’s go!,hip-hop music playing; audience cheering; Chri...,False
2,"She said, “$300, I’ll do anything you want.” I...",audience laughing; audience cheering; hip-hop ...,True
3,Ladies and gentlemen. Ladies and gentlemen. Ch...,audience cheering; audience continue cheering;...,False
4,"What’s up, Baltimore?",audience cheers loudly,False
5,"Yes! Yes, yes. Thank you! Thank you so much! T...",audience cheering,False


<h3>Text Pre-processing</h3>

In [23]:
import re
import string 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
def clean_text(text) :
    '''Make text lowercase, remove text in square brackets, remove punctuations, 
    remove quotation marks, remove words containing numbers, remove \n'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)   
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text) 
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    
    return text

cleaning = lambda x : clean_text(x)

In [25]:
# apply data cleaning to Annotation column 

df_processed = df_labelled.copy()
df_processed['Script'] = df_processed['Script'].apply(cleaning)

In [26]:
df_processed.head()

Unnamed: 0,Script,Annotation,Label
1,lets go,hip-hop music playing; audience cheering; Chri...,False
2,she said ill do anything you want i said bitc...,audience laughing; audience cheering; hip-hop ...,True
3,ladies and gentlemen ladies and gentlemen chri...,audience cheering; audience continue cheering;...,False
4,whats up baltimore,audience cheers loudly,False
5,yes yes yes thank you thank you so much thank ...,audience cheering,False


In [27]:
# Helper functions to create tokenizer for TF-IDF Vectorize

def get_wordnet_pos(treebank_tag) : 
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        # As default pos in lemmatization is Noun
        return wordnet.NOUN
    
lemmatizer = WordNetLemmatizer()

def pos_then_lemmatize(pos_tagged_words) :
    res = []
    for pos in pos_tagged_words : 
        word = pos[0]
        pos_tag = pos[1]

        lem = lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag))
        res.append(lem)
    return res

def custom_tokenizer(text) : 
    words = word_tokenize(text.lower())
    
    stop_words = set(stopwords.words('english')) 
    filtered_words = [w for w in words if not w in stop_words] 
    pos_tagged_words = nltk.pos_tag(filtered_words)
    tokens = pos_then_lemmatize(pos_tagged_words)
    
    return tokens

In [2]:
# TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range = (1, 1),
                    tokenizer = custom_tokenizer)
tf_vectors = tf.fit_transform(df_labelled['Script'])
tf_feature_names = tf.get_feature_names_out()
tfidf_matrix = pd.DataFrame(tf_vectors.toarray(), columns=tf_feature_names)
tfidf_matrix

NameError: name 'custom_tokenizer' is not defined

In [None]:
conda update -n base -c defaults conda