# **Engineering Features from Text Script**

In this notebook, we use the scraped text data for script of the TV series "Friends" and try to extract features such as spoken words by character, main locations, emotional archetypes, etc.

## Using Regular Expressions to identify all active characters in an episoede

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=41b9717f6aa9d0ab2a9eb46dbbdbef554e66b1b71123364454beb2887239ace3
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [None]:
import spacy
import en_core_web_sm
import re
nlp = en_core_web_sm.load()

In [None]:
import pandas as pd
#reading the scraped data
df = pd.read_csv('ready_for_mining_script.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,text script,Title,Episode,Season
0,0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0
1,1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0
2,2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0
3,3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0
4,4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0


In [None]:
#using regular expression and inherent script structure to get character names with spoken lines
def extract_character_names(text):
    doc = nlp(text)
    pattern = r'(\w+\s*):'
    matches = re.findall(pattern, text)
    filtered_matches = [match for match in matches if any(ent.text == match for ent in doc.ents)]
    unique_character_names = list(set(filtered_matches))
    return unique_character_names

In [None]:
#applying above function to dataset and tracking progress
from tqdm import tqdm
tqdm.pandas()
df['character_names'] = df['text script'].progress_apply(extract_character_names)

100%|██████████| 229/229 [04:14<00:00,  1.11s/it]


In [None]:
df.tail() #results of character identification

Unnamed: 0.1,Unnamed: 0,text script,Title,Episode,Season,character_names
224,224,"[Scene: Central Perk. Phoebe, Monica and Chand...",The One With Princess Consuela,14.0,10.0,"[Mark, Joey, Rachel, Clerk, Chandler, Campbell..."
225,225,[Scene: Central Perk. Phoebe's reading a newsp...,The One Where E...,15.0,10.0,"[Joey, Rachel, Realtor, Chandler, Ross, Monica..."
226,226,[Scene: Joey's place. Rachel and Joey \n ...,The One With Ra...,16.0,10.0,"[Rachel, Joey, Chandler, Erica, Ross, Monica, ..."
227,227,[Scene: Monica and Chandler's apartment. It's ...,The Last One,17.0,10.0,"[Rachel, Joey, 2, Chandler, Erica, Ross, Gunth..."
228,228,[Scene: In a TV commercial that the gang is wa...,The One After the Superbowl (2),13.0,2.0,"[JANITOR, ROB]"


## Identifying the total number of spoken words for each identified character

In [None]:
#so now we have a character list for each episode that is non-empty. I will still not do any text cleaning becuase episodes have different ways in which character
#lines are represented instead i'll first get the words per character per episode.
def count_words(text):
    words_spoken = {}
    last_character = None
    lines = text.split('\n')
    for line in lines:
        # Remove text within square brackets and parentheses
        line = re.sub(r'\[.*?\]|\(.*?\)', '', line)
        if ':' in line:
            character, spoken_text = line.split(':', 1)
            character = character.strip()
            spoken_text = spoken_text.strip()
            if character not in words_spoken:
                words_spoken[character] = 0
            words = spoken_text.split()
            words_spoken[character] += len(words)
            last_character = character
        elif last_character is not None:
            # Append text to the last character's count
            spoken_text = line.strip()
            words = spoken_text.split()
            words_spoken[last_character] += len(words)
    return words_spoken

tqdm.pandas()
df['Spoken Word List'] = df['text script'].progress_apply(count_words)

100%|██████████| 229/229 [00:00<00:00, 452.45it/s]


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List
0,0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"[Joey, Paul, Rachel, Frannie, Chandler, Ross, ...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo..."
1,1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"[Rachel, Joey, Barry, Scene, Geller, Chandler,...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra..."
2,2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"[Joey, Rachel, Scene, Chandler, Lizzie, Ross, ...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ..."
3,3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"[Joey, Rachel, Receptionist, Chandler, Ross, J...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P..."
4,4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"[Angela, Joey, Rachel, Bob, Chandler, Ross, Mo...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C..."


## Calculating the total number of spoken words in the script

In [None]:
def calculate_total_words(words_spoken):
    total = 0
    for character, word_count in words_spoken.items():
        total += word_count
    return total

tqdm.pandas()
df['Total Spoken Words'] = df['Spoken Word List'].progress_apply(calculate_total_words)

100%|██████████| 229/229 [00:00<00:00, 74439.71it/s]


In [None]:
df.head() #results

Unnamed: 0.1,Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words
0,0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"[Joey, Paul, Rachel, Frannie, Chandler, Ross, ...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290
1,1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"[Rachel, Joey, Barry, Scene, Geller, Chandler,...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520
2,2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"[Joey, Rachel, Scene, Chandler, Lizzie, Ross, ...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336
3,3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"[Joey, Rachel, Receptionist, Chandler, Ross, J...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721
4,4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"[Angela, Joey, Rachel, Bob, Chandler, Ross, Mo...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765


## Named Entity Recognition for location identification

Here we use a pretratined model from Staford's NER library to identofy main locations in scene change text from scripts

In [78]:
import nltk
nltk.download('punkt')
from nltk.tag.stanford import StanfordNERTagger
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [79]:
def extract_locations_organizations(text):
    PATH_TO_JAR = 'stanford-ner.jar'
    PATH_TO_MODEL = 'english.all.3class.distsim.crf.ser.gz'
    tagger = StanfordNERTagger(model_filename=PATH_TO_MODEL, path_to_jar=PATH_TO_JAR, encoding='utf-8')
    bracket_contents = re.findall(r'\[(.*?)\]', text)
    locations = []
    organizations = []

    for bracket_text in bracket_contents:
        words = word_tokenize(bracket_text)
        ner_tags = tagger.tag(words)

        current_location = []
        current_organization = []

        for word, tag in ner_tags:
            if tag == 'LOCATION':
                current_location.append(word)
            elif tag == 'ORGANIZATION':
                current_organization.append(word)

        if current_location:
            locations.append(" ".join(current_location))
        if current_organization:
            organizations.append(" ".join(current_organization))

    return locations + organizations

In [None]:
tqdm.pandas()
df['List of Locations'] = df['text script'].progress_apply(extract_locations_organizations)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations
0,0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"[Joey, Paul, Rachel, Frannie, Chandler, Ross, ...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,[Iridium]
1,1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"[Rachel, Joey, Barry, Scene, Geller, Chandler,...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,[Central Park]
2,2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"[Joey, Rachel, Scene, Chandler, Lizzie, Ross, ...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"[Central Perk, Iridium, Iridium, Cental Perk]"
3,3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"[Joey, Rachel, Receptionist, Chandler, Ross, J...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[]
4,4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"[Angela, Joey, Rachel, Bob, Chandler, Ross, Mo...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[]


In [None]:
df.to_csv('the_one_with_the_NER.csv') #saving results to resume work

As can be seen from results of NER, there were a lot of empty values because locations such as rooms, buildings, restraunts, etc could not be identified used NER. So further research was conducted which revelaed that since te show was filmed live, only a handful of set locatiosn were ever actually used. Thus regular expressions and fuzzy string matching were used to iterate over the scene change text information int the script. The procedure is illustrated in the following section.

## Fuzy string matching for location recognition

In [6]:
import pandas as pd

In [3]:
df = pd.read_csv("the_one_with_the_NER.csv")
df = df.drop(columns = ['Unnamed: 0.1','Unnamed: 0'])
df.head()

Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,['Iridium']
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,['Central Park']
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"['Central Perk', 'Iridium', 'Iridium', 'Cental..."
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[]
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[]


In [9]:
import ast
# Convert string representations of lists to actual lists using ast.literal_eval
df['List of Locations'] = df['List of Locations'].apply(ast.literal_eval)

# Count rows where the value of 'List of Locations' is an empty list
count_empty_lists = df[df['List of Locations'].apply(lambda x: len(x) == 0)].shape[0]

print(f"Number of rows with empty lists in 'List of Locations': {count_empty_lists}")

Number of rows with empty lists in 'List of Locations': 110


In [10]:
(count_empty_lists/df[['List of Locations']].shape[0])*100 #percentage of null values of NER locations

48.03493449781659

Calculating the total number of scene changes/ narrative shifts

In [11]:
import re
# Function to extract sentences between square brackets
def extract_scene_changes(text):
    return re.findall(r'\[Scene[\s:,.]?([^\]]+)\]', text)


# Apply the function to create the 'scene changes' column
df['scene changes'] = df['text script'].apply(extract_scene_changes)

# Create the 'number of scene changes' column
df['number of scene changes'] = df['scene changes'].apply(len)
df.head()

Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations,scene changes,number of scene changes
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,[Iridium],"[ The Subway, Phoebe is singing for\nchange., ...",13
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,[Central Park],"[Central Perk, everyone's there., Museum of P...",11
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"[Central Perk, Iridium, Iridium, Cental Perk]","[ Chandler and Joey's, Chandler is helping Joe...",14
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[],"[ Central Perk, Ross and Monica are watching P...",15
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[],"[ Central Perk, all six are there., Central P...",16


In [12]:
df['number of scene changes'].isnull().sum() #fidelity check

0

In [13]:
for index, row in df.tail().iterrows():
    print(f"Row {index + 1}: {row['scene changes']}") #fidelity check

Row 225: [' Central Perk. Phoebe, Monica and Chandler on \n                          their couch.', ' A restaurant. Rachel enters.', " A counter at a government building. Phoebe's \n                          waiting in line.", ' Central Perk. Chandler and Monica are there \n                          when Phoebe enters.', " Chandler and Monica's future house. They enter \n                          the living room with the realtor and Joey.", ' Phoebe is at Central Perk. Mike enters.', ' Joey is in Monica and Chandler\'s future house, \n                          sitting in a child\'s bedroom, looking at a quiz card \n                          which has "5+10=" printed on one side.', ' Outside Ralph Lauren building. Rachel just \n                          walked out carrying a box of her stuff, and a strange \n                          man approaches her.', " Chandler and Monica's new house. Sitting near \n                          the window, they look at the neighborhood.", ' Central Pe

In [14]:
# Function to extract the first location from each scene
def extract_first_location(scene_changes):
    extracted_parts = []
    for scenes in scene_changes:
      current_part = ''
      for char in scenes:
        if char in ('.', ','):
            cleaned_string = re.sub(r'[\n,]', '', current_part.strip())
            extracted_parts.append(cleaned_string)
            break
        else:
            current_part += char
    return extracted_parts

# Apply the function to create the 'first locations' column
df['locations'] = df['scene changes'].apply(extract_first_location)

# Display the updated DataFrame
df.head()

Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations,scene changes,number of scene changes,locations
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,[Iridium],"[ The Subway, Phoebe is singing for\nchange., ...",13,"[The Subway, Ross's Apartment, A Restaurant, M..."
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,[Central Park],"[Central Perk, everyone's there., Museum of P...",11,"[Central Perk, Museum of Prehistoric History, ..."
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"[Central Perk, Iridium, Iridium, Cental Perk]","[ Chandler and Joey's, Chandler is helping Joe...",14,"[Chandler and Joey's, Central Perk, Iridium, M..."
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[],"[ Central Perk, Ross and Monica are watching P...",15,"[Central Perk, A Street, Central Perk, Monica ..."
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[],"[ Central Perk, all six are there., Central P...",16,"[Central Perk, Central Perk, Monica and Rachel..."


In [15]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [16]:
from fuzzywuzzy import fuzz

def is_approximate_match(item, target, threshold=90):
    similarity_score = fuzz.token_set_ratio(item, target)
    return similarity_score >= threshold

def extract_nouns(scene_changes):
    final_ans = []
    for scenes in scene_changes:
        nouns = []
        if is_approximate_match(scenes, "Central Perk"):
            nouns.append("Central Perk")
        if is_approximate_match(scenes, "Monica's Apartment"):
            nouns.append("Monica's Apartment")
        if is_approximate_match(scenes, "Ross's Apartment"):
            nouns.append("Ross's Apartment")
        if is_approximate_match(scenes, "Chandler's Apartment"):
            nouns.append("Chandler's Apartment")
        if is_approximate_match(scenes, "Ralph Lauren"):
            nouns.append("Ralph Lauren")
        if is_approximate_match(scenes, "Bloomingdales"):
            nouns.append("Bloomingdales")
        if is_approximate_match(scenes, "Phoebe's Apartment"):
            nouns.append("Phoebe's Apartment")
        if nouns:
            final_ans.append(" ".join(nouns))
    return final_ans

# Assuming df is a DataFrame with a 'locations' column
df['locs'] = df['locations'].apply(extract_nouns)
df.head()



Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations,scene changes,number of scene changes,locations,locs
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,[Iridium],"[ The Subway, Phoebe is singing for\nchange., ...",13,"[The Subway, Ross's Apartment, A Restaurant, M...","[Ross's Apartment, Monica's Apartment, Ross's ..."
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,[Central Park],"[Central Perk, everyone's there., Museum of P...",11,"[Central Perk, Museum of Prehistoric History, ...","[Central Perk, Central Perk, Monica's Apartment]"
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"[Central Perk, Iridium, Iridium, Cental Perk]","[ Chandler and Joey's, Chandler is helping Joe...",14,"[Chandler and Joey's, Central Perk, Iridium, M...","[Central Perk, Central Perk, Central Perk, Cen..."
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[],"[ Central Perk, Ross and Monica are watching P...",15,"[Central Perk, A Street, Central Perk, Monica ...","[Central Perk, Central Perk]"
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[],"[ Central Perk, all six are there., Central P...",16,"[Central Perk, Central Perk, Monica and Rachel...","[Central Perk, Central Perk, Monica's Apartmen..."


In [17]:
import ast
# Convert string representations of lists to actual lists using ast.literal_eval
#df['locs'] = df['locs'].apply(ast.literal_eval)

# Count rows where the value of 'List of Locations' is an empty list
count_empty_lists = df[df['locs'].apply(lambda x: len(x) == 0)].shape[0]

print(f"Number of rows with empty lists in 'List of Locations': {count_empty_lists}")

Number of rows with empty lists in 'List of Locations': 25


Now this might indicate out of ordinary format episodes which might impact the ratings. eg- Ross and Emily's wedding in London

In [18]:
#saving relevant information as a separate dataframe
df_final = df[['Title','Episode','Season','character_names','Spoken Word List','Total Spoken Words','number of scene changes','locs' ]]
df_final.head()

Unnamed: 0,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,number of scene changes,locs
0,The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,13,"[Ross's Apartment, Monica's Apartment, Ross's ..."
1,The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,11,"[Central Perk, Central Perk, Monica's Apartment]"
2,The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,14,"[Central Perk, Central Perk, Central Perk, Cen..."
3,The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,15,"[Central Perk, Central Perk]"
4,The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,16,"[Central Perk, Central Perk, Monica's Apartmen..."


In [19]:
df_final.to_csv('Interim_chkpoint.csv') #saving and checkpointing to return later

## Detecting sarcasm using pretrained huggingface model

Testing methodology with a random row number first

In [46]:
import re
import string
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def cleaned_text(text):
    lines = text.split('\n')
    cleaned_spoken_texts = []
    for line in lines:
        # Remove text within square brackets and parentheses
        line = re.sub(r'\[.*?\]|\(.*?\)', '', line)
        if ':' in line:
            character, spoken_text = line.split(':', 1)
            spoken_text = preprocess_data(spoken_text.strip())
            cleaned_spoken_texts.append(spoken_text)
    return ' '.join(cleaned_spoken_texts)

def preprocess_data(text: str) -> str:
    return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()

def create_word_windows(text, window_size):
    words = text.split()
    windows = [words[i:i+window_size] for i in range(0, len(words), window_size)]
    return [' '.join(window) for window in windows]

MODEL_PATH = "helinivan/english-sarcasm-detector"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)

cleaned_spoken_texts = cleaned_text(df['text script'].loc[50,])
win_size = int(int(df['Total Spoken Words'].loc[50,])/10)
cleaned_spoken_texts = create_word_windows(cleaned_spoken_texts,win_size)
# Assuming you want to process each cleaned spoken text separately
total_sarcastic_count = 0
total_sarcastic_confidence = 0.0

for spoken_text in cleaned_spoken_texts:
    tokenized_text = tokenizer([spoken_text], padding=True, truncation=True, max_length=256, return_tensors="pt")
    output = model(**tokenized_text)
    probs = output.logits.softmax(dim=-1).tolist()[0]
    confidence = max(probs)
    prediction = probs.index(confidence)
    results = {"is_sarcastic": prediction, "confidence": confidence}

    # Update totals if the instance is sarcastic
    if results["is_sarcastic"] == 1:
        total_sarcastic_count += 1
        total_sarcastic_confidence += confidence

    # Do something with the results, e.g., print or store them
    print(results)
if total_sarcastic_count > 0:
    average_sarcastic_confidence = total_sarcastic_confidence / total_sarcastic_count
    print(f"Total Sarcastic Count: {total_sarcastic_count}")
    print(f"Average Sarcastic Confidence: {average_sarcastic_confidence}")
else:
    print("No sarcastic instances found.")


{'is_sarcastic': 0, 'confidence': 0.8568001389503479}
{'is_sarcastic': 1, 'confidence': 0.73240065574646}
{'is_sarcastic': 0, 'confidence': 0.7222325801849365}
{'is_sarcastic': 1, 'confidence': 0.7502384185791016}
{'is_sarcastic': 1, 'confidence': 0.8202803134918213}
{'is_sarcastic': 1, 'confidence': 0.541737973690033}
{'is_sarcastic': 1, 'confidence': 0.7291350364685059}
Total Sarcastic Count: 0.7142857142857143
Average Sarcastic Confidence: 0.7147584795951843


In [47]:
!pip install tqdm #to track progress



Running the sarcasm index detection on the entire dataset

In [55]:
import re
import string
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm  # Import tqdm

def cleaned_text(text):
    lines = text.split('\n')
    cleaned_spoken_texts = []
    for line in lines:
        # Remove text within square brackets and parentheses
        line = re.sub(r'\[.*?\]|\(.*?\)', '', line)
        if ':' in line:
            character, spoken_text = line.split(':', 1)
            spoken_text = preprocess_data(spoken_text.strip())
            cleaned_spoken_texts.append(spoken_text)
    return ' '.join(cleaned_spoken_texts)

def preprocess_data(text: str) -> str:
    return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()

def create_word_windows(text, window_size):
    words = text.split()
    windows = [words[i:i+window_size] for i in range(0, len(words), window_size)]
    return [' '.join(window) for window in windows]

def calculate_sarcasm_index(spoken_texts, tokenizer, model, progress_bar=True):
    total_sarcastic_count = 0
    total_sarcastic_confidence = 0.0

    # Use tqdm to display a progress bar
    iterator = tqdm(spoken_texts, desc="Processing", disable=not progress_bar, position=0)

    for spoken_text in iterator:
        tokenized_text = tokenizer([spoken_text], padding=True, truncation=True, max_length=256, return_tensors="pt")
        output = model(**tokenized_text)
        probs = output.logits.softmax(dim=-1).tolist()[0]
        confidence = max(probs)
        prediction = probs.index(confidence)
        results = {"is_sarcastic": prediction, "confidence": confidence}

        # Update totals if the instance is sarcastic
        if results["is_sarcastic"] == 1:
            total_sarcastic_count += 1
            total_sarcastic_confidence += confidence

    # Calculate the normalized sarcasm index
    if total_sarcastic_count > 0:
        normalized_total_sarcastic_count = total_sarcastic_count / len(spoken_texts)
        return normalized_total_sarcastic_count
    else:
        return 0.0

# Assuming df is your DataFrame
df['sarcasm_index'] = df.apply(lambda row: (cleaned_text(row['text script']), int(row['Total Spoken Words'] / 10)), axis=1)
df['sarcasm_index'] = df['sarcasm_index'].apply(lambda x: create_word_windows(x[0], x[1])).apply(lambda x: calculate_sarcasm_index(x, tokenizer, model))

df.head()


Processing: 100%|██████████| 6/6 [00:03<00:00,  1.85it/s]
Processing: 100%|██████████| 7/7 [00:03<00:00,  1.81it/s]
Processing: 100%|██████████| 8/8 [00:04<00:00,  1.91it/s]
Processing: 100%|██████████| 6/6 [00:04<00:00,  1.42it/s]
Processing: 100%|██████████| 8/8 [00:04<00:00,  1.90it/s]
Processing: 100%|██████████| 7/7 [00:03<00:00,  1.78it/s]
Processing: 100%|██████████| 7/7 [00:04<00:00,  1.62it/s]
Processing: 100%|██████████| 8/8 [00:04<00:00,  1.73it/s]
Processing: 100%|██████████| 8/8 [00:04<00:00,  1.90it/s]
Processing: 100%|██████████| 7/7 [00:03<00:00,  1.89it/s]
Processing: 100%|██████████| 8/8 [00:05<00:00,  1.54it/s]
Processing: 100%|██████████| 8/8 [00:04<00:00,  1.90it/s]
Processing: 100%|██████████| 7/7 [00:04<00:00,  1.73it/s]
Processing: 100%|██████████| 7/7 [00:04<00:00,  1.44it/s]
Processing: 100%|██████████| 7/7 [00:04<00:00,  1.63it/s]
Processing: 100%|██████████| 10/10 [00:05<00:00,  1.77it/s]
Processing: 100%|██████████| 6/6 [00:04<00:00,  1.41it/s]
Processing: 

Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations,scene changes,number of scene changes,locations,locs,sarcasm_index
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,[Iridium],"[ The Subway, Phoebe is singing for\nchange., ...",13,"[The Subway, Ross's Apartment, A Restaurant, M...","[Ross's Apartment, Monica's Apartment, Ross's ...",0.666667
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,[Central Park],"[Central Perk, everyone's there., Museum of P...",11,"[Central Perk, Museum of Prehistoric History, ...","[Central Perk, Central Perk, Monica's Apartment]",0.714286
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"[Central Perk, Iridium, Iridium, Cental Perk]","[ Chandler and Joey's, Chandler is helping Joe...",14,"[Chandler and Joey's, Central Perk, Iridium, M...","[Central Perk, Central Perk, Central Perk, Cen...",0.625
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[],"[ Central Perk, Ross and Monica are watching P...",15,"[Central Perk, A Street, Central Perk, Monica ...","[Central Perk, Central Perk]",0.333333
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[],"[ Central Perk, all six are there., Central P...",16,"[Central Perk, Central Perk, Monica and Rachel...","[Central Perk, Central Perk, Monica's Apartmen...",0.625


In [56]:
#saving results
df_final['Sarcasm Index'] = df['sarcasm_index']
df_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['Sarcasm Index'] = df['sarcasm_index']


Unnamed: 0,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,number of scene changes,locs,Sarcasm Index
0,The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,13,"[Ross's Apartment, Monica's Apartment, Ross's ...",0.666667
1,The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,11,"[Central Perk, Central Perk, Monica's Apartment]",0.714286
2,The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,14,"[Central Perk, Central Perk, Central Perk, Cen...",0.625
3,The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,15,"[Central Perk, Central Perk]",0.333333
4,The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,16,"[Central Perk, Central Perk, Monica's Apartmen...",0.625


In [57]:
df_final.to_csv('CP_2.csv') #saving and checkpoint

## Detecting emotional Archetypes

Taking inspiration from the original rejected thesis by acclaimed American writer Kurt Vonnegut and the recent work of a group of researchers from the University of Vermont we worked under the assumption that stories may have one of 6 archetypes and added a 7th case 'Uncharted' when such a pattern could not be recognised. The six archetypes were - "Rags to Riches" (rise), "Riches to Rags" (fall), "Man in a Hole" (fall then rise), "Icarus" (rise then fall), "Cinderella" (rise then fall then rise) and "Oedipus" (fall then rise then fall). We first use a sliding window approach of 200 words over the script and then create a rule based mapping to obtain the archetype value. Most of ou scripts were "Uncharted"

In [12]:
#function to map scripts to archetype given a list of detected sentiments using the sliding window approach
def categorize_story(sentiment_list):
    positive_sentiments = {'joy', 'surprise', 'neutral'}
    negative_sentiments = {'anger', 'fear', 'disgust', 'sadness'}

    positive_count = sum(1 for sentiment in sentiment_list if sentiment in positive_sentiments)
    negative_count = sum(1 for sentiment in sentiment_list if sentiment in negative_sentiments)

    total_sentiments = positive_count + negative_count

    # Rule 1: All Positive Sentiments
    if positive_count == total_sentiments:
        return "Rags to Riches"

    # Rule 2: All Negative Sentiments
    elif negative_count == total_sentiments:
        return "Riches to Rags"

    midpoint = len(sentiment_list) // 2
    part1 = sentiment_list[:midpoint]
    part2 = sentiment_list[midpoint:]

    part1_pc = sum(1 for sentiment in part1 if sentiment in positive_sentiments)
    part1_nc = sum(1 for sentiment in part1 if sentiment in negative_sentiments)
    part2_pc = sum(1 for sentiment in part2 if sentiment in positive_sentiments)
    part2_nc = sum(1 for sentiment in part2 if sentiment in negative_sentiments)

    # Rule 3: First Half Mostly Negative, Second Half Mostly Positive (Icarus)
    if part1_nc > part1_pc and part2_pc > part2_nc:
        return "Icarus"

    # Rule 4: First Half Mostly Positive, Second Half Mostly Negative (Man in a Hole)
    elif part1_pc > part1_nc and part2_nc > part2_pc:
        return "Man in a Hole"

    midpoint1 = len(sentiment_list) // 3
    midpoint2 = 2 * midpoint1
    part1 = sentiment_list[:midpoint1]
    part2 = sentiment_list[midpoint1:midpoint2]
    part3 = sentiment_list[midpoint2:]

    part1_pc = sum(1 for sentiment in part1 if sentiment in positive_sentiments)
    part1_nc = sum(1 for sentiment in part1 if sentiment in negative_sentiments)
    part2_pc = sum(1 for sentiment in part2 if sentiment in positive_sentiments)
    part2_nc = sum(1 for sentiment in part2 if sentiment in negative_sentiments)
    part3_pc = sum(1 for sentiment in part3 if sentiment in positive_sentiments)
    part3_nc = sum(1 for sentiment in part3 if sentiment in negative_sentiments)

    # Rule 5: Three-Part Pattern (Cinderella)
    if part1_nc > part1_pc and part2_pc > part2_nc and part3_pc > part3_nc:
        return "Cinderella"

    # Rule 6: Three-Part Pattern (Oedipus)
    elif part1_pc > part1_nc and part2_nc > part2_pc and part3_nc > part3_pc:
        return "Oedipus"

    # Default: Uncategorized
    else:
        return "Uncharted"

In [13]:
import re
import string
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm  # Import tqdm
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="michellejieli/emotion_text_classifier")

#function to clean the text
def cleaned_text(text):
    lines = text.split('\n')
    cleaned_spoken_texts = []
    for line in lines:
        # Remove text within square brackets and parentheses
        line = re.sub(r'\[.*?\]|\(.*?\)', '', line)
        if ':' in line:
            character, spoken_text = line.split(':', 1)
            spoken_text = preprocess_data(spoken_text.strip())
            cleaned_spoken_texts.append(spoken_text)
    return ' '.join(cleaned_spoken_texts)

#remove all punctuations
def preprocess_data(text: str) -> str:
    return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()

#create sliding windows
def create_word_windows(text, window_size):
    words = text.split()
    windows = [words[i:i+window_size] for i in range(0, len(words), window_size)]
    return [' '.join(window) for window in windows]

#Calculating sentiment from pretrained huggingface model
def calculate_senti_list(spoken_texts, classifier, progress_bar=True):
    total_sarcastic_count = 0
    total_sarcastic_confidence = 0.0

    # Use tqdm to display a progress bar
    iterator = tqdm(spoken_texts, desc="Processing", disable=not progress_bar, position=0)
    tot_list = []

    # Keep track of the previous emotion
    prev_emotion = None

    for spoken_text in iterator:
        emotion = str(classifier(spoken_text)[0]['label'])
        tot_list.append(emotion)
    tot_list = categorize_story(tot_list)
    print(tot_list)
    return tot_list

# Assuming df is your DataFrame
df['emoticon'] = df.apply(lambda row: (cleaned_text(row['text script']), int(200)), axis=1)
df['emoticon'] = df['emoticon'].apply(lambda x: create_word_windows(x[0], x[1])).apply(lambda x: calculate_senti_list(x, classifier))

df.head()


Processing: 100%|██████████| 6/6 [00:01<00:00,  3.96it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.86it/s]


Icarus


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.83it/s]


Man in a Hole


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.64it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.49it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  4.11it/s]


Uncharted


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.72it/s]


Uncharted


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.76it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.80it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.22it/s]


Cinderella


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.98it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.97it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.01it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  4.00it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.42it/s]


Rags to Riches


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.30it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.94it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.07it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.23it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.83it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.97it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.95it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  4.18it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.34it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.94it/s]


Man in a Hole


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.85it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.98it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.83it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.32it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.96it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.82it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.05it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.91it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.21it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.00it/s]


Uncharted


Processing: 100%|██████████| 21/21 [00:05<00:00,  3.94it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.56it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.45it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.97it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  4.10it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.58it/s]


Uncharted


Processing: 100%|██████████| 14/14 [00:03<00:00,  3.64it/s]


Uncharted


Processing: 100%|██████████| 16/16 [00:04<00:00,  3.89it/s]


Uncharted


Processing: 100%|██████████| 14/14 [00:03<00:00,  3.97it/s]


Uncharted


Processing: 100%|██████████| 14/14 [00:04<00:00,  3.46it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.82it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.04it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.92it/s]


Uncharted


Processing: 0it [00:00, ?it/s]


Rags to Riches


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.13it/s]


Rags to Riches


Processing: 100%|██████████| 11/11 [00:02<00:00,  4.00it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.34it/s]


Rags to Riches


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.42it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.20it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.16it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.47it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.48it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  2.98it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.70it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.53it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.42it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.27it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.50it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.35it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.21it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.87it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.45it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.37it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.38it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.89it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.62it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.58it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.65it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.92it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.49it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.99it/s]


Uncharted


Processing: 100%|██████████| 1/1 [00:00<00:00,  9.77it/s]


Rags to Riches


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.23it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.96it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.21it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.52it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.31it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.91it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.17it/s]


Oedipus


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.25it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.38it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:03<00:00,  2.70it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.31it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.23it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.46it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.04it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.34it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.25it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.31it/s]


Rags to Riches


Processing: 100%|██████████| 9/9 [00:03<00:00,  2.90it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.43it/s]


Cinderella


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.90it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.77it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.69it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.23it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.76it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.98it/s]


Icarus


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.77it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.60it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.36it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.96it/s]


Oedipus


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.88it/s]


Cinderella


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.90it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.35it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.61it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.74it/s]


Cinderella


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.83it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.86it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.30it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.95it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.90it/s]


Uncharted


Processing: 100%|██████████| 20/20 [00:05<00:00,  3.39it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.51it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.30it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.20it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.93it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.50it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.52it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.54it/s]


Uncharted


Processing: 100%|██████████| 1/1 [00:00<00:00,  5.01it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.71it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.46it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.19it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.39it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.83it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.35it/s]


Uncharted


Processing: 100%|██████████| 20/20 [00:06<00:00,  3.18it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:03<00:00,  2.89it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.25it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.48it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.22it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  2.87it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.36it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.29it/s]


Uncharted


Processing: 100%|██████████| 20/20 [00:06<00:00,  2.95it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.25it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.49it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.17it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.20it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.36it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.00it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.51it/s]


Icarus


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.55it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.41it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.05it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.22it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.35it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.21it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:04<00:00,  3.01it/s]


Uncharted


Processing: 100%|██████████| 13/13 [00:03<00:00,  3.33it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.50it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.23it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.04it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.17it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.41it/s]


Icarus


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.24it/s]


Uncharted


Processing: 100%|██████████| 19/19 [00:06<00:00,  2.97it/s]


Uncharted


Processing: 100%|██████████| 7/7 [00:02<00:00,  3.40it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.25it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.87it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.18it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.44it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.41it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  2.86it/s]


Man in a Hole


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.31it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.39it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:03<00:00,  3.08it/s]


Man in a Hole


Processing: 100%|██████████| 12/12 [00:04<00:00,  2.99it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.50it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.40it/s]


Uncharted


Processing: 100%|██████████| 1/1 [00:00<00:00,  8.90it/s]


Rags to Riches


Processing: 100%|██████████| 10/10 [00:03<00:00,  2.92it/s]


Uncharted


Processing: 100%|██████████| 1/1 [00:00<00:00,  4.30it/s]


Rags to Riches


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.06it/s]


Cinderella


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.29it/s]


Oedipus


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.54it/s]


Rags to Riches


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.50it/s]


Rags to Riches


Processing: 100%|██████████| 12/12 [00:04<00:00,  2.81it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.44it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.46it/s]


Uncharted


Processing: 100%|██████████| 20/20 [00:06<00:00,  3.02it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.33it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.52it/s]


Icarus


Processing: 100%|██████████| 11/11 [00:02<00:00,  3.89it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.24it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.76it/s]


Uncharted


Processing: 100%|██████████| 14/14 [00:03<00:00,  3.79it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.94it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:03<00:00,  3.34it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:03<00:00,  3.33it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.88it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.95it/s]


Man in a Hole


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.48it/s]


Man in a Hole


Processing: 0it [00:00, ?it/s]


Rags to Riches


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.66it/s]


Man in a Hole


Processing: 100%|██████████| 1/1 [00:00<00:00, 10.06it/s]


Rags to Riches


Processing: 100%|██████████| 11/11 [00:02<00:00,  4.03it/s]


Uncharted


Processing: 100%|██████████| 11/11 [00:02<00:00,  4.02it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:02<00:00,  4.06it/s]


Uncharted


Processing: 100%|██████████| 79/79 [00:21<00:00,  3.62it/s]


Uncharted


Processing: 100%|██████████| 76/76 [00:20<00:00,  3.66it/s]


Uncharted


Processing: 100%|██████████| 49/49 [00:13<00:00,  3.64it/s]


Uncharted


Processing: 100%|██████████| 16/16 [00:04<00:00,  3.95it/s]


Uncharted


Processing: 100%|██████████| 30/30 [00:08<00:00,  3.63it/s]


Uncharted


Processing: 100%|██████████| 12/12 [00:02<00:00,  4.01it/s]


Uncharted


Processing: 100%|██████████| 7/7 [00:01<00:00,  4.27it/s]


Uncharted


Processing: 100%|██████████| 3/3 [00:00<00:00,  4.13it/s]


Rags to Riches


Processing: 100%|██████████| 1/1 [00:00<00:00,  4.03it/s]


Rags to Riches


Processing: 100%|██████████| 17/17 [00:05<00:00,  3.31it/s]


Man in a Hole


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.67it/s]


Uncharted


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.89it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  4.09it/s]


Uncharted


Processing: 100%|██████████| 7/7 [00:01<00:00,  4.02it/s]


Uncharted


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.55it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.38it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.94it/s]


Uncharted


Processing: 100%|██████████| 10/10 [00:02<00:00,  3.96it/s]


Uncharted


Processing: 100%|██████████| 9/9 [00:02<00:00,  3.82it/s]


Cinderella


Processing: 100%|██████████| 8/8 [00:02<00:00,  3.57it/s]


Uncharted


Processing: 100%|██████████| 6/6 [00:01<00:00,  3.03it/s]


Man in a Hole


Processing: 100%|██████████| 20/20 [00:05<00:00,  3.85it/s]


Uncharted


Processing: 100%|██████████| 21/21 [00:05<00:00,  3.90it/s]

Uncharted





Unnamed: 0,text script,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,List of Locations,emoticon
0,"[Scene: The Subway, Phoebe is singing for\ncha...",The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,['Iridium'],Uncharted
1,"[Scene Central Perk, everyone's there.]\nMonic...",The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,['Central Park'],Icarus
2,"[Scene: Chandler and Joey's, Chandler is helpi...",The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,"['Central Perk', 'Iridium', 'Iridium', 'Cental...",Man in a Hole
3,"[Scene: Central Perk, Ross and Monica are watc...",The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,[],Rags to Riches
4,"[Scene: Central Perk, all six are there.]\nMon...",The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,[],Man in a Hole


In [16]:
df_final['Emotional Archetype'] = df['emoticon']
df_final.head() #saving results

Unnamed: 0.1,Unnamed: 0,Title,Episode,Season,character_names,Spoken Word List,Total Spoken Words,number of scene changes,locs,Sarcasm Index,Emotional Archetype
0,0,The Pilot-The Uncut Version,1.0,1.0,"['Joey', 'Paul', 'Rachel', 'Frannie', 'Chandle...","{'[Scene': 81, 'Phoebe': 166, 'Ross': 357, 'Jo...",2290,13,"[""Ross's Apartment"", ""Monica's Apartment"", ""Ro...",0.666667,Uncharted
1,1,The One With The Sonogram At the End,2.0,1.0,"['Rachel', 'Joey', 'Barry', 'Scene', 'Geller',...","{'Monica': 253, 'Joey': 73, 'Phoebe': 115, 'Ra...",2520,11,"['Central Perk', 'Central Perk', ""Monica's Apa...",0.714286,Icarus
2,2,The One With The Thumb,3.0,1.0,"['Joey', 'Rachel', 'Scene', 'Chandler', 'Lizzi...","{'Chandler': 449, 'Joey': 185, 'Monica': 480, ...",2336,14,"['Central Perk', 'Central Perk', 'Central Perk...",0.625,Man in a Hole
3,3,The One With George Stephanopoulos,4.0,1.0,"['Joey', 'Rachel', 'Receptionist', 'Chandler',...","{'[Scene': 158, 'Monica': 418, 'Ross': 440, 'P...",2721,15,"['Central Perk', 'Central Perk']",0.333333,Rags to Riches
4,4,The One With The East German Laundry Detergant,5.0,1.0,"['Angela', 'Joey', 'Rachel', 'Bob', 'Chandler'...","{'Monica': 323, 'Ross': 560, 'Rachel': 371, 'C...",2765,16,"['Central Perk', 'Central Perk', ""Monica's Apa...",0.625,Man in a Hole


In [21]:
df_final['Emotional Archetype'].value_counts() #observing values obtained

Uncharted         169
Man in a Hole      26
Rags to Riches     20
Cinderella          6
Icarus              5
Oedipus             3
Name: Emotional Archetype, dtype: int64

In [19]:
df_final.to_csv('Final_Script_Feature_Engineered.csv') #saving and checkpoints