# DSCI 511: DATA ACQUSITION AND PRE-PROCESSING
# PROJECT TITLE: SENTIMENT ANALYSIS OF 'THE OFFICE, US'

This project has been completed by Emily Wang, Rohit Lakshminarayan and Manoj Venkatachalaiah for DSCI 511, Winter 2021.


### --------------------------------------------------------------------------------------------------------------------------------------------------------------


##  Description, Scope and Access Rights

We have assembled a dataset for the tv show 'The Office, US' by scraping transcripts of all episodes across all seasons from https://www.officequotes.net/ and using NLP to convert the text data into numeric data. We have achieved this using two domains of data acqusition and pre-processing: Web Scraping and Sentiment analysis. The final dataset is an episode-wise distribution of sentiment scores for each of the main characters across all seasons. The final dataset, its attributes and how it was constructed is described in detail later in the notebook. 


We hope that a dataset like ours will be benefitial to researches in the entertainment domain. It can also be looked at from a business point of view, anlyzing tv show data will provide insight into what made the tv show a hit/flop and which set of writers were the best, which directors and producers could benefit from. It will be useful in examining how each character and or writer contributed to each episode. 
    

We hope to make this data availabe for all researchers by publishing it on forums such as kaggle, github etc.


### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## The code is divided into three main categories:
#### 1) Web Scraping 
Scraping the episode transcripts and details from https://www.officequotes.net/ and episode details such as rating and review counts from https://www.imdb.com.
#### 2) Pre-processing
Cleaning the raw text data.
#### 3) Sentiment Analysis
Converting the pre-processed text data into numerical scores (sentiment scores).



##### All sections of code have sufficient text explainations and comments so that anyone willing use our project materials or follow up on it will have a better understanding of everything.


### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## Modules needed 
Below cell contains all the modules that need to be installed for our project code. Please run both the cells below if you are going to execute some or all of the code in this notebook.

In [None]:
!pip install requests
!pip install bs4
!pip install nltk
!pip install vaderSentiment

In [352]:
import requests
from bs4 import BeautifulSoup
import re
from collections import defaultdict
import json
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd
from collections import Counter
import urllib.request
import numpy as np




### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## 1) Web Scraping

### 1.1 Inspecting the html response returned for https://www.officequotes.net/

In [353]:
URL = 'https://www.officequotes.net/'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

In [354]:
soup

<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<link href="https://www.officequotes.net/xmlrpc.php" rel="pingback"/>
<!--wordpress head-->
<title>Welcome to OfficeQuotes.net - OfficeQuotes.net</title>
<!-- This site is optimized with the Yoast SEO plugin v13.1 - https://yoast.com/wordpress/plugins/seo/ -->
<meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots">
<link href="https://www.officequotes.net/" rel="canonical"/>
<meta content="en_US" property="og:locale">
<meta content="website" property="og:type">
<meta content="Welcome to OfficeQuotes.net - OfficeQuotes.net" property="og:title"/>
<meta content="Welcome to OfficeQuotes.net, the comprehensive source for every line ever said on NBC’s The Office. From the most popul

#### Observation:
The home page contains details of all episodes across all seasons. We can build a metadata dictionary using this information which will come handy later on when we try to retrieve the html response of each episode's transcript.

### 1.2 Building metadata dictionary for episodes and seasons

In [355]:
ep=soup.find('p',text='Season I') #retriving the tag where the metadata starts

d=[]
for i in ep.find_next_siblings(): #the siblings of Season 1 tag will give us the details of succeding seasons
    if i.text=='Other':
        break
    if i.name=='p':
        
        d.append(i.text)
    
temp=[]
for i in d:
    for j in i.split('\n'):
        temp.append(j.replace('\xa0'," ").replace('Â'," "))

In [356]:

episodes=defaultdict(lambda : {}) #using a default dictionary to store episode details
season=1

for i in temp:
    if i[0:6]=='Season': #if the retrieved text is a season, the counter will set the variable Season by +1, else the text 
                                                                                                         #is episode title
        season+=1
    if i[0:6]!='Season':
        episodes[season][i.split('.')[0]]=i.split('.')[1].strip()

episodes

defaultdict(<function __main__.<lambda>()>,
            {1: {'01': 'Pilot',
              '02': 'Diversity Day',
              '03': 'Health Care',
              '04': 'The Alliance',
              '05': 'Basketball',
              '06': 'Hot Girl'},
             2: {'01': 'The Dundies',
              '02': 'Sexual Harassment',
              '03': 'Office Olympics',
              '04': 'The Fire',
              '05': 'Halloween',
              '06': 'The Fight',
              '07': 'The Client',
              '08': 'Performance Review',
              '09': 'E-mail Surveillance',
              '10': 'Christmas Party',
              '11': 'Booze Cruise',
              '12': 'The Injury',
              '13': 'The Secret',
              '14': 'The Carpet',
              '15': 'Boys and Girls',
              '16': 'Valentine’s Day',
              '17': 'Dwight’s Speech',
              '18': 'Take Your Daughter to Work Day',
              '19': 'Michael’s Birthday',
              '20': 'Drug

#### Observation:
There appears to be 9 seasons and a total of 186 episodes.

### 1.3 Storing metadata 
The metadata is written to a json file so that it can called and made us of whenever required.

In [6]:
metadata={}
for m in list(episodes): # converting the default dictionary to an ordinary one to write it to a json file
    metadata[m]=episodes[m]
    
file='raw_data/metadata.json'
with open(file, 'w', encoding='utf-8') as f:
    json.dump(metadata, f, ensure_ascii=False, indent=4)


### 1.4 Building urls for episodes 
The metadata is used to build urls for web scraping purposes. The urls for all 186 episoded are built. 

In [7]:
urls=[]
for i in episodes:
    for j in episodes[i]:
        URL = 'https://www.officequotes.net/no'+str(i)+'-'+str(j)+'.php' #building url for each episode
        urls.append(URL)

In [8]:
urls

['https://www.officequotes.net/no1-01.php',
 'https://www.officequotes.net/no1-02.php',
 'https://www.officequotes.net/no1-03.php',
 'https://www.officequotes.net/no1-04.php',
 'https://www.officequotes.net/no1-05.php',
 'https://www.officequotes.net/no1-06.php',
 'https://www.officequotes.net/no2-01.php',
 'https://www.officequotes.net/no2-02.php',
 'https://www.officequotes.net/no2-03.php',
 'https://www.officequotes.net/no2-04.php',
 'https://www.officequotes.net/no2-05.php',
 'https://www.officequotes.net/no2-06.php',
 'https://www.officequotes.net/no2-07.php',
 'https://www.officequotes.net/no2-08.php',
 'https://www.officequotes.net/no2-09.php',
 'https://www.officequotes.net/no2-10.php',
 'https://www.officequotes.net/no2-11.php',
 'https://www.officequotes.net/no2-12.php',
 'https://www.officequotes.net/no2-13.php',
 'https://www.officequotes.net/no2-14.php',
 'https://www.officequotes.net/no2-15.php',
 'https://www.officequotes.net/no2-16.php',
 'https://www.officequotes.net/n

### 1.5 Inspecting the response for an episode to identify target data



In [357]:
URL = urls[0]
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

soup

<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<link href="https://www.officequotes.net/xmlrpc.php" rel="pingback"/>
<!--wordpress head-->
<title>Season 1 - Episode 01 - OfficeQuotes.net</title>
<!-- This site is optimized with the Yoast SEO plugin v13.1 - https://yoast.com/wordpress/plugins/seo/ -->
<meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots">
<link href="https://www.officequotes.net/no1-01.php" rel="canonical"/>
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type">
<meta content="Season 1 - Episode 01 - OfficeQuotes.net" property="og:title"/>
<meta content="&amp;nbsp Season 1 – Episode 01 “Pilot” Written by Greg Daniels, Ricky Gervais, and Stephen Merchant Directed by Ken Kwap

#### Observation
All the div tags with class='quote' is where the target data is located, these individual tags are scenes in the episode. Each scene has several dialogues for several characters. But, it seems that deleted scenes are also included as part of the transcript, which is unwanted data as far as the scope of our project is concerned. Therefore, in the web scraping script below we will only be retreiving the actual dialogues said in the episode by characters and not the deleted dialogues.

### 1.6 Scraping Episode transcripts

In [42]:
for i in urls:
    print(i) #each url is printed to detect faulty repsonses
    page = requests.get(i)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    dialogues=[]
    for j in soup.find_all('div',class_="quote"): #div tag with class='quote' contains all scenes
        if j.text[:8]==' Deleted' or j.text[:9]==' Delete': #omitting deleted scenes
            break
        dialogues.append(j)
    
    transcript=defaultdict(lambda:'') # this default dictionary records all combined dialogues for each character in an episode
    character=''
    for k in dialogues:
        
        for l in list(k.children):
            
            if l!=' ' and l!='\n':
                if l.name=='b':#the b tags contain the character name, the next tag contains the dilaogue
                    character=l.text.split(':')[0]
                elif l.name!='br' and l.name!='i' and l.name!='a' and l.name!='div' and l.name!='p' and l.name!='tr':
                    transcript[character]+=l.replace('\t\t\t\t','').strip()
         
              
    _dict={}
    for m in list(transcript): #converting default dict to ordinary one to write them to json files
        _dict[m]=transcript[m]
    
    file='raw_data/'+str(i[-8:-4])+'.json'                 #all transcripts are written out as json files
    with open(file, 'w', encoding='utf-8') as f:
        json.dump(_dict, f, ensure_ascii=False, indent=4)

    

https://www.officequotes.net/no1-01.php
https://www.officequotes.net/no1-02.php
https://www.officequotes.net/no1-03.php
https://www.officequotes.net/no1-04.php
https://www.officequotes.net/no1-05.php
https://www.officequotes.net/no1-06.php
https://www.officequotes.net/no2-01.php
https://www.officequotes.net/no2-02.php
https://www.officequotes.net/no2-03.php
https://www.officequotes.net/no2-04.php
https://www.officequotes.net/no2-05.php
https://www.officequotes.net/no2-06.php
https://www.officequotes.net/no2-07.php
https://www.officequotes.net/no2-08.php
https://www.officequotes.net/no2-09.php
https://www.officequotes.net/no2-10.php
https://www.officequotes.net/no2-11.php
https://www.officequotes.net/no2-12.php
https://www.officequotes.net/no2-13.php
https://www.officequotes.net/no2-14.php
https://www.officequotes.net/no2-15.php
https://www.officequotes.net/no2-16.php
https://www.officequotes.net/no2-17.php
https://www.officequotes.net/no2-18.php
https://www.officequotes.net/no2-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-02.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-03.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-04.php
https://www.officequotes.net/no8-05.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-06.php
https://www.officequotes.net/no8-07.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-08.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-09.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-10.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-11.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-12.php
https://www.officequotes.net/no8-13.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-14.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-15.php
https://www.officequotes.net/no8-16.php
https://www.officequotes.net/no8-17.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-18.php
https://www.officequotes.net/no8-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-20.php
https://www.officequotes.net/no8-21.php
https://www.officequotes.net/no8-22.php
https://www.officequotes.net/no8-23.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-24.php
https://www.officequotes.net/no9-01.php
https://www.officequotes.net/no9-02.php
https://www.officequotes.net/no9-03.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-04.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-05.php
https://www.officequotes.net/no9-06.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-07.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-08.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-09.php
https://www.officequotes.net/no9-10.php
https://www.officequotes.net/no9-11.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-12.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-13.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-14.php
https://www.officequotes.net/no9-15.php
https://www.officequotes.net/no9-16.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-17.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-18.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-20.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-21.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-22.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-23.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


### 1.7 Example transcript

In [358]:
transcript #transcript of the last episode, season 9, episode 23

defaultdict(<function __main__.<lambda>()>,
            {'Dwight': 'The documentary series finished airing ages ago. Why is PBS sending another crew?pff, Nobody buys DVDs anymore.PBS. The propaganda wing of Bill and Melinda Gates and viewers like you.In the past year, I have consolidated the entire Scranton paper market. We regained the white pages, the school district, Lackawanna county. We supply them all. I’m getting married tomorrow afternoon, and in the morning, there’s a mini-reunion. A kind of a “where are they now” panel at a local theatre. It’ll be nice to see everyone again. [laughs] I haven’t seen Kevin since we let him go.[mimicking trumpet] Today marks several important milestones. Stanley, as you know, is retiring.No! And our next and most thickly frosted cake is…for…Kevin.Go ahead and just read the frosting.Uh-huh.It’s a colloquial way of saying “you’re fired,” Kevin, which you are.The cake has spoken Pam. Sorry.Well if anyone here can make a case for Kevin staying.Based

In [359]:
list(transcript) #all characters in the above episode

['Dwight',
 'CameraMan',
 'Kevin',
 'Stanley',
 'Meredith',
 'Pam',
 'All',
 'Oscar',
 'Jim',
 'Toby',
 'Crowd',
 'Angela',
 'Phyllis',
 'Malcolm',
 'Zeke',
 'Clark',
 'Andy',
 'Casey Dean',
 'Bill Hader',
 'Seth Mayers',
 'Dakota',
 'Nellie',
 'Darryl',
 'Guy',
 'Rachel',
 'Stripper',
 'Erin',
 'Jakey',
 'Mose',
 'Man',
 'Man 1',
 'David Wallace',
 'Pete',
 'Woman 1',
 'Woman 2',
 'Man 2',
 'Moderator',
 'Man 3',
 'Woman 3',
 'Woman 4',
 'Joan',
 'Ed',
 'Usher',
 'Creed',
 'Kelly',
 'Ravi',
 'Ryan',
 'Michael',
 'Minister',
 'Val',
 'Carol Stills',
 'Carol',
 'Buyer',
 'Woman',
 'Photographer']

In [361]:
transcript['Jim'] #all dialogues of jim combined into one string for sentiment analysis

'Umm….I bike to work now. Saves on gas, cheaper than a vasectomy and, uh, oh, yeah, it’s good for the environment too.Pam and I are great. She just recently finished her mural for the Irish cultural center.[to Cici] Can you clap! Can you clap for mom?And Dwight is imitating Japanese business practices for reasons he explained to us in Japanese.Okay, the limo’s gonna be here at five.  I need everybody to be ready ‘cause I want to pack in a lot.Uh, no. No whorehouse. This is Dwight’s night, okay?Dwight has made me his bestisch mensch. Which is Schrute for best man. He’s putting himself entirely in my hands tonight. And I know for over 12 years I’ve done nothing but trick and prank him but tonight…only good surprises. “Guten Pranken”. [chuckles]Okay, hold on. Are you sure Mose isn’t going to show up?Mose has been weird? That’s so unlike him.Hey man, good to see you.Hey!Kismet? Yeah, right. Pam and I came up with excuses for every other weekend. You remember my two lap band surgeries, righ

#### Observation
Above is the transcript for the last episode of season 9. The keys are all the different characters involved in the episode. The values for the keys are the combined dialogues of each character in the show.

### 1.8 Scraped raw data
All the scraped data of each episode is stored in the folder 'raw_data'. We are making the raw data available for users so that they can build on it with their own preprocessing steps if required. We have have our own set of pre-processing steps which are show further below.

#### Note: The episode transcripts for Episode 8 of Season 4 and Episode 18 of Season 5 wasn't available on the website. Therefore, we will be omitting said episodes from out project.

### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## 2) Preprocessing

### 2.1 Removing scene descriptions from dialogues

### Before:


In [374]:
transcript['Angela']

'[whispering] Yes. My heart is so open, I am so at peace. [scoffs] Look at Meredith. She’s disgusting. Those feet. They’re like the paws of an orangutan.[laughing] To remind you that our wedding’s gonna be wonderful.D, it’s gonna be perfect. The only people that need to be there are you and me.I don’t…I don’t know why.This is my big sister Rachel.[laughs] We’re very close. We even have our own special language.People love it.What? [knock at the door] Okay.Wait, what is this?Okay…Rachel, are you all right?Oh geeze. [Jakey starts dancing on Angela]. Oh, my God!Okay, if anything, this is rougher. Stop it Meredith.[Jakey resumes dancing] Uh, no. It’s o…thank you. You know what? You don’t have to…oh no, no, no. No, no, no. It’s okay.That was interesting. [creaking sound] What was that?Will you lock the door?Alright, see, you don’t have to leave the door wide open. We get it. It’s the wind. Just come and shut…[Mose grabs Angela and takes her away] OH! My God!What the [bleep] is your problem 

#### Observation
The texts '[whispering]' and '[scoffs]' in the first line and many similar ones in other lines are descriptions of scenes that aren't required for our analysis. Therefore we will be getting rid of text in brackets using regular expressions.

### After:

In [375]:
transcript['Angela']=re.sub("[\(\[].*?[\)\]]", "", transcript['Angela'])
transcript['Angela']

' Yes. My heart is so open, I am so at peace.  Look at Meredith. She’s disgusting. Those feet. They’re like the paws of an orangutan. To remind you that our wedding’s gonna be wonderful.D, it’s gonna be perfect. The only people that need to be there are you and me.I don’t…I don’t know why.This is my big sister Rachel. We’re very close. We even have our own special language.People love it.What?  Okay.Wait, what is this?Okay…Rachel, are you all right?Oh geeze. . Oh, my God!Okay, if anything, this is rougher. Stop it Meredith. Uh, no. It’s o…thank you. You know what? You don’t have to…oh no, no, no. No, no, no. It’s okay.That was interesting.  What was that?Will you lock the door?Alright, see, you don’t have to leave the door wide open. We get it. It’s the wind. Just come and shut… OH! My God!What the  is your problem you   ?!Thanks.  Oh. Ouch.No, my heels aren’t too high. It’s because I spent three hours in a car trunk. Thanks for not locking the door when I asked you to, Phyllis.  Sorry

### 2.2 Removing stop words
Stop words are words that don't contribute much to the sentiment of a sentence such as 'is', 'of', 'a' etc. We will be getting rid of them using the 'english stop words that the 'nltk' module provides.

### Before

In [376]:
transcript['Angela']

' Yes. My heart is so open, I am so at peace.  Look at Meredith. She’s disgusting. Those feet. They’re like the paws of an orangutan. To remind you that our wedding’s gonna be wonderful.D, it’s gonna be perfect. The only people that need to be there are you and me.I don’t…I don’t know why.This is my big sister Rachel. We’re very close. We even have our own special language.People love it.What?  Okay.Wait, what is this?Okay…Rachel, are you all right?Oh geeze. . Oh, my God!Okay, if anything, this is rougher. Stop it Meredith. Uh, no. It’s o…thank you. You know what? You don’t have to…oh no, no, no. No, no, no. It’s okay.That was interesting.  What was that?Will you lock the door?Alright, see, you don’t have to leave the door wide open. We get it. It’s the wind. Just come and shut… OH! My God!What the  is your problem you   ?!Thanks.  Oh. Ouch.No, my heels aren’t too high. It’s because I spent three hours in a car trunk. Thanks for not locking the door when I asked you to, Phyllis.  Sorry

### After

In [377]:
stop_words = set(stopwords.words('english'))   #using the stop word provided by the module
word_tokens = word_tokenize(transcript['Angela'])  
filtered_sentence = [w for w in word_tokens if not w in stop_words] #filtering the dialogue and removing stop words
transcript['Angela']=' '.join(filtered_sentence)
transcript['Angela']

'Yes . My heart open , I peace . Look Meredith . She ’ disgusting . Those feet . They ’ like paws orangutan . To remind wedding ’ gon na wonderful.D , ’ gon na perfect . The people need me.I ’ t…I ’ know why.This big sister Rachel . We ’ close . We even special language.People love it.What ? Okay.Wait , ? Okay…Rachel , right ? Oh geeze . . Oh , God ! Okay , anything , rougher . Stop Meredith . Uh , . It ’ o…thank . You know ? You ’ to…oh , , . No , , . It ’ okay.That interesting . What ? Will lock door ? Alright , see , ’ leave door wide open . We get . It ’ wind . Just come shut… OH ! My God ! What problem ? ! Thanks . Oh . Ouch.No , heels ’ high . It ’ I spent three hours car trunk . Thanks locking door I asked , Phyllis . Sorry Phyllis . You ’ know . As long I get altar.Hi.Oh , honeymoon wait till tomorrow . We wanted hang guys . I mean , going together ? Do even mattress ?'

### 2.3 Lemmatization
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Examples of lemmatization:

-> rocks : rock

-> corpora : corpus

-> better : good

### Before

In [387]:
a="The striped bats are hanging on their feet for better"
a

'The striped bats are hanging on their feet for better'

### After

In [388]:
lemmatizer = WordNetLemmatizer()
new=[lemmatizer.lemmatize(x) for x in a.split()]
' '.join(new)

'The striped bat are hanging on their foot for better'

#### Observation
We can notice the change in some words, many of them are changed from plural to singular form.

### 2.4 Pre-processing all episode raw data

The above three pre-processing steps were just a demonstration on a single episode's character's dialogues. The entire process is automated in the cells below so that the raw data for each episode is read from the 'raw_data' folder and is preprocessed.

In [64]:
with open('raw_data/metadata.json') as fp: #using the metadata to read files from raw_data folder
    data=json.load(fp)

In [65]:

for i in data:
    for j in data[i]:
        file='raw_data/'+str(i)+'-'+str(j)+'.json'
        if file!='raw_data/4-08.json' and file!='raw_data/5-18.json': #skipping over unavailable data
            with open(file, encoding='utf-8') as fp:
                data1=json.load(fp)
                for k in data1:
                    newk=re.sub("[\(\[].*?[\)\]]", "", data1[k])  #preprocessing step 1, refer section 2.1
                    stop_words = set(stopwords.words('english'))  #preprocessing step 2, refer section 2.2
                    word_tokens = word_tokenize(newk)  
                    filtered_sentence = [w for w in word_tokens if not w in stop_words]  

                    lemmatizer = WordNetLemmatizer() #preprocessing step 3, refer section 2.3

                    newk=lemmatizer.lemmatize(" ".join(filtered_sentence))
                    
                    data1[k]=newk
                
                file1='preprocessed_data/'+str(i)+'-'+str(j)+'.json'   
                with open(file1, 'w', encoding='utf-8') as f:
                    json.dump(data1, f, ensure_ascii=False, indent=4)

### 2.5 Storing pre-processed data
The preprocessed text data of each episode is present in the folder 'preprocessed_data'. We have made said data available to users so they can follow it up with their own idea and analysis.

### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## 3) Sentiment Analysis

In [74]:
with open('raw_data/metadata.json') as fp:
    data=json.load(fp)

### 3.1) Example

In [390]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

analyser.polarity_scores(transcript['Angela'])

{'neg': 0.054, 'neu': 0.771, 'pos': 0.175, 'compound': 0.9838}

The vader sentiment analyser will return four scores for each string passed through it. Above is an example where we have passed all the dialogues for 'Angela' from the last episode in seaoson 9.

A 0.054 negative score means that 5.4% of the text passed was negative. The negative, neutral and positive scores will add up to form the compound score.

##### Interpretation:

positive sentiment-->  compound score > 0.05

neutral sentiment-->  (compound score < 0.05) and (compound score > -0.05)

negative sentiment-->  compound score < -0.05


### 3.2 Why Vader sentiment analyser?

#### a) Captures capitalization

In [391]:
analyser.polarity_scores('The food here is great!')

{'neg': 0.0, 'neu': 0.477, 'pos': 0.523, 'compound': 0.6588}

In [392]:
analyser.polarity_scores('The food here is GREAT!')

{'neg': 0.0, 'neu': 0.438, 'pos': 0.562, 'compound': 0.729}

#### b) Captures degree modifiers

In [393]:
analyser.polarity_scores('The food here is good!')

{'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.4926}

In [395]:
analyser.polarity_scores('The food here is marginally good!')

{'neg': 0.0, 'neu': 0.633, 'pos': 0.367, 'compound': 0.4402}

#### c) Captures conjunctions

In [396]:
analyser.polarity_scores('The food here is great, but the service is terrible.')

{'neg': 0.282, 'neu': 0.544, 'pos': 0.173, 'compound': -0.3818}

### 3.3 Converting the entire pre-processed data into sentiment scores

The text data is converted to sentiment scores and is stored in a dictionary.

In [75]:
final=defaultdict(lambda:{})

In [76]:
episode=0
for i in data:
    for j in data[i]:
        file='preprocessed_data/'+str(i)+'-'+str(j)+'.json'
        if file!='preprocessed_data/4-08.json' and file!='preprocessed_data/5-18.json': 
            episode+=1
            with open(file, encoding='utf-8') as fp:
                data1=json.load(fp)
                for k in data1:
                    ep='Episode '+str(episode)
                    final[ep][k]=analyser.polarity_scores(data1[k])
                

In [397]:
final

defaultdict(<function __main__.<lambda>()>,
            {'Episode 1': {'Michael': {'neg': 0.084,
               'neu': 0.728,
               'pos': 0.188,
               'compound': 0.9994},
              'Jim': {'neg': 0.058,
               'neu': 0.787,
               'pos': 0.155,
               'compound': 0.9822},
              'Pam': {'neg': 0.058,
               'neu': 0.775,
               'pos': 0.167,
               'compound': 0.9694},
              'Dwight': {'neg': 0.038,
               'neu': 0.825,
               'pos': 0.137,
               'compound': 0.9767},
              'Jan': {'neg': 0.071,
               'neu': 0.829,
               'pos': 0.1,
               'compound': 0.2716},
              'Michel': {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
              'Todd Packer': {'neg': 0.0,
               'neu': 1.0,
               'pos': 0.0,
               'compound': 0.0},
              'Phyllis': {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},


#### Observation
The default dictionary 'final' is a nested dictionary where the outer keys are all 184 episodes. The inner keys to these episodes are each character for that particular episode and the values to those keys are each of the 4 sentiment scores returned by the sentiment analyser.

### 3.4 Limiting the number of characters to the main characters

In [87]:

char_count=Counter()
for i in final:
    for j in final[i]:
        char_count[j]+=1

In [399]:
char_count.most_common() #characters and the number of episodes they have appeared in

[('Dwight', 184),
 ('Jim', 183),
 ('Pam', 180),
 ('Kevin', 178),
 ('Angela', 172),
 ('Stanley', 168),
 ('Phyllis', 166),
 ('Oscar', 162),
 ('Andy', 142),
 ('Ryan', 140),
 ('Kelly', 140),
 ('Meredith', 136),
 ('Michael', 135),
 ('Creed', 133),
 ('Toby', 112),
 ('Darryl', 104),
 ('Erin', 98),
 ('Gabe', 47),
 ('Jan', 39),
 ('Nellie', 33),
 ('All', 32),
 ('Everyone', 31),
 ('Roy', 29),
 ('David', 26),
 ('Karen', 26),
 ('Pete', 22),
 ('Robert', 21),
 ('Clark', 20),
 ('Holly', 17),
 ('Man', 15),
 ('David Wallace', 15),
 ('Hank', 13),
 ('Guy', 12),
 ('Val', 11),
 ('Todd Packer', 10),
 ('Group', 10),
 ('Nate', 10),
 ('Bob Vance', 9),
 ('Woman', 9),
 ('Michael and Dwight', 8),
 ('Waiter', 8),
 ('Both', 8),
 ('Jo', 8),
 ('Everybody', 7),
 ('Crowd', 7),
 ('Josh', 7),
 ('Bob', 7),
 ('Receptionist', 7),
 ('Mose', 7),
 ('Cathy', 7),
 ('Carol', 6),
 ('Micheal', 6),
 ('', 6),
 ('Charles', 6),
 ('Daryl', 6),
 ('Michel', 5),
 ('Packer', 5),
 ('Waitress', 5),
 ('Kenny', 5),
 ('Bartender', 5),
 ('Helene',

In [97]:
char=[i[0] for i in char_count.most_common(17)] #the characters until 'Erin' are considered, ie almost a minimum of 100 episodes

In [98]:
char #these are the characters that will be part of the final dataset

['Dwight',
 'Jim',
 'Pam',
 'Kevin',
 'Angela',
 'Stanley',
 'Phyllis',
 'Oscar',
 'Andy',
 'Ryan',
 'Kelly',
 'Meredith',
 'Michael',
 'Creed',
 'Toby',
 'Darryl',
 'Erin']

### 3.5 Building a dataframe from the dictionary

In [285]:
updated=defaultdict(lambda:{})
ep_list=list(final)

In [286]:
for i in char:
    pos=i+'_positive_score'
    neu=i+'_neutral_score'
    neg=i+'_negative_score'
    comp=i+'_compound_score'
    for j in ep_list:
        if i in list(final[j]):
            updated[pos][j]=final[j][i]['pos']
            updated[neu][j]=final[j][i]['neu']
            updated[neg][j]=final[j][i]['neg']
            updated[comp][j]=final[j][i]['compound']

        else:
            updated[pos][j]=0
            updated[neu][j]=0
            updated[neg][j]=0
            updated[comp][j]=0


In [426]:
final_df=pd.DataFrame(updated)

In [418]:
final_df

Unnamed: 0,Dwight_positive_score,Dwight_neutral_score,Dwight_negative_score,Dwight_compound_score,Jim_positive_score,Jim_neutral_score,Jim_negative_score,Jim_compound_score,Pam_positive_score,Pam_neutral_score,...,Toby_negative_score,Toby_compound_score,Darryl_positive_score,Darryl_neutral_score,Darryl_negative_score,Darryl_compound_score,Erin_positive_score,Erin_neutral_score,Erin_negative_score,Erin_compound_score
Episode 1,0.137,0.825,0.038,0.9767,0.155,0.787,0.058,0.9822,0.167,0.775,...,0.000,0.0000,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
Episode 2,0.218,0.694,0.088,0.9636,0.160,0.784,0.056,0.9896,0.255,0.745,...,0.098,-0.0772,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
Episode 3,0.182,0.721,0.097,0.9953,0.219,0.692,0.089,0.9917,0.074,0.798,...,0.000,0.0000,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
Episode 4,0.188,0.749,0.064,0.9959,0.154,0.793,0.054,0.9962,0.225,0.758,...,0.000,0.8504,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
Episode 5,0.166,0.815,0.018,0.9485,0.189,0.781,0.030,0.9802,0.171,0.789,...,0.000,0.0000,0.103,0.779,0.118,-0.3867,0.000,0.000,0.000,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Episode 180,0.126,0.839,0.036,0.9912,0.101,0.888,0.011,0.8970,0.216,0.750,...,0.144,-0.8879,0.000,0.000,0.000,0.0000,0.091,0.909,0.000,0.8727
Episode 181,0.204,0.782,0.015,0.9905,0.257,0.688,0.055,0.9895,0.353,0.610,...,0.035,0.5386,0.148,0.789,0.063,0.5191,0.151,0.660,0.189,-0.8082
Episode 182,0.117,0.816,0.067,0.9816,0.242,0.707,0.051,0.9987,0.180,0.807,...,0.000,0.2263,0.070,0.878,0.052,0.4682,0.196,0.727,0.077,0.9664
Episode 183,0.147,0.796,0.057,0.9981,0.161,0.761,0.078,0.9974,0.311,0.637,...,0.000,0.0000,0.146,0.798,0.056,0.9743,0.190,0.749,0.060,0.9690


In [288]:
sorted(df.columns)

['Andy_compound_score',
 'Andy_negative_score',
 'Andy_neutral_score',
 'Andy_positive_score',
 'Angela_compound_score',
 'Angela_negative_score',
 'Angela_neutral_score',
 'Angela_positive_score',
 'Creed_compound_score',
 'Creed_negative_score',
 'Creed_neutral_score',
 'Creed_positive_score',
 'Darryl_compound_score',
 'Darryl_negative_score',
 'Darryl_neutral_score',
 'Darryl_positive_score',
 'Dwight_compound_score',
 'Dwight_negative_score',
 'Dwight_neutral_score',
 'Dwight_positive_score',
 'Erin_compound_score',
 'Erin_negative_score',
 'Erin_neutral_score',
 'Erin_positive_score',
 'Jim_compound_score',
 'Jim_negative_score',
 'Jim_neutral_score',
 'Jim_positive_score',
 'Kelly_compound_score',
 'Kelly_negative_score',
 'Kelly_neutral_score',
 'Kelly_positive_score',
 'Kevin_compound_score',
 'Kevin_negative_score',
 'Kevin_neutral_score',
 'Kevin_positive_score',
 'Meredith_compound_score',
 'Meredith_negative_score',
 'Meredith_neutral_score',
 'Meredith_positive_score',
 '

#### Observation
Each of the characters have 4 different columns based on the 4 different sentiment scores.

### --------------------------------------------------------------------------------------------------------------------------------------------------------------


## 4) Expanding the Dataset

We have expanded the dataset to include details of episodes such as Title, Writers, Director and imdb rating.

In [232]:
urls.pop(82) #getting rid of missing data
urls.pop(58)

'https://www.officequotes.net/no4-08.php'

### 4.1 Scraping the writer and director information for each episode

In [250]:
directors=[]
writers=[]
for i in urls:
    print(i)
    page = requests.get(i)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    dialogues=[]
    a=soup.find('div',class_='entry-content').find('p').text
    if a!=' ':
        b=re.sub('[\t*]',"",a)
        w=b.split('Written by')[1].split('Directed by')
        writers.append(w[0].strip())
        d=w[1].split('Transcribed by')[0]
        directors.append(d.split('Original')[0])
    else:
        writers.append(None)
        directors.append(None)

https://www.officequotes.net/no1-01.php
https://www.officequotes.net/no1-02.php
https://www.officequotes.net/no1-03.php
https://www.officequotes.net/no1-04.php
https://www.officequotes.net/no1-05.php
https://www.officequotes.net/no1-06.php
https://www.officequotes.net/no2-01.php
https://www.officequotes.net/no2-02.php
https://www.officequotes.net/no2-03.php
https://www.officequotes.net/no2-04.php
https://www.officequotes.net/no2-05.php
https://www.officequotes.net/no2-06.php
https://www.officequotes.net/no2-07.php
https://www.officequotes.net/no2-08.php
https://www.officequotes.net/no2-09.php
https://www.officequotes.net/no2-10.php
https://www.officequotes.net/no2-11.php
https://www.officequotes.net/no2-12.php
https://www.officequotes.net/no2-13.php
https://www.officequotes.net/no2-14.php
https://www.officequotes.net/no2-15.php
https://www.officequotes.net/no2-16.php
https://www.officequotes.net/no2-17.php
https://www.officequotes.net/no2-18.php
https://www.officequotes.net/no2-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-02.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-03.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-04.php
https://www.officequotes.net/no8-05.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-06.php
https://www.officequotes.net/no8-07.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-08.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-09.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-10.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-11.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-12.php
https://www.officequotes.net/no8-13.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-14.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-15.php
https://www.officequotes.net/no8-16.php
https://www.officequotes.net/no8-17.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-18.php
https://www.officequotes.net/no8-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-20.php
https://www.officequotes.net/no8-21.php
https://www.officequotes.net/no8-22.php
https://www.officequotes.net/no8-23.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no8-24.php
https://www.officequotes.net/no9-01.php
https://www.officequotes.net/no9-02.php
https://www.officequotes.net/no9-03.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-04.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-05.php
https://www.officequotes.net/no9-06.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-07.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-08.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-09.php
https://www.officequotes.net/no9-10.php
https://www.officequotes.net/no9-11.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-12.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-13.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-14.php
https://www.officequotes.net/no9-15.php
https://www.officequotes.net/no9-16.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-17.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-18.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-19.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-20.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-21.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-22.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://www.officequotes.net/no9-23.php


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [252]:
len(directors)

184

### 4.2 Scraping the imdb data

In [430]:

data_set = []
data_set_season = {}
for s in range(1,10):
    with urllib.request.urlopen('https://www.imdb.com/title/tt0386676/episodes?season='+str(s)) as response:
        html = response.read()
        season_set = []
        soup = BeautifulSoup(html)
        episodes = soup.findAll("div", {"class": "list_item"})
        for episode in episodes:
            title = episode.findAll("a", {"itemprop": "name"})
            airdate = episode.findAll("div", {"class": "airdate"})
            rating = episode.findAll("span", {"class": "ipl-rating-star__rating"})
            num_votes = episode.findAll("span", {"class": "ipl-rating-star__total-votes"})
            description = episode.findAll("div", {"class": "item_description"})
            row_data = [title[0].text,airdate[0].text,rating[0].text,num_votes[0].text.replace('(','').replace(')','').replace(',',''),description[0].text]
            row_data = [r.replace('\n','').strip() for r in row_data]
            data_set.append(row_data)
            season_set.append(row_data)
    data_set_season['Season'+str(s)]= season_set




df = pd.DataFrame(data_set,columns=['Title','AirDate','Rating','Num_Votes','Description'])

### 4.3 Getting rid of missing and extra episodes and columns and adding writer and director data

In [431]:
df = df[df.Title != 'The Delivery: Part 2']
df = df[df.Title != 'Niagara: Part 2']
df = df[df.Title != 'The Deposition']
df = df[df.Title != 'New Boss']


In [432]:
del df['AirDate']
del df['Description']

In [433]:
df['Writers']=writers
df['Directors']=directors


In [434]:
df

Unnamed: 0,Title,Rating,Num_Votes,Writers,Directors
0,Pilot,7.5,6018,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
1,Diversity Day,8.3,5890,B.J. Novak,Ken Kwapis
2,Health Care,7.8,4939,Paul Lieberstein,Ken Whittingham
3,The Alliance,8.0,4800,Michael Schur,Bryan Gordon
4,Basketball,8.4,5296,Greg Daniels,Greg Daniels
...,...,...,...,...,...
183,Stairmageddon,8.0,2506,Dan Sterling,Matt Sohn
184,Paper Airplane,8.0,2572,Halsted Sullivan & Warren Lieberstein,Jesse Peretz
185,Livin' the Dream,9.1,3694,Niki Schwartz-Wright,Jeffrey Blitz
186,A.A.R.M.,9.5,4993,Brent Forrester,David Rogers


### 4.4 Merging both the dataframes

In [427]:
final_df = final_df.rename_axis('Episodes').reset_index()

In [428]:
final_df['Episodes']

0        Episode 1
1        Episode 2
2        Episode 3
3        Episode 4
4        Episode 5
          ...     
179    Episode 180
180    Episode 181
181    Episode 182
182    Episode 183
183    Episode 184
Name: Episodes, Length: 184, dtype: object

In [439]:
df['Episodes']=list(final_df['Episodes'])
del final_df['Episodes']
df=df[['Episodes','Title','Writers','Directors','Rating','Num_Votes']]

In [440]:
df

Unnamed: 0,Episodes,Title,Writers,Directors,Rating,Num_Votes
0,Episode 1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis,7.5,6018
1,Episode 2,Diversity Day,B.J. Novak,Ken Kwapis,8.3,5890
2,Episode 3,Health Care,Paul Lieberstein,Ken Whittingham,7.8,4939
3,Episode 4,The Alliance,Michael Schur,Bryan Gordon,8.0,4800
4,Episode 5,Basketball,Greg Daniels,Greg Daniels,8.4,5296
...,...,...,...,...,...,...
183,Episode 180,Stairmageddon,Dan Sterling,Matt Sohn,8.0,2506
184,Episode 181,Paper Airplane,Halsted Sullivan & Warren Lieberstein,Jesse Peretz,8.0,2572
185,Episode 182,Livin' the Dream,Niki Schwartz-Wright,Jeffrey Blitz,9.1,3694
186,Episode 183,A.A.R.M.,Brent Forrester,David Rogers,9.5,4993


In [441]:
final_dataset=pd.DataFrame(
    np.column_stack([df,final_df]),
    columns=df.columns.append(final_df.columns)
)

In [1]:
# final_dataset

### 4.5 Writing the final dataset to csv file

In [443]:
final_dataset.to_csv('final_dataset.csv',index=False)

### --------------------------------------------------------------------------------------------------------------------------------------------------------------


# FINAL DATASET

In [2]:
final_dataset

Unnamed: 0,Episodes,Title,Writers,Directors,Rating,Num_Votes,Dwight_positive_score,Dwight_neutral_score,Dwight_negative_score,Dwight_compound_score,...,Toby_negative_score,Toby_compound_score,Darryl_positive_score,Darryl_neutral_score,Darryl_negative_score,Darryl_compound_score,Erin_positive_score,Erin_neutral_score,Erin_negative_score,Erin_compound_score
0,Episode 1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis,7.5,6018,0.137,0.825,0.038,0.9767,...,0.000,0.0000,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
1,Episode 2,Diversity Day,B.J. Novak,Ken Kwapis,8.3,5890,0.218,0.694,0.088,0.9636,...,0.098,-0.0772,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
2,Episode 3,Health Care,Paul Lieberstein,Ken Whittingham,7.8,4939,0.182,0.721,0.097,0.9953,...,0.000,0.0000,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
3,Episode 4,The Alliance,Michael Schur,Bryan Gordon,8.0,4800,0.188,0.749,0.064,0.9959,...,0.000,0.8504,0.000,0.000,0.000,0.0000,0.000,0.000,0.000,0.0000
4,Episode 5,Basketball,Greg Daniels,Greg Daniels,8.4,5296,0.166,0.815,0.018,0.9485,...,0.000,0.0000,0.103,0.779,0.118,-0.3867,0.000,0.000,0.000,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179,Episode 180,Stairmageddon,Dan Sterling,Matt Sohn,8.0,2506,0.126,0.839,0.036,0.9912,...,0.144,-0.8879,0.000,0.000,0.000,0.0000,0.091,0.909,0.000,0.8727
180,Episode 181,Paper Airplane,Halsted Sullivan & Warren Lieberstein,Jesse Peretz,8.0,2572,0.204,0.782,0.015,0.9905,...,0.035,0.5386,0.148,0.789,0.063,0.5191,0.151,0.660,0.189,-0.8082
181,Episode 182,Livin' the Dream,Niki Schwartz-Wright,Jeffrey Blitz,9.1,3694,0.117,0.816,0.067,0.9816,...,0.000,0.2263,0.070,0.878,0.052,0.4682,0.196,0.727,0.077,0.9664
182,Episode 183,A.A.R.M.,Brent Forrester,David Rogers,9.5,4993,0.147,0.796,0.057,0.9981,...,0.000,0.0000,0.146,0.798,0.056,0.9743,0.190,0.749,0.060,0.9690


### Column description

    Episodes        Episode number across all seasons, ranges from 
                    0 to 184

    Title		   Title of the episode

    Writers		 Writers of the episode

    Director		Director of the episode

    Rating		  The imdb rating of the episode 

    Num_Votes	   The number of votes casted

    Dwight_positive_score   The positive score for the character 'Dwight
                            in the episode

    Dwight_neutral_score	The neutral score for the character 'Dwight
                            in the episode

    Dwight_negative_score  The negative score for the character 'Dwight
                            in the episode

    Dwight_compound_score	The compound score for the character 'Dwight
                            in the episode

    The above 4 sentiment scores are present for all the characters in the list:
     'Jim',
     'Pam',
     'Kevin',
     'Angela',
     'Stanley',
     'Phyllis',
     'Oscar',
     'Andy',
     'Ryan',
     'Kelly',
     'Meredith',
     'Michael',
     'Creed',
     'Toby',
     'Darryl',
     'Erin'

    The next 4 columns would be 'Jim_positive_score','Jim_neutral_score', 
    'Jim_negative_score', 'Jim_compound_score' and so on.