# What Office Character Do You Talk Like?



We are going to download all of The Office scripts and put them in a python dataframe. 

* Find source data of scripts
* pull into python
* transform the data into easy to work with dataframes
* run sentiment analysis on each character's line
* create a function asking for end users input


<hr size="5"/>

### Table of Contents
* [1. Finding Script Data](#Finding-Script-Data)
* [2. Data Manipulation](#Fixing-Inconsistancies)
* [3. Sentiment](#Sentiment)

# Finding the Text to Every Episode's Script

There is a Google docs sheet that has exactly what we are looking for located [here](https://docs.google.com/spreadsheets/d/18wS5AAwOh8QO95RwHLS95POmSNKA2jjzdt0phrxeAE0/edit#gid=747974534). We should download and save thsi to our desktop. We can pull the dataframe into python by using `pandas`. 

In [229]:
import pandas as pd 

the_office_raw_script = pd.read_excel(r"C:\Users\JF\Desktop\the-office-lines.xlsx")

Some general tips for looking at a dataframe. 
* How many rows does it have?
* What are the column names?
* Group by count.


In [4]:
# How many rows?
len(the_office_raw_script.index)

59909

59,909 lines of dialog happened in The Office!

In [5]:
# Column names?
list(the_office_raw_script)

['id',
 'season',
 'episode',
 'scene',
 'line_text',
 'speaker',
 'deleted',
 'Stage1']

We have eight columns, looks like `speaker` is the column we are going to be most interested in. 

In [63]:
# How many speakers have over 100 scenes?
s = the_office_raw_script.groupby('speaker').scene.count()
# s[s > 100].index.tolist()

# Fixing Inconsistancies

* Fix capalization
* Fix spelling



In [38]:
#Capalization
the_office_raw_script['speaker'] = the_office_raw_script['speaker'].str.lower()

In [62]:
# Fix Spelling
the_office_raw_script['speaker']  = the_office_raw_script['speaker'].replace(['micheal', 'michel', 'michal', 'Michael [on phone]'], 'michael')

# Sentiment
* import `TextBlob`

In [174]:
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
analyser = SentimentIntensityAnalyzer()

In [159]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)

In [230]:
the_office_raw_script['subjectivity'] = the_office_raw_script['line_text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

the_office_raw_script['polarity'] = the_office_raw_script['line_text'].apply(lambda x: TextBlob(x).sentiment.polarity)

the_office_raw_script['scores'] = the_office_raw_script['line_text'].apply(lambda x: analyser.polarity_scores(x))



In [239]:
# using the following link to transform [scores] to four columns 
# https://stackoverflow.com/questions/50512188/unpack-dictionary-from-pandas-column
scores = the_office_raw_script['scores']
scores_spread = pd.DataFrame.from_records(data.tolist())

In [244]:
df_c = pd.concat([scores_spread, the_office_raw_script], axis=1)

In [245]:
df_c

Unnamed: 0,compound,neg,neu,pos,id,season,episode,scene,line_text,speaker,deleted,Stage1,subjectivity,polarity,scores
0,0.4927,0.000,0.803,0.197,1,1,1,1,All right Jim. Your quarterlies look very good...,Michael,False,Michael,0.657857,0.597857,"{'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'comp..."
1,0.0000,0.000,1.000,0.000,2,1,1,1,"Oh, I told you. I couldn't close it. So...",Jim,False,not Michael,0.000000,0.000000,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,0.0000,0.000,1.000,0.000,3,1,1,1,So you've come to the master for guidance? Is ...,Michael,False,Michael,0.000000,0.000000,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
3,0.4215,0.000,0.714,0.286,4,1,1,1,"Actually, you called me in here, but yeah.",Jim,False,not Michael,0.100000,0.000000,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp..."
4,0.2732,0.000,0.811,0.189,5,1,1,1,"All right. Well, let me show you how it's done.",Michael,False,Michael,0.535714,0.285714,"{'neg': 0.0, 'neu': 0.811, 'pos': 0.189, 'comp..."
5,0.8146,0.096,0.732,0.172,6,1,1,2,"on the phone Yes, I'd like to speak to your of...",Michael,False,Michael,0.597959,0.054150,"{'neg': 0.096, 'neu': 0.732, 'pos': 0.172, 'co..."
6,0.2225,0.000,0.969,0.031,7,1,1,3,"I've, uh, I've been at Dunder Mifflin for onet...",Michael,False,Michael,0.556845,0.110491,"{'neg': 0.0, 'neu': 0.969, 'pos': 0.031, 'comp..."
7,0.2732,0.000,0.488,0.512,8,1,1,3,Well. I don't know.,Pam,False,not Michael,0.000000,0.000000,"{'neg': 0.0, 'neu': 0.488, 'pos': 0.512, 'comp..."
8,0.4588,0.000,0.833,0.167,9,1,1,3,"If you think she's cute now, you should have s...",Michael,False,Michael,1.000000,0.500000,"{'neg': 0.0, 'neu': 0.833, 'pos': 0.167, 'comp..."
9,0.0000,0.000,1.000,0.000,10,1,1,3,What?,Pam,False,not Michael,0.000000,0.000000,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


we have all of the sentiment above, for a better comparison of 

In [258]:
the_office_class = the_office_raw_script[ (the_office_raw_script['speaker'] == 'Jim') |
                             (the_office_raw_script['speaker'] == 'Michael') |
                           (the_office_raw_script['speaker'] == 'Pam') |
                            (the_office_raw_script['speaker'] == 'Andy') |
                            (the_office_raw_script['speaker'] == 'Dwight')]

In [277]:
the_office_class['num_speaker'] = the_office_class.speaker.map({'Jim':1, 'Michael':2, 'Pam':3, 'Andy':4, 'Dwight':5})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [279]:
speaker = the_office_class['num_speaker']
line_text = the_office_class['line_text']
from collections import Counter
print(Counter(speaker))

Counter({2: 12137, 5: 7529, 1: 6814, 3: 5375, 4: 3968})


In [291]:

from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [296]:
#pam
the_office_class_pam = the_office_class[the_office_class['speaker'] == "Pam"]
the_office_class_pam_resample = resample(the_office_class_pam, 
                                         replace=True,     # sample with replacement
                                         n_samples=12137,    # to match majority class
                                         random_state=123) # reproducible results
 
    
the_office_class_jim = the_office_class[the_office_class['speaker'] == "Jim"]
the_office_class_jim_resample = resample(the_office_class_jim, 
                                         replace=True,     # sample with replacement
                                         n_samples=12137,    # to match majority class
                                         random_state=123) # reproducible results


the_office_class_dwight = the_office_class[the_office_class['speaker'] == "Dwight"]
the_office_class_dwight_resample = resample(the_office_class_dwight, 
                                         replace=True,     # sample with replacement
                                         n_samples=12137,    # to match majority class
                                         random_state=123) # reproducible results

the_office_class_andy = the_office_class[the_office_class['speaker'] == "Andy"]
the_office_class_andy_resample = resample(the_office_class_andy, 
                                         replace=True,     # sample with replacement
                                         n_samples=12137,    # to match majority class
                                         random_state=123) # reproducible results 


In [288]:
the_office_class_michael = the_office_class[the_office_class['speaker'] == "Michael"]

In [297]:
df_upsampled = pd.concat([the_office_class_michael, 
                          the_office_class_andy_resample,
                          the_office_class_dwight_resample,
                          the_office_class_jim_resample,
                          the_office_class_pam_resample ])

In [298]:
df_upsampled.num_speaker.value_counts()

5    12137
4    12137
3    12137
2    12137
1    12137
Name: num_speaker, dtype: int64

In [299]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB


X_train, X_test, y_train, y_test = train_test_split(df_upsampled['line_text'], df_upsampled['speaker'], random_state = 0)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [331]:
print(clf.predict(count_vect.transform(["assistant to the reginal manager"])))

['Dwight']
