## Overview

This notebook will examine the following questions:

1. How many characters are there? What are their names?
1. For each character, find out who has the most lines across all episodes
1. What is the average of words per line for each character?
1. What is the most common word per character
1. Number of episodes where the character does not have a line, for each character
1. Number of time "That's what she said" joke comes up
    *     Include five examples of the joke
1. The average percent of lines each character contributed each episode per season.


In order to accomplish this, I will use the data cleaned in the notebook available at `code/the_office_data.ipynb` and stored in `data/cleaned_office_data.csv`.

### Task #1.  How many characters are there, what are their names

This task was essentially done as a part of the data cleaning effort. There are a number of ways in which we *could* answer it. However, I think that the way which makes the most sense is to appeal to the known authority on all things "The Office" -- in this case that authority will be the wiki.


I will further limit the characters to those which have lines in our script. Otherwise, it might be confusing for those who examine this data after me. After cleaning the data, this may be accomplished with relative ease:

In [8]:
import pandas as pd
df = pd.read_csv('../data/cleaned_office_data.csv')
characters = df.sort_values('speaker').speaker.unique()

To find the number of characters, we examine the length and find 115:

In [10]:
len(characters)

115

and to print their names it's simply a print statement:

In [11]:
for character in characters:
    print(character)

aj
alex
andy
angela
astrid
ben
billy
brandon
brenda
brian
bruce
cameron
carol
cathy
charles
clark
craig
creed
dan
danny
darryl
david
deangelo
devon
donna
dwight
ed
elizabeth
ellen
erin
esther
fannie
frank
fred
gabe
gabor
gideon
gil
glenn
hank
hannah
helene
henry
holly
hunter
ira
irene
isabel
jada
jake
jan
jeb
jerry
jessica
jim
jo
jordan
josh
justin
justine
karen
kathy
katy
kelly
kendall
kenny
kevin
leo
lonny
lynn
madge
martin
matt
meemaw
megan
melissa
melvina
meredith
merv
michael
mose
nate
nellie
nick
oscar
pam
penny
pete
philip
phillip
phyllis
rachel
ravi
robert
rolando
rolf
rory
roy
ryan
sasha
shirley
stanley
stephanie
teddy
toby
todd
tom
tony
trevor
troy
val
vikram
walter
wolf
zeke


I have omitted last names here because the script does not contain them. However, if desired, they may be obtained by cross referencing the wiki.

### Task #2. For each character, find out who has the most lines across all episodes

This task description is slightly ambiguous. I could interpret it as either:

1. Which character has the most lines across all episodes
1. Which character has the most lines in each episode

Since it is not possible to ask for clarification in this assignment, I will perform both queries.

#### 2a. Which character has the most lines across all episodes

The answer to this (unsurprisingly) is Michael. To find this, I grouped the data by speaker, and then counted the elements each speaker had and sorted in a descending fashion:

In [74]:
df.groupby('speaker').size().reset_index(name='line_counts').sort_values( 'line_counts', ascending=False).iloc[0].speaker

'michael'

#### 2b. Which character has the most lines in each episode

This question is a bit more involved than the last one. The high level approach here is to group by season, episode, AND speaker. However, it is tricky because pandas does not provide a native "unique by multiple columns" method.

So, in lieu of that, I sorted them by their (descending) line_counts and used the fact that the `.drop_duplicates()` method retains only the first matching element.

In [15]:
per_episode = df.groupby(['season', 'episode','speaker']).size().reset_index(name='line_counts').sort_values('line_counts', ascending=False).reset_index(drop=True)
per_episode.head()

Unnamed: 0,season,episode,speaker,line_counts
0,4,14,michael,167
1,4,4,michael,162
2,4,1,michael,160
3,4,2,michael,158
4,4,3,michael,156


Then, I dropped duplicates of season and episode and finally pulled the rows out of the original dataframe and sorted the results by season / episode to make them more readable.

Hopefully reading this chart is fairly intuitive. The character with the most lines for a given episode may be found in the `speaker` column. For example, in season 1 episode 1, Michael had the most lines:

In [19]:
pd.set_option('display.max_rows', 186)
per_episode = per_episode.iloc[per_episode[['season', 'episode']].drop_duplicates().index].sort_values(['season', 'episode']).reset_index(drop=True)
per_episode

Unnamed: 0,season,episode,speaker,line_counts
0,1,1,michael,98
1,1,2,michael,103
2,1,3,dwight,88
3,1,4,michael,111
4,1,5,michael,135
5,1,6,michael,128
6,2,1,michael,138
7,2,2,michael,112
8,2,3,michael,120
9,2,4,dwight,83


### Task #3. What is the average of words per line for each character?

In this task, I will determine how many words an average line contains for each given character. I do this by splitting the lines into words using a fairly naive regex pattern that matches only (case insensitive) letters a-z

In [21]:
import re
df['word_count'] = df.line_text.apply(lambda x: len(re.findall("[a-zA-Z_]+", x)))

Then, I group the rows by speaker and take an average which I round to 2 decimals for readability

In [27]:
average_word_count_df = df[["speaker", "word_count"]].groupby('speaker').mean().sort_values('speaker')
average_word_count_df.word_count = average_word_count_df.word_count.apply(lambda x: round(x, 2))
average_word_count_df.reset_index()

Unnamed: 0,speaker,word_count
0,aj,5.65
1,alex,24.2
2,andy,13.44
3,angela,9.85
4,astrid,2.67
5,ben,7.75
6,billy,8.75
7,brandon,8.72
8,brenda,5.71
9,brian,10.7


### Task #4. What is the most common word per character?

Here I will begin by using the same naive regex definition of a word as I used previously to ensure that all words are delimited by spaces.

In [28]:
from stopwords import stopwords
stopword_set = set(stopwords)
most_common_word_df = pd.DataFrame(df)
most_common_word_df['line_text'] = df.line_text.apply(lambda x: ' '.join(re.findall("[a-zA-Z_]+", x)))

Next, I will combine all the lines each character says into a single long string (per character) and drop the duplicates since we only need one row per character now.

In [31]:
joined_line_series = most_common_word_df.groupby('speaker').line_text.transform(lambda x: ' '.join(x))
most_common_word_df['joined_lines'] = joined_line_series
most_common_word_df = most_common_word_df.drop_duplicates('speaker')

Then I will remove stopwords (otherwise the results of this query will be especially boring). And finally, I will use the Counter python class to extract the most common word.

In [36]:
from collections import Counter
most_common_word_df['most_common_word'] = most_common_word_df.joined_lines.apply(lambda x: Counter([word for word in x.split() if word.lower() not in stopword_set]).most_common(1))
most_common_word_df.most_common_word = most_common_word_df[most_common_word_df.most_common_word.apply(lambda x: x != [])].most_common_word.apply(lambda x: x[0][0])
most_common_word_df[["speaker", "most_common_word"]].sort_values('speaker').reset_index(drop=True).dropna()

Unnamed: 0,speaker,most_common_word
1,alex,Uh
2,andy,Hey
3,angela,gonna
4,astrid,Mommy
5,ben,Michael
6,billy,fine
7,brandon,Val
8,brenda,Yeah
9,brian,Hey
11,cameron,Dwight


* Examining this shows that we might want to consider expanding our collection of stopwords. For example, it would be reasonable to include words like "Hey" and "Uh" in the stopword list
* There is also a bit of an open question as to whether we might want to exclude character names
* However, this result is decently interesting, so for the scope of this project I will leave it as-is

### Task #5. Number of episodes where the character does not have a line, for each character

In order to accomplish this, I took a reverse-approach. Specifically, I started by getting the total count of episodes in our project


In [37]:
total_episode_count = len(df.groupby(['season', 'episode']).size())
total_episode_count

186

Next, I grouped the rows of the data by speaker, season, and episode to get the count of episodes in which each character *does* have a line

In [40]:
speechless_episode_df = df.groupby(['speaker', 'season', 'episode']).count().groupby('speaker').size().reset_index(name="episode_count").sort_values('episode_count', ascending=False)
speechless_episode_df.head()

Unnamed: 0,speaker,episode_count
25,dwight,186
54,jim,185
85,pam,182
66,kevin,180
3,angela,175


Finally, I subtracted the number of episodes in which a character speaks from the total episode count to wind up with the speechless episode count

In [41]:
speechless_episode_df['speechless_episode_count'] = speechless_episode_df.episode_count.apply(lambda x: total_episode_count - x)
speechless_episode_df[['speaker', 'speechless_episode_count']].reset_index(drop=True)

Unnamed: 0,speaker,speechless_episode_count
0,dwight,0
1,jim,1
2,pam,4
3,kevin,6
4,angela,11
5,stanley,15
6,phyllis,16
7,oscar,20
8,kelly,41
9,andy,42


### Task #6. Number of times "That's what she said" joke comes up

This task was a bit more interesting. Not every reference to "that's what she said" is unique, nor does every reference to "that's what she said" contain the full quote.

Upon examining the data that occurss by filtering out "what she s" - the results seem largely satisfactory:

In [44]:
twss_df = df[df.line_text.apply(lambda x: "what she s" in x.lower())].reset_index(drop=True)

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted,word_count
0,2545,2,2,24,That's what she said. Pam?,michael,False,6
1,2547,2,2,24,"That's what she sai [clears throat] Nope, but...",michael,False,13
2,2591,2,2,34,Does that include 'That's What She Said'?,jim,False,8
3,2594,2,2,34,THAT'S WHAT SHE SAID!,michael,False,5
4,5325,2,10,2,"A, that's what she said, and B, I wanted it to...",michael,False,27
5,6322,2,12,33,That's what she said.,dwight,False,5
6,6353,2,12,38,"Uh, that's what she said. See, haven't lost m...",michael,True,23
7,7644,2,17,6,That's what she said!,michael,False,5
8,8872,2,21,22,That's what she said. [Jim mouths these words ...,michael,False,64
9,9624,3,1,49,I am glad that today spurred social change. T...,michael,False,42


However, there are several duplicates here- so we should unique by episode scene

In [67]:
twss_df.season.astype(str) + twss_df.episode.astype(str) + twss_df.scene.astype(str)
twss_df['key'] = twss_df.apply(lambda x: '-'.join([str(x.season), str(x.episode), str(x.scene)]), axis=1)
twss_df = twss_df.drop_duplicates('key').reset_index(drop=True)
twss_df

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted,word_count,key
0,2545,2,2,24,That's what she said. Pam?,michael,False,6,2-2-24
1,2591,2,2,34,Does that include 'That's What She Said'?,jim,False,8,2-2-34
2,5325,2,10,2,"A, that's what she said, and B, I wanted it to...",michael,False,27,2-10-2
3,6322,2,12,33,That's what she said.,dwight,False,5,2-12-33
4,6353,2,12,38,"Uh, that's what she said. See, haven't lost m...",michael,True,23,2-12-38
5,7644,2,17,6,That's what she said!,michael,False,5,2-17-6
6,8872,2,21,22,That's what she said. [Jim mouths these words ...,michael,False,64,2-21-22
7,9624,3,1,49,I am glad that today spurred social change. T...,michael,False,42,3-1-49
8,10904,3,5,59,That's what she said. [Stanley and Michael bot...,stanley,False,10,3-5-59
9,12594,3,10,49,Oh. [She whispers in his ear. Michael starts t...,michael,False,15,3-10-49


This looks reasonable, so a decent estimate of the number of "That's what she said" jokes would be 32:

In [68]:
len(twss_df)

32

### Task #6b. Five examples of the joke

The easiest way to do this is to manually pull the text from the scenes. After getting the season, episode, and scene from the readout of the preceding DataFrame, it's pretty straightforward to go into the script file and copy the relevant lines. 

I preferred a manual method here because there is not a simple way to script how much context should be included.

#### Example 1. from season 2, episode 2, scene 34

Michel: Attention, everyone! Hello! Ah, yes! I just want you to know that, uh, this is not my decision, but from here on out... we can no longer be friends. And when we talk about things here we must only discuss work-associated things.  And, uh, you can consider this my retirement from comedy. And in the future, if I want to say something funny or witty or do an impression, I will no longer, ever, do any of those things.


Jim: Does that include 'That's What She Said'?

Jim: Wow! That is really hard. You really think you can go all day long? Well, you always left me satisfied and smiling

Michael: THAT'S WHAT SHE SAID!

#### Example 2. from season 2, episode 12, scene 38

Oscar: [Jim popping Michael's bubble wrap cast] You should put butter on it.

Michael: Uh, that's what she said.  See, haven't lost my sense of humor.  No, no need, it was a non-stick grill.

#### Example 3. from season 2, episode 2, scene 24

Pam: Uh... my mother's coming.

Michael: That's what she sai [clears throat]  Nope, but... Okay. Well, suit yourself.

#### Example 4. from season 2, episode 10, scene 2

Kevin: [holds up the piece of tree he just cut off with a paper cutter] Well, sort of. Why did you get it so big?

Michael: A, that's what she said, and B, I wanted it to be impressive. The biggest day of the year deserves the biggest tree of the year.

#### Example 5. from season 2, episode 12, scene 33

Doctor: Does the skin look red and swollen?

Dwight: That's what she said.

### Task #7. The average percent of lines each character contributed each episode per season

I will interpret this as the "percent of lines per character per season" rather than the "percent of lines per character per episode per season" as the latter would be a bit redundant.

In order to do this, I begin by grouping rows by speaker and season and counting them

In [70]:
percent_per_season_df = df.groupby(['speaker', 'season']).size().reset_index(name="speaker_season_line_count")
percent_per_season_df.head()

Unnamed: 0,speaker,season,speaker_season_line_count
0,aj,5,16
1,aj,7,15
2,alex,5,10
3,andy,3,398
4,andy,4,229


Next, I will find the total number of lines per season by grouping season and summing

In [71]:
total_lines_per_season = percent_per_season_df.groupby('season').sum().reset_index()[['season', 'speaker_season_line_count']]
total_lines_per_season.head()

Unnamed: 0,season,speaker_season_line_count
0,1,1912
1,2,7118
2,3,7217
3,4,5338
4,5,7852


And finally, I will divide the speaker's line count by the total line count, multiply by 100, and round the answer to 2 decimal places to get the answer:

In [73]:
pd.set_option('display.max_rows', 329)
percent_per_season_df['percent_per_season'] =  percent_per_season_df.apply(lambda x: round(100 * x['speaker_season_line_count'] / total_lines_per_season[total_lines_per_season['season'] == x['season']].speaker_season_line_count.iloc[0], 2), axis=1)
percent_per_season_df[['speaker', 'season', 'percent_per_season']]

Unnamed: 0,speaker,season,percent_per_season
0,aj,5,0.2
1,aj,7,0.21
2,alex,5,0.13
3,andy,3,5.51
4,andy,4,4.29
5,andy,5,6.36
6,andy,6,7.62
7,andy,7,7.98
8,andy,8,17.0
9,andy,9,9.76


### Task #8. 

I will leave these undone since I didn't have time to accomplish them during the 4hr. chunk. However, here are a few questions which I would like to answer given more time:

1. Plot questions 1-7 to provide a more easily digestable report
1. Which season has the most dialogue (measured by number of words spoken)?
1. What is the average number of words spoken per scene across all seasons? Said another way, is there a given scene number which tends to have more dialogue than the other scenes?