# 'Frasier' Characters' Most Distinguishing Words

I recently stumbled on [this post](https://www.reddit.com/r/dataisbeautiful/comments/8a4gbr/the_office_characters_most_distinguishing_words_oc/) on /r/dataisbeautiful, which found the most distinguishing words of '_The Office_' characters. I am a fan of '_The Office_', but an even bigger fan of the sitcom '[_Frasier_](https://en.wikipedia.org/wiki/Frasier)', so I set out to do similar analysis for '_Frasier_' characters.

## Getting the dialogue

Luckily for me, there's a great site [here](http://www.kacl780.net/frasier/transcripts/) which has complete transcripts of every '_Frasier_' episode. First, let's get the links to all the episodes from the [home page](http://www.kacl780.net/frasier/transcripts/). I use BeautifulSoup to get the HTML as a nested data structure, and pull all the links that have "episode" in them.

In [50]:
import urllib.request
from bs4 import BeautifulSoup

def url_to_soup(url):
    fp = urllib.request.urlopen(url)
    html_str = fp.read()
    fp.close()
    return BeautifulSoup(html_str, 'html.parser') 

main_page_soup = url_to_soup("http://www.kacl780.net/frasier/transcripts/")
urls = [a.get('href') for a in main_page_soup.find_all('a')]
urls = ['http://www.kacl780.net' + u for u in urls if 'episode' in u]

print('ep 1 transcript URL: ' + urls[0])
print('number of episodes: ' + str(len(urls)))

ep 1 transcript URL: http://www.kacl780.net/frasier/transcripts/season_1/episode_1/the_good_son.html
number of episodes: 264


## Parsing the dialogue

Now, let's parse out the dialogue from each of the transcripts. 

First, I'll define two helper functions. 

(1) `clean_string` will clean / normalize the words in the dialogue

(2) `get_dialogue_following_bold_tag`. This requires a bit of explanation. In the transcript, each character's lines are prefaced by their name in __bold__. This function thus find all the elements with tag 'b', then cleans and stores all the text in between.

There is some bolded on text on the website that isn't the start of a character line, but we will filter this out during processing. 

In [51]:
from string import punctuation
import re

# given a string S:
# remove all punctuation, strip trailing whitespace, and lowercase ALL the letters 
def clean_string(S):
    if S == None:
        return None
    return re.sub('['+punctuation+']', '', S).strip().lower()


# given a bold tag bt:
# return dialogue line as list of cleaned words    
def get_dialogue_following_bold_tag(bt):
    if bt == None or bt.string == None:
        return []
    words_in_line = []
    ns = bt.next_sibling
    while ns != None and ns.name != 'b':
        # we specifically want tags with no "name"
        # cues (like stage directions and intonation) are italicized with tag 'i'
        if ns.name == None:
            words_in_line += clean_string(ns.string).split()
        # move onto the next sibling
        ns = ns.next_sibling
    return words_in_line

Now, let's get to the bulk of the processing. Given a URL that corresponds to an episode's transcript, `url_to_lines_dict` will return a dict of dicts. The outer dict's keys are the characters' names. The values are inner dicts: the keys are the words in the character's dialogue, and the values are the count of how many times the character says the word. We only process the dialogue of the main cast, which consists of Frasier, Niles, Roz, Daphne, and Martin. 

In [52]:
from collections import defaultdict
import pandas as pd

main_characters = set(['frasier', 'niles', 'roz', 'daphne', 'martin'])

# given a URL, convert it to the soup
# then, find all the bold tags in the soup
# for each of the bold tags, 
def url_to_lines_dict(url):
    soup = url_to_soup(url)
    bold_tags = soup.find_all('b')
    # this is a two-level defaultdict, see https://stackoverflow.com/a/27809959
    d_char_wc = defaultdict(lambda: defaultdict(int))
    for bt in bold_tags:
        char = clean_string(bt.string)
        # only process the line if it's uttered by a main character
        if char in main_characters:
            for word in get_dialogue_following_bold_tag(bt):
                d_char_wc[char][word] += 1
    return d_char_wc

Let's process the first episode to get a sense of what our data looks like.

In [53]:
d_char_wc = url_to_lines_dict(urls[0])
df_char_wc = pd.DataFrame(d_char_wc)
# display the top 5 most frequent words, sorted by the 'frasier' column
df_char_wc.sort_values(by='frasier', ascending=False).head()

Unnamed: 0,frasier,roz,niles,martin,daphne
you,63.0,8.0,19.0,17.0,15.0
i,58.0,1.0,19.0,20.0,16.0
the,50.0,16.0,11.0,13.0,9.0
to,43.0,6.0,14.0,6.0,2.0
a,42.0,4.0,10.0,8.0,10.0


Great! The above looks more or less reasonable. Let's go ahead and process the rest of the episodes, combining the resulting dictionaries as we go. NOTE: the below takes a few minutes :-)

In [54]:
from collections import Counter

for url in urls[1:]: # we did the first one above
    d_ep = url_to_lines_dict(url)
    for mc in main_characters:
        d_char_wc[mc] = Counter(d_char_wc[mc]) + Counter(d_ep[mc])

df_char_wc = pd.DataFrame(d_char_wc)
df_char_wc.ix[df_char_wc[]]
df_char_wc.sort_values(by='frasier', ascending=False).head()

Unnamed: 0,frasier,roz,niles,martin,daphne
you,10742.0,2169.0,4113.0,3844.0,2280.0
i,9394.0,2057.0,4257.0,3023.0,2250.0
the,7868.0,1321.0,3885.0,2678.0,1689.0
to,7482.0,1336.0,3530.0,2308.0,1555.0
a,6987.0,1282.0,2927.0,2273.0,1599.0


Before we start computing the "score" of each word based on their occurrences, let's first take out the [stop words](https://en.wikipedia.org/wiki/Stop_words). 

In [132]:
from nltk.corpus import stopwords

stop_words = set(clean_string(sw) for sw in stopwords.words('english'))
df_char_wc = df_char_wc.drop(index=stop_words, errors='ignore')
## df_char_wc.sort_values(by='frasier', ascending=False).head(10)

Unnamed: 0,frasier,roz,niles,martin,daphne
well,3748.0,455.0,1393.0,1120.0,581.0
oh,3563.0,627.0,1461.0,1226.0,964.0
im,2487.0,464.0,1072.0,709.0,560.0
know,2295.0,406.0,654.0,797.0,379.0
niles,2167.0,121.0,166.0,384.0,184.0
yes,2093.0,83.0,652.0,65.0,238.0
dad,1731.0,29.0,621.0,15.0,24.0
right,1669.0,210.0,505.0,507.0,290.0
roz,1312.0,44.0,146.0,85.0,86.0
see,1004.0,124.0,329.0,257.0,167.0


## Scoring the dialogue

We're finally ready to compute what are the most distinguishing words of each character. Let's create another dataframe that will store our computed score for each word. First, we square each entry and divide by row sum. That's equivalent to $\frac{(person\space says\space word)^2}{anyone\space says \space word}$. 

What's the motivation of the above? Well, $\frac{person\space says\space word}{anyone\space says\space word}$ is how often a person says a word. However, simply having this ratio would result in inflated scores for infrequent words. Thus, multiplying again by $(person\space says\space word)$ would take into account the infrequency. See [this illuminating reddit comment](https://www.reddit.com/r/dataisbeautiful/comments/8a4gbr/the_office_characters_most_distinguishing_words_oc/dwwk070?utm_source=share&utm_medium=web2x) for a more detailed explanation. 

In [139]:
import numpy as np
df_score = df_char_wc.apply(np.square).div(df_char_wc.sum(axis=1), axis=0)
df_score.head(10)

Unnamed: 0,frasier,roz,niles,martin,daphne
1,4.9,0.4,0.1,,
10,,0.25,1.0,,0.25
100,2.0,0.125,1.125,,
1000,0.5,0.5,,,
10000,1.333333,,,0.333333,
1000piece,,,1.0,,
100th,,1.0,,,
101,,,0.5,0.5,
1030,,,,1.0,
105,,,1.0,,


We'll now scale by the following factor: $\frac{anyone \space speaks}{person \space speaks}$. This is to incorporate the fact that some characters simply speak more than other characters -- for example, Frasier naturally has more lines than any other character, as he is not only the main character, but also very verbose by nature.

In [140]:
total_word_count = sum(df_char_wc.sum())
for mc in main_characters:
    df_score[mc] = df_score[mc] * total_word_count / df_char_wc[mc].sum() 
df_score.sort_values(by='frasier', ascending=False).head(10).index

Index(['well', 'oh', 'niles', 'yes', 'dad', 'im', 'know', 'roz', 'right',
       'see'],
      dtype='object')

Finally, let's print out the top ten words for the main characters:

In [143]:
for mc in main_characters:
    top_10 = df_score.sort_values(by=mc, ascending=False).head(10).index
    print(mc, top_10)

niles Index(['frasier', 'oh', 'well', 'im', 'maris', 'daphne', 'dad', 'yes', 'going',
       'know'],
      dtype='object')
roz Index(['frasier', 'oh', 'im', 'hey', 'alice', 'know', 'yeah', 'really', 'well',
       'get'],
      dtype='object')
martin Index(['yeah', 'hey', 'oh', 'well', 'know', 'fras', 'eddie', 'got', 'get',
       'guys'],
      dtype='object')
daphne Index(['crane', 'dr', 'oh', 'mum', 'im', 'well', 'like', 'ill', 'mr', 'hes'], dtype='object')
frasier Index(['well', 'oh', 'niles', 'yes', 'dad', 'im', 'know', 'roz', 'right',
       'see'],
      dtype='object')


Cool! Some of these words are expected: for example, Daphne's "dr" "mr" "crane" makes sense, since she is the only character to refer to the Crane boys as such for a majority of episodes. Furthermore, she is also the only British character, so the appearance of her dreaded "mum" is expected. There are also some surprises: I would have thought some form of ["Oh Dear God"](https://www.youtube.com/watch?v=jaUy_dlKC0I) would appear for Frasier :-) 

This was a relatively simple word analysis. In fact, the question of distinguishing dialogue is only a subset of the larger question of determining important words in any document. For in-depth reading, see [this Wikipedia article](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).