# Simpsons Haiku Demo 
## [@SimpsonsHaiku](https://twitter.com/SimpsonsHaiku)

For anyone who knows me, they'll attest that The Simpsons has had an undue impact on my life, being formative through my early years and enduring to the present moment. That contribution spanned curating a sense of humour deeply anchored in an appreciation for the surreal, a continued love of animation, or exposure to the depths of obscure Americana (thank you, John Swartwzelder).

This notebook demonstrates the implementation of an idea I had long ago, inspired by [@nythaikus](https://twitter.com/nythaikus). We start by loading the core haiku object, of class `SimpsonsHaiku`. 

In [1]:
import matplotlib.pyplot as plt
import compuglobal
from haiku import *

In [2]:
simpsons_haiku = SimpsonsHaiku()

100%|███████████████████████████████████████████████████████████████████████| 400995/400995 [00:07<00:00, 55386.07it/s]


In [3]:
script = simpsons_haiku.script

In [4]:
script.spoken_words.iloc[0]

'Ooo, careful, Homer.'

In [5]:
script[script.normalized_text.progress_apply(lambda x: 'diddly' in x)].iloc[0].values

100%|██████████████████████████████████████████████████████████████████████| 224863/224863 [00:00<00:00, 917681.78it/s]


array([25369, 86, 69,
       'Devil: Ahem! I hold here a contract between myself and one Homer Simpson, pledging me his soul for a donut. Which I delivered! And it was scrum-diddly-umptious!',
       378000, True, 346.0, 25.0, 'Devil', 'Simpson Living Room',
       'Ahem! I hold here a contract between myself and one Homer Simpson, pledging me his soul for a donut. Which I delivered! And it was scrum-diddly-umptious!',
       'ahem i hold here a contract between myself and one homer simpson pledging me his soul for a donut which i delivered and it was scrum-diddly-umptious',
       26.0, 86, 'Treehouse of Horror IV', 5, 86, 'Ahem', 1, 2],
      dtype=object)

In [6]:
corpus = script['spoken_words'].str.cat(sep=' ')
for char in [",", ".", "?", "!", ":", "\\", "\""]:#self.strip_list:
    corpus = corpus.replace(char, '')
corpus_list = corpus.lower().replace('-', ' ').replace('/', ' ').split(' ')

corpus_df = pd.DataFrame({'word' : corpus_list})

simpsons_count = corpus_df.value_counts().reset_index(name='counts')
simpsons_count

Unnamed: 0,word,counts
0,the,95633
1,i,81109
2,you,79846
3,a,72057
4,,71689
...,...,...
42120,hippity,1
42121,hippo's,1
42122,street's,1
42123,hir,1


In [7]:
df = pd.read_json('simpson_lect.json', orient='index').reset_index().rename({'index':'word', 0:'n_syllable'}, axis=1)
df['syllables_estimate'] = df.word.apply(syllables.estimate)
df['syllapy_estimate'] = df.word.apply(syllapy.count)

df['syllables_error'] = abs(df.word.apply(syllables.estimate) - df['n_syllable'])
df['syllapy_error'] = abs(df.word.apply(syllapy.count) - df['n_syllable'])

In [8]:
# Comparing syllables and syllapy performance on labelled syllable set
df.describe().iloc[1:3, :]

Unnamed: 0,n_syllable,syllables_estimate,syllapy_estimate,syllables_error,syllapy_error
mean,1.908297,1.802038,1.679767,0.25182,0.283843
std,0.844484,0.826913,0.865224,0.560398,0.669679


In [9]:
# Generate haiku_df, will do so from scratch here but will load from path if it is passed when instantiating SimpsonsHaiku object.
# haiku_df = simpsons_haiku.generate_haiku_df(save=True)
# haiku_df.sample().values

In [None]:
haiku, _ = simpsons_haiku.generate_haiku()
print(haiku)

 75%|██████████████████████████████████████████████████████▋                  | 168642/224863 [09:39<02:57, 317.61it/s]

In [None]:
simpsons_haiku.generate_haiku()

In [None]:
# Max number of lines of dialogue in a 17-syllable sequence? It's 16 (3 men and a comic book)
haiku_df['n_lines'] = haiku_df.number.apply(len)
haiku_df[haiku_df.n_lines == haiku_df.n_lines.max()]

In [None]:
# Max number of unique characters in a 17-syllable sequence? It's Homer 3D, with 9 characters. Did anyone see the movie Tron?
haiku_df['n_characters'] = haiku_df.character_id.apply(lambda x: len(set(x)))
haiku_df[haiku_df.n_characters == haiku_df.n_characters.max()]

In [None]:
# How about locations?
haiku_df['n_locations'] = haiku_df.location_id.apply(lambda x: len(set(x)))
haiku_df[haiku_df.n_locations == haiku_df.n_locations.max()].spoken_words_split.values

In [None]:
# Distribution by Season
# haiku_df.season.value_counts().plot(kind='bar')
haiku_df.groupby('season').count()['id_x'].plot()
plt.title('Number of haikus per season')

In [None]:
# Which episode(s) have the most haikus?
haiku_df.reset_index().groupby('episode_id').count()

In [None]:
base_script = pd.read_csv('dataset/simpsons_script_lines.csv', error_bad_lines=False).dropna(subset=['word_count'])
episode_data = pd.read_csv('dataset/simpsons_episodes.csv')[['id', 'title', 'season', 'number_in_series']]
base_script = pd.merge(base_script, episode_data, how='left', left_on='episode_id', right_on='id')

base_script['n_syllables'] = base_script.spoken_words.progress_apply(simpsons_haiku.count_syllables_line)
base_script.groupby('season')['n_syllables'].mean().plot(label='Mean syllables per line')
base_script.groupby('season')['word_count'].mean().plot(label='Mean words per line')

plt.title('Mean number of words per line by season (Original script)')
plt.legend()

In [None]:
# This the ratio of the two lines above
plt.title('Mean number of syllables per word by season')
(base_script.groupby('season')['n_syllables'].sum() / base_script.groupby('season')['word_count'].sum()).plot(color='g')

In [None]:
# Taking a look at the longest lines
long_script = script[script.n_syllables > 17]
# long_script.sort_values('n_syllables', ascending=False).head()

In [None]:
# Distribution by syllable count
haiku_df.n_syllables.value_counts()#.head(20)

In [None]:
# Higher quality haikus
haiku_array = haiku_df[haiku_df.n_syllables.apply(lambda x: (x == [5, 7, 5]) | (x == [17]) | (x == [5, 12])| (x == [12, 5]))].sample().spoken_words_split.values# Search for [17], [5, 7, 5], [5, 12], [12, 5]
haiku_array

In [None]:
haiku_df[haiku_df.n_syllables.apply(lambda x: ((x != [5, 7, 5]) & (x != [17]) & (x != [5, 12]) & (x != [12, 5])))]

In [None]:
# Medium quality?
haiku_df[haiku_df.n_syllables.apply(lambda x: x == [7, 5, 5])].sample().spoken_words_split.values

In [None]:
# Lower quality haikus
haiku_df[haiku_df.n_syllables.apply(lambda x: ((x != [5, 7, 5]) & (x != [17]) & (x != [5, 12]) & (x != [12, 5])))].sample().spoken_words_split.values

In [None]:
# 

simpsons = compuglobal.Frinkiac()
haiku_array = ["It doesn't take a whiz to see that you're looking out for number one"]
# Search
screencap = simpsons.search_for_screencap(haiku_array[0])

# Images/Gifs
image = screencap.get_meme_url()
gif = screencap.get_gif_url()

In [None]:
image

In [None]:
for word in "Perhaps I may be of help Where did you come from I'm your cellmate".split():
    print(word, simpsons_haiku.num_syllables(word))

In [None]:
# base_script[base_script.spoken_words.apply(lambda x: 'TV' in x)]