Resources:
- https://github.com/jsvine/markovify
- https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons

In [2]:
import numpy as np
import pandas as pd
import json
import zipfile
import markovify

In [3]:
zf = zipfile.ZipFile('./dialogue-lines-of-the-simpsons.zip') 
data = pd.read_csv(zf.open('simpsons_dataset.csv'))

In [4]:
data.columns

Index(['raw_character_text', 'spoken_words'], dtype='object')

In [5]:
char_counts = data['raw_character_text'][~ pd.isnull(data['raw_character_text'])].value_counts()
char_counts[:10]

Homer Simpson          29782
Marge Simpson          14141
Bart Simpson           13759
Lisa Simpson           11489
C. Montgomery Burns     3162
Moe Szyslak             2862
Seymour Skinner         2438
Ned Flanders            2144
Grampa Simpson          1880
Milhouse Van Houten     1862
Name: raw_character_text, dtype: int64

In [6]:
character = 'Homer Simpson'
bot_name = 'homer_bot'

In [7]:
data_character = data[data['raw_character_text'] == character]
data_character = data_character[~pd.isnull(data_character)]
data_character.head(3)

Unnamed: 0,raw_character_text,spoken_words
57,Homer Simpson,Never thrown a party? What about that big bash...
58,Homer Simpson,"Bart didn't get one vote?! Oh, this is the wor..."
62,Homer Simpson,Oh.


In [8]:
last_chars = [c[-1] if type(c) == str else '0' for c in data_character['spoken_words'].values ]
pd.Series(last_chars).value_counts()

.    16471
!     6110
?     4586
0     1932
"      343
-      230
/       15
:       11
Y       10
E        8
O        6
t        5
'        5
,        5
N        4
D        4
G        3
S        3
n        3
L        3
         3
r        2
H        2
7        2
h        2
y        2
w        1
K        1
m        1
X        1
1        1
o        1
I        1
R        1
T        1
A        1
W        1
e        1
dtype: int64

In [9]:
valid_ends = ['!', '.', '?']
valids = [c in valid_ends for c in last_chars]

In [10]:
valid_character = data_character[valids]['spoken_words'].values

In [11]:
text_chracter = ' '.join(list(valid_character))

In [12]:
model = markovify.Text(text_chracter)

In [13]:
for i in range(20):
    print(model.make_short_sentence(140))

And I can finally meet Jim Jarmusch and ask Mr. Seckofsky and Barney Gumbel.
No, it's something to get here.
Hey, could you just listen once in your beautiful voice with other people have a crayon up our dental plan!
But Mr. Burns, you're coming home.
I've got to squeal to every one of you youngsters is Abe Simpson?
You must think we're good!
If a mosquito bites you, don't make me choose.
That is one of those placemats with the dead.
I'm sorry I missed you guys are dragging me up when the air live?
I can't take his place.
Hey boy, we're not going anywhere.
The blue being my usual effervescent self...
You know, Moe, you're a good thing.
The men will clear the launch area.
I won't say, but a breeze if we were just talking about the romance between you and Jack Valenti thinking you can do is be surprised.
So basically, my job so that makes computers, or a bus filled with murderous rage.
Yeah, it's clear to moving back in time!
Now do you want to star in a carpet and throw it out, ladies.


In [15]:
json_model = model.to_json()
json.dump(json_model, open('./{}.json'.format(bot_name), 'w'))

In [16]:
test_model = markovify.Text.from_json(json.load(open('./{}.json'.format(bot_name), 'r')))

In [17]:
for i in range(10):
    print(test_model.make_short_sentence(280))

I'm thinking about that mini-van I rented that plane.
C'mon, help me figure this out!
You're not even for one of these guys lost the Civil Rights Act.
Oh Lord, protect this rocket house and got rid of that one right now in the blue.
I don't know what Schadenfreude is.
Oh Barney, that's great.
The Olympics have preempted my favorite song now.
Well, you should let me give you your nanny, and to prove anything that's even remotely true.
It'll follow me to the square roots of any more!
Um, Marge, I have to make it any less true.
