Resources:
- https://github.com/jsvine/markovify
- https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons

In [1]:
import pandas as pd
import json
import zipfile
import markovify

In [2]:
zf = zipfile.ZipFile('./dialogue-lines-of-the-simpsons.zip') 
data = pd.read_csv(zf.open('simpsons_dataset.csv'))

In [3]:
data.columns

Index(['raw_character_text', 'spoken_words'], dtype='object')

In [4]:
char_counts = data['raw_character_text'][~ pd.isnull(data['raw_character_text'])].value_counts()
char_counts[:10]

Homer Simpson          29782
Marge Simpson          14141
Bart Simpson           13759
Lisa Simpson           11489
C. Montgomery Burns     3162
Moe Szyslak             2862
Seymour Skinner         2438
Ned Flanders            2144
Grampa Simpson          1880
Milhouse Van Houten     1862
Name: raw_character_text, dtype: int64

In [5]:
character = 'Homer Simpson'
bot_name = 'homer_bot'

In [6]:
data_character = data[data['raw_character_text'] == character]
data_character = data_character[~pd.isnull(data_character)]
data_character.head(3)

Unnamed: 0,raw_character_text,spoken_words
57,Homer Simpson,Never thrown a party? What about that big bash...
58,Homer Simpson,"Bart didn't get one vote?! Oh, this is the wor..."
62,Homer Simpson,Oh.


In [7]:
last_chars = [c[-1] if type(c) == str else '0' for c in data_character['spoken_words'].values ]
pd.Series(last_chars).value_counts()

.    16471
!     6110
?     4586
0     1932
"      343
-      230
/       15
:       11
Y       10
E        8
O        6
,        5
t        5
'        5
D        4
N        4
S        3
n        3
         3
L        3
G        3
y        2
7        2
r        2
h        2
H        2
o        1
1        1
A        1
e        1
T        1
m        1
w        1
X        1
K        1
I        1
R        1
W        1
dtype: int64

In [8]:
valid_ends = ['!', '.', '?']
valids = [c in valid_ends for c in last_chars]

In [9]:
valid_character = data_character[valids]['spoken_words'].values

In [10]:
text_chracter = ' '.join(list(valid_character))

In [11]:
model = markovify.Text(text_chracter)

In [12]:
for i in range(20):
    print(model.make_short_sentence(140))

Read your town on a has-been planet orbited by a flying saucer!
The burgers are getting sharper.
Marge, I'm sorry -- it's my burger.
This is for your patience!
Well, that depends on what you think, sweetie?
Two can play in the world like major league baseball park.
On a beautiful thing.
But I'm going now and receive a free car because I'm gonna tell me?
Why scrimp now on the wrong foot.
Agreed, but to win back a hat.
Marge, why don't you take that back!
...but I am that I don't know -- I'll use my inventive mind to ketchup water.
You can't make you happy and you can watch cartoons and Lisa gonna get my lips started to move your ass.
The wind may have a field day with this.
It shows you've been smarter than you.
What the heck is this, the Twilight Zone?
You know Marge, joining the professional arm-wrestling circuit!
To me, she's beautiful.
Mmmm, / Mmmm, invisible cola.
If you can find your car seat to hold down the aisle at the bar stool.


In [15]:
json_model = model.to_json()
json.dump(json_model, open('./{}.json'.format(bot_name), 'w'))

In [16]:
test_model = markovify.Text.from_json(json.load(open('./{}.json'.format(bot_name), 'r')))

In [17]:
for i in range(10):
    print(test_model.make_short_sentence(280))

I'm thinking about that mini-van I rented that plane.
C'mon, help me figure this out!
You're not even for one of these guys lost the Civil Rights Act.
Oh Lord, protect this rocket house and got rid of that one right now in the blue.
I don't know what Schadenfreude is.
Oh Barney, that's great.
The Olympics have preempted my favorite song now.
Well, you should let me give you your nanny, and to prove anything that's even remotely true.
It'll follow me to the square roots of any more!
Um, Marge, I have to make it any less true.
