# Introduction
I have built a simple turn based information retrieval chatbot `IRChatbot`. 

My chatbot is initialized from a set of conversations, where each conversation consists of a message sequence, or "turns".
When the bot receives a new message it will atempt to find the most similar message among it's known conversations, and respond with the same message that was used in that conversation.

In [224]:
import imp

import chatbot
imp.reload(chatbot)
from chatbot import IRChatbot

import numpy as np

# Toy Example
Let's start with a toy data set of just four sentences.

In [225]:
messages = [
    "Hello, I'm Gustav.",
    "Nice to meet you Gustav, I'm Carl.",
    "What are you doing?",
    "I'm reading a book."
]
message_dict = dict(enumerate(messages))

edges = [(0, 1), (2, 3)]

In [226]:
bot = IRChatbot(message_dict, edges)

In [227]:
bot.respond("Hello, I'm Simon")

"Nice to meet you Gustav, I'm Carl."

In [228]:
bot.respond("What is happening?")

"I'm reading a book."

In [229]:
bot.respond("How are you doing?")

"I'm reading a book."

# Move Script Data
Now let's move on to a larger data set, the [Cornell Movie Dialog Corpus](http://www.cs.cornell.edu/~cristian//Cornell_Movie-Dialogs_Corpus.html).

In [230]:
import pandas as pd

In [231]:
movie_lines = pd.read_csv('data/cornell_movie_dialogs_corpus/cornell movie-dialogs corpus/movie_lines.txt', 
                          sep=r'\+\+\+\$\+\+\+', header=None, names=['id', 'character_id', 'movie_id', 'character_name', 'line'], )

  


In [232]:
for col in movie_lines.columns:
    movie_lines.loc[:, col] = movie_lines.loc[:, col].str.strip()

In [233]:
movie_lines = movie_lines.fillna('')

In [234]:
movie_lines.head()

Unnamed: 0,id,character_id,movie_id,character_name,line
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [235]:
conversations = pd.read_csv('data/cornell_movie_dialogs_corpus/cornell movie-dialogs corpus/movie_conversations.txt', 
                          sep=r'\+\+\+\$\+\+\+', header=None, names=['character_id_1', 'character_id_2', 'movie_id', 'lines'])

  


In [236]:
for col in conversations.columns:
    conversations.loc[:, col] = conversations.loc[:, col].str.strip()

In [237]:
conversations.head()

Unnamed: 0,character_id_1,character_id_2,movie_id,lines
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


In [238]:
print("%d conversations with %d messages" % (len(conversations), len(movie_lines)))

83097 conversations with 304713 messages


The data contains a lot of meta information that my chatbot does not need, so I will throw that away.

In [239]:
import re

In [240]:
# Convert to lists
conversations['lines'] = conversations['lines'].apply(lambda x: list(re.sub(r"[' \[\]]", '', x).split(',')))

In [241]:
messages = dict(zip(movie_lines.loc[:,'id'].values, movie_lines.loc[:,'line'].values))

In [242]:
conversations['edges'] = conversations['lines'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])

In [243]:
conversations.head()

Unnamed: 0,character_id_1,character_id_2,movie_id,lines,edges
0,u0,u2,m0,"[L194, L195, L196, L197]","[(L194, L195), (L195, L196), (L196, L197)]"
1,u0,u2,m0,"[L198, L199]","[(L198, L199)]"
2,u0,u2,m0,"[L200, L201, L202, L203]","[(L200, L201), (L201, L202), (L202, L203)]"
3,u0,u2,m0,"[L204, L205, L206]","[(L204, L205), (L205, L206)]"
4,u0,u2,m0,"[L207, L208]","[(L207, L208)]"


Let's use just 1000 sentences to test my implementation.

In [244]:
def trim_data(conversations, messages, n):
    fewer_conversations = conversations.iloc[:n]
    edges = []
    step = np.ceil(len(fewer_conversations) / 10)
    for i in range(10):
        edges += fewer_conversations.loc[i*step:min((i+1)*step, len(fewer_conversations)),'edges'].sum()
    fewer_messages = {}
    for edge in edges:
        for message_id in edge:
            fewer_messages[message_id] = messages[message_id]
    
    return fewer_messages, edges

In [257]:
fewer_messages, fewer_edges = trim_data(conversations, messages, 83097)

In [258]:
print("There are now %d edges" % len(fewer_edges))

There are now 221661 edges


In [259]:
print("There are now %d lines" % len(fewer_messages))

There are now 304713 lines


In [260]:
%time bot = IRChatbot(fewer_messages, edges)

Wall time: 2min 14s


In [261]:
%time bot.respond("What is happening?")

Wall time: 2.3 s


'What is happening?'

In [262]:
%time bot.respond("Who are you?")

Wall time: 2.63 s


'Who are you?'

In [263]:
%time bot.respond("What is your name?")

Wall time: 2.6 s


'What is your name?'

In [264]:
%time bot.respond("Tell me something")

Wall time: 2.79 s


'Tell me something, anything.'

In [265]:
%time bot.respond("What do you hate about peas?")

Wall time: 2.41 s


'I hate peas.'

In [266]:
%time bot.respond("Where are we?")

Wall time: 2.55 s


'Where are we?'

In [267]:
%time bot.respond("Let's go!")

Wall time: 3.4 s


"Let's go!  Let's go!"

Most of the time the bot is pretty much just mimicking my input, which is the behaviour I have set for when the closest message lacks a response. Let's see what happens if I say stuff thats less likely to occur in the movies.

In [278]:
%time bot.respond("Would you like to go to Stockholm?")

Wall time: 2.68 s


'Would you like to?'

In [279]:
%time bot.respond("I am learning about chatbots")

Wall time: 2.33 s


"You're learning."

It takes the bot about 3 seconds to answer a question, which I think is fine considering a human would probably need more time to type an answer. 

It would be interesting to see if we can get better reponses and reduce the time to answer by using doc2vec! There are also other techniques to reduce the number f features such as LDA or PCA, which could also be neat to try out.

In [269]:
print("Each document is represented as a vector with %d elements" % bot.message_vectors['L194'].shape[1])

Each document is represented as a vector with 59807 elements


# Doc2Vec 
Let's condense the document representations by using doc2vec!

In [201]:
fewer_messages, fewer_edges = trim_data(conversations, messages, 83097)

In [270]:
%time bot_doc2vec = IRChatbot(fewer_messages, edges, representation='doc2vec', representation_kwargs={'epochs':50, 'vector_size':100})

Wall time: 29min 52s


In [271]:
%time bot_doc2vec.respond("What is happening?")

Wall time: 766 ms


'What is happening?'

In [272]:
%time bot_doc2vec.respond("Who are you?")

Wall time: 807 ms


'Who are you?'

In [273]:
%time bot_doc2vec.respond("What is your name?")

Wall time: 778 ms


'What is your name?'

In [274]:
%time bot_doc2vec.respond("Tell me something")

Wall time: 885 ms


'Tell me something true.'

In [275]:
%time bot_doc2vec.respond("What do you hate about peas?")

Wall time: 771 ms


'Whatta you so happy about?'

In [276]:
%time bot_doc2vec.respond("Where are we?")

Wall time: 786 ms


'Where are we ?'

In [277]:
%time bot_doc2vec.respond("Let's go!")

Wall time: 972 ms


"Let's go!"

In [280]:
%time bot_doc2vec.respond("Would you like to go to Stockholm?")

Wall time: 783 ms


'Something like that. Will you do it?'

In [281]:
%time bot_doc2vec.respond("I am learning about chatbots")

Wall time: 750 ms


'I am accountable.'

In [286]:
def chat(model):
    print("Bot: Hey!")
    msg = ""
    while msg.lower() != "bye":
        msg = input()
        print("Bot: " + model.respond(msg))

In [287]:
chat(bot_doc2vec)

Bot: Hey!
Hi bot!
Bot: Hi!  Hi!
You seem happy!
Bot: Yippy!
:D
Bot: 5:11.
Okay? What does that mean?
Bot: Oh yeah? What kind is that?
What are you talking about?
Bot: What are you talking about?
I'm asking the questions here
Bot: It's a method for quitting smoking.
Ahh I see, thanks for explaining.
Bot: Alex, please.
I'm not Alex
Bot: Alex.
No
Bot: No.
Are you just mimicking what I say?
Bot: Are you jackin' with me?
No I'm not
Bot: No I'm not.
Ahhhh
Bot: Ahhh.
Please stop
Bot: Pops
Great, what about pops?
Bot: Yes, but what happened to George?
He livs in Brooklyn now.
Bot: Sounds like Beverly Hills.
No, it's in new york
Bot: Naw, it's okay.
You make no sense.
Bot: You mean the snake.
Bye
Bot: Bye.


In [288]:
chat(bot)

Bot: Hey!
Hi!
Bot: Hi!
You sure take your time to answer.
Bot: Take your time.
Hello my friend. You are now talking to a new person.
Bot: You are my friend.
That is good to hear albeit a bit worrying how easy you make friends.
Bot: I like worrying about you.
It is not you who should be worried. 
Bot: Should I be worried?
No, I explained to you that you should not. 
Bot: No, no, it's not a threat, I've explained all of that.
Would you like to know why it is that people have two ears yet only a single mouth?
Bot: Single?
one
Bot: One.
It is because a person, including you, should listen more than it talks.
Bot: Including you?
Yes.
Bot: Yes.
If I took that off, would you die? 
Bot: Took off. Yes. They took off.
You're a big guy. 
Bot: You're a big help.
bye
Bot: Bye, Bye.


In [289]:
msg = "Hey!"
for i in range(10):
    print("Bot1: "+ msg)
    msg = bot.respond(msg)
    print("Bot2: " + msg)
    msg = bot.respond(msg)
    

Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
Bot1: Hey!
Bot2: Hey!
