
# Chatbot Data Wrangling and Exploratory Data Analysis (EDA)


The intention of this project is to create a chatbot that will respond to my incoming hangouts messages.  In order to accomplish this, the bot will need some initial data to train on.  I will utilize past hangouts conversations as well as a more general question and answer dataset from kaggle. 

### Table of Contents

1. Data Wrangling 
- import packages
- load, view kaggle data, and add hangouts data

2. Exploratory Data Analysis (EDA)
- decision tree classifier
- chatbot functionality

3. Summary
- findings
- save dataset
- notebook details

### Data Wrangling
#### Import Packages

In [1]:
import numpy as np
import pandas as pd

import os
import sys

import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

# use os to get path
PROJ_ROOT = os.path.join(os.pardir)
print(os.path.abspath(PROJ_ROOT))

C:\Users\Hailey\Documents\GitHub\SBwork\Capstone3-Chatbot


#### Load,  View Kaggle Data, and Add Hangouts Data
The data for this project is my google hangouts history (accessed using the instructions below) with the addition of extra data from a [kaggle](https://www.kaggle.com/grafstor/simple-dialogs-for-chatbot) dataset.  To access my chat history, I enabled the chats label (within gmail) by going to settings > labels > click show next to chats.  Then I was able to navigate to the chats label and a list of chat history was available.  

In [2]:
# load and view kaggle data
data_path = os.path.join(PROJ_ROOT,'Data', 'kaggle-dialogs-original.txt')

data = pd.read_csv(data_path, sep='\t', header=None)
data.columns = ['SenderText', 'BotText']
df = pd.DataFrame(data)
df.head(3)

Unnamed: 0,SenderText,BotText
0,"hi, how are you doing?",i'm fine. how about yourself?
1,i'm fine. how about yourself?,i'm pretty good. thanks for asking.
2,i'm pretty good. thanks for asking.,no problem. so how have you been?


In [29]:
# create dict of my convos

st = ["hi",
      "Can the dogs come?",
      "meow",
      "mrow",
      "Bring the pups too",
      "Bring all the doggos",
      
      "Bring Echo!",
      "We would love to host you whenever you come down here!",
      "Heard you guys are buying a condo!  Congrats!",
      "hold please",
      #"https://www.youtube.com",
      "they better send mah money",
      
      "https://www.zillow.com",
      "Have you heard from her?",
      "I drank almost all of my tea",
      "just out of curiosity, what's going on with that?",
      "hiya",
      "What do you want for dinner?",
      
      "What's for dinner?",
      "Hope you are having a wonderful day!",
      "frustrated...",
      "You're hard to get ahold of!",
      "Good morning",
      "Sleep tight",
      
      "mlem",
      "mlem",
      "does this sound okay?",
      "does this sound good enough?",
      "Did the mail come?",
      "I just got a notice from",
      
      "I just did that",
      "my DL is being sent to that address",
      "if USPS isn't doing mail forwarding, I wont get it",
      "How are you doing today?",
      "How are you doing?",
      "will edit and translate",
      
      "Can you translate this?",
      "the dog kinda looks over that.. haha",
      "Foxy caught a rat",
      "How are the doggos?",
      "send pics",
      "What did they think?",
      
      "I miss the doggos",
      "wow!",
      "Those ears though!",
      "so cute",
      "how's my wifey doing?",
      "how is my wifey?",
      
      "How are you?",
      "I'm feeling unmotivated",
      "aww, yay",
      "not feeling motivated",
      "I'm not feeling motivated.",
      "sitting here staring off into nothing",
      
      "it is through the allsate guy",
      "I spoke to the insurance",
      "the insurance is all set up",
      "is that done yet?"
     ]

bt = ["hiya", 
      "Of course!  If they wont be trouble for you, I will absolutely bring them!",
      "mrow",
      "meow",
      "They will love the trip",
      "They would love to come",
      
      "She misses you!",
      "So looking forward to it!",
      "thank you!",
      "will do",
      #"can't watch the video rn"
      "lolol",
      
      "what's your favorite part of that place?",
      "Sent a text.",
      "do you need some more?",
      "well, let's discuss later in a call.",
      "hi",
      "How about Thai?",
      
      "Something yummy",
      "Awh, thanks!  I hope your day is going well too!",
      "what's up? can I help?",
      "so sorry... super busy day :/",
      "hiya! how are you today?",
      "I'll only let the Foxy lox bite ;) haha",
      
      "that kinda day, huh?",
      "thanks for the mlems",
      "yeh, sounds great",
      "yeh, you're a great writer, be you!",
      "Lemme go check...",
      "ooo, what about?",
      
      "you're a rockstar",
      "ohhh, hmmm",
      "yeh, I mean, it is all set up, but iunno",
      "iunno... ok I guess, you?",
      "I'm good, you?",
      "Thanks, you are brilliant.",
      
      "I can try.  Will you proof it for me?",
      "mwahahahaha",
      "oh man, lil hunter!",
      "Cute lil buggers",
      "will do!",
      "I think they liked it",
      
      "They miss you too!",
      "indeed",
      "cuteness",
      ":D",
      "she's happy that she is your wifey!",
      "She loves you <3",
      
      "procrastinating",
      "I feel ya.  Can I help?",
      ":D",
      "I feel that",
      "try to knock something simple off your list!",
      "You got this!  One step at a time.  Happy to help if I can.",
      
      "tytyty",
      "thank you so much for handling that!",
      "thanks for soing that.  I really appreciate it.",
      "it will be today."
     ]

if len(st) == len(bt):
    hangouts_responses = {}
    hangouts_responses["SenderText"] = st
    hangouts_responses["BotText"] = bt
    hangouts_responses = pd.DataFrame(data = hangouts_responses, dtype=str, columns = ['SenderText', 'BotText'])    
else: 
    print("The length of st is " + str(len(st)) + " but the length of bt is " + str(len(bt)))

#hangouts_responses

In [36]:
# add my convos to the previous data

df_all_resp = pd.concat([df, hangouts_responses], axis =0).reset_index()
df_all_resp.tail() #ensure the data was added

Unnamed: 0,index,SenderText,BotText
3777,52,sitting here staring off into nothing,You got this! One step at a time. Happy to h...
3778,53,it is through the allsate guy,tytyty
3779,54,I spoke to the insurance,thank you so much for handling that!
3780,55,the insurance is all set up,thanks for soing that. I really appreciate it.
3781,56,is that done yet?,it will be today.


In [38]:
# make sure ALL the data was added to the df.  
len(df.index) + len(hangouts_responses) == len(df_all_resp)

True

### Exploratory Data Analysis (EDA)

#### Decision Tree Classifier

In [39]:
# define a specialized function for the CountVectorizer analyzer
def text_cleaner(x):
    return [a for a in (''.join([a for a in x if a not in string.punctuation])).lower().split()]

In [40]:
# make a pipeline for classification
pipe = Pipeline([
    ('bow',CountVectorizer(analyzer=text_cleaner)),
    ('tfidf',TfidfTransformer()),
    ('classifier',DecisionTreeClassifier())
])

In [41]:
pipe.fit(df.SenderText, df.BotText)

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_cleaner at 0x00000056618E8F78>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', DecisionTreeClassifier())])

#### Chatbot Functionality

Let's take a look at the responses the bot gives us after training on the examples

In [50]:
# testing with lines the bot knows

print(pipe.predict(['Hi, how are you doing'])[0])
print(pipe.predict(["i'm pretty good."])[0])
print(pipe.predict(['What did they say?'])[0])

i'm fine. how about yourself?
what's on tv?
they said i need a new hard drive.


In [53]:
# testing with lines similar to what the bot knows
print(pipe.predict(['Hi, how are you today'])[0]) #single word change
print(pipe.predict(["i'm good."])[0]) #single word excluded
print(pipe.predict(['hat did they say?'])[0])  #testing a typo

i don't know. i think i'm average.
what's on tv?
what happened?


In [56]:
# testing with lines unknown to the bot
print(pipe.predict(["where is the lizard?"])[0])
print(pipe.predict(["what should we have for lunch?"])[0])
print(pipe.predict(["what's your favorite food?"])[0])

yes, old people don't smell like fruit.
save your money for school.
i like to watch people.


Above, we can see the bot performs very well on lines it was trained on, so-so on lines similar to what it was trained on, and pretty poorly on lines not like anything it has seen, although these responses are pretty funny!

### Summary

Herein we prepared our dataset and a classifier for a simple bot to explore the data.  In the future, we will need to improve the pipeline in order to provide better prediction and learning since we saw that the bot only performed okay on lines that were similar to those the bot knew and did not perform well on unknown lines.

#### Save Dataset

In [54]:
data_path_save = os.path.join(PROJ_ROOT,'Data', 'kaggle-dialogs-and-hangouts-dialogs.txt')
df.to_csv(data_path_save)

#### Notebook details

In [55]:
# use watermark in a notebook with the following call
%load_ext watermark

# %watermark? #<-- watermark documentation

%watermark -a "H.GRYK" -d -t -v -p pandas
%watermark -p numpy
%watermark -p os
%watermark -p sys
%watermark -p nltk
%watermark -p sklearn
%watermark -p tqdm

H.GRYK 2020-10-23 20:15:21 

CPython 3.7.7
IPython 7.18.1

pandas 1.0.5
numpy 1.19.1
os unknown
sys 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
nltk 3.5
sklearn 0.23.2
tqdm 4.48.2
