
# Chatbot Data Wrangling and Exploratory Data Analysis (EDA)


The intention of this project is to create a chatbot that will respond to my incoming hangouts messages.  In order to accomplish this, the bot will need some initial data to train on.  I will utilize past hangouts conversations as well as a more general question and answer dataset from kaggle. 

### Table of Contents

1. Data Wrangling 
- import packages
- load, view kaggle data, and add hangouts data

2. Exploratory Data Analysis (EDA)
- decision tree classifier
- chatbot functionality

3. Summary
- findings
- save dataset
- notebook details

### Data Wrangling
#### Import Packages

In [156]:
import numpy as np
import pandas as pd

import os
import sys

import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

# use os to get path
PROJ_ROOT = os.path.join(os.pardir)
print(os.path.abspath(PROJ_ROOT))

C:\Users\Hailey\Documents\GitHub\SBwork\Capstone3-Chatbot


#### Load,  View Kaggle Data, and Add Hangouts Data
The data for this project is my google hangouts history (accessed using the instructions below) with the addition of extra data from a [kaggle](https://www.kaggle.com/grafstor/simple-dialogs-for-chatbot) dataset.  To access my chat history, I enabled the chats label (within gmail) by going to settings > labels > click show next to chats.  Then I was able to navigate to the chats label and a list of chat history was available.  

In [189]:
# load and view kaggle data
data_path = os.path.join(PROJ_ROOT,'Data', 'kaggle-dialogs-original.txt')

data = pd.read_csv(data_path, sep='\t', header=None)
data.columns = ['SenderText', 'BotText']
df = pd.DataFrame(data)
df.head(3)

Unnamed: 0,SenderText,BotText
0,"hi, how are you doing?",i'm fine. how about yourself?
1,i'm fine. how about yourself?,i'm pretty good. thanks for asking.
2,i'm pretty good. thanks for asking.,no problem. so how have you been?


In [204]:
# create dict of my convos
new_responses = { 'SenderText':,
                 'SenderText': 'hiya', 'BotText':'hi!',
                 'SenderText': "what's for dinner?", 'BotText': 'something yummy',
                 'SenderText':'Hope you are having a wonderful day!', 'BotText':'Thanks! Hope your day is going well too.',
                 'SenderText': "frustrated", 'BotText': "what's up, can I help?",
                 'SenderText': "you're hard to get ahold of!", 'BotText': "so sorry... super busy day!",
                 'SenderText': "Good morning", 'BotText': "hiya! how are you today?",
                 'SenderText': "mlem", 'BotText': "that kinda day, huh?",
                 'SenderText': "mlem", 'BotText': "thanks for the mlems",
                 'SenderText': "does this sound okay?", 'BotText': "yeh, that's awesome",
                 'SenderText': "Does this sound good enough?", 'BotText': "yeh, you're a great writer, be you!",
                 'SenderText': "Did the mail come?", 'BotText': "Lemme go check...",
                 'SenderText': "I just got a notice from", 'BotText': "ooo, about what?",
                 'SenderText': "I just did that", 'BotText': "you're a rockstar",
                 'SenderText': "my DL is being sent to that address", 'BotText': "ooohhh, hmmm...",
                 'SenderText': "if USPS isn't doing mail forwarding, I wont get it", 'BotText': "yeh, I mean, it is all set up, but iunno.",
                 'SenderText': "How are you doing today?", 'BotText': "iunno",
                 'SenderText': "How are you doing?", 'BotText': "I'm good, you?",
                 'SenderText': "will edit and translate", 'BotText': "thanks, you are brilliant",
                 'SenderText': "can you translate this?", 'BotText': "I can try, you'll have to proof it for me ;)",
                 'SenderText': "the dog kinda looks done with this", 'BotText': "mwahahahaha",
                 'SenderText': "Foxy caught a rat", 'BotText': "oh man, lil hunter",
                 'SenderText': "How are the doggos?", 'BotText': "Cute lil buggers",
                 'SenderText': "What did they think?", 'BotText': "They liked it",
                 'SenderText': "I miss the doggos", 'BotText': "They miss you too!",
                 'SenderText': "wow!", 'BotText': "indeed",
                 'SenderText': "Those ears though!", 'BotText': "cuteness",
                 'SenderText': "so cute", 'BotText': ":D",
                 'SenderText': "how is my Wifey doing?", 'BotText': "she's happy to be your wife!",
                 'SenderText': "how is my wifey?", 'BotText': "she loves you!",
                 'SenderText': "how are you?", 'BotText': "procrastinating",
                 'SenderText': "I'm feeling unmotivated", 'BotText': "I feel ya.  Can I help? ",
                 'SenderText': "aww, yay", 'BotText': ":D",
                 'SenderText': "not feeling motivated", 'BotText': "I feel that.",
                 'SenderText': "I'm not feeling motivated", 'BotText': "try to knock on thing off your list",
                 'SenderText': "sitting here staring at my talk", 'BotText': "You got this! One step at a time. Happy to help if I can.",
                 'SenderText': "it is represented by the allstate guy", 'BotText': "gotcha. ty.",
                 'SenderText': "i spoke to the insurance guy", 'BotText': "thank you so much for handling that!",
                 'SenderText': "the insurance is all set up", 'BotText': "thanks for doing that. i really appreciate it.",
                 'SenderText': "is that done yet?", 'BotText': "it will be today",
                }



hangouts_responses = {'SenderText':['hi', 
                                    "Can the dogs come?",
                                    "meow",
                                    "mrow",
                                    "Bring the pups too",
                                    "Bring all the doggos",
                                    "Bring Echo!",
                                    "We would love to host you whenever you come down here!",
                                    "Heard you guys are buying a condo!  Congrats!",
                                    "hold please",
                                    "https://www.youtube.com",
                                    "https://www.youtube.com",
                                    "they better send mah money",
                                    "https://www.zillow.com",
                                    "Have you heard from her?", 
                                    "I drank almost all of my tea",
                                    "just out of curiosity, what's going on with that"
                                   ], 
                     'BotText':['hiya', 
                               "Of course!  If they wont be trouble for you, I will absolutely bring them!",
                                "mrow",
                                "meow",
                                "They will love the trip",
                                "They would love to come",
                                "She misses you!",
                                "So looking forward to it!",
                                "thank you!",
                                "will do",
                                "cool, thanks",
                                "can't watch the video rn"
                                "lolol",
                                "what's your favorite part of that place?",
                                "Sent a text.",
                                "do you need some more?",
                                "well, let's discuss later in a call."
                               ]}

hangouts_responses = pd.DataFrame(data = hangouts_responses)
hangouts_responses

Unnamed: 0,SenderText,BotText
0,tesst,kjf
1,hi,r3r
2,coo,tip


In [194]:
# add my convos to the data
#df_all_resp = df.append(new_responses, ignore_index=True)
#df_all_resp.tail(5)

df_hangouts  = pd.DataFrame([new_responses])
df_all_resp = pd.concat([df, df_hangouts], axis =0) .reset_index()
df_hangouts.tail()
#df_all_resp.tail()

Unnamed: 0,SenderText,BotText
0,mrow,meow


In [187]:
len(df.index) + len(sender) == len(df_all_resp)


False

### Exploratory Data Analysis (EDA)

#### Decision Tree Classifier

In [145]:
# define a specialized function for the CountVectorizer analyzer
def text_cleaner(x):
    return [a for a in (''.join([a for a in x if a not in string.punctuation])).lower().split()]

In [154]:
# make a pipeline for classification
pipe = Pipeline([
    ('bow',CountVectorizer(analyzer=text_cleaner)),
    ('tfidf',TfidfTransformer()),
    ('classifier',DecisionTreeClassifier())
])

In [155]:
pipe.fit(df.SenderText, df.BotText)

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_cleaner at 0x0000004EA01C4E58>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', DecisionTreeClassifier())])

#### Chatbot Functionality

Let's take a look at the responses the bot gives us after training on the examples

In [157]:
# testing with lines the bot knows
print(pipe.predict(['Hello'])[0])
print(pipe.predict(['Hi, how are you doing'])[0])
print(pipe.predict(["i'm pretty good."])[0])

thank you.
i'm fine. how about yourself?
worried about what?


In [158]:
# testing with lines similar to what the bot knows
print(pipe.predict(['helo'])[0])  #testing a typo
print(pipe.predict(['Hi, how are you today'])[0]) #single word change
print(pipe.predict(["i'm good."])[0]) #single word excluded

thank you.
i attended school today. did you?
worried about what?


In [159]:
# testing with lines unknown to the bot
print(pipe.predict(["where is the lizard?"])[0])
print(pipe.predict(["what do you want for dinner?"])[0])
print(pipe.predict(["what's your favorite food?"])[0])

i'd have to say babe ruth.
i'm not voting for the mayor.
my favorite movie is superbad.


Above, we can see the bot performs very well on lines it was trained on, so-so on lines similar to what it was trained on, and pretty poorly on lines not like anything it has seen, although these responses are pretty funny!

### Summary

#### Save Dataset

In [110]:
data_path_save = os.path.join(PROJ_ROOT,'Data', 'kaggle-dialogs-and-hangouts-dialogs.txt')
df.to_csv(data_path_save)

#### Notebook details

In [14]:
# use watermark in a notebook with the following call
%load_ext watermark

# %watermark? #<-- watermark documentation

%watermark -a "H.GRYK" -d -t -v -p pandas
%watermark -p numpy
%watermark -p os
%watermark -p sys
%watermark -p nltk
%watermark -p sklearn
%watermark -p tqdm

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
H.GRYK 2020-10-23 13:20:31 

CPython 3.7.7
IPython 7.18.1

pandas 1.0.5
numpy 1.19.1
os unknown
sys 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
nltk 3.5
sklearn 0.23.2
tqdm 4.48.2
