# Nastya Chatbot Training Data

Takes our large CSV of all the data and converts it to the format of a dataframe with text-response columns.
The texts are solely from Pratik and the responses are solely from Anastasiia.

References:

- https://towardsdatascience.com/how-to-build-an-easy-quick-and-essentially-useless-chatbot-using-your-own-text-messages-f2cb8b84c11d

In [64]:
import pandas as pd
from string import punctuation
import re

### 1. Getting the data to the format we want it to be in

Reading in the CSV file

In [25]:
df = pd.read_csv('data/message_data.csv', index_col=[0])
df.head()

Unnamed: 0,content,sender_name,timestamp_ms
0,Okay babes,Anastasiia Morozova,1582067120307
1,!!!,Pratik Karki,1582067116590
2,I just got done,Pratik Karki,1582067113534
3,I thought you were done!,Anastasiia Morozova,1582067104121
4,The person was a annoying ones,Pratik Karki,1582067101785


Sorting the dataframe by date

In [26]:
df.sort_values(by='timestamp_ms', inplace=True)
df.reset_index(inplace=True)
df.head()

Unnamed: 0,index,content,sender_name,timestamp_ms
0,182484,Cool,Anastasiia Morozova,1508118080364
1,182483,That was a quick reply,Pratik Karki,1508386653307
2,182482,Took me a while to realize how cool this fact is.,Anastasiia Morozova,1508417569122
3,182481,Lol,Pratik Karki,1508421997827
4,182480,That's kinda mean,Pratik Karki,1508422003040


Adding `is_from_me` column and dropping `index`

In [27]:
df['is_from_me'] = ['1' if x == 'Pratik Karki' else '0' for x in df['sender_name']]
df.drop('index', axis=1, inplace=True)
df['index'] = df.index
df.head()

Unnamed: 0,content,sender_name,timestamp_ms,is_from_me,index
0,Cool,Anastasiia Morozova,1508118080364,0,0
1,That was a quick reply,Pratik Karki,1508386653307,1,1
2,Took me a while to realize how cool this fact is.,Anastasiia Morozova,1508417569122,0,2
3,Lol,Pratik Karki,1508421997827,1,3
4,That's kinda mean,Pratik Karki,1508422003040,1,4


### 2. Now we can make training data

For this, we want to create a dataframe with `text` and `response` columns

In [115]:
#remove the nulls
df.dropna(subset = ['content'], inplace=True)

#helper functions
def make_sentences(series):
    sentence = '. '.join(series)
    sentence = re.sub(r'http\S+', '', sentence)
    sentence = ''.join(filter(lambda x: ord(x)<128,sentence))
    return sentence

#initiliaze empty df
train_data = pd.DataFrame(columns = ['text','response'])
train_data.at[0, 'text'] = ''

First we test with a subset of the larger dataframe

In [120]:
# Create test dataframe to test our function
test_df = pd.DataFrame(df).set_index('index')[5000:5500].copy(deep=True)
test_df

Unnamed: 0_level_0,content,sender_name,timestamp_ms,is_from_me
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5000,"See, we go way back",Anastasiia Morozova,1514506389243,0
5001,Weâre good bros,Anastasiia Morozova,1514506397433,0
5002,Of course,Pratik Karki,1514506411094,1
5003,If he's your bro then that must everyone you'v...,Pratik Karki,1514506440771,1
5004,Wow thanks for taking that away from me,Anastasiia Morozova,1514506716924,0
...,...,...,...,...
5495,See what matters isnât the intensity or the ...,Anastasiia Morozova,1514662718791,0
5496,That's what she said haha,Pratik Karki,1514664728652,1
5497,I'm starting bm at the last episode,Pratik Karki,1514664736566,1
5498,Noooo,Anastasiia Morozova,1514664981092,0


In [125]:
def create_train_data(train_data, df):
    # store current text and response sentences
    text_sentence = []
    response_sentence = []

    train_data_row = 0

    #iterate through each convo
    for index, row in df.iterrows():
        # retrieve current text and remove end punctuation
        curr_string = row['content'].rstrip(punctuation)

        if row['is_from_me'] == '1':
            # going from response to text
            if len(response_sentence) > 0:
                sentence = make_sentences(response_sentence)
                train_data.at[train_data_row,'response'] = sentence
                response_sentence.clear()
                train_data_row+=1 # only increment when response is over
            text_sentence.append(curr_string)
        else:
            if len(text_sentence) > 0:
                sentence = make_sentences(text_sentence)
                train_data.at[train_data_row,'text'] = sentence
                text_sentence.clear()
            response_sentence.append(curr_string)

    # use this line if last response is Nan
    train_data.iloc[-1, train_data.columns.get_loc('response')] = ''
        

In [126]:
create_train_data(train_data, test_df)
train_data

Unnamed: 0,text,response
0,,"See, we go way back. Were good bros"
1,Of course. If he's your bro then that must eve...,Wow thanks for taking that away from me. Is it...
2,No I don't want you to take him away from me,Thats what I meant. And had in mind *evil laug...
3,Haha. Good luck,What
4,I said good luck with that,Thanks Ill need it
...,...,...
136,I think you should be able to express a full r...,"Me? No. I feel it a lot, but I know I should h..."
137,Yeah you're stronger than me in that aspect,How do you mean
138,You don't feel as sad. So you have better ment...,You dont know that. You dont know how or what ...
139,You just said that you don't let it weigh you ...,See what matters isnt the intensity or the fre...


Then we work with the actual dataset

In [127]:
create_train_data(train_data, df)
train_data

Unnamed: 0,text,response
0,,Cool
1,That was a quick reply,Took me a while to realize how cool this fact is
2,Lol. That's kinda mean,"Didn't intend an insult, but hey, being mean i..."
3,It's true. I'm mean sometimes. It keeps things...,"Are you lawful, neutral or chaotic mean"
4,Definitely chaotic. I make people question the...,Lawful. I'm systematically and elegantly mean
...,...,...
53705,[ ] 1 cups refined flour\n[ ] 4 medium potato...,Anastasiia sent an attachment
53706,The video chat ended. No hay nada en la seccin...,You missed a call from Anastasiia.
53707,You called Anastasiia. Anastasiia missed you...,Why did you call
53708,To find your phone. Can you ask. If I can use ...,Anastasiia called you. You missed a call from ...


Save as CSV

In [129]:
train_data.to_csv('train_data.csv',index=False)