# Nastya Chatbot Training Data

Takes our large CSV of all the data and converts it to the format of a dataframe with text-response columns.
The texts are solely from Pratik and the responses are solely from Anastasiia.

References:

- https://towardsdatascience.com/how-to-build-an-easy-quick-and-essentially-useless-chatbot-using-your-own-text-messages-f2cb8b84c11d

In [223]:
import pandas as pd
from string import punctuation
import re

### 1. Getting the data to the format we want it to be in

Reading in the CSV file

In [224]:
df1 = pd.read_csv('data/message_data.csv', index_col=[0], encoding='utf-8')
df2 = pd.read_csv('data/merged_data.csv', index_col=[0], encoding='utf-8')
df1.head()

Unnamed: 0,content,sender_name,timestamp_ms
0,Okay babes,Anastasiia Morozova,1582067120307
1,!!!,Pratik Karki,1582067116590
2,I just got done,Pratik Karki,1582067113534
3,I thought you were done!,Anastasiia Morozova,1582067104121
4,The person was a annoying ones,Pratik Karki,1582067101785


In [225]:
df2.head()

Unnamed: 0,content,sender_name,timestamp_ms
0,facebook me if you are coming to the dinner? p...,Deka Abdirahman,1410973204870
1,everyone is welcomed to join us that would be ...,Deka Abdirahman,1410917500280
2,send the birthday notes and gifts to Helena an...,Deka Abdirahman,1410917468397
3,Yeah will do,Karnika Arora,1410917423351
4,Hahahah okay I'm so confused,Karnika Arora,1410917403573


In [226]:
# merging two dataframes
df = df1.append(df2)
df.tail()

Unnamed: 0,content,sender_name,timestamp_ms
72392,Do u wanna go to solera at some point tonight?,Marysia Ciupka,1506213673344
72393,"Say hi to your new Facebook friend, Edoardo.",Anastasiia Morozova,1476096498581
72394,John sent an attachment.,John Patrick,1469074859637
72395,https://www.youtube.com/watch?v=j556MWGVVqI,John Patrick,1468900364603
72396,congrats on joining HBX :),Nay Than,1434591830971


Sorting the dataframe by date

In [227]:
df.sort_values(by='timestamp_ms', inplace=True)
df.reset_index(inplace=True)
df.tail()

Unnamed: 0,index,content,sender_name,timestamp_ms
293775,15480,$42. Venmo is lindsay-helmrich,Lindsay Helmrich,1587410873969
293776,15479,Just sent!,Anastasiia Morozova,1587413189573
293777,15478,"Thank you, enjoy!",Lindsay Helmrich,1587413204954
293778,202065,https://www.indianhealthyrecipes.com/chicken-b...,Pratik Karki,1587442142067
293779,202064,https://www.reddit.com/r/AnimalsBeingDerps/com...,Pratik Karki,1587442934960


Adding `is_from_me` column and dropping `index`

In [228]:
df['is_from_me'] = ['0' if x == 'Anastasiia Morozova' else '1' for x in df['sender_name']]
df.drop('index', axis=1, inplace=True)
df['index'] = df.index
df.head()

Unnamed: 0,content,sender_name,timestamp_ms,is_from_me,index
0,ÐÐ°ÑÑÑ) Ð¿ÑÐ¸Ð²ÐµÑÐ¸Ðº) Ð½Ðµ Ð·Ð½Ð°ÐµÑÑ...,Masha Zhuzlyakova,1290611508000,1,0
1,"Ð½Ð°Ð¶Ð¸Ð¼Ð°ÐµÑÑ Ð½Ð° ÐºÐ½Ð¾Ð¿ÐºÑ"" Ð°ÐºÐºÐ°...",Anastasiia Morozova,1290612218000,0,1
2,Ð¡Ð¿Ð°ÑÐ¸Ð±ÐºÐ¸ Ð±Ð¾Ð»ÑÑÐ¾Ðµ!!!!! Ð±Ð»Ð¸Ð½...,Masha Zhuzlyakova,1290612529000,1,2
3,"Ð²ÑÐµ Ð¾ÑÐ»Ð¸ÑÐ½Ð¾, ÑÑÐµÐ±Ð° ÑÐ¾Ð¶Ðµ Ð½Ð...",Anastasiia Morozova,1290663463000,0,3
4,ÐÐ¹..Ñ Ð¼ÐµÐ½Ñ Ñ ÑÑÐµÐ±Ð¾Ð¹ Ð² Ð½Ð°Ð¿ÑÑ...,Masha Zhuzlyakova,1290716748000,1,4


### 2. Now we can make training data

For this, we want to create a dataframe with `text` and `response` columns

But first, we want to remove all text such as 'X called you' or 'X is waving at you' generated by Facebook. 

In [229]:
facebook_generated_strings = ['sent an attachment', 'called you.', 'sent a live location', 
                              'hi to your new Facebook friend',
                              'is waving at you', 'ou missed a call from'
                             ]

In [230]:
# Creating test data for our purposes
test = {'content': ['Hello', 'normal text', 'You sent an attachment.', 'Say hi to your new Facebook friend, Andrew.',
                    'You sent an attachment.', 'Ella called you.', 'Ella sent a live location.',
                    'John is waving at you!', 'You missed a call from Mateusz.', 'fine!',
                    'ÐÐ¹..Ñ Ð¼ÐµÐ½Ñ Ñ ÑÑÐµÐ±Ð¾Ð¹',
                    'ÐÐ¹..Ñ Ð¼ÐµÐ½Ñ Ñ ÑÑÐµÐ±Ð¾Ð¹ Ð² Ð½Ð°Ð¿ÑÑ...'
                   ],
        'sender_name': ['Anastasiia Morozova', 'Anastasiia Morozova', 'Anastasiia Morozova', 
                        'Ella Nicolson', 'Ella Nicolson', 'Ella Nicolson',
                        'Anastasiia Morozova', 'Anastasiia Morozova', 'Anastasiia Morozova', 'Ella Nicolson',
                        'someone', 's'
                       ], 
        'timestamp': ['123', '123', '123','123','123','123', '123', '123', '123', '123', '123','123']}

test_df = pd.DataFrame.from_dict(test)

test_df

Unnamed: 0,content,sender_name,timestamp
0,Hello,Anastasiia Morozova,123
1,normal text,Anastasiia Morozova,123
2,You sent an attachment.,Anastasiia Morozova,123
3,"Say hi to your new Facebook friend, Andrew.",Ella Nicolson,123
4,You sent an attachment.,Ella Nicolson,123
5,Ella called you.,Ella Nicolson,123
6,Ella sent a live location.,Anastasiia Morozova,123
7,John is waving at you!,Anastasiia Morozova,123
8,You missed a call from Mateusz.,Anastasiia Morozova,123
9,fine!,Ella Nicolson,123


In [231]:
# Replace all Cyrillic characters with ...
test_df.content.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

test_df

Unnamed: 0,content,sender_name,timestamp
0,Hello,Anastasiia Morozova,123
1,normal text,Anastasiia Morozova,123
2,You sent an attachment.,Anastasiia Morozova,123
3,"Say hi to your new Facebook friend, Andrew.",Ella Nicolson,123
4,You sent an attachment.,Ella Nicolson,123
5,Ella called you.,Ella Nicolson,123
6,Ella sent a live location.,Anastasiia Morozova,123
7,John is waving at you!,Anastasiia Morozova,123
8,You missed a call from Mateusz.,Anastasiia Morozova,123
9,fine!,Ella Nicolson,123


In [232]:
test_df = test_df[~test_df["content"].str.contains('|'.join(facebook_generated_strings))]

test_df

Unnamed: 0,content,sender_name,timestamp
0,Hello,Anastasiia Morozova,123
1,normal text,Anastasiia Morozova,123
9,fine!,Ella Nicolson,123
10,..,someone,123
11,.. ...,s,123


In [233]:
#remove the nulls
df.dropna(subset = ['content'], inplace=True)

# remove facebook generated text
df = df[~df["content"].str.contains('|'.join(facebook_generated_strings))]

import regex 

#helper functions
def make_sentences(series):
    sentence = '. '.join(series)
    sentence = re.sub(r'http\S+', '', sentence)
    sentence = re.sub(r'\n', '', sentence)
    sentence = ''.join(filter(lambda x: ord(x)<128,sentence))
    return sentence

#initiliaze empty df
train_data = pd.DataFrame(columns = ['text','response'])

First we test with a subset of the larger dataframe

In [234]:
# Create test dataframe to test our function
test_df = pd.DataFrame(df).set_index('index')[:1000].copy(deep=True)
test_df

Unnamed: 0_level_0,content,sender_name,timestamp_ms,is_from_me
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,ÐÐ°ÑÑÑ) Ð¿ÑÐ¸Ð²ÐµÑÐ¸Ðº) Ð½Ðµ Ð·Ð½Ð°ÐµÑÑ...,Masha Zhuzlyakova,1290611508000,1
1,"Ð½Ð°Ð¶Ð¸Ð¼Ð°ÐµÑÑ Ð½Ð° ÐºÐ½Ð¾Ð¿ÐºÑ"" Ð°ÐºÐºÐ°...",Anastasiia Morozova,1290612218000,0
2,Ð¡Ð¿Ð°ÑÐ¸Ð±ÐºÐ¸ Ð±Ð¾Ð»ÑÑÐ¾Ðµ!!!!! Ð±Ð»Ð¸Ð½...,Masha Zhuzlyakova,1290612529000,1
3,"Ð²ÑÐµ Ð¾ÑÐ»Ð¸ÑÐ½Ð¾, ÑÑÐµÐ±Ð° ÑÐ¾Ð¶Ðµ Ð½Ð...",Anastasiia Morozova,1290663463000,0
4,ÐÐ¹..Ñ Ð¼ÐµÐ½Ñ Ñ ÑÑÐµÐ±Ð¾Ð¹ Ð² Ð½Ð°Ð¿ÑÑ...,Masha Zhuzlyakova,1290716748000,1
...,...,...,...,...
995,haha,Anastasiia Morozova,1317514409790,0
996,haha,Mike Becker,1317514411656,1
997,whos your coordinator,Mike Becker,1317514416197,1
998,Marie Lackore,Anastasiia Morozova,1317514453017,0


In [235]:
def create_train_data(train_data, df):
    # store current text and response sentences
    text_sentence = []
    response_sentence = []

    train_data_row = 0

    #iterate through each convo
    for index, row in df.iterrows():
        # retrieve current text and remove end punctuation
        curr_string = row['content'].rstrip(punctuation)

        if row['is_from_me'] == '1':
            # going from response to text
            if len(response_sentence) > 0:
                sentence = make_sentences(response_sentence)
                train_data.at[train_data_row,'response'] = sentence
                response_sentence.clear()
                train_data_row+=1 # only increment when response is over
            text_sentence.append(curr_string)
        else:
            if len(text_sentence) > 0:
                sentence = make_sentences(text_sentence)
                train_data.at[train_data_row,'text'] = sentence
                text_sentence.clear()
            response_sentence.append(curr_string)

    # use this line if last response is Nan
    train_data.iloc[-1, train_data.columns.get_loc('response')] = ''
        

### Then we work with the actual dataset

In [236]:
from datetime import datetime 

start_time = datetime.now() 

create_train_data(train_data, df)
train_data

Unnamed: 0,text,response
0,) ) ?)))) -,""" "" , "" "" ,"
1,"!!!!! , )) ?? ?)))",", )))) ) ? ?"
2,.. .. .... . ... ))) ....,", - ) , ....... , !!!! , , -:) ,..."
3,... ... ... )) .... ??,", ))) , . Have you recieved the official..."
4,",",
...,...,...
68855,Lol. You sent me already. Package will go out...,Canceled or postponed
68856,Well they havent decided yet. I guess technica...,Yeah that would be unfortunate for you. So Im ...
68857,Yeah yahoo thought itd be the flagship for the...,Well just saying that season 6 aint bad
68858,"Good, Im excited to watch it. I got my stimulu...",


In [237]:
print('Time elapsed (hh:mm:ss.ms) {}'.format(datetime.now() - start_time))

Time elapsed (hh:mm:ss.ms) 0:13:41.450188


Save as CSV

In [238]:
train_data.to_csv('data/train_data_2.csv',index=False)