# Whatsapp chat bot

## Summary

WhatsApp allows you to download the message history of any conversation. In this project, I have built a model that can take any WhatsApp message history and turn it into a chat bot. As the chat bot is based on the messages of the friend the user has been messaging, the chat bot's personality is based on that of your friend.

The user uploads a WhatsApp message history, and the chat bot learns from this dataset. The user then inputs a message, and the model finds a message sent by the user in the dataset which is most similar to the input message. The model then selects the response corresponding to this most similar message from the dataset. The response is then given by the chat bot.

## Methodology

First, the model loads and processes the conversation history, which has been downloaded by the user directly from WhatsApp as a .txt file. Conversations tend to have groups of messages from each person before the other person responds. As a simplification, the model extracts the last message of a group sent by the user, and the first message received in response to this. This gives a pair of messages: a message sent by the user and a response from the user's friend. So at this stage the model has an array of messages and an array of corresponding responses.

In processing the data, the model has an option to remove stop words. Stop words are typically removed from data used for Natural Language Processing because they are very common and have little specialised meaning. However, I have found that removing stop words in the chat bot worsens the results. This is likely to be because small talk contains a lot of stop words. For example, 'how', 'are', and 'you' are all stop words. By default, the chat bot therefore does not remove stop words.

The chat bot also makes all of the chat data lower case, and by doing so assumes there is no important distinction between upper and lower case letters in the WhatsApp conversation.

Once the chat history has been converted into a list of messages from the user and responses to the user. The messages from the user are tagged and used to train a doc2vec model which uses DM-PV (Distributed Memory version of Paragraph Vector) to convert each message into a numerical vector, where each vector corresponds to a position in 'meaning space'.

After the chat bot has been trained on the input message history, it can have a conversation with the user. The user must input a message, and the chat bot then takes that message and converts it into a vector. The chat bot then finds the vector in the meaning space which closest to the new message vector. In other words, it finds the message that has the closest meaning to the new message. The chat bot then simply returns the response from the user's friend that corresponds to the closest message.

## Import dependencies

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.doc2vec import Doc2Vec

## Chat bot class

In [2]:
class ChatBot:
    
    def __init__(self, my_name, friend_name, chat_file_path, remove_stop_words=False):
        '''
        Initialise the chat bot, and load and process the data.
        '''
        
        # The name of the person who has downloaded the chat data
        self.my_name = my_name
        # The name of the other person
        self.friend_name = friend_name
        # Option to remove stop words from the data
        self.remove_stop_words = remove_stop_words
        
        # Load the raw chat data
        chat = open(chat_file_path)
        chat_text = chat.read()
        raw_chat = chat_text.splitlines()
        
        # List of messages from me
        self.me_chat = []
        # Corresponding list of responses to my messages
        self.friend_chat = []
        
        # The first element denotes the name of the sender and the second denotes the message sent
        previous_row = [None, None]

        # Iterate through all messages
        n = len(raw_chat)
        for i in range(n):
            row = raw_chat[i]
            # Check that the row is not empty
            if len(row) != 0:
                # Check that the row is valid, i.e. excludes picture messages, etc.
                # Valid rows start with a '['
                if row[0] == '[':
                    # Remove time stamp in row
                    row = row.split('] ')[1]
                    # Split the remaining string into the name of the sender, and the message
                    row = row.split(': ')[0:2]
                    # Only keep the last message in a string of messages from me, and my friend's first response to this
                    if previous_row[0] == my_name and row[0] != my_name:
                        self.me_chat.append(previous_row[1])
                        self.friend_chat.append(row[1])
                    previous_row = row
        
        # Option to remove stop words
        if self.remove_stop_words == True:
            stop_words = stopwords.words('english')
        
        # The tagged message data
        self.tagged_data = []
        
        # Create tagged documents for each message from me
        n = len(self.me_chat)
        for i in range(n):
            message = word_tokenize(self.me_chat[i].lower())
            if self.remove_stop_words == True:
                tagged_message = [word for word in message if word not in stop_words]
            else:
                tagged_message = message
            self.tagged_data.append(TaggedDocument(words=tagged_message, tags = [i]))

    def train(self, max_epochs=100, vec_size=100, alpha=0.025):
        
        # Instantiate the Doc2Vec model
        self.model = Doc2Vec(vector_size=vec_size,
                             alpha=alpha, 
                             min_alpha=0.00025,
                             min_count=1,
                             dm =1)
        
        # Build the vocab using the tagged data
        self.model.build_vocab(self.tagged_data)
        
        # Iterate through each training epoch
        for epoch in range(max_epochs):
            # Output a message to indicate progress
            if epoch % 10 == 0:
                print('Iteration: ' +  str(epoch) + ' / ' + str(max_epochs))
            # Train the model
            self.model.train(self.tagged_data,
                             total_examples=self.model.corpus_count,
                             epochs=self.model.epochs)
            # Decrease the learning rate
            self.model.alpha -= 0.0002
            # Fix the learning rate, no decay
            self.model.min_alpha = self.model.alpha
        
        model_file_name = self.friend_name + "_doc2vec.model"
        self.model.save(model_file_name)
        self.model = Doc2Vec.load(model_file_name)
        self.model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
        print('Iteration: ' +  str(max_epochs) + ' / ' + str(max_epochs))
        print("Model Saved")
        
    def message(self, message):
        
        # Make the message lower case and tokenize it
        message = word_tokenize(message.lower())
        
        # Option to remove stop words
        if self.remove_stop_words == True:
            message = [word for word in message if word not in stop_words]
            
        # Infer the message's vector
        message_vector = self.model.infer_vector(message, epochs=1000)
        
        # Find the most similar message to the message given
        similar_message = self.model.docvecs.most_similar([message_vector])[0][0]
        print(self.friend_name + ': ' + self.friend_chat[similar_message])

## Testing

In [3]:
alison_chat_bot = ChatBot(my_name='Guy', friend_name='Alison',
                   chat_file_path='/Users/guybrett-robertson/Documents/data/whatsapp_chats/alison_chat.txt')

In [4]:
alison_chat_bot.train(max_epochs=100, vec_size=20)

Iteration: 0 / 100
Iteration: 10 / 100
Iteration: 20 / 100
Iteration: 30 / 100
Iteration: 40 / 100
Iteration: 50 / 100
Iteration: 60 / 100
Iteration: 70 / 100
Iteration: 80 / 100
Iteration: 90 / 100
Iteration: 100 / 100
Model Saved


### Input messages that are the same as messages present in the chat data set

We would expect the chat bot to give the same response as the original response that was given in the chat data.

In [5]:
print('Me: ' + alison_chat_bot.me_chat[213])
print('Alison: ' + alison_chat_bot.friend_chat[213])

Me: Gnight Alison x
Alison: Have a good night ☺️ x


In [6]:
alison_chat_bot.message('Gnight Alison x')

Alison: Have a good night ☺️ x


In [7]:
print('Me: ' + alison_chat_bot.me_chat[294])
print('Alison: ' + alison_chat_bot.friend_chat[294])

Me: I just totally forgot — where is your place a Londres?
Alison: Kilburn :)


In [8]:
alison_chat_bot.message('I just totally forgot — where is your place a Londres?')

Alison: Kilburn :)


### Messages that are similar but not identical to messages present in the chat data set

We would expect the chat bot to give some reasonable responses to these.

In [9]:
alison_chat_bot.message('Ca va?')

Alison: Ça va merci :) quoi de neuf aujourd'hui ?


In [10]:
alison_chat_bot.message('Yayyy')

Alison: Amazing 😄🥰


In [11]:
alison_chat_bot.message("How's it going?")

Alison: Bonjour toi! :)


In [12]:
alison_chat_bot.message("I miss you")

Alison: I miss you too x


### Completely new messages

We wouldn't expect the chat bot to give particularly reasonable responses to these.

In [13]:
alison_chat_bot.message('What is the meaning of life?')

Alison: Ah right! Hope it goes well :) did you know her already?


In [14]:
alison_chat_bot.message('The quick brown fox jumps over the lazy dog.')

Alison: Yes 😄


In [15]:
alison_chat_bot.message('What are you going to do yesterday?')

Alison: Basically I had left her gift (the frame) in my car so we would take it when she leaves to get the pizzas. Except she locked us inside, the door and the gate were closed 😆 so I had to climb out the window and stuff and then climb back in 🤣
