# Customer Support Chatbot

We are going to use Deep Learning to train a customer support chatbot. We are going to use the Customer Support on Twitter dataset which can be found here:  https://www.kaggle.com/thoughtvector/customer-support-on-twitter

## Import libraries

In [1]:
import re
import random
import time

import keras
import pandas as pd
import sklearn
import nltk
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer

Using TensorFlow backend.


## Import Data

In [2]:
data = pd.read_csv('twcs.csv')

In [3]:
data.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [4]:
data.shape

(2811774, 7)

## Data Preprocessing

We need to create a new dataset where each row contains a customer's tweet and a company's response.

First all the customer tweets need to be identified.

In [5]:
customer_tweets = data[pd.isnull(data.in_response_to_tweet_id) & data.inbound]

In [6]:
customer_tweets.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
6,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,
12,18,115713,True,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,17,
14,20,115715,True,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",19,
23,29,115716,True,Tue Oct 31 22:01:35 +0000 2017,actually that's a broken link you sent me and ...,28,
25,31,115717,True,Tue Oct 31 22:06:54 +0000 2017,"Yo @Ask_Spectrum, your customer service reps a...",30,


Now the desired dataset will be created

In [7]:
customer_tweets_and_responses = pd.merge(customer_tweets, data, left_on='tweet_id', 
                                  right_on='in_response_to_tweet_id')

In [8]:
customer_tweets_and_responses.head()

Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
0,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0
1,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,9,sprintcare,False,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...,,8.0
2,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,10,sprintcare,False,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...,,8.0
3,18,115713,True,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,17,,17,sprintcare,False,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...,16.0,18.0
4,20,115715,True,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",19,,19,sprintcare,False,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...,,20.0


The data set may contain company tweets that are not responses to customer tweets (where inbound_y is True).
These rows need to be removed.

In [9]:
customer_tweets_and_responses = customer_tweets_and_responses[customer_tweets_and_responses.inbound_y == False]

In [10]:
customer_tweets_and_responses.head()

Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
0,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0
1,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,9,sprintcare,False,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...,,8.0
2,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,10,sprintcare,False,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...,,8.0
3,18,115713,True,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,17,,17,sprintcare,False,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...,16.0,18.0
4,20,115715,True,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",19,,19,sprintcare,False,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...,,20.0


In [11]:
customer_tweets_and_responses.shape

(794299, 14)

We decided to train our models on a fraction of the dataset since computers do not have enough RAM to train the models on the entire dataset.

In [12]:
customer_tweets_and_responses = customer_tweets_and_responses.sample(frac=0.01, random_state=42)

In [13]:
customer_tweets_and_responses.shape

(7943, 14)

Next the customer screen names will be replaced with a generic placeholder (@_sn_)

In [14]:
# Replace anonymized screen names with common token @_sn_

sn_re = re.compile(r'@([^\s:]+)')

def sn_replace(txt):
    handles = sn_re.findall(txt)
    #print(handles)
    for handle in handles:
        if handle.isnumeric():
            txt = txt.replace(handle, '_sn_')
    return txt

x_text = customer_tweets_and_responses.text_x.apply(lambda txt: sn_replace(txt))
y_text = customer_tweets_and_responses.text_y.apply(lambda txt: sn_replace(txt))

In [15]:
x_text.head()

835199    @GWRHelp hi, Paddington to Swindon is delayed ...
251227    Nothing like boarding your #HorizonAir @Alaska...
507635    @AppleSupport iOS 11 is so bugged I need to re...
833816    Sooo the global bank system for @_sn_ is down ...
438132    Most Postal people would probably bawk at me f...
Name: text_x, dtype: object

In [16]:
# 8192 - large enough for demonstration, larger values make network training slower
MAX_VOCAB_SIZE = 2**13

count_vec = CountVectorizer(tokenizer=TweetTokenizer().tokenize, max_features=MAX_VOCAB_SIZE - 3)
count_vec.fit(x_text + y_text)
analyzer = count_vec.build_analyzer()

The vocabulary will have three tokens. 'UNK' stands for unknown word. 'PAD' is used to pad short messages and 'START' is used to indicate the beginning of the tweet.

In [17]:
UNK = 0
PAD = 1
START = 2 
vocab = {k: v + 3 for k, v in count_vec.vocabulary_.items()}
vocab['__unk__'] = UNK
vocab['__pad__'] = PAD
vocab['__start__'] = START
# Used to turn seq2seq predictions into human readable strings
reverse_vocab = {v: k for k, v in vocab.items()}

In [18]:
len(vocab)

8192

'word_idx' converts a sentence into its vector representation. 'word_idx_r' converts a vector representation of a sentence back into a sentence.

In [19]:
MAX_MESSAGE_LEN = 30

def word_idx(sentence):
    full_length = [vocab.get(tok, UNK) for tok in analyzer(sentence)] + [PAD] * MAX_MESSAGE_LEN
    return full_length[:MAX_MESSAGE_LEN]

def word_idx_r(word_idxs):
    return ' '.join(reverse_vocab[idx] for idx in word_idxs if idx != PAD).strip()

In [20]:
x = np.vstack(x_text.apply(word_idx).values)
y = np.vstack(y_text.apply(word_idx).values)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [22]:
# Embedding size for an entire message
EMBEDDING_SIZE = 100
CONTEXT_SIZE = 100
BATCH_SIZE = 4
DROPOUT = 0.2
# Learning rate
lr=0.005

## Models

A few seq2seq models will be trained and compared. 

In [23]:
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Dense, Input, LSTM, Dropout, Embedding, RepeatVector, concatenate, TimeDistributed
from keras.utils import np_utils

In [24]:
def nn_model():
    shared_embedding = Embedding(
        output_dim=EMBEDDING_SIZE,
        input_dim=MAX_VOCAB_SIZE,
        input_length=MAX_MESSAGE_LEN,
        name='embedding',
    )
    
    # ENCODER
    
    encoder_input = Input(
        shape=(MAX_MESSAGE_LEN,),
        dtype='int32',
        name='encoder_input',
    )
    
    embedded_input = shared_embedding(encoder_input)
    
    
    encoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='encoder',
        dropout=DROPOUT,
    )
    
    context = RepeatVector(MAX_MESSAGE_LEN)(encoder_rnn(embedded_input))
    
    # DECODER
    
    last_word_input = Input(
        shape=(MAX_MESSAGE_LEN, ),
        dtype='int32',
        name='last_word_input',
    )
    
    embedded_last_word = shared_embedding(last_word_input)
    
    decoder_input = concatenate([embedded_last_word, context], axis=2)
    
    decoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='decoder',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    decoder_output = decoder_rnn(decoder_input)
    
    next_word_dense = TimeDistributed(
        Dense(MAX_VOCAB_SIZE, activation='softmax'),
        name='next_word_dense',
    )(decoder_output)
    
    return Model(inputs=[encoder_input, last_word_input], outputs=[next_word_dense])

s2s_model = nn_model()
optimizer = Adam(lr=lr, clipvalue=5.0)
s2s_model.compile(optimizer='adam', loss='categorical_crossentropy')

In [25]:
s2s_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
encoder (LSTM)                  (None, 100)          80400       embedding[0][0]                  
__________

In [27]:
def add_start_token(y_array):
    """ Add start token to vectors. """
    return np.hstack([
        START * np.ones((len(y_array), 1)),
        y_array[:, :-1],
    ])

def binarize_labels(labels):
    """ Turns integer word indexes into sparse binary matrices for 
        the expected model output.
    """
    return np.array([np_utils.to_categorical(row, num_classes=MAX_VOCAB_SIZE)
                     for row in labels])

In [28]:
def respond_to(model, text):
    """ Generates a response to a customer's tweet """
    input_y = add_start_token(PAD * np.ones((1, MAX_MESSAGE_LEN)))
    idxs = np.array(word_idx(text)).reshape((1, MAX_MESSAGE_LEN))
    for position in range(MAX_MESSAGE_LEN - 1):
        prediction = model.predict([idxs, input_y]).argmax(axis=2)[0]
        input_y[:,position + 1] = prediction[position]
    return word_idx_r(model.predict([idxs, input_y]).argmax(axis=2)[0])

The following function will be used to train the models.

In [29]:
def train_model(model, start_idx, end_idx):
    
    b_train_y = binarize_labels(y_train[start_idx:end_idx])
    input_train_y = add_start_token(y_train[start_idx:end_idx])
    
    model.fit(
        [X_train[start_idx:end_idx], input_train_y], 
        b_train_y,
        epochs=1,
        batch_size=BATCH_SIZE,
        verbose=1,
    )
    
    rand_idx = random.sample(list(range(len(X_test))), SUB_BATCH_SIZE)
    
    print('Test results:', model.evaluate(
        [X_test[rand_idx], add_start_token(y_test[rand_idx])],
        binarize_labels(y_test[rand_idx])
    ))
    
    input_strings = [
        "@AmazonHelp I hadnt expected that such a big brand like amazon would have such a poor customer service.",
    ]
    
    for input_string in input_strings:
        output_string = respond_to(model, input_string)
        print(f'> "{input_string}"\n< "{output_string}"')

The following block of code will train the model for 50 epochs.

In [0]:
SUB_BATCH_SIZE = 64
for epoch in range(50):
    print(f'Training in epoch {epoch}...')
    for start_idx in range(0, len(X_train), SUB_BATCH_SIZE):
        train_model(s2s_model, start_idx, start_idx + SUB_BATCH_SIZE)

The model will be saved to a json file. The weights of the model will be saved in a hdf5 file.

In [0]:
# serialize model to JSON
model_json = s2s_model.to_json()
with open("model_50e.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
s2s_model.save_weights("model_50e.h5")
print("Saved model to disk")

Saved model to disk


Now a model with two LSTM layers (in addition to the encoder and decoder layers) will be trained.

In [26]:
def nn_model():
    shared_embedding = Embedding(
        output_dim=EMBEDDING_SIZE,
        input_dim=MAX_VOCAB_SIZE,
        input_length=MAX_MESSAGE_LEN,
        name='embedding',
    )
    
    # ENCODER
    
    encoder_input = Input(
        shape=(MAX_MESSAGE_LEN,),
        dtype='int32',
        name='encoder_input',
    )
    
    embedded_input = shared_embedding(encoder_input)
    
    
    encoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='encoder',
        dropout=DROPOUT,
    )
    
    first_lstm = LSTM(
        CONTEXT_SIZE,
        name='first_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    
    context = RepeatVector(MAX_MESSAGE_LEN)(encoder_rnn(first_lstm(embedded_input)))
    
    # DECODER
    
    last_word_input = Input(
        shape=(MAX_MESSAGE_LEN, ),
        dtype='int32',
        name='last_word_input',
    )
    
    embedded_last_word = shared_embedding(last_word_input)
    
    decoder_input = concatenate([embedded_last_word, context], axis=2)
    
    decoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='decoder',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    last_lstm = LSTM(
        CONTEXT_SIZE,
        name='last_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    decoder_output = last_lstm(decoder_rnn(decoder_input))
    
    next_word_dense = TimeDistributed(
        Dense(int(MAX_VOCAB_SIZE/2), activation='relu'),
        name='next_word_dense',
    )(decoder_output)
    
    next_word = TimeDistributed(
        Dense(MAX_VOCAB_SIZE, activation='softmax'),
        name='next_word_softmax'
    )(next_word_dense)
    
    return Model(inputs=[encoder_input, last_word_input], outputs=[next_word])

s2s_model = nn_model()
optimizer = Adam(lr=lr, clipvalue=5.0)
s2s_model.compile(optimizer='adam', loss='categorical_crossentropy')

In [27]:
s2s_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
first_lstm (LSTM)               (None, 30, 100)      80400       embedding[0][0]                  
__________

In [None]:
SUB_BATCH_SIZE = 64
for epoch in range(50):
    print(f'Training in epoch {epoch}...')
    for start_idx in range(0, len(X_train), SUB_BATCH_SIZE):
        train_model(s2s_model, start_idx, start_idx + SUB_BATCH_SIZE)

In [None]:
# serialize model to JSON
model_json = s2s_model.to_json()
with open("model_2l_50e.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
s2s_model.save_weights("model_2l_50e.h5")
print("Saved model to disk")

Now a model with four LSTM layers (in addition to the encoder and decoder layers) will be trained.

In [28]:
def nn_model():
    shared_embedding = Embedding(
        output_dim=EMBEDDING_SIZE,
        input_dim=MAX_VOCAB_SIZE,
        input_length=MAX_MESSAGE_LEN,
        name='embedding',
    )
    
    # ENCODER
    
    encoder_input = Input(
        shape=(MAX_MESSAGE_LEN,),
        dtype='int32',
        name='encoder_input',
    )
    
    embedded_input = shared_embedding(encoder_input)
    
    
    encoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='encoder',
        dropout=DROPOUT,
    )
    
    first_lstm = LSTM(
        CONTEXT_SIZE,
        name='first_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    second_lstm = LSTM(
        CONTEXT_SIZE,
        name='second_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    context = RepeatVector(MAX_MESSAGE_LEN)(encoder_rnn(second_lstm(first_lstm(embedded_input))))
    
    # DECODER
    
    last_word_input = Input(
        shape=(MAX_MESSAGE_LEN, ),
        dtype='int32',
        name='last_word_input',
    )
    
    embedded_last_word = shared_embedding(last_word_input)
    
    decoder_input = concatenate([embedded_last_word, context], axis=2)
    
    decoder_rnn = LSTM(
        CONTEXT_SIZE,
        name='decoder',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    second_last_lstm = LSTM(
        CONTEXT_SIZE,
        name='second_last_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    last_lstm = LSTM(
        CONTEXT_SIZE,
        name='last_lstm',
        return_sequences=True,
        dropout=DROPOUT
    )
    
    decoder_output = last_lstm(second_last_lstm(decoder_rnn(decoder_input)))
    
    next_word_dense = TimeDistributed(
        Dense(int(MAX_VOCAB_SIZE/2), activation='relu'),
        name='next_word_dense',
    )(decoder_output)
    
    next_word = TimeDistributed(
        Dense(MAX_VOCAB_SIZE, activation='softmax'),
        name='next_word_softmax'
    )(next_word_dense)
    
    return Model(inputs=[encoder_input, last_word_input], outputs=[next_word])

s2s_model = nn_model()
optimizer = Adam(lr=lr, clipvalue=5.0)
s2s_model.compile(optimizer='adam', loss='categorical_crossentropy')

In [29]:
s2s_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
first_lstm (LSTM)               (None, 30, 100)      80400       embedding[0][0]                  
__________

In [None]:
SUB_BATCH_SIZE = 64
for epoch in range(50):
    print(f'Training in epoch {epoch}...')
    for start_idx in range(0, len(X_train), SUB_BATCH_SIZE):
        train_model(s2s_model, start_idx, start_idx + SUB_BATCH_SIZE)

In [None]:
# serialize model to JSON
model_json = s2s_model.to_json()
with open("model_4l_50e.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
s2s_model.save_weights("model_4l_50e.h5")
print("Saved model to disk")

Now the models will be compared.

### Model 1

In [0]:
from keras.models import model_from_json
# load json and create model
json_file = open('model_50e.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model_50e.h5")
print("Loaded model from disk")

Loaded model from disk


In [0]:
loaded_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
encoder (LSTM)                  (None, 100)          80400       embedding[0][0]            

In [0]:
loaded_model.compile(optimizer='adam', loss='categorical_crossentropy')

print('Test results:', loaded_model.evaluate(
        [X_test, add_start_token(y_test)],
        binarize_labels(y_test)
    ))

Test results: 11.969458524801208


### Model 2

In [0]:
# load json and create model
json_file = open('model_2l_50e.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model_2l_50e.h5")
print("Loaded model from disk")

Loaded model from disk


In [0]:
loaded_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
first_lstm (LSTM)               (None, 30, 100)      80400       embedding[0][0]            

In [0]:
loaded_model.compile(optimizer='adam', loss='categorical_crossentropy')

print('Test results:', loaded_model.evaluate(
        [X_test, add_start_token(y_test)],
        binarize_labels(y_test)
    ))

Test results: 4.170023515436018


### Model 3

In [32]:
# load json and create model
json_file = open('model_4l_50e.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model_4l_50e.h5")
print("Loaded model from disk")

Loaded model from disk


In [33]:
loaded_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
last_word_input (InputLayer)    (None, 30)           0                                            
__________________________________________________________________________________________________
encoder_input (InputLayer)      (None, 30)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 100)      819200      encoder_input[0][0]              
                                                                 last_word_input[0][0]            
__________________________________________________________________________________________________
first_lstm (LSTM)               (None, 30, 100)      80400       embedding[0][0]            

In [34]:
loaded_model.compile(optimizer='adam', loss='categorical_crossentropy')

print('Test results:', loaded_model.evaluate(
        [X_test, add_start_token(y_test)],
        binarize_labels(y_test)
    ))

Test results: 4.302756843512879


## Conclusion

The second model had the lowest loss which indicates that it is the best model. The fact that the third model had a higher loss than the second model suggests that adding additional layers to the second model might lead to overfitting.

Reference:
https://www.kaggle.com/soaxelbrooke/twitter-basic-seq2seq