<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Exercise_9_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 9.1

In this exercise we develop a simple chatbot application using a sequence-to-sequence model and a database of movie dialogues. The exercise was developed from [https://github.com/sekharvth/simple-chatbot-keras](https://github.com/sekharvth/simple-chatbot-keras)


(a) Import the library modules we will need.

In [0]:
import numpy as np
import pandas as pd
%tensorflow_version 2.x
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Input, Dense, LSTM, TimeDistributed
from tensorflow.keras.models import Model, load_model


---
(b) Import movie dialogues data set. This comes from [https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)

In [0]:
# 
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/movie_lines.csv",keep_default_na=False)
df.head()

---
(c) Tokenize the dialogues. Run the code and add comments.

In [0]:
# 
max_words=5000

# 
contexts=df.CONTEXT.tolist()
# 
targets=[ "BOS "+l+" EOS" for l in df.TARGET.tolist()]
print("Contexts:",contexts[:5])
print("Targets:",targets[:5])

# 
tokenizer = Tokenizer(num_words=max_words,oov_token="UNK")
tokenizer.fit_on_texts(df.CONTEXT.tolist()+targets)
word_index=tokenizer.word_index
print("Found",len(word_index),"different words.")

print(list(word_index.items())[:10])
print(list(word_index.items())[-10:])

# 
ctxt=tokenizer.texts_to_sequences(df.CONTEXT.tolist())
targ=tokenizer.texts_to_sequences(targets)
print("Context",ctxt[:5])
print("Target",targ[:5])

# 
index_to_word={ v:k for k,v in tokenizer.word_index.items()}
index_to_word[0]='.'

---
(d) Reduce the complete data set to only simple dialogue turns. Run the code and add comments.

In [0]:
# 
min_seq=2
max_seq=12

# 
print("Unfiltered count",len(ctxt),len(targ))
ctxt_filt=[]
targ_filt=[];
for i in range(len(ctxt)):
  clen=len(ctxt[i])
  tlen=len(targ[i])-2   # -2 for BOS/EOS
  if ((min_seq<=clen)and(clen<=max_seq)and(min_seq<=tlen)and(tlen<=max_seq)):
    if (not (1 in ctxt[i]) and not (1 in targ[i])):       # 1 is code for UNK
      ctxt_filt.append(ctxt[i])
      targ_filt.append(targ[i])
print("Filtered count",len(ctxt_filt),len(targ_filt))


---
(e) Prepare data for training. Run the code and add comments.

In [0]:
#
seq_len=max_seq
ctxt_pad=pad_sequences(ctxt_filt, maxlen=seq_len, padding='pre')
targ_pad=pad_sequences(targ_filt, maxlen=seq_len+2, padding='post')
outs_pad=np.roll(targ_pad,-1,axis=1)
outs_pad[:,-1]=0

# 
ntrain=12800
perm=np.random.permutation(len(ctxt_filt))
ctxt_pad=ctxt_pad[perm[:ntrain]]
targ_pad=targ_pad[perm[:ntrain]]
outs_pad=outs_pad[perm[:ntrain]]

# 
print("Context",ctxt_pad[:5])
print("Target",targ_pad[:5])
print("Outputs",outs_pad[:5])


---
(f) Load Glove embeddings to use as input to network. Run the code and add comments.

In [0]:
# 
df=pd.read_csv('https://www.phon.ucl.ac.uk/courses/pals0039/data/glove.6B.100d.zip',header=None)
df.rename(columns={0:"word"},inplace=True)
print("Read %d word embeddings of length %d" % (len(df),len(df.columns)-1))
df.head()

In [0]:
# 
glove_index={}
for i,word in enumerate(df.word):
  glove_index[word]=i

# 
embed_dim=100
word_embed=np.zeros((max_words,embed_dim))
oov_count=0
for i in range(max_words):
  w=index_to_word[i]
  if w in glove_index:
    # 
    idx=glove_index[w]
  else:
    # 
    idx=glove_index["."]
    oov_count+=1
  word_embed[i,:]=np.array(df.iloc[idx,1:])

# 
print("OOV rate = %.1f%%" % (100*oov_count/max_words))


---
(g) Build the sequence to sequence model. Run the code and add comments.

In [0]:
# 
latent_dim=200
num_encoder_tokens=max_words
num_decoder_tokens=max_words

# 
input_context = Input(shape = (seq_len, ), dtype = 'int32', name = 'input_context')
# 
input_target = Input(shape = (seq_len+2, ), dtype = 'int32', name = 'input_target')

# 
embed_layer = Embedding(input_dim = max_words, output_dim = embed_dim, trainable = False )
embed_layer.build((None,))
embed_layer.set_weights([word_embed])

# 
input_ctx_embed = embed_layer(input_context)
embed_layer2 = Embedding(input_dim = max_words, output_dim = embed_dim, trainable = True )
input_tar_embed = embed_layer2(input_target)

# 
LSTM_encoder = LSTM(latent_dim, return_state = True)
encoder_outputs, state_h, state_c = LSTM_encoder(input_ctx_embed)

# 
encoder_states = [state_h, state_c]

# 
# 
# 
LSTM_decoder = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = LSTM_decoder(input_tar_embed,initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(num_decoder_tokens, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)

# 
# 
train_model = Model([input_context, input_target], decoder_outputs)

train_model.compile(optimizer = 'rmsprop', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
train_model.summary()


---
(h) Train the model and build the encoder and decoder models we'll need for inference. Run the code and add comments. Training will take a few minutes.

In [0]:
%%time
#
train_model.fit([ctxt_pad, targ_pad], outs_pad, epochs = 100, batch_size = 128)

# 
encoder_model = Model(input_context, encoder_states)

# 
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = LSTM_decoder(input_tar_embed, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([input_target,decoder_state_input_h,decoder_state_input_c],[decoder_outputs,state_h,state_c])

# 
encoder_model.save('ex9_1_encoder.h5')
decoder_model.save('ex9_1_decoder.h5')

---
(i) Define inference operation for dialogue turns. Run the code and add comments.

In [0]:
# 
encoder_model=load_model('ex9_1_encoder.h5', compile=False)
decoder_model=load_model('ex9_1_decoder.h5', compile=False)

# 
def decode_sequence(input_seq):
  # 
  states_value = encoder_model.predict(input_seq)

  # 
  target_seq = np.zeros((1, seq_len+2))
  # 
  target_seq[0, 0] = tokenizer.word_index['bos']

  # 
  decoded_sentence = []
  pos=0
  while True:
    # 
    output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

    # 
    sampled_token_index = np.argmax(output_tokens[0, pos, :])

    #
    if (sampled_token_index == tokenizer.word_index['eos'] or pos >= seq_len):
      break

    # 
    decoded_sentence.append(index_to_word[1+sampled_token_index])

    # 
    pos += 1
    target_seq[0, pos] = sampled_token_index

  return " ".join(decoded_sentence)

# 
for i in range(10):
  inp=[index_to_word[w] for w in ctxt_filt[i]]
  print("Prompt",i,' '.join(inp))
  tar=[index_to_word[w] for w in targ_filt[i]]
  print("Target",i,' '.join(tar))
  ctxt=ctxt_pad[i:i+1,:]
  print("Output",i,decode_sequence(ctxt),'\n')


---
(j) Interactive chat with the chatbot. Run the code and add comments.

In [0]:
# get a question
print("Type 'stop' to stop.")
question=input("Q: ")
while question != "stop":
  # 
  ques_list=tokenizer.texts_to_sequences([question])
  # 
  ques_pad=pad_sequences(ques_list, maxlen=seq_len, padding='pre')
  # 
  print("A:",decode_sequence(ques_pad))
  question=input("Q: ")


---
(k) This example could be improved in many ways: increase the number of training samples, change the size of the encoder/decoder or add additional layers, change the amount of training, perform filtering to eliminate words words unknown in the Glove data. There are also better ways to calculate performance - ignoring padding would be a start - or you could add separate test data. Do try out some variants yourself.