<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Answers_7_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.2

In this exercise we build a word language model using a recurrent network. We test it on a Cloze task in which we have to choose which word fits best within a given sentence.

For training the language model we use a collection of copyright free stories from the Gutenberg arhive.

The Cloze test comes from the [Children's Book Test](https://arxiv.org/pdf/1511.02301.pdf). In this test you are presented with an extract of a story and then a sentence with one word missing. Your task is to pick the right word to fill in the blank from the alternatives provided.

Here is an example:
<ul><br>CONTEXT: ... When the youth had overtaken them , he saw that there was a clear spring in the middle of the space . He sat down at the foot of the tree upon which the birds were perched , and listened attentively to what they were saying to each other . The sun is not down yet ,  said the first bird ;  we must wait yet awhile till the moon rises and the maiden comes to the spring .	<br>QUERY: Do you think she will see that young	____ sitting under the tree ?
<br>ALTERNATIVES:	birds|food|maiden|man|middle|place|sight|south|wash|wings
<br>CORRECT: man
</ul>

Our approach is to build a language model from a set of children's stories, using sequences of length 100. We then use the end of the CONTEXT and the QUERY text up to the blank as the starting point for predicting the blank (we don't use the right part of the QUERY at all) which provides a distribution over the vocabulary. We then choose the most probable word from the alternatives. This is certainly not an optimum approach and there are many better ways of solving this problem. The best approaches can achieve over 75% on this CBTest/CN task.


(a) Setup libraries.

In [None]:
import requests
import numpy as np

%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, Embedding, Flatten, SimpleRNN, LSTM, GRU, Bidirectional, Dropout, TimeDistributed
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import get_file


---
(b) Load the text we will use for training. This is a collection of stories from the Gutenberg archive. 

In [None]:
# download the training text
url = "https://www.phon.ucl.ac.uk/courses/pals0039/data/cloze-corpus.txt"
response = requests.get(url)
# convert the text to a single lower case string
raw_text = response.text.lower().replace('\n',' ')
print("Corpus has",len(raw_text),"characters")
print(raw_text[:250])

---
(c) Tokenize the text. Then build a vocabulary of 10,000 words and set the rare words to unknown.

In [None]:
# set the vocabulary limit to be 10,000 words
max_words=10000

# use the Keras Tokenizer
tokenizer = Tokenizer(num_words=max_words,oov_token="UNK")
tokenizer.fit_on_texts([raw_text])
word_index=tokenizer.word_index
print("Found",len(word_index),"different words.")

In [None]:
# print the 10 most common and 10 least common words
print(list(word_index.items())[:10])
print(list(word_index.items())[-10:])

In [None]:
# convert the text to a list of word indexes, replacing rare words with UNK
raw_seq=tokenizer.texts_to_sequences([raw_text])[0]
# print the start of the sample
print(raw_seq[:50])
# report largest code
print("# words",len(raw_seq),"Max index",max(raw_seq))
# calculate the out-of-vocabulary rate
num_oov=sum(1 for w in raw_seq if w==1)
print("%OOV",100*num_oov/len(raw_seq))

---
(d) Prepare the text sample for training by dividing into sequences of 100 words. Then save some of the data for validation of the language model.

In [None]:
# divide the text into sequences of fixed length
seq_len=100
nseq=len(raw_seq)//seq_len
# chunk the text into sequences
seq=np.reshape(raw_seq[:nseq*seq_len],(nseq,seq_len))
# shift the text back one word
raw_seq_shift=np.roll(raw_seq,-1) 
# and chunk into sequences to act as targets
seq_shift=np.reshape(raw_seq_shift[:nseq*seq_len],(nseq,seq_len))
# randomise the order of the sequences
p = np.random.permutation(nseq)
seq=seq[p]
seq_shift=seq_shift[p];

# divide into train and validation sets
nval=nseq//20
Xval=seq[:nval,:]
yval=seq_shift[:nval,:]
Xtrain=seq[nval:,:]
ytrain=seq_shift[nval:,:]

print("Train size",Xtrain.shape,ytrain.shape)
print("Validation size",Xval.shape,yval.shape)

print(Xtrain[0,:10],ytrain[0,:10])


---
(e) Build a recurrent language model with two LSTM layers.

In [None]:
import tensorflow as tf
# define a function for Keras to report perplexity during training
def perplexity(y_true, y_pred):
    cross_entropy = tf.losses.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = tf.exp(tf.reduce_mean(cross_entropy))
    return perplexity

# set up basic sizes for nextwork
isize=max_words
embed_size=64
osize=max_words

# build the model with an embedding layer and two layers of LSTMs
model = Sequential()
model.add(Embedding(input_dim=isize, output_dim=embed_size,input_length=seq_len))
model.add(LSTM(32,return_sequences=True,activation='tanh'))
model.add(LSTM(32,return_sequences=True,activation='tanh'))
model.add(TimeDistributed(Dense(osize, activation='softmax')));
#
# compile the network
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=[perplexity])
print(model.summary())

---
(f) Train the model. There is also optional code here to save and restore trained models since training can take some time (about 30min with GPU).

In [None]:
# train the model
history=model.fit(Xtrain,ytrain, batch_size=64, validation_data=(Xval,yval), epochs=25)


In [None]:
# (optional) save the model to your Google drive account
from google.colab import drive
drive.mount('/content/gdrive')
model_save_name = 'ex72.h5'
path = "/content/gdrive/My Drive/"+model_save_name
model.save(path,)

In [None]:
# (optional) load a trained model from your Google drive account
from google.colab import drive
drive.mount('/content/gdrive')
model_save_name = 'ex72.h5'
path = "/content/gdrive/My Drive/"+model_save_name
model=load_model(path, custom_objects={'perplexity': perplexity})

In [None]:
# (optional) load a trained model from the course website
model_save_name = 'ex72.h5'
url = "https://www.phon.ucl.ac.uk/courses/pals0039/data/"+model_save_name
file = get_file(model_save_name,url,cache_subdir="models")
model=load_model(file, custom_objects={'perplexity': perplexity})


---
(g) Load the data for the cloze task to use for testing. The cloze test data is stored in a spreadsheet with named columns for the relevant parts of each test question.

In [None]:
import pandas as pd

# read in the Cloze data set from the course web site.
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/cloze-test.csv",keep_default_na=False)

# the basics of the Cloze task is made up of the columns: df.CONTEXT, df.QUERY and df.ALTERNATIVES with the correct answer in df.ANSWER
df.head()

---
(h) Encode the cloze test data using the tokenizer and assemble into sequences.

In [None]:
# concatenate the context and the query and divide up the alternatives
cloze_context=[]
cloze_answer=[]
cloze_alter=[]
for i in range(len(df)):
  # concatenate the CONTEXT and the QUERY to get sufficient text for the LM
  str=df.CONTEXT.iat[i]+" "+df.CONTEXT.iat[i]+" "+df.QUERY.iat[i]
  cloze_context.append(str)
  cloze_answer.append(df.ANSWER.iat[i])
  cloze_alter.append((df.ALTERNATIVES.iat[i]).split('|'))

# convert the strings to integer sequences
cloze_context_seq=tokenizer.texts_to_sequences(cloze_context)
cloze_answer_seq=tokenizer.texts_to_sequences(cloze_answer)
cloze_alter_seq=tokenizer.texts_to_sequences(cloze_alter)

In [None]:
# print some samples of what we have
print(cloze_context[0])
print(cloze_context_seq[0])
print(cloze_answer[:10])
print(cloze_answer_seq[:10])
print(cloze_alter[:10])
print(cloze_alter_seq[:10])


---
(i) Run the model over the all the test sequences and obtain a pdf over the word that completes the query. We first chop down the context to the last 100 words to fit the sequence length used to train the lanugage model.

In [None]:
seq_len=100
# chop all context sequences down to seq_len values by taking the last 100 words
cloze_context_lim=np.stack(np.array([ x[-seq_len:] for x in cloze_context_seq]))
print(cloze_context_lim.shape)

In [None]:
# run the model in batches of 50 so as not to overload memory in Colab
block_size=50
nblock=cloze_context_lim.shape[0]//block_size
# array to hold the word probabilities
ypred=np.zeros((nblock*block_size,max_words))
for i in range(nblock):
  # get the predicted word probabilities
  testdata=cloze_context_lim[i*block_size:(i+1)*block_size,:]
  pred=model.predict(testdata,batch_size=50)
  # save the probabilities for the last predicted word
  ypred[i*block_size:(i+1)*block_size,:]=pred[:,-1,:]

print(ypred.shape)


---
(j) For each cloze sentence, find the probabilities of each of the alternatives and choose the most probable. Finally compare our answer with the ANSWER field and keep a record of how well we did.

In [None]:
# for each cloze sentence in turn
ntest=ypred.shape[0]
ncorrect=0;
for i in range(ntest):
  # get the number of alternatives
  nprob=len(cloze_alter_seq[i])
  # get the predicted probabilities
  prob=np.zeros(nprob)
  for j in range(nprob):
    prob[j]=ypred[i,cloze_alter_seq[i][j]]
  # choose the most probable among the alternatives
  top_word=cloze_alter_seq[i][np.argmax(prob)]
  # get the actually correct word
  correct_word=cloze_answer_seq[i][0]
  # record whether we got it right
  if (top_word==correct_word):
    ncorrect += 1

# print how well we did
print("Correct: %d/%d (%.1f%%)" % (ncorrect,ntest,100*ncorrect/ntest))


(k) Experiment with the network architecture and training protocol to see if you can improve performance on the Cloze task. Advanced: find a way to use the right-hand part of the QUERY string. One way coule be to evaluate each ALTERNATIVE in turn embedded in the QUERY sentence and choose the one which is given the greatest probability by the language model.