<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Exercise_7_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.1

In this exercise we train a character sequence language model using a recurrent neural network and then generate some random text.

We will use the text of Alice in Wonderland as the source material. Our language model will treat each character as a symbol. We divide the text into 95% for training and 5% for testing. Then split each part into sequences of length 100 characters for training and testing the model.

After training we will calculate the perplexity of the model on the test data.
Then we will use the model to generate some "Alice-like" text at the character level.


(a) Import the standard libraries.

In [None]:
import requests
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, Flatten, SimpleRNN, LSTM, GRU, Bidirectional, Dropout, TimeDistributed
from tensorflow.keras.preprocessing.sequence import pad_sequences

---
(b) Read in the Alice text and convert characters to a list of integers.

In [None]:
# download the text file of alice in wonderland
url = "https://www.phon.ucl.ac.uk/courses/pals0039/data/alice.txt"
response = requests.get(url)
# convert everything to lower case and replace newlines with spaces.
raw_text = response.text.lower().replace('\n',' ')
print(raw_text[:250])

In [None]:
# create mapping of unique chars to integers
# get a set of all characters
chars = sorted(list(set(raw_text)))
# build a dictionary that maps characters to integers
char_to_int = dict((c, i) for i, c in enumerate(chars))
print(char_to_int)
# get number of different characters
NCHAR=len(chars)
print("number of symbols",NCHAR)
# convert Alice text to list of integers
raw_seq=[char_to_int[x] for x in raw_text]
print(raw_seq[:100])

---
(c) Divide the enumerated text into train and test partitions.

In [None]:
# use 95% of data for training, rest for testing
ntrain=int(0.95*len(raw_seq))
train_seq=raw_seq[:ntrain]
test_seq=raw_seq[ntrain:]
print("Train length",len(train_seq),"Test length",len(test_seq))

---
(d) Split the training text up into training sequences of fixed length. Here we also calculate the output labels for each sequence, which are just the input characters shifted back one place.

In [None]:
# function to convert one long sequence into batches of seqlen
def prepare_sequences(text,seqlen):
  # number of sequences we can generate
  nseq=(len(text)-1)//seqlen
  # inputs = features
  feats=np.zeros((nseq,seqlen))
  # outputs = labels
  labels=np.zeros((nseq,seqlen))
  for i in range(nseq):
    # inputs are chunks of the sequence
    feats[i,:]=text[i*seqlen:i*seqlen+seqlen]
    # labels are the same chunks but advanced by one unit 
    labels[i,:]=text[i*seqlen+1:i*seqlen+seqlen+1]
  return feats,labels

# prepare sequences of length 100
seqlen=100
Xtrain,ytrain = prepare_sequences(train_seq,seqlen)
Xtest,ytest = prepare_sequences(test_seq,seqlen)

print("Train",Xtrain.shape,ytrain.shape)
print("Test",Xtest.shape,ytest.shape)


---
(e) Build the recurrent model and add perplexity metric.

In [None]:
import tensorflow as tf
# define a function for Keras to calculate perplexity during training
def perplexity(y_true, y_pred):
    cross_entropy = tf.losses.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = tf.exp(tf.reduce_mean(cross_entropy))
    return perplexity

# sizes of the problem and the embedding to use for each character
isize=NCHAR
embed_size=32
osize=NCHAR

# build the model
model = Sequential()
model.add(Embedding(input_dim=isize, output_dim=embed_size,input_length=seqlen))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(TimeDistributed(Dense(osize, activation='softmax')));
#
# compile the network
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=[perplexity])
print(model.summary())

(f) Train the character language model and plot training progress

In [None]:
# train the model
history=model.fit(Xtrain,ytrain, batch_size=64, validation_split=0.05, epochs=100)


In [None]:
# get the history dictionary 
hist=history.history
epochs=range(1,len(hist['loss'])+1)

# plot loss curves
plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
plt.plot(epochs,hist['loss'],'ro',label="Training loss")
plt.plot(epochs,hist['val_loss'],'b-',label="Validation loss")
plt.ylim(bottom=0);
plt.title("Training and Validation loss",fontsize=16)
plt.xlabel("Epochs",fontsize=14)
plt.ylabel("Loss",fontsize=14)
plt.grid()
plt.legend(fontsize=12)

# plot perplexity curves
plt.subplot(1,2,2)
plt.plot(epochs,hist['perplexity'],'ro',label="Training perplexity")
plt.plot(epochs,hist['val_perplexity'],'b-',label="Validation perplexity")
plt.ylim(bottom=0);
plt.title("Training and Validation perplexity",fontsize=16)
plt.xlabel("Epochs",fontsize=14)
plt.ylabel("Perplexity",fontsize=14)
plt.grid()
plt.legend(fontsize=12)
plt.show()

---
(g) Calculate the perplexity of the test set given the trained model.

In [None]:
import math

# make predictions from the test data
ypred=model.predict(Xtest)

# reshape back into a single long sequence
nseq=ypred.shape[0]
seqlen=ypred.shape[1]
ypred=np.reshape(ypred,(nseq*seqlen,ypred.shape[2]));

# reshape the original test data back to a single sequence
ytest_seq=np.reshape(ytest,(nseq*seqlen,1))

# collect all the probabilities assigned to the actual words 
probs=[]
for i in range(ytest_seq.shape[0]):
  probs.append(ypred[i,int(ytest_seq[i])])

# get the mean probability, entropy and perplexity
meanprob=np.mean(probs)
entropy=np.mean(-np.log(probs))
print("mean probability",meanprob,"entropy",entropy,"perplexity",math.exp(entropy))

---
(h) Generate some new text character by character using the language model.

In [None]:
# sample from a probability distribution over characters
def sampledist(dist):
  dist=dist/np.sum(dist)      # ensure normalised
  thresh=np.random.random()   # choose a random variate
  sum=0
  for i in range(len(dist)):
    sum += dist[i]
    if sum > thresh:          # choose one value from the distribution
      return(i)
  return(0)

# function to generate text from the model
def generate_text(model, start_string, num_generate=1000):
  # Evaluation step (generating text using the learned model)

  # Convert our start string to numbers (vectorizing)
  # we put the start string at the end of the sequence
  input_eval = [char_to_int[s] for s in start_string]
  input_eval = pad_sequences([input_eval],maxlen=100,padding='pre',value=0)
  input_eval = np.reshape(input_eval,(1,100))

  # Empty string to store our results
  text_generated = []

  # reset the state of the model
  model.reset_states()
  for i in range(num_generate):
    # generate the predictions of the model
    predictions = model.predict(input_eval,batch_size=1)

    # get the character probabilities for the last element
    cprobabilities=predictions[0][-1,:]

    # choose one of the characters by random sampling
    predicted_id = sampledist(cprobabilities)
  
    # We pass the predicted symbol as the next input to the model
    # along with the previous hidden state
    input_eval[0,0:99] = input_eval[0,1:100];
    input_eval[0,99] = predicted_id;

    # convert the id to a character and save
    text_generated.append(chars[predicted_id])

  return (start_string + ''.join(text_generated))

# generate 1000 characters of Alice in Wonderland style text
import textwrap
text=generate_text(model, start_string="once upon a time ",num_generate=1000)
print(textwrap.fill(text,80))

(i) Experiment with the recurrent network structure and training to see if you can reduce the perplexity of the test set given the model to a value below 4. Does a better model lead to better generated text?