<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Answers_7_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.1 Answers

In this exercise we train a character sequence language model using a recurrent neural network and then generate some random text.


(a) Import the standard libraries.

In [0]:
import requests
import numpy as np

%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, Flatten, SimpleRNN, LSTM, GRU, Bidirectional, Dropout, TimeDistributed
from tensorflow.keras.preprocessing.sequence import pad_sequences

---
(b) Read in text file and convert characters to a list of integers. Run the code and add comments

In [0]:
# download the text file of alice in wonderland
url = "https://www.phon.ucl.ac.uk/courses/pals0039/data/alice.txt"
response = requests.get(url)
# convert everything to lower case and replace newlines with spaces.
raw_text = response.text.lower().replace('\n',' ')
print(raw_text[:250])

In [0]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
print(char_to_int)
# number of different characters
NCHAR=len(chars)
print(NCHAR)
# convert text to list of integers
raw_seq=[char_to_int[x] for x in raw_text]
print(raw_seq[:100])

---
(c) Divide into train and test. Run the code and add comments.

In [0]:
# use 90% of data for training, rest for testing
ntrain=int(0.9*len(raw_seq))
train_seq=raw_seq[:ntrain]
test_seq=raw_seq[ntrain:]
print(len(train_seq),len(test_seq))

---
(d) Build training sequences of fixed length. Run the code and add comments.

In [0]:
# convert one long sequence into batches of seqlen
def prepare_sequences(text,seqlen):
  # number of sequences we can generate
  nseq=(len(text)-1)//seqlen
  # inputs = features
  feats=np.zeros((nseq,seqlen))
  # outputs = labels
  labels=np.zeros((nseq,seqlen))
  for i in range(nseq):
    # inputs are chunks of the sequence
    feats[i,:]=text[i*seqlen:i*seqlen+seqlen]
    # labels are the same chunks but advanced by one unit 
    labels[i,:]=text[i*seqlen+1:i*seqlen+seqlen+1]
  return feats,labels

# prepare sequences of length 100
seqlen=100
Xtrain,ytrain = prepare_sequences(train_seq,seqlen)
Xtest,ytest = prepare_sequences(test_seq,seqlen)

print(Xtrain.shape,ytrain.shape)
print(Xtest.shape,ytest.shape)


---
(e) Build the recurrent model. Run the code and add comments

In [0]:
import tensorflow as tf
# this will be used by Keras to report perplexity during training
def perplexity(y_true, y_pred):
    cross_entropy = tf.losses.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = tf.exp(tf.reduce_mean(cross_entropy))
    return perplexity

# sizes of the problem and the embedding
isize=NCHAR
embed_size=64
osize=NCHAR

# build the model
model = Sequential()
model.add(Embedding(input_dim=isize, output_dim=embed_size,input_length=seqlen))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(TimeDistributed(Dense(osize, activation='softmax')));
#
# compile the network
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=[perplexity])
print(model.summary())

(f) train model

In [0]:
# train the model
history=model.fit(Xtrain,ytrain, batch_size=64, validation_data=(Xtest,ytest), epochs=100)


---
(g) Calculate perplexity on test set. Run the code and add comments.

In [0]:
import math

# make predictions from the test data
ypred=model.predict(Xtest)

# reshape back into a single long sequence
nseq=ypred.shape[0]
seqlen=ypred.shape[1]
ypred=np.reshape(ypred,(nseq*seqlen,ypred.shape[2]));

# reshape the original test data back to a single sequence
ytest_seq=np.reshape(ytest,(nseq*seqlen,1))

# collect all the probabilities assigned to the correct words 
probs=[]
for i in range(ytest_seq.shape[0]):
  probs.append(ypred[i,int(ytest_seq[i])])

# get the mean probability, entropy and perplexity
meanprob=np.mean(probs)
entropy=np.mean(-np.log(probs))
print(meanprob,entropy,math.exp(entropy))

---
(h) Generate some new text using the language model. Run the code and add comments.

In [0]:
# sample from a probability distribution
def sampledist(dist):
  thresh=np.random.random()
  sum=0
  for i in range(len(dist)):
    sum += dist[i]
    if sum > thresh:
      return(i)
  return(0)

def generate_text(model, start_string, num_generate=1000):
  # Evaluation step (generating text using the learned model)

  # Converting our start string to numbers (vectorizing)
  input_eval = [char_to_int[s] for s in start_string]
  input_eval = pad_sequences([input_eval],maxlen=100,padding='pre',value=0)
  input_eval = np.reshape(input_eval,(1,100))

  # Empty string to store our results
  text_generated = []

  # reset the state of the model
  model.reset_states()
  for i in range(num_generate):
    # generate the predictions of the model
    predictions = model.predict(input_eval,batch_size=1)

    # get the character probabilities for the last element
    cprobabilities=predictions[0][-1,:]

    # choose one of the characters by random sampling
    predicted_id = sampledist(cprobabilities)
  
    # We pass the predicted word as the next input to the model
    # along with the previous hidden state
    input_eval[0,0:99] = input_eval[0,1:100];
    input_eval[0,99] = predicted_id;

    # convert the id to a character and save
    text_generated.append(chars[predicted_id])

  return (start_string + ''.join(text_generated))

# generate 1000 characters of Alice in Wonderland style text
import textwrap
text=generate_text(model, start_string="once upon a time ",num_generate=1000)
print(textwrap.fill(text,80))