<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Exercise_7_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.1

In this exercise we train a character sequence language model using a recurrent neural network and then generate some random text.


(a) Import the standard libraries.

In [0]:
import requests
import numpy as np

%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, Flatten, SimpleRNN, LSTM, GRU, Bidirectional, Dropout, TimeDistributed
from tensorflow.keras.preprocessing.sequence import pad_sequences

---
(b) Read in text file and convert characters to a list of integers. Run the code and add comments

In [0]:
# 
url = "https://www.phon.ucl.ac.uk/courses/pals0039/data/alice.txt"
response = requests.get(url)
# 
raw_text = response.text.lower().replace('\n',' ')
print(raw_text[:250])

In [0]:
# 
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
print(char_to_int)
# 
NCHAR=len(chars)
print(NCHAR)
# 
raw_seq=[char_to_int[x] for x in raw_text]
print(raw_seq[:100])

---
(c) Divide into train and test. Run the code and add comments.

In [0]:
#
ntrain=int(0.9*len(raw_seq))
train_seq=raw_seq[:ntrain]
test_seq=raw_seq[ntrain:]
print(len(train_seq),len(test_seq))

---
(d) Build training sequences of fixed length. Run the code and add comments.

In [0]:
# 
def prepare_sequences(text,seqlen):
  # 
  nseq=(len(text)-1)//seqlen
  # 
  feats=np.zeros((nseq,seqlen))
  # 
  labels=np.zeros((nseq,seqlen))
  for i in range(nseq):
    # 
    feats[i,:]=text[i*seqlen:i*seqlen+seqlen]
    # 
    labels[i,:]=text[i*seqlen+1:i*seqlen+seqlen+1]
  return feats,labels

# 
seqlen=100
Xtrain,ytrain = prepare_sequences(train_seq,seqlen)
Xtest,ytest = prepare_sequences(test_seq,seqlen)

print(Xtrain.shape,ytrain.shape)
print(Xtest.shape,ytest.shape)


---
(e) Build the recurrent model. Run the code and add comments

In [0]:
import tensorflow as tf
# 
def perplexity(y_true, y_pred):
    cross_entropy = tf.losses.sparse_categorical_crossentropy(y_true, y_pred)
    perplexity = tf.exp(tf.reduce_mean(cross_entropy))
    return perplexity

# 
isize=NCHAR
embed_size=64
osize=NCHAR

# 
model = Sequential()
model.add(Embedding(input_dim=isize, output_dim=embed_size,input_length=seqlen))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(LSTM(128,return_sequences=True,activation='tanh'))
model.add(TimeDistributed(Dense(osize, activation='softmax')));
#
# 
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=[perplexity])
print(model.summary())

(f) train model

In [0]:
# 
history=model.fit(Xtrain,ytrain, batch_size=64, validation_data=(Xtest,ytest), epochs=100)


---
(g) Calculate perplexity on test set. Run the code and add comments.

In [0]:
import math

# 
ypred=model.predict(Xtest)

# 
nseq=ypred.shape[0]
seqlen=ypred.shape[1]
ypred=np.reshape(ypred,(nseq*seqlen,ypred.shape[2]));

# 
ytest_seq=np.reshape(ytest,(nseq*seqlen,1))

# 
probs=[]
for i in range(ytest_seq.shape[0]):
  probs.append(ypred[i,int(ytest_seq[i])])

# 
meanprob=np.mean(probs)
entropy=np.mean(-np.log(probs))
print(meanprob,entropy,math.exp(entropy))

---
(h) Generate some new text using the language model. Run the code and add comments.

In [0]:
# 
def sampledist(dist):
  thresh=np.random.random()
  sum=0
  for i in range(len(dist)):
    sum += dist[i]
    if sum > thresh:
      return(i)
  return(0)

def generate_text(model, start_string, num_generate=1000):
  # 

  # 
  input_eval = [char_to_int[s] for s in start_string]
  input_eval = pad_sequences([input_eval],maxlen=100,padding='pre',value=0)
  input_eval = np.reshape(input_eval,(1,100))

  # 
  text_generated = []

  # 
  model.reset_states()
  for i in range(num_generate):
    # 
    predictions = model.predict(input_eval,batch_size=1)

    # 
    cprobabilities=predictions[0][-1,:]

    # 
    predicted_id = sampledist(cprobabilities)
  
    # 
    # 
    input_eval[0,0:99] = input_eval[0,1:100];
    input_eval[0,99] = predicted_id;

    # 
    text_generated.append(chars[predicted_id])

  return (start_string + ''.join(text_generated))

# 
import textwrap
text=generate_text(model, start_string="once upon a time ",num_generate=1000)
print(textwrap.fill(text,80))