<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Answers_6_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 6.1 Answers

In this exercise we revisit the IMDB sentiment analysis task, but this time exploiting Glove word embeddings and a recurrent network for classification.


(a) Load the IMDB review data set. Run the code and add comments,

In [0]:
# import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# import Keras toolkit
%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, Flatten, SimpleRNN, LSTM, GRU
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# download the IMDB dataset, with a 10,000 word vocoabulary
(Xtrain_seq,ytrain_seq),(Xtest_seq,ytest_seq)=imdb.load_data(num_words=10000)

# print some samples
print(Xtrain_seq[:5])
print(ytrain_seq[:5])



---
(b) Print some basic parameters of the data set and an example review. Run the code and add comments.

In [0]:
# get a list of all the sequence lengths and report longest
listlengths=[]
for sequence in Xtrain_seq:
  listlengths.append(len(sequence))
print("Longest list",max(listlengths))

# get a list of the largest code used in each sequence and report largest overall
wordcodes=[]
for sequence in Xtrain_seq:
  wordcodes.append(max(sequence))
print("Highest code",max(wordcodes))

# get the IMDB dictionary and build a reverse dictionary
dictionary=imdb.get_word_index();
print("the",dictionary['the'])
reverse_dictionary={0:"padding",1:"BOS",2:"UNK"}
for (key,value) in dictionary.items():
  reverse_dictionary[value+3]=key

# print the start of the reverse dictionary
print(list(map(reverse_dictionary.get, range(10))))

# print the first review decoded
print("First review")
review=[]
for i in range(len(Xtrain_seq[0])):
  review.append(reverse_dictionary[Xtrain_seq[0][i]])
print(" ".join(review))
  

---
(c) Load the Glove embeddings. Run the code and add comments.

In [0]:
# read in the Glove embeddings of size 100
df=pd.read_csv('https://www.phon.ucl.ac.uk/courses/pals0039/data/glove.6B.100d.zip',header=None)
df.rename(columns={0:"word"},inplace=True)
print("Read %d word embeddings of length %d" % (len(df),len(df.columns)-1))
embedsize=len(df.columns)-1

# print the first few rows
df.head()

---
(d) Build a look up index for Glove. Run the code and add comments.

In [0]:
# build a dictionary for Glove
glove_index={}
for i,word in enumerate(df.word):
  glove_index[word]=i

# test that the dictionary works
print("#words",len(glove_index))
print(glove_index['the'],glove_index['white'],glove_index['cat'])
print(df.word[glove_index['the']],df.word[glove_index['white']],df.word[glove_index['cat']])

# build a numpy array of embeddings
glove_embed=np.array(df.iloc[:,1:])

---
(e) map the IMDB vocabulary to the Glove dictionary. Run the code and add comments.

In [0]:
# build an array to hold the embeddings for every IMDB word
word_embed=np.zeros((10000,embedsize))
# look up each IMDB word in the Glove dictionary, copying embedding if available
for i in range(10000):
  if i <= 3:
    word='.'
  elif i in reverse_dictionary:
    word=reverse_dictionary[i]
  else:
    print("Word index %d not found" % (i))
  if word.lower() in glove_index:
    eidx=glove_index[word.lower()]
  else:
    eidx=glove_index['.']
  word_embed[i,:]=glove_embed[eidx,:]

# print some simple tests to check all is well
idx_the=glove_index['the']
emb_the=glove_embed[idx_the]
widx_the=dictionary['the']+3
wemb_the=word_embed[widx_the]
print(idx_the,emb_the)
print(widx_the,wemb_the);

---
(e) Prepare IMDB sequences to fixed length padded at start. Run code and add comments.

In [0]:
# pad/truncate all IMBD reviews to 500 words
maxseq=500

# pad the sequences with zeros at the start
Xtrain_pad=pad_sequences(Xtrain_seq,maxlen=maxseq,padding='pre',value=0)
Xtest_pad=pad_sequences(Xtest_seq,maxlen=maxseq,padding='pre',value=0)

# print the size of the training data and some sample values
print(Xtrain_pad.shape)
print(Xtrain_pad[:10,:25])

---
(f) Build a model with embeddings in first layer. Run the code and add comments.

In [0]:
# create a sequential model
model = Sequential()
# first layer holds our pre-calculated embedding (which is fixed)
model.add(Embedding(10000, embedsize, weights=[word_embed], input_length=maxseq, trainable=False))
model.add(LSTM(32,return_sequences=False,dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())


---
(g) Train the model. Run the code and add comments.

In [0]:
# we use the first 1000 test reviews for validation (for demonstration)
Xval = Xtest_pad[:1000]
yval = ytest_seq[:1000]
# fit the model
history=model.fit(Xtrain_pad, ytrain_seq, epochs=15, verbose=1,batch_size=64,validation_data=(Xval,yval))


---
(h) Plot training graphs. Run the code and add comments.

In [0]:
# get the history dictionary 
hist=history.history
epochs=range(1,len(hist['loss'])+1)

# plot loss curves
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.plot(epochs,hist['loss'],'bo',label="Training loss")
plt.plot(epochs,hist['val_loss'],'b-',label="Validation loss")
plt.title("Training and Validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

# plot accuracy curves
plt.subplot(1,2,2)
plt.plot(epochs,hist['accuracy'],'bo',label="Training accuracy")
plt.plot(epochs,hist['val_accuracy'],'b-',label="Validation accuracy")
plt.title("Training and Validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

---
(i) Get test performance. Run the code and add comments.

In [0]:
# evaluate the model on the test data
score, acc = model.evaluate(Xtest_pad, ytest_seq,verbose=0)
print('Test loss:', score)
print('Test accuracy:', acc)

# predict the sentiment of the test data and report some output values
ypred=model.predict(Xtest_pad)
ypred=ypred.flatten()
print(list(zip(ytest_seq[:10],ypred[:10])))


---
(j) Experiment with the netwrk architecture and training protocol for this problem. What is the best result you can get on the test set?