<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Answers_4_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 4.1 Answers

In this exercise we implement a DNN system for sentiment analysis of movie reviews.

We use a set of film reviews taken from the [Internet Movie Database](https://www.imdb.com/) which have been labelled as positive or negative. Words in the reviews have already been tokenised and encoded as numbers using a dictionary. We load the numeric sequences as variable length lists then build a bag of words model for each review. This gives a fixed length vector for each review which we can input into a DNN classifier.

---
(a) Run the following code and then add comments to explain what is performed in each step.

In [None]:
# import standard libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# import functions from keras toolkit
%tensorflow_version 2.x
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import imdb

# load the training data and test data from the IMDB database
(Xtrain_seq,ytrain_seq),(Xtest_seq,ytest_seq)=imdb.load_data(num_words=10000)

# print out a sample of the data
print(Xtrain_seq[:5])
print(ytrain_seq[:5])



---
(b) Here we build a dictionary and a reverse dictionary. Run the code then add comments.

In [None]:
# find out the lengths for each of the reviews
listlengths=[]
for sequence in Xtrain_seq:
  listlengths.append(len(sequence))
print("Longest list",max(listlengths))

# find out the largest word code used
wordcodes=[]
for sequence in Xtrain_seq:
  wordcodes.append(max(sequence))
print("Highest code",max(wordcodes))

# get the dictionary from the database
dictionary=imdb.get_word_index();

# build a reverse dictionary
reverse_dictionary={0:"padding",1:"BOS",2:"UNK"}
for (key,value) in dictionary.items():
  reverse_dictionary[value+3]=key

# print out the first 10 items in the reverse dictionary
print(list(map(reverse_dictionary.get, range(10))))

# convert a review back from numerical codes to text using the reverse dictionary
def imdb_review(seq):
  review=[]
  for widx in seq:
    review.append(reverse_dictionary[widx])
  return " ".join(review)

# print out the first review
print("First review:",imdb_review(Xtrain_seq[0]))
  

---
(c) We now vectorise the review documents. Run the code and add comments.

In [None]:
# function to convert lists of word indices into count vectors
def vectorise(sequences,numwords=10000):
  # output has length of lists but width of dictionary
  table=np.zeros((len(sequences),numwords))
  # loop over every sequence
  for i,seq in enumerate(sequences):
    # loop over every word in sequence
    for idx in seq:
      # add 1 to relevant cell
      table[i,idx] = table[i,idx]+1
  return(table)

# convert training and test data to vectors
Xtrain=vectorise(Xtrain_seq)
Xtest=vectorise(Xtest_seq)

# convert labels to numpy array
ytrain=np.asarray(ytrain_seq,dtype='float32')
ytest=np.asarray(ytest_seq,dtype='float32')

# print out some of the data to see what it looks like
print(Xtrain.shape,ytrain.shape,Xtest.shape,ytest.shape)
print(Xtrain[:10,:25])

---
(d) Build the DNN model. Run the code and add comments.

In [None]:
# use the Keras Sequential model
model = Sequential()
# add a dense layer of 16 units with 10000 inputs
model.add(Dense(16,activation='sigmoid',input_shape=(Xtrain.shape[1],)))
# add a second dense layer
model.add(Dense(16,activation='sigmoid'))
# add a final output layer to give sentiment probability
model.add(Dense(1,activation='sigmoid'))
# use cross-entropy to train probability.
model.compile(loss='binary_crossentropy',optimizer="rmsprop",metrics=['accuracy'])
model.summary()

---
(e) Train the model. Run the code and add comments.

In [None]:
# train the model, saving history
# we use 5% of the training data for validation
history=model.fit(Xtrain,ytrain,epochs=20,batch_size=64,validation_split=0.05)


---
(f) Plot training graphs. Run the code and add comments.

In [None]:
# get the history dictionary
hist=history.history

# get x axis as number of epochs
epochs=range(1,len(hist['loss'])+1)

# plot how loss on the training data and validation data changed with epoch
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.plot(epochs,hist['loss'],'b-',label="Training loss")
plt.plot(epochs,hist['val_loss'],'r-',label="Validation loss")
plt.title("Training and Validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

# plot how accuracy on the training data and validation data changed with epoch
plt.subplot(1,2,2)
plt.plot(epochs,hist['accuracy'],'b-',label="Training accuracy")
plt.plot(epochs,hist['val_accuracy'],'r-',label="Validation accuracy")
plt.title("Training and Validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

---
(g) Get test performance. Run the code and add comments

In [None]:
# evaluate the model on the test data
score, acc = model.evaluate(Xtest, ytest,verbose=0)
print('Test loss:', score)
print('Test accuracy:', acc)

# look at the numerical values of the first 10 predictions
ypred=model.predict(Xtest)
print(ypred.shape)
ypred=ypred.flatten()
print(ypred.shape)
print(ytest[:10])
print(ypred[:10])


---
(h) Experiment with the problem to try to improve performance. Here are some ideas:
<ol>
<li>Change the structure of the network: size, number of layers, node type, optimizer, number of epochs of training.
<li>Change term counts to term frequencies
<li>Weight term counts using the TF-IDF method
</ol>
