## Deep Learning Course (980)
## Assignment Three 

__Assignment Goals:__

- Implementing RNN based language models.
- Implementing and applying a Recurrent Neural Network on text classification problem using TensorFlow.
- Implementing __many to one__ and __many to many__ RNN sequence processing.

In this assignment, you will implement RNN-based language models and compare extracted word representation from different models. You will also compare two different training methods for sequential data: Truncated Backpropagation Through Time __(TBTT)__ and Backpropagation Through Time __(BTT)__. 
Also, you will be asked to apply Vanilla RNN to capture word representations and solve a text classification problem. 


__DataSets__: You will use two datasets, an English Literature for language model task (part 1 to 4) and 20Newsgroups for text classification (part 5). 


1. (30 points) Implement the RNN based language model described by Mikolov et al.[1], also called __Elman network__ and train a language model on the English Literature dataset. This network contains input, hidden and output layer and is trained by standard backpropagation (TBTT with τ = 1) using the cross-entropy loss. 
   - The input represents the current word while using 1-of-N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. 
   - The hidden layer is a fully connected sigmoid layer with size 500. 
   - Softmax Output Layer to capture a valid probability distribution.
   - The model is trained with truncated backpropagation through time (TBTT) with τ = 1: the weights of the network are updated based on the error vector computed only for the current time step.
   
   Download the English Literature dataset and train the language model as described, report the model cross-entropy loss on the train set. Use nltk.word_tokenize to tokenize the documents. 
For initialization, s(0) can be set to a vector of small values. Note that we are not interested in the *dynamic model* mentioned in the original paper. 
To make the implementation simpler you can use Keras to define neural net layers, including Keras.Embedding. (Keras.Embedding will create an additional mapping layer compared to the Elman architecture.) 

2. (20 points) TBTT has less computational cost and memory needs in comparison with *backpropagation through time algorithm (BTT)*. These benefits come at the cost of losing long term dependencies [2]. Now let's try to investigate computational costs and performance of learning our language model with BTT. For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input  size will be [1, Sentence Length]). 

    1. Split the document into sentences (you can use nltk.tokenize.sent_tokenize).
    2. For each sentence, perform one pass that computes the mean/sum loss for this sentence; then perform a gradient update for the whole sentence. (So the mini-batch size varies for the sentences with different lengths). You can truncate long sentences to fit the data in memory. 
    3. Report the model cross-entropy loss.

3. (15 points) It does not seem that simple recurrent neural networks can capture truly exploit context information with long dependencies, because of the problem that gradients vanish and exploding. To solve this problem, gating mechanisms for recurrent neural networks were introduced. Try to learn your last model (Elman + BTT) with the SimpleRnn unit replaced with a Gated Recurrent Unit (GRU). Report the model cross-entropy loss. Compare your results in terms of cross-entropy loss with two other approach(part 1 and 2). Use each model to generate 10 synthetic sentences of 15 words each. Discuss the quality of the sentences generated - do they look like proper English? Do they match the training set?
    Text generation from a given language model can be done using the following iterative process:
   1. Set sequence = \[first_word\], chosen randomly.
   2. Select a new word based on the sequence so far, add this word to the sequence, and repeat. At each iteration, select the word with maximum probability given the sequence so far. The trained language model outputs this probability. 

4. (15 points) The text describes how to extract a word representation from a trained RNN (Chapter 4). How we can evaluate the extracted word representation for your trained RNN? Compare the words representation extracted from each of the approaches using one of the existing methods.

5. (20 points) We are aiming to learn an RNN model that predicts document categories given its content (text classification). For this task, we will use the 20Newsgroupst dataset. The 20Newsgroupst contains messages from twenty newsgroups.  We selected four major categories (comp, politics, rec, and religion) comprising around 13k documents altogether. Your model should learn word representations to support the classification task. For solving this problem modify the __Elman network__ architecture such that the last layer is a softmax layer with just 4 output neurons (one for each category). 

    1. Download the 20Newsgroups dataset, and use the implemented code from the notebook to read in the dataset.
    2. Split the data into a training set (90 percent) and validation set (10 percent). Train the model on  20Newsgroups.
    3. Report your accuracy results on the validation set.

__NOTE__: Please use Jupyter Notebook. The notebook should include the final code, results and your answers. You should submit your Notebook in (.pdf or .html) and .ipynb format. (penalty 10 points) 

To reduce the parameters, you can merge all words that occur less often than a threshold into a special rare token (\__unk__).

__Instructions__:

The university policy on academic dishonesty and plagiarism (cheating) will be taken very seriously in this course. Everything submitted should be your own writing or coding. You must not let other students copy your work. Spelling and grammar count.

Your assignments will be marked based on correctness, originality (the implementations and ideas are from yourself), clarification and test performance.


[1] Tom´ as Mikolov, Martin Kara ˇ fiat, Luk´ ´ as Burget, Jan ˇ Cernock´ ˇ y,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010

[2] Tallec, Corentin, and Yann Ollivier. "Unbiasing truncated backpropagation through time." arXiv preprint arXiv:1705.08209 (2017).


In [1]:

"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
        print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = "ISO-8859-1", mode ='r').read().lower())
            groups.append(cat)

    return news, list(map(to_categories, groups))



data_path = "datasets/20news_subsampled"
news, groups = data_loader(data_path)

Category:comp.graphics
Category:comp.os.ms-windows.misc
Category:comp.sys.ibm.pc.hardware
Category:comp.sys.mac.hardware
Category:comp.windows.x
Category:rec.autos
Category:rec.motorcycles
Category:rec.sport.baseball
Category:rec.sport.hockey
Category:soc.religion.christian
Category:talk.politics.guns
Category:talk.politics.mideast
Category:talk.politics.misc
Category:talk.religion.misc


In [2]:
'''Implementing RNN based language model Elman network. (part 1)
'''
from nltk import word_tokenize, download
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.python.client import device_lib
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
import numpy as np

# load the English Literature dataset
english_literature_path = './datasets/English Literature.txt'
with open(english_literature_path) as f:
    english_literature_text = f.read()
print(len(english_literature_text))

# tokenize the English Literature dataset
download('punkt')
english_literature_tokens = word_tokenize(english_literature_text)
print(len(english_literature_tokens))

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


1115394


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\songyih\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


254533


In [23]:
# build vocabulary
from collections import Counter

word2index = {}
index2word = []
english_literature_counter = Counter(english_literature_tokens)
for word, count in english_literature_counter.items():
    index2word.append(word)
    word2index[word] = len(word2index)

vocabulary_size = len(word2index)
print(vocabulary_size)

14309


In [24]:
# preprocess the dataset to get training data

max_input_len = 1
step = 1
x = []
y = []
for i in range(0, len(english_literature_tokens) - max_input_len, step):
    if i % 100 == 0:
        print("Progress: {0}%".format(round(i / len(english_literature_tokens) * 100, 2)), end="\r")
    curr_words = english_literature_tokens[i:i + max_input_len]
    x.append([word2index.get(curr_word, 0) for curr_word in curr_words])
    next_word = english_literature_tokens[i + max_input_len]
    y.append(word2index.get(next_word, 0))
X = np.array(x)
Y = to_categorical(y, vocabulary_size)
print("")
print(X.shape, Y.shape)

Progress: 0.0%Progress: 0.04%Progress: 0.08%Progress: 0.12%Progress: 0.16%Progress: 0.2%Progress: 0.24%Progress: 0.28%Progress: 0.31%Progress: 0.35%Progress: 0.39%Progress: 0.43%Progress: 0.47%Progress: 0.51%Progress: 0.55%Progress: 0.59%Progress: 0.63%Progress: 0.67%Progress: 0.71%Progress: 0.75%Progress: 0.79%Progress: 0.83%Progress: 0.86%Progress: 0.9%Progress: 0.94%Progress: 0.98%Progress: 1.02%Progress: 1.06%Progress: 1.1%Progress: 1.14%Progress: 1.18%Progress: 1.22%Progress: 1.26%Progress: 1.3%Progress: 1.34%Progress: 1.38%Progress: 1.41%Progress: 1.45%Progress: 1.49%Progress: 1.53%Progress: 1.57%Progress: 1.61%Progress: 1.65%Progress: 1.69%Progress: 1.73%Progress: 1.77%Progress: 1.81%Progress: 1.85%Progress: 1.89%Progress: 1.93%Progress: 1.96%Progress: 2.0%Progress: 2.04%Progress: 2.08%Progress: 2.12%Progress: 2.16%Progress: 2.2%Progress: 2.24%Progress: 2.28%Progress: 2.32%Progress: 2.36%Progress: 2.4%Progress: 2.44%

Progress: 27.19%Progress: 27.23%Progress: 27.27%Progress: 27.3%Progress: 27.34%Progress: 27.38%Progress: 27.42%Progress: 27.46%Progress: 27.5%Progress: 27.54%Progress: 27.58%Progress: 27.62%Progress: 27.66%Progress: 27.7%Progress: 27.74%Progress: 27.78%Progress: 27.82%Progress: 27.85%Progress: 27.89%Progress: 27.93%Progress: 27.97%Progress: 28.01%Progress: 28.05%Progress: 28.09%Progress: 28.13%Progress: 28.17%Progress: 28.21%Progress: 28.25%Progress: 28.29%Progress: 28.33%Progress: 28.37%Progress: 28.4%Progress: 28.44%Progress: 28.48%Progress: 28.52%Progress: 28.56%Progress: 28.6%Progress: 28.64%Progress: 28.68%Progress: 28.72%Progress: 28.76%Progress: 28.8%Progress: 28.84%Progress: 28.88%Progress: 28.92%Progress: 28.95%Progress: 28.99%Progress: 29.03%Progress: 29.07%Progress: 29.11%Progress: 29.15%Progress: 29.19%Progress: 29.23%Progress: 29.27%Progress: 29.31%Progress: 29.35%Progress: 29.39%Progress: 29.43%Progress: 29.47%Pr

Progress: 66.99%Progress: 67.02%Progress: 67.06%Progress: 67.1%Progress: 67.14%Progress: 67.18%Progress: 67.22%Progress: 67.26%Progress: 67.3%Progress: 67.34%Progress: 67.38%Progress: 67.42%Progress: 67.46%Progress: 67.5%Progress: 67.54%Progress: 67.57%Progress: 67.61%Progress: 67.65%Progress: 67.69%Progress: 67.73%Progress: 67.77%Progress: 67.81%Progress: 67.85%Progress: 67.89%Progress: 67.93%Progress: 67.97%Progress: 68.01%Progress: 68.05%Progress: 68.09%Progress: 68.12%Progress: 68.16%Progress: 68.2%Progress: 68.24%Progress: 68.28%Progress: 68.32%Progress: 68.36%Progress: 68.4%Progress: 68.44%Progress: 68.48%Progress: 68.52%Progress: 68.56%Progress: 68.6%Progress: 68.64%Progress: 68.67%Progress: 68.71%Progress: 68.75%Progress: 68.79%Progress: 68.83%Progress: 68.87%Progress: 68.91%Progress: 68.95%Progress: 68.99%Progress: 69.03%Progress: 69.07%Progress: 69.11%Progress: 69.15%Progress: 69.19%Progress: 69.22%Progress: 69.26%Pro


(254532, 1) (254532, 14309)


In [24]:
model_elman = Sequential()
model_elman.add(Embedding(vocabulary_size, 500, input_length=max_input_len))
model_elman.add(SimpleRNN(units=500, activation='sigmoid'))
model_elman.add(Dense(vocabulary_size, activation='softmax'))

model_elman.summary()
model_elman.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1, 500)            7154500   
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, 500)               500500    
_________________________________________________________________
dense_9 (Dense)              (None, 14309)             7168809   
Total params: 14,823,809
Trainable params: 14,823,809
Non-trainable params: 0
_________________________________________________________________


In [19]:
print(device_lib.list_local_devices())
filepath = "./model_elman.pth"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, save_weights_only=False)
elman_training_history = model_elman.fit(
    X,
    Y,
    batch_size=128,
    epochs=20,
    verbose=1,
    shuffle=False,
    callbacks=[checkpoint]
)


[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10287761693967540431
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6700198133
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12434136452038456957
physical_device_desc: "device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Report the model cross-entropy loss.
The model cross-entropy loss for the elman + TBTT model (part 1) on the train set is 4.1998

In [25]:
''' Elman network + backpropagation through time algorithm (part 2)
'''
from nltk.tokenize import sent_tokenize

# prepare sentence sequences of the dataset
english_literature_sentences = sent_tokenize(english_literature_text)
english_literature_sentences_seq = []
english_literature_sentences_length = []
max_length = 40
for sentence in english_literature_sentences:
    tmp_tokens = word_tokenize(sentence)
    if len(tmp_tokens) > max_length:
        tmp_tokens = tmp_tokens[:max_length]
    for i in range(1, len(tmp_tokens)):
        # 1-of-N encoding
        tmp_seq = tmp_tokens[:i+1]
        tmp_seq_encoded = []
        for token in tmp_seq:
            tmp_seq_encoded.append(word2index[token])
        english_literature_sentences_seq.append(tmp_seq_encoded)
        english_literature_sentences_length.append(len(tmp_seq_encoded))

In [26]:
print('number of sequences', len(english_literature_sentences_seq))
print('mean sentence length', sum(english_literature_sentences_length) / len(english_literature_sentences_length))
max_sentence_length = max(english_literature_sentences_length)
print('max sentence length', max_sentence_length)

number of sequences 210783
mean sentence length 14.081591020148684
max sentence length 40


In [27]:
# prepare input and target data for training the model
from tensorflow.keras.preprocessing.sequence import pad_sequences

english_literature_sentences_seq = pad_sequences(english_literature_sentences_seq, maxlen=max_sentence_length, padding='pre')
english_literature_sentences_seq = np.array(english_literature_sentences_seq)
x_BTT = english_literature_sentences_seq[:, :-1]

y_BTT = to_categorical(english_literature_sentences_seq[:, -1], vocabulary_size)

In [28]:
# build the network with BTT
model_BTT = Sequential()
model_BTT.add(Embedding(vocabulary_size, 500, input_length=max_sentence_length-1))
model_BTT.add(SimpleRNN(units=500, activation='sigmoid'))
model_BTT.add(Dense(vocabulary_size, activation='softmax'))

model_BTT.summary()
model_BTT.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 39, 500)           7154500   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 500)               500500    
_________________________________________________________________
dense (Dense)                (None, 14309)             7168809   
Total params: 14,823,809
Trainable params: 14,823,809
Non-trainable params: 0
_________________________________________________________________


In [29]:
print(device_lib.list_local_devices())
filepath = "./model_BTT.pth"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, save_weights_only=False)
BTT_training_history = model_BTT.fit(
    x_BTT,
    y_BTT,
    batch_size=128,
    epochs=60,
    verbose=1,
    shuffle=False,
    callbacks=[checkpoint]
)

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 4406640222819138814
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6700198133
locality {
  bus_id: 1
  links {
  }
}
incarnation: 13219286407285742176
physical_device_desc: "device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/

### Report the model cross-entropy loss.
The model cross-entropy loss for the elman + BTT model (part 2) on the train set is 0.5191

In [8]:
''' Elman + BTT model with the SimpleRnn unit replaced with a Gated Recurrent Unit (part 3)
'''
from tensorflow.keras.layers import GRU

model_BTT_GRU = Sequential()
model_BTT_GRU.add(Embedding(vocabulary_size, 500, input_length=max_sentence_length-1))
model_BTT_GRU.add(GRU(units=500, activation='sigmoid'))
model_BTT_GRU.add(Dense(vocabulary_size, activation='softmax'))

model_BTT_GRU.summary()
model_BTT_GRU.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 39, 500)           7154500   
_________________________________________________________________
gru (GRU)                    (None, 500)               1501500   
_________________________________________________________________
dense (Dense)                (None, 14309)             7168809   
Total params: 15,824,809
Trainable params: 15,824,809
Non-trainable params: 0
_________________________________________________________________


In [9]:
print(device_lib.list_local_devices())
filepath = "./model_BTT_GRU.pth"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, save_weights_only=False)
BTT_GRU_training_history = model_BTT_GRU.fit(
    x_BTT,
    y_BTT,
    batch_size=128,
    epochs=100,
    verbose=1,
    shuffle=False,
    callbacks=[checkpoint]
)

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6192848861112352318
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6700198133
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16910545245568130575
physical_device_desc: "device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100

### Report the model cross-entropy loss.
The model cross-entropy loss for the elman + BTT with GRU is 0.2892

### Compare your results in terms of cross-entropy loss with two other approach (part 1 and 2)

- Elman + TBTT (part 1): cross-entropy loss 4.1998, acc 0.1959 
- Elman + BTT (part 2): best cross-entropy loss 0.5191, acc 0.8812
- Elman + BTT with the SimpleRNN unit replaced with GRU (part 3): best cross-entropy loss 0.2892, acc 0.9268 

The cross-entropy loss of Elman + BTT network with the SimpleRNN unit replaced with GRU is the best among these three model.

In [31]:
'''Use each model to generate 10 synthetic sentences of 15 words each
'''
word_num = 15
sentence_num = 10

# basic elman with TBTT model (part 1)
elman_TBTT = load_model('./model_elman.pth')
for i in range(sentence_num):
    # randomly choose init word
    init_encoded = np.random.randint(vocabulary_size)
    encoded_sequence = [init_encoded]
    # generate the predicted sequence
    for _ in range(word_num - 1):
        latest_word_encoded = [encoded_sequence[-1]]
        latest_word_encoded = np.array(latest_word_encoded)
        predicted_encoded = elman_TBTT.predict_classes(latest_word_encoded, verbose=0)
        encoded_sequence.append(predicted_encoded[0])
    # decode the sequence
    decoded_sequence = []
    for encoded_word in encoded_sequence:
        decoded_sequence.append(index2word[encoded_word])
    print(' '.join(decoded_sequence))

studded all the world , I have no more than the world , I have
gaze this island . PROSPERO : I have no more than the world , I
chid'st me , I have no more than the world , I have no more
measuring his name , I have no more than the world , I have no
planched gate And , I have no more than the world , I have no
Senators : I have no more than the world , I have no more than
drift ; And , I have no more than the world , I have no
grazing , I have no more than the world , I have no more than
cap of the world , I have no more than the world , I have
corrupted foul play the world , I have no more than the world , I


In [32]:
# elman with BTT model (part 2)
elman_BTT = load_model('./model_BTT.pth')
for i in range(sentence_num):
    # randomly choose init word
    init_encoded = np.random.randint(vocabulary_size)
    encoded_sequence = [init_encoded]
    # generate the predicted sequence
    for _ in range(word_num - 1):
        input_sequence = pad_sequences([encoded_sequence], maxlen=max_sentence_length-1, padding='pre')
        input_sequence = np.array(input_sequence)
        predicted_encoded = elman_BTT.predict_classes(input_sequence, verbose=0)
        encoded_sequence.append(predicted_encoded[0])
    # decode the sequence
    decoded_sequence = []
    for encoded_word in encoded_sequence:
        decoded_sequence.append(index2word[encoded_word])
    print(' '.join(decoded_sequence))

dip'dst in view ; but first was struck with me than never will be ruled
bud of better ; he does offend my brother ? ' Lord , how have
noisemaker ! ' I hate thee by your side ; and see this night or
punto reverso ! ' I not ? -- No ; I will resist such entertainment
strokedst me and madest much of him ! ' I ' the plain way is
births : On whom God will never yet a word , we hear the minstrels
It is a hint That wrings mine eyes to't . ' n ' the air
awaked him , we 'll be put to woo . ' I ' the part
footing of the city ? ' song , the great subject well , she 'll
Fill me for that gird , good Tranio , for that thou likest it not


In [33]:
# elman with BTT model with the SimpleRnn unit replaced with GRU (part 3)
elman_BTT_GRU = load_model('./model_BTT_GRU.pth')
for i in range(sentence_num):
    # randomly choose init word
    init_encoded = np.random.randint(vocabulary_size)
    encoded_sequence = [init_encoded]
    # generate the predicted sequence
    for _ in range(word_num - 1):
        input_sequence = pad_sequences([encoded_sequence], maxlen=max_sentence_length-1, padding='pre')
        input_sequence = np.array(input_sequence)
        predicted_encoded = elman_BTT_GRU.predict_classes(input_sequence, verbose=0)
        encoded_sequence.append(predicted_encoded[0])
    # decode the sequence
    decoded_sequence = []
    for encoded_word in encoded_sequence:
        decoded_sequence.append(index2word[encoded_word])
    print(' '.join(decoded_sequence))

Devised are thee ? ' to the Capitol ! -- I am : and here
pluck in the topsail . ' the air doth burn . ' the air doth
fan , this Claudio is mine only son . ' the other way In that
shield me not first ? ' to the gaol . ' the house , how
abusing Baptista is to a cause to sigh , Then he shall command his mind
ye are not so mad -- That thou hast cause to pry into this morning
commonly and the air And each more villain : if these thing it is ,
pieces : He hath not possible nor prayers ; and he be too noble for
couples : You seem to hear of this : you have such vantage in this
fixes renown 'd two in this field We fall in broil . ' the last


### Discuss the quality of the sentences generated

The sentences generated by basic elman with TBTT model (part 1) doesn't look like English at all and seems to be overfitting as it is simply repeating some phrases, while sentences generated by the later 2 models look much better. Although the elman with BTT model cannot form a nice sentence, it produces some natural phrases. And the elman + BTT with SimpleRNN replaced with GRU performs the best among these three models. Though it still doesn't work perfectly, it is able to somehow generate some sentences that looks like English, and matches the training set.

In [5]:
''' Compare the words representation extracted from each of the approaches using one of the existing methods. (part 4)
Here I choose to use intrinsic evaluation method. 
I evaluate the model by compare the similarity of my three models and gold standard similarity dataset
'''

# load wordsim_similarity_goldstandard as benchmark
goldstandard_word0 = []
goldstandard_word1 = []
goldstandard_similarity = []
vocabulary_keys = word2index.keys()
found_pairs = 0
total_pairs = 0
with open('./datasets/wordsim_similarity_goldstandard.txt') as f:
    lines = f.readlines()
    for line in lines:
        total_pairs += 1
        temp = line.strip().split('\t')
        # only save the pair of words that can be found in our vocabulary
        if temp[0] in vocabulary_keys and temp[1] in vocabulary_keys:
            found_pairs += 1
            goldstandard_word0.append(temp[0])
            goldstandard_word1.append(temp[1])
            goldstandard_similarity.append(float(temp[2]))

In [6]:
print('Found {0} pair of words in local vocabulary out of {1} pair of words in wordsim_similarity_goldstandard'.format(found_pairs, total_pairs))

Found 66 pair of words in local vocabulary out of 203 pair of words in wordsim_similarity_goldstandard


In [7]:
# calculate similarity for elman + TBTT model (part 1) on the found pair of words

# load model
elman_TBTT = load_model('./model_elman.pth')
elman_TBTT_embedding = elman_TBTT.layers[0].get_weights()[0]
elman_TBTT_embedding = np.array(elman_TBTT_embedding)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


In [37]:
from sklearn.metrics.pairwise import cosine_similarity

# do the calculation
elman_TBTT_similarity = []
for i in range(len(goldstandard_similarity)):
    index_word0 = word2index[goldstandard_word0[i]]
    index_word1 = word2index[goldstandard_word1[i]]
    elman_TBTT_similarity.append(cosine_similarity(elman_TBTT_embedding[index_word0].reshape(1, -1), elman_TBTT_embedding[index_word1].reshape(1, -1))[0][0])

In [38]:
# calculate the Spearman rank correlation on our similarity and goldstandard similarity

from scipy import stats
elman_TBTT_correlation = stats.spearmanr(np.array(elman_TBTT_similarity), np.array(goldstandard_similarity))
print('Spearman rank correlation between elman + TBTT (part 1) and gold starndard similarity is ', elman_TBTT_correlation.correlation.round(4))

Spearman rank correlation between elman + TBTT (part 1) and gold starndard similarity is  0.1581


In [43]:
# calculate similarity for elman + BTT model (part 2) on the found pair of words

elman_BTT = load_model('./model_BTT_max40.pth')
elman_BTT_embedding = elman_BTT.layers[0].get_weights()[0]
elman_BTT_embedding = np.array(elman_BTT_embedding)

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

# do the calculation
elman_BTT_similarity = []
for i in range(len(goldstandard_similarity)):
    index_word0 = word2index[goldstandard_word0[i]]
    index_word1 = word2index[goldstandard_word1[i]]
    elman_BTT_similarity.append(cosine_similarity(elman_BTT_embedding[index_word0].reshape(1, -1), elman_BTT_embedding[index_word1].reshape(1, -1))[0][0])

In [45]:
# calculate the Spearman rank correlation on our similarity and goldstandard similarity

from scipy import stats
elman_BTT_correlation = stats.spearmanr(np.array(elman_BTT_similarity), np.array(goldstandard_similarity))
print('Spearman rank correlation between elman + BTT (part 2) and gold starndard similarity is ', elman_BTT_correlation.correlation.round(4))

Spearman rank correlation between elman + BTT (part 2) and gold starndard similarity is  0.2478


In [53]:
# calculate similarity for elman + BTT model with GRU (part 3) on the found pair of words

# load model
elman_BTT_GRU = load_model('./model_BTT_GRU.pth')
elman_BTT_GRU_embedding = elman_BTT_GRU.layers[0].get_weights()[0]
elman_BTT_GRU_embedding = np.array(elman_BTT_GRU_embedding)

In [54]:
from sklearn.metrics.pairwise import cosine_similarity

# do the calculation
elman_BTT_GRU_similarity = []
for i in range(len(goldstandard_similarity)):
    index_word0 = word2index[goldstandard_word0[i]]
    index_word1 = word2index[goldstandard_word1[i]]
    elman_BTT_GRU_similarity.append(cosine_similarity(elman_BTT_GRU_embedding[index_word0].reshape(1, -1), elman_BTT_GRU_embedding[index_word1].reshape(1, -1))[0][0])

In [55]:
# calculate the Spearman rank correlation on our similarity and goldstandard similarity

from scipy import stats
elman_BTT_GRU_correlation = stats.spearmanr(np.array(elman_BTT_GRU_similarity), np.array(goldstandard_similarity))
print('Spearman rank correlation between elman + BTT with GRU (part 3) and gold starndard similarity is ', elman_BTT_GRU_correlation.correlation.round(4))

Spearman rank correlation between elman + BTT with GRU (part 3) and gold starndard similarity is  0.2482


### Compare the words representation extracted from each of the approaches using one of the existing methods.

I used intrinsic evaluation method to evaluate the extracted words representations

I chose the gold standard similarity dataset as benchmark, then found the word pairs that appears on both my local vocabulary and benchmark dataset.

After that I calculated the similarity for the found word pairs on the three models.
Finally I calculated the Spearman rank correlation between the similarity of my models and the benchmark.

Here are the correlation results:

- elman + TBTT (part 1): 0.1581
- elman + BTT (part 2): 0.2478
- elman + BTT with GRU (part 3): 0.2482

The result is actually consistent with the text generation result above, that the elman + TBTT works the worst and the other two are much better.

In [3]:
''' Learn an RNN model that predicts document categories given its content (part 5)
'''

news_tokens = []
for news_item in news:
    news_tokens.append(word_tokenize(news_item))

In [4]:
from collections import Counter

# build vocabulary
word2index_news = {}
index2word_news = []
flattend_news_tokens = []
for sublist in news_tokens:
    for item in sublist:
        flattend_news_tokens.append(item)
news_counter = Counter(flattend_news_tokens)

for word, count in news_counter.items():
    index2word_news.append(word)
    word2index_news[word] = len(word2index_news)

news_vocabulary_size = len(word2index_news)
print(news_vocabulary_size)

207442


In [5]:
# encode the dataset
max_news_length = 400
news_encoded = []
news_encoded_length = []
for news_tokens_item in news_tokens:
    if len(news_tokens_item) > max_news_length:
        news_tokens_item = news_tokens_item[:max_news_length]
    tmp_encoded = []
    for news_token in news_tokens_item:
        tmp_encoded.append(word2index_news[news_token])
    news_encoded.append(tmp_encoded)
    news_encoded_length.append(len(tmp_encoded))

In [6]:
print('number of news', len(news_encoded))
print('mean news length', sum(news_encoded_length) / len(news_encoded_length))
max_news_length = max(news_encoded_length)
print('max news length', max_news_length)

number of news 13108
mean news length 238.1869087580104
max news length 400


In [7]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# prepare the input and target for the network
news_encoded_padded = pad_sequences(news_encoded, maxlen=max_news_length, padding='pre')
x_news = np.array(news_encoded_padded)
y_news = to_categorical(groups, 4)

In [8]:
# split train test dataset
from sklearn.model_selection import train_test_split
x_news_train, x_news_test, y_news_train, y_news_test = train_test_split(x_news, y_news, test_size=0.1, stratify=y_news)

In [13]:
# define the new network structure
model_news = Sequential()
model_news.add(Embedding(news_vocabulary_size, 400, input_length=max_news_length))
model_news.add(SimpleRNN(units=500, activation='sigmoid'))
model_news.add(Dropout(0.5))
model_news.add(Dense(256, activation='sigmoid'))
model_news.add(Dropout(0.5))
model_news.add(Dense(4, activation='softmax'))

model_news.summary()
model_news.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 400, 400)          82976800  
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 500)               450500    
_________________________________________________________________
dropout_5 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               128256    
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 1028      
Total params: 83,556,584
Trainable params: 83,556,584
Non-trainable params: 0
________________________________________________________________

In [14]:
print(device_lib.list_local_devices())
filepath = "./model_news.pth"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False)
news_training_history = model_news.fit(
    x_news_train,
    y_news_train,
    batch_size=128,
    epochs=20,
    verbose=1,
    shuffle=True,
    validation_data=(x_news_test, y_news_test),
    callbacks=[checkpoint]
)

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7053203860371525633
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6700198133
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12412288893670240266
physical_device_desc: "device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Train on 11797 samples, validate on 1311 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Report your accuracy results on the validation set.
The best validation loss of the model is 0.9626 at epoch 5, and the best accuracy of the model is 0.7079 at epoch 14.