CMPT 980 - Assignment 3

Puria Azadi Moghadam - Student No: 301406080

## Deep Learning Course (980)
## Assignment Three 

__Assignment Goals:__

- Implementing RNN based language models.
- Implementing and applying a Recurrent Neural Network on text classification problem using TensorFlow.
- Implementing __many to one__ and __many to many__ RNN sequence processing.

In this assignment, you will implement RNN-based language models and compare extracted word representation from different models. You will also compare two different training methods for sequential data: Truncated Backpropagation Through Time __(TBTT)__ and Backpropagation Through Time __(BTT)__. 
Also, you will be asked to apply Vanilla RNN to capture word representations and solve a text classification problem. 


__DataSets__: You will use two datasets, an English Literature for language model task (part 1 to 4) and 20Newsgroups for text classification (part 5). 


1. (30 points) Implement the RNN based language model described by Mikolov et al.[1], also called __Elman network__ and train a language model on the English Literature dataset. This network contains input, hidden and output layer and is trained by standard backpropagation (TBTT with τ = 1) using the cross-entropy loss. 
   - The input represents the current word while using 1-of-N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. 
   - The hidden layer is a fully connected sigmoid layer with size 500. 
   - Softmax Output Layer to capture a valid probability distribution.
   - The model is trained with truncated backpropagation through time (TBTT) with τ = 1: the weights of the network are updated based on the error vector computed only for the current time step.
   
   Download the English Literature dataset and train the language model as described, report the model cross-entropy loss on the train set. Use nltk.word_tokenize to tokenize the documents. 
For initialization, s(0) can be set to a vector of small values. Note that we are not interested in the *dynamic model* mentioned in the original paper. 
To make the implementation simpler you can use Keras to define neural net layers, including Keras.Embedding. (Keras.Embedding will create an additional mapping layer compared to the Elman architecture.) 

2. (20 points) TBTT has less computational cost and memory needs in comparison with *backpropagation through time algorithm (BTT)*. These benefits come at the cost of losing long term dependencies [2]. Now let's try to investigate computational costs and performance of learning our language model with BTT. For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input  size will be [1, Sentence Length]). 

    1. Split the document into sentences (you can use nltk.tokenize.sent_tokenize).
    2. For each sentence, perform one pass that computes the mean/sum loss for this sentence; then perform a gradient update for the whole sentence. (So the mini-batch size varies for the sentences with different lengths). You can truncate long sentences to fit the data in memory. 
    3. Report the model cross-entropy loss.

3. (15 points) It does not seem that simple recurrent neural networks can capture truly exploit context information with long dependencies, because of the problem that gradients vanish and exploding. To solve this problem, gating mechanisms for recurrent neural networks were introduced. Try to learn your last model (Elman + BTT) with the SimpleRnn unit replaced with a Gated Recurrent Unit (GRU). Report the model cross-entropy loss. Compare your results in terms of cross-entropy loss with two other approach(part 1 and 2). Use each model to generate 10 synthetic sentences of 15 words each. Discuss the quality of the sentences generated - do they look like proper English? Do they match the training set?
    Text generation from a given language model can be done using the following iterative process:
   1. Set sequence = \[first_word\], chosen randomly.
   2. Select a new word based on the sequence so far, add this word to the sequence, and repeat. At each iteration, select the word with maximum probability given the sequence so far. The trained language model outputs this probability. 

4. (15 points) The text describes how to extract a word representation from a trained RNN (Chapter 4). How we can evaluate the extracted word representation for your trained RNN? Compare the words representation extracted from each of the approaches using one of the existing methods.

5. (20 points) We are aiming to learn an RNN model that predicts document categories given its content (text classification). For this task, we will use the 20Newsgroupst dataset. The 20Newsgroupst contains messages from twenty newsgroups.  We selected four major categories (comp, politics, rec, and religion) comprising around 13k documents altogether. Your model should learn word representations to support the classification task. For solving this problem modify the __Elman network__ architecture such that the last layer is a softmax layer with just 4 output neurons (one for each category). 

    1. Download the 20Newsgroups dataset, and use the implemented code from the notebook to read in the dataset.
    2. Split the data into a training set (90 percent) and validation set (10 percent). Train the model on  20Newsgroups.
    3. Report your accuracy results on the validation set.

__NOTE__: Please use Jupyter Notebook. The notebook should include the final code, results and your answers. You should submit your Notebook in (.pdf or .html) and .ipynb format. (penalty 10 points) 

To reduce the parameters, you can merge all words that occur less often than a threshold into a special rare token (\__unk__).

__Instructions__:

The university policy on academic dishonesty and plagiarism (cheating) will be taken very seriously in this course. Everything submitted should be your own writing or coding. You must not let other students copy your work. Spelling and grammar count.

Your assignments will be marked based on correctness, originality (the implementations and ideas are from yourself), clarification and test performance.


[1] Tom´ as Mikolov, Martin Kara ˇ fiat, Luk´ ´ as Burget, Jan ˇ Cernock´ ˇ y,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010

[2] Tallec, Corentin, and Yann Ollivier. "Unbiasing truncated backpropagation through time." arXiv preprint arXiv:1705.08209 (2017).


In [0]:
from google.colab import files
uploaded = files.upload

In [1]:
import numpy as np
import nltk 
nltk.download('punkt')
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, Masking, SimpleRNN, TimeDistributed, LSTM, GRU
from keras.optimizers import Adam

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Using TensorFlow backend.


In [2]:
with open("./EnglishLiterature.txt") as f:
    #content = f.readlines()
    content = [content.rstrip() for content in f]
    #print(content)

#print(content)
content2=[s for s in content if s != '']
# print(content2)

In [3]:
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(content2)
#print(content2)
all_integer_sentences=[]
for line in content2:
  for word in line.split():
    all_integer_sentences.append(word)
all_integer_sentences4= [tokenizer.texts_to_sequences([line])[0] for line in content2]
#all_integer_sentences = tokenizer.texts_to_sequences([content2])
all_integer_sentences2 = tokenizer.texts_to_sequences(all_integer_sentences)
# print(all_integer_sentences2)

## **PART ONE**

In [4]:
inputs=[]
outputs=[]
for integer_sentence in all_integer_sentences:
    for i in range(len(integer_sentence)-1):
        inputs.append(integer_sentence[i:i+1])
        outputs.append(integer_sentence[i+1:i+2])

all_integer_sentences2 =  [integer[0] for integer in all_integer_sentences2 if integer != []]
inputs = all_integer_sentences2[0:-1]
outputs = all_integer_sentences2[1:]
print(np.shape(inputs))
# print((outputs))
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
num_words = len(word_idx) + 1
inputs=np.array(inputs)
outputs=np.array(outputs)
print(num_words)
categorical_outputs = to_categorical(outputs, num_classes=num_words)
print(np.shape(inputs))
print(np.shape(categorical_outputs))

(202648,)
12633
(202648,)
(202648, 12633)


In [10]:
model = Sequential()
model.add(Embedding(num_words, 50, input_length=None))
model.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
# model.add(SimpleRNN(500, activation='sigmoid'))
#model.add(Dense(50, activation='relu'))
model.add(Dense(num_words, activation='softmax'))
print(model.summary())



Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 50)          631650    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_1 (Dense)              (None, 12633)             6329133   
Total params: 7,236,283
Trainable params: 7,236,283
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(inputs, categorical_outputs,  epochs=20)



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f601c0b2f28>

**It seems that the elman network has learned data dataset since the loss went down from 6.99 to 4.79. Worth mentioning, the accuracy is not a good measure for assessing the network.**

## **PART TWO:**

In [0]:
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(content2)
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
num_words = len(word_idx) + 1
All_text = ' '.join(content2)
All_text = nltk.sent_tokenize(All_text)
inputs2=[]
outputs2=[]
# print(All_text)
for i,sentence in enumerate(All_text):
    # tokenized_text = nltk.word_tokenize(sentence)
    # print(sentence.split())
    tagged = tokenizer.texts_to_sequences(sentence.split())
    # print(tagged)
    inputs2.append(tagged[0:-1])
    outputs2.append(tagged[1:])


**Since we wanted to use BTT, we build our input and outputs again as:**
**For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input size will be [1, Sentence Length])**

In [14]:
model2 = Sequential()
model2.add(Embedding(num_words, 50, input_length=None))
model2.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
# model.add(SimpleRNN(500, activation='sigmoid'))
#model.add(Dense(50, activation='relu'))
model2.add(Dense(num_words, activation='softmax'))
print(model2.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 50)          631650    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_3 (Dense)              (None, 12633)             6329133   
Total params: 7,236,283
Trainable params: 7,236,283
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
import keras
adam=keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
for epoch in range(20):
  print("Epoch:" , epoch)
  Loss=0
  ACC = 0
  counter = 1;
  for i in range(len(inputs2)):
    # if(i % 1000 == 0):
    #   print("Epoch :",i)
    inputs222=np.array(inputs2[i])
    if(len(outputs2[i])>0):
      flag=0
      OUT=[]
      IN=[]
      # print(outputs2[i])
      for index in range(len(outputs2[i])):
        if(len(outputs2[i][index])>0):
          OUT.append(outputs2[i][index][0])
      for index in range(len(inputs222)):
        if(len(inputs222[index])>0):
          IN.append(inputs222[index][0])
      
      IN=np.array(IN)
      categorical_outputs2 = to_categorical(OUT, num_classes=num_words)
      history=model2.fit(IN, categorical_outputs2,verbose=0, batch_size=len(inputs222),  epochs=1)
      Loss += history.history['loss'][0]
      ACC += history.history['acc'][0]
      counter +=1
  print("Loss:", {Loss / counter})
  print("ACC:", {ACC / counter})
  # print(model2.evaluate(   x=inputs, y=categorical_outputs))

Epoch: 0
Loss: {7.223208463796496}
ACC: {0.032992622232174175}
Epoch: 1
Loss: {6.64865230188892}
ACC: {0.05482211536646615}
Epoch: 2
Loss: {6.317404680087648}
ACC: {0.07789450329764318}
Epoch: 3
Loss: {6.096285639437894}
ACC: {0.09171071200474765}
Epoch: 4
Loss: {5.9284656611417414}
ACC: {0.10260169942551585}
Epoch: 5
Loss: {5.797905805638064}
ACC: {0.10830917702742208}
Epoch: 6
Loss: {5.680729468705446}
ACC: {0.11406511248231055}
Epoch: 7
Loss: {5.576052816402598}
ACC: {0.11919770755115557}
Epoch: 8
Loss: {5.477349007622949}
ACC: {0.12450415301486155}
Epoch: 9
Loss: {5.381228362469354}
ACC: {0.12906859757586497}
Epoch: 10
Loss: {5.277098637940192}
ACC: {0.13394183680264027}
Epoch: 11
Loss: {5.169635143567776}
ACC: {0.13735678642013432}
Epoch: 12
Loss: {5.06812858824314}
ACC: {0.14062095655698867}
Epoch: 13
Loss: {4.985577315561791}
ACC: {0.14388559958204233}
Epoch: 14
Loss: {4.9139477769988185}
ACC: {0.1475363640853678}
Epoch: 15
Loss: {4.854683705769011}
ACC: {0.1494565598664401}
Epo

## **PART THREE:**

In [13]:
model3 = Sequential()
model3.add(Embedding(num_words, 50, input_length=None))
model3.add(GRU(500, return_sequences=False, activation='sigmoid'))
# model.add(SimpleRNN(500, activation='sigmoid'))
#model.add(Dense(50, activation='relu'))
model3.add(Dense(num_words, activation='softmax'))
print(model3.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 50)          631650    
_________________________________________________________________
gru_1 (GRU)                  (None, 500)               826500    
_________________________________________________________________
dense_2 (Dense)              (None, 12633)             6329133   
Total params: 7,787,283
Trainable params: 7,787,283
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
import keras
adam=keras.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
for epoch in range(20):
  print("Epoch:" , epoch)
  Loss=0
  ACC = 0
  counter = 1;
  for i in range(len(inputs2)):
    inputs222=np.array(inputs2[i])
    if(len(outputs2[i])>0):
      flag=0
      OUT=[]
      IN=[]
      for index in range(len(outputs2[i])):
        if(len(outputs2[i][index])>0):
          OUT.append(outputs2[i][index][0])
      for index in range(len(inputs222)):
        if(len(inputs222[index])>0):
          IN.append(inputs222[index][0])
      
      IN=np.array(IN)
      categorical_outputs2 = to_categorical(OUT, num_classes=num_words)
      history = model3.fit(IN, categorical_outputs2,verbose=0, batch_size=len(inputs222),  epochs=1)
      Loss += history.history['loss'][0]
      ACC += history.history['acc'][0]
      counter +=1
  print("Loss:", {Loss / counter})
  print("ACC:", {ACC / counter})
  # print(model3.evaluate(   x=inputs, y=categorical_outputs))

Epoch: 0
Loss: {7.00776178035001}
ACC: {0.0362762985194884}
Epoch: 1
Loss: {6.384619500965908}
ACC: {0.06767510118491903}
Epoch: 2
Loss: {6.035981140131883}
ACC: {0.08587676668913684}
Epoch: 3
Loss: {5.75442597817214}
ACC: {0.0982025175030073}
Epoch: 4
Loss: {5.525105390311738}
ACC: {0.10821017189008046}
Epoch: 5
Loss: {5.332612569813312}
ACC: {0.11740226108142558}
Epoch: 6
Loss: {5.161411083943945}
ACC: {0.12506627384422095}
Epoch: 7
Loss: {5.018351114221809}
ACC: {0.13157263303889646}
Epoch: 8
Loss: {4.888788093555167}
ACC: {0.13888320258737336}
Epoch: 9
Loss: {4.764755136610951}
ACC: {0.14470814459131648}
Epoch: 10
Loss: {4.664035070948265}
ACC: {0.15146646758122087}
Epoch: 11
Loss: {4.57514666861896}
ACC: {0.1568143345663441}
Epoch: 12
Loss: {4.5085599445123625}
ACC: {0.1607572796154941}
Epoch: 13
Loss: {4.462790779138161}
ACC: {0.1627379795225832}
Epoch: 14
Loss: {4.425066891263849}
ACC: {0.1641898429330295}
Epoch: 15
Loss: {4.399911514001747}
ACC: {0.16525285871302742}
Epoch: 16


**As we have seen the final loss in these three RNN networks, the third network which uses GRU has smaller value for cross-entropy loss comparing to the first and second models. In other words, RNN with GRU has learned the structure of language, and it can capture truly exploit context information with long dependencies. Therefore, the third model has long term memory in addition to the short term memory, and the GRU prevent the gradients vanish and exploding.**


In [23]:
import random

model.load_weights("model.h5")
model2.load_weights("model2.h5")
model3.load_weights("model3.h5")

print("Texts generated for First model:")
FirstWord = []
for sent in range(10):
  FirstWord.append(random.randrange(num_words))
for sent in range(10):
  # FirstWord = random.randrange(num_words)
  asghar =[]
  asghar.append(FirstWord[sent])
  for i in range(15):
    next_words = model.predict(np.array(asghar).reshape(1, -1))[0]
    # next_words = np.log(next_words) / 1
    # exp_next_words = np.exp(next_words)
    # next_words = exp_next_words / exp_next_words.sum()
    # print(next_words.sum())
    next_word = np.argmax(np.random.multinomial(1, next_words/(sum( next_words)+0.000001), 1)[0])
    asghar.append(next_word+1)
  n=[]
  for i in asghar:
    n.append(idx_word.get(i, '< --- >'))
  print(' '.join(n))
print("--------------------------------------")
print("--------------------------------------")
print("--------------------------------------")
print("Texts generated for Second model:")
for sent in range(10):
  # FirstWord = random.randrange(num_words)
  asghar =[]
  asghar.append(FirstWord[sent])
  for i in range(15):
    next_words = model2.predict(np.array(asghar).reshape(1, -1))[0]
    # next_words = np.log(next_words) / 1
    # exp_next_words = np.exp(next_words)
    # next_words = exp_next_words / exp_next_words.sum()
    # print(next_words.sum())
    next_word = np.argmax(np.random.multinomial(1, next_words/(sum( next_words)+0.000001), 1)[0])
    asghar.append(next_word+1)
  n=[]
  for i in asghar:
    n.append(idx_word.get(i, '< --- >'))
  print(' '.join(n))
print("--------------------------------------")
print("--------------------------------------")
print("--------------------------------------")
print("Texts generated for Third model:")
for sent in range(10):
  # FirstWord = random.randrange(num_words)
  asghar =[]
  asghar.append(FirstWord[sent])
  for i in range(15):
    next_words = model3.predict(np.array(asghar).reshape(1, -1))[0]
    # next_words = np.log(next_words) / 1
    # exp_next_words = np.exp(next_words)
    # next_words = exp_next_words / exp_next_words.sum()
    # print(next_words.sum())
    next_word = np.argmax(np.random.multinomial(1, next_words/(sum( next_words)+0.000001), 1)[0])
    asghar.append(next_word+1)
  n=[]
  for i in asghar:
    n.append(idx_word.get(i, '< --- >'))
  print(' '.join(n))



Texts generated for First model:
aggravate and no order say unto that my thousand winds there may many words be goodly
infuse but him see i ross i hath master you thee march brutus for other fortress
devil's puissant take't york death tortures have ratcliff with iii and i'll late you of and
stabs and i own to pieces good son thou honesty not his swoon thou prince thy
tack woeful town you on and ranging hold friendship is with that how a bridal letters
soothing one of own in choose a heavy ' now your often thou edward's thank my
wall sir perdita go sworn my art for father after heel me he not and no
sufferance lie more murdering degree me to some spare and my foolish tears richard your kings
usurps and myself waiting as how a double deserving to it still are his brakenbury thy
roof cannot swear be that shall as elizabeth to these elbow to idle thy long did
--------------------------------------
--------------------------------------
--------------------------------------
Texts generated

**As it can be seen in above produced sentences, the sentences produced by all models are not very well and has lots of differences comparing to real english. This is because the fact that our model are very simple.**
**But between these models, the sentences generated by the third model has better qualities and they are more similar to proper english. It can be seen that the third model has not generated repetitive words in sentences, but the first two models have this problem and they mostly choose much simpler words. Because model-3 has learned the the relations between word in the sentences(not just individual words). Moreover, this model has long term memory resulted from using GRU.**


## **PART FOUR**

**-------------------------------------------------------------**

**-------------------------------------------------------------**

**There multiple solutions for evaluating the extracted word representation for trained RNNs.**
1.    For instance, we can check the similarity of the generated vectors for words with close meanings(for example: "Cat" and "Dog" have close meanings and should have similar vetors)
2.   or we can transfer generated vectors to 2 dimensional space and visualize these vector.
3.  Also, we can extract the weights of embedding layer for each of our models and replace other networks embedding layer weights with the extracted weights. Then, we evaluate each of these new models with the same data. The best embedding layer is which has low loss value for all networks(not just one of them).
**I have implemented the latter solution:**




In [15]:
model.load_weights("model.h5")
layer_emb_model = model.layers[0]
model2.load_weights("model2.h5")
layer_emb_model2 = model2.layers[0]
model3.load_weights("model3.h5")
layer_emb_model3 = model3.layers[0]

layer = model3.layers[0]
layer.set_weights(layer_emb_model.get_weights())
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model3.evaluate(   x=inputs, y=categorical_outputs)
Loss1_3 = history[0]
ACC1_3 = history[1]

layer = model2.layers[0]
layer.set_weights(layer_emb_model.get_weights())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model2.evaluate(   x=inputs, y=categorical_outputs)
Loss1_2 = history[0]
ACC1_2 = history[1]

layer = model.layers[0]
layer.set_weights(layer_emb_model.get_weights())
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model.evaluate(   x=inputs, y=categorical_outputs)
Loss1_1 = history[0]
ACC1_1 = history[1]
print("Loss while using weights of model_1 on first, second, and third model are:{}, {}, {}".format(Loss1_1,Loss1_2,Loss1_3))
print("Average loss while using weights of first model is:", (Loss1_1+Loss1_2+Loss1_3)/3)
print("--------------------------------------------")

model.load_weights("model.h5")
layer_emb_model = model.layers[0]
model2.load_weights("model2.h5")
layer_emb_model2 = model2.layers[0]
model3.load_weights("model3.h5")
layer_emb_model3 = model3.layers[0]

layer = model3.layers[0]
layer.set_weights(layer_emb_model2.get_weights())
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model3.evaluate(   x=inputs, y=categorical_outputs)
Loss2_3 = history[0]
ACC2_3 = history[1]

layer = model2.layers[0]
layer.set_weights(layer_emb_model2.get_weights())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model2.evaluate(   x=inputs, y=categorical_outputs)
Loss2_2 = history[0]
ACC2_2 = history[1]

layer = model.layers[0]
layer.set_weights(layer_emb_model2.get_weights())
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model.evaluate(   x=inputs, y=categorical_outputs)
Loss2_1 = history[0]
ACC2_1 = history[1]
print("Loss while using weights of model_2 on first, second, and third model are:{}, {}, {}".format(Loss2_1,Loss2_2,Loss2_3))
print("Average loss while using weights of second model is:", (Loss2_1+Loss2_2+Loss2_3)/3)
print("--------------------------------------------")

model.load_weights("model.h5")
layer_emb_model = model.layers[0]
model2.load_weights("model2.h5")
layer_emb_model2 = model2.layers[0]
model3.load_weights("model3.h5")
layer_emb_model3 = model3.layers[0]

layer = model3.layers[0]
layer.set_weights(layer_emb_model3.get_weights())
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model3.evaluate(   x=inputs, y=categorical_outputs)
Loss3_3 = history[0]
ACC3_3 = history[1]

layer = model2.layers[0]
layer.set_weights(layer_emb_model3.get_weights())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model2.evaluate(   x=inputs, y=categorical_outputs)
Loss3_2 = history[0]
ACC3_2 = history[1]

layer = model.layers[0]
layer.set_weights(layer_emb_model3.get_weights())
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history=model.evaluate(   x=inputs, y=categorical_outputs)
Loss3_1 = history[0]
ACC3_1 = history[1]
print("Loss while using weights of model_3 on first, second, and third model are:{}, {}, {}".format(Loss3_1,Loss3_2,Loss3_3))
print("Average loss while using weights of third model is:", (Loss3_1+Loss3_2+Loss3_3)/3)






Loss while using weights of model_1 on first, second, and third model are:4.619924710226193, 12.160405842994647, 12.693790654152354
Average loss while using weights of first model is: 9.824707069124399
--------------------------------------------
Loss while using weights of model_2 on first, second, and third model are:9.849282254054945, 5.66019674527942, 11.367405138523626
Average loss while using weights of second model is: 8.958961379285997
--------------------------------------------
Loss while using weights of model_3 on first, second, and third model are:8.362277764789512, 8.404916897146105, 5.6606748846618125
Average loss while using weights of third model is: 7.475956515532477


**Average loss while using weights of first model is: 9.824707069124399**

**Average loss while using weights of second model is: 8.958961379285997**

**Average loss while using weights of third model is: 7.475956515532477**


**As you can see, when we use the third model's embedding layer weights, the "Average Loss" has the lowest value showing that model-3 has the best embedding layer. It is clear that using GRU and giving all the sentence as input to the model(as mini batch) had a very significant role on this better functionality.**

**The second model has better embedding layer comparing to first model since we gave all the sentence as input(mini-batch) to second model but it has worse embedding layer than third model since model-2 has not long term memory.**

In [0]:
# model.save("model.h5")
# model2.save("model2.h5")
# model3.save("model3.h5")


## **PART FIVE**

In [0]:
from google.colab import files
files.download('model2.h5')

In [0]:
import tarfile
tf = tarfile.open("20Newsgroups_subsampled.tar")
tf.extractall()

In [0]:

"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
        print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = 'ISO-8859-1', mode ='r').read())
            groups.append(cat)

    return news, list(map(to_categories, groups))



data_path = "/content/20news_subsampled"
news, groups = data_loader(data_path)

Category:rec.autos
Category:rec.sport.hockey
Category:comp.windows.x
Category:comp.graphics
Category:talk.politics.guns
Category:comp.sys.mac.hardware
Category:rec.sport.baseball
Category:comp.sys.ibm.pc.hardware
Category:comp.os.ms-windows.misc
Category:talk.politics.mideast
Category:rec.motorcycles
Category:talk.religion.misc
Category:talk.politics.misc
Category:soc.religion.christian


In [0]:
import numpy as np
import nltk 
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
nltk.download('punkt')
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, Masking, SimpleRNN, TimeDistributed, LSTM, GRU, SpatialDropout1D
from keras.optimizers import Adam

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
print(news[0])

From: RZAA80@email.sps.mot.com (Jim Chott)
Subject: Re: Re: Toyota Land Cruiser worth it?

In article <2820016@iftccu.ca.boeing.com>, hovnania@iftccu.ca.boeing.com
(Paul Hovnanian) wrote:
> 
> Based on my experience with a '79 FJ40 ( the hard-top jeep-style model ) I 
> would definitely give a new model consideration if I were in the market. The
> older models are VERY well built. Unless Toyota lost its mind, I would
> assume, until  proven otherwise, that the newer models have inherited some
> if not all of the qualities of their ancestors.
> 
> Two major differences in the running gear (that I'm aware of) need study.
> My '79 has a solid front axle housing whereas the newer models have
> independant front suspension. The solid axle is theoretically stronger and


The new Cruisers DO NOT have independent suspension in the front.  They
still
run a straight axle, but with coils.  The 4Runner is the one with
independent
front.  The Cruisers have incredible wheel travel with this system. 

In [0]:
import re
import nltk
nltk.download('stopwords')
SIGNS = re.compile('[/(){}\[\]\|@,;]')
Allowed = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NEWS=[]
lenthes=[]
for text in news:
    text = text.lower() # lowercase text
    text = SIGNS.sub(' ', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
    NEWS.append(text)
    lenthes.append(len(text))
print(NEWS[0])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
from: rzaa80 email.sps.mot.com jim chott subject: re: re: toyota land cruiser worth it? article <2820016 iftccu.ca.boeing.com> hovnania iftccu.ca.boeing.com paul hovnanian wrote: > > based experience '79 fj40 hard-top jeep-style model > would definitely give new model consideration market. > older models well built. unless toyota lost mind would > assume proven otherwise newer models inherited > qualities ancestors. > > two major differences running gear i'm aware need study. > '79 solid front axle housing whereas newer models > independant front suspension. solid axle theoretically stronger new cruisers independent suspension front. still run straight axle coils. 4runner one independent front. cruisers incredible wheel travel system. > reliable newer model experience tell. > independant front suspension doubt compromise made satisfy > typical user never need real utility vehi

In [0]:
Maxofwords = np.max(lenthes)
tokenizer = Tokenizer( filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(NEWS)
word_index = tokenizer.word_index
X=[]
t=[]
X = tokenizer.texts_to_sequences(NEWS)
print(len(X[2]))
for i in range(13108):
    t.append(len(X[i]))
X = pad_sequences(X,maxlen=7000)
print('Shape of data tensor:', X.shape)
# print(X[0])

87
Shape of data tensor: (13108, 7000)


In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X,groups, test_size = 0.00001, shuffle= True)
X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)
Y_train = to_categorical(Y_train,num_classes= 4)
Y_test = to_categorical(Y_test,num_classes= 4)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(13107, 7000) (13107, 4)
(1, 7000) (1, 4)


**Implementing the Elman Network for text classification with SimpleRNN module and dropout to prevent overfiting:**

In [0]:
model4 = Sequential()
model4.add(Embedding(len(word_index)+1, 100, input_length=X.shape[1]))
model4.add(SpatialDropout1D(0.2))
model4.add(SimpleRNN(500, dropout=0.2, recurrent_dropout=0.2, activation='sigmoid'))
model4.add(Dense(4, activation='softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.summary()


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 7000, 100)         14815700  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 7000, 100)         0         
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 500)               300500    
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 2004      
Total params: 15,118,204
Trainable params: 15,118,204
Non-trainable params: 0
_________________________________________________________________


In [0]:
import keras
from keras.callbacks import EarlyStopping
# DataGenerator = keras.preprocessing.image.ImageDataGenerator()
# train_generator = DataGenerator.flow(X_train, Y_train, batch_size=64)
# valid_generator = DataGenerator.flow(X_test, Y_test, batch_size=64)
# model.fit_generator(train_generator, steps_per_epoch=X_train.shape[0]//64, epochs=100, validation_data=valid_generator, validation_steps=X_test.shape[0]//64, shuffle=True)
model4.fit(X_train, Y_train, epochs=5, batch_size=64,validation_split=0.1)

Train on 11796 samples, validate on 1311 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd50c6ec470>

**Epoch 3/5
11796/11796 [==============================] - 1084s 92ms/step - loss: 0.4910 - acc: 0.8264 - val_loss: 1.0241 - val_acc: 0.6339**

**Implementing the Elman Network for text classification with GRU module and dropout to prevent overfiting:**

In [0]:
model5 = Sequential()
model5.add(Embedding(len(word_index)+1, 100, input_length=X.shape[1]))
model5.add(SpatialDropout1D(0.2))
model5.add(GRU(500, dropout=0.2, recurrent_dropout=0.2, activation='sigmoid'))
model5.add(Dense(4, activation='softmax'))
model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model5.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 7000, 100)         14815700  
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 7000, 100)         0         
_________________________________________________________________
gru_2 (GRU)                  (None, 500)               901500    
_________________________________________________________________
dense_5 (Dense)              (None, 4)                 2004      
Total params: 15,719,204
Trainable params: 15,719,204
Non-trainable params: 0
_________________________________________________________________


In [0]:
model5.fit(X_train, Y_train, epochs=5, batch_size=64,validation_split=0.1)

Train on 11796 samples, validate on 1311 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd81de39588>

**Epoch 5/5
11796/11796 [==============================] - 3060s 259ms/step - loss: 0.0504 - acc: 0.9828 - val_loss: 0.2944 - val_acc: 0.9130
<keras.callbacks.History at 0x7fd81de39588>**

In [0]:
model4.save("model4.h5")
model5.save("model5.h5")

**As you can see, using GRU model has better validation accuracy**