## Deep Learning Course (980)
## Assignment Three 

__Assignment Goals:__

- Implementing RNN based language models.
- Implementing and applying a Recurrent Neural Network on text classification problem using TensorFlow.
- Implementing __many to one__ and __many to many__ RNN sequence processing.

In this assignment, you will implement RNN-based language models and compare extracted word representation from different models. You will also compare two different training methods for sequential data: Truncated Backpropagation Through Time __(TBTT)__ and Backpropagation Through Time __(BTT)__. 
Also, you will be asked to apply Vanilla RNN to capture word representations and solve a text classification problem. 


__DataSets__: You will use two datasets, an English Literature for language model task (part 1 to 4) and 20Newsgroups for text classification (part 5). 


1. (30 points) Implement the RNN based language model described by Mikolov et al.[1], also called __Elman network__ and train a language model on the English Literature dataset. This network contains input, hidden and output layer and is trained by standard backpropagation (TBTT with τ = 1) using the cross-entropy loss. 
   - The input represents the current word while using 1-of-N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. 
   - The hidden layer is a fully connected sigmoid layer with size 500. 
   - Softmax Output Layer to capture a valid probability distribution.
   - The model is trained with truncated backpropagation through time (TBTT) with τ = 1: the weights of the network are updated based on the error vector computed only for the current time step.
   
   Download the English Literature dataset and train the language model as described, report the model cross-entropy loss on the train set. Use nltk.word_tokenize to tokenize the documents. 
For initialization, s(0) can be set to a vector of small values. To improve performance, you can merge all words that occur less often than a threshold (here 3) into a special rare token (\__unk__). Note that we are not interested in the *dynamic model* mentioned in the original paper. 
To make the implementation simpler you can use Keras to define neural net layers, including Keras.Embedding. (Keras.Embedding will create an additional mapping layer compared to the Elman architecture.) 

2. (20 points) TBTT has less computational cost and memory needs in comparison with *backpropagation through time algorithm (BTT)*. These benefits come at the cost of losing long term dependencies [2]. Now let's try to investigate computational costs and performance of learning our language model with BTT. For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input  size will be [1, Sentence Length]). 

    1. Split the document into sentences (you can use nltk.tokenize.sent_tokenize).
    2. For each sentence, perform one pass that computes the mean/sum loss for this sentence; then perform a gradient update for the whole sentence. (So the mini-batch size varies for the sentences with different lengths). You can truncate long sentences to fit the data in memory. 
    3. Report the model cross-entropy loss.

3. (15 points) It does not seem that simple recurrent neural networks can capture truly exploit context information with long dependencies, because of the problem that gradients vanish and exploding. To solve this problem, gating mechanisms for recurrent neural networks were introduced. Try to learn your last model (Elman + BTT) with the SimpleRnn unit replaced with a Gated Recurrent Unit (GRU). Report the model cross-entropy loss. Compare your results in terms of cross-entropy loss with two other approach(part 1 and 2). Use each model to generate 10 synthetic sentences of 15 words each. Discuss the quality of the sentences generated - do they look like proper English? Do they match the training set?
    Text generation from a given language model can be done using the following iterative process:
   1. Set sequence = \[first_word\], chosen randomly.
   2. Select a new word based on the sequence so far, add this word to the sequence, and repeat. At each iteration, select the word with maximum probability given the sequence so far. The trained language model outputs this probability. 

4. (15 points) The text describes how to extract a word representation from a trained RNN (Chapter 4). How we can evaluate the extracted word representation for your trained RNN? Compare the words representation extracted from each of the approaches using one of the existing methods.

5. (20 points) We are aiming to learn an RNN model that predicts document categories given its content (text classification). For this task, we will use the 20Newsgroupst dataset. The 20Newsgroupst contains messages from twenty newsgroups.  We selected four major categories (comp, politics, rec, and religion) comprising around 13k documents altogether. Your model should learn word representations to support the classification task. For solving this problem modify the __Elman network__ architecture such that the last layer is a softmax layer with just 4 output neurons (one for each category). 

    1. Download the 20Newsgroups dataset, and use the implemented code from the notebook to read in the dataset.
    2. Split the data into a training set (90 percent) and validation set (10 percent). Train the model on  20Newsgroups.
    3. Report your accuracy results on the validation set.

__NOTE__: Please use Jupyter Notebook. The notebook should include the final code, results and your answers. You should submit your Notebook in (.pdf or .html) and .ipynb format. (penalty 10 points) 

__Instructions__:

The university policy on academic dishonesty and plagiarism (cheating) will be taken very seriously in this course. Everything submitted should be your own writing or coding. You must not let other students copy your work. Spelling and grammar count.

Your assignments will be marked based on correctness, originality (the implementations and ideas are from yourself), clarification and test performance.


[1] Tom´ as Mikolov, Martin Kara ˇ fiat, Luk´ ´ as Burget, Jan ˇ Cernock´ ˇ y,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010

[2] Tallec, Corentin, and Yann Ollivier. "Unbiasing truncated backpropagation through time." arXiv preprint arXiv:1705.08209 (2017).


In [88]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


In [89]:
import numpy as np
import string
import random
import matplotlib.pyplot as plt
from collections import Counter
import nltk
nltk.download('punkt')
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, GRU, Dense, Dropout
from keras.callbacks import CSVLogger, ModelCheckpoint, TensorBoard
from keras import optimizers
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

[nltk_data] Downloading package punkt to /home/lpan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [90]:
# Part1
# Import file
def load_doc(filename):
    file = open(filename, 'r')
    raw_text = file.read()
    file.close()
    return raw_text

raw_text = load_doc('English Literature.txt')
tokens = nltk.word_tokenize(raw_text)

# Data cleaning
translator = str.maketrans('', '', string.punctuation)
tokens_cleaned = [s.translate(translator).lower() for s in tokens if s.isalpha()]
# print('before cleaning: ', tokens[:5])
tokens = tokens_cleaned
# print('before cleaning: ', tokens[:5])

# Merge all words that occur less often than a threshold into a special rare token (_unk_), and create word dictionary
threshold = 3
word_dict = {'pad': 0, '_unk_': 1}  # 'pad' for use in part 2
idx2word = ['pad','_unk_']
counter = Counter(tokens)
for word, count in counter.items():
    if count >= threshold:
        idx2word.append(word)
        word_dict[word] = len(word_dict)
vocab_size = len(word_dict)
print('total vocabulary size: ', vocab_size)

# Map to training data X and y
X = []
y = []
max_length = 1
for i in range(len(tokens) - max_length):
    sentence = tokens[i:i+max_length]
    gt_word = tokens[i+max_length]
    X.append([word_dict.get(word,1) for word in sentence])
    y.append(word_dict.get(gt_word,1))
X = np.array(X)
print('y shape: ', len(y))
# One hot encoding
Y = to_categorical(y, vocab_size)
print('X shape: ', X.shape)
print('Y shape: ', Y.shape)

total vocabulary size:  4650
y shape:  200600
X shape:  (200600, 1)
Y shape:  (200600, 4650)


In [91]:
# Build model
max_length = 1
model1 = Sequential()
model1.add(Embedding(vocab_size, 50, input_length=max_length))
model1.add(SimpleRNN(500, activation='sigmoid'))
model1.add(Dense(vocab_size, activation='softmax'))
model1.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 1, 50)             232500    
_________________________________________________________________
simple_rnn_7 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_10 (Dense)             (None, 4650)              2329650   
Total params: 2,837,650
Trainable params: 2,837,650
Non-trainable params: 0
_________________________________________________________________


In [93]:
# Training
part1_root_folder = 'checkpoints_part1'
checkpointer1 = ModelCheckpoint(
        filepath=part1_root_folder+'/epoch-{epoch:02d}-accu-{accuracy:.4f}-loss-{loss:.4f}.hdf5',
        monitor='loss',
        verbose=1,
        save_best_only=True, mode='auto', period=10)
logfile1 = CSVLogger(part1_root_folder+'/train.log', append=False, separator=',')
adam = optimizers.Adam(lr=0.005)
model1.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history1 = model1.fit(X,Y,
        batch_size=128,
        epochs=30,
        verbose=1,
        callbacks=[checkpointer1, logfile1],
        workers=4)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30

Epoch 00010: loss improved from inf to 4.79031, saving model to checkpoints_part1/epoch-10-accu-0.1060-loss-4.7903.hdf5
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30

Epoch 00020: loss improved from 4.79031 to 4.57190, saving model to checkpoints_part1/epoch-20-accu-0.1048-loss-4.5719.hdf5
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30

Epoch 00030: loss improved from 4.57190 to 4.51024, saving model to checkpoints_part1/epoch-30-accu-0.1055-loss-4.5102.hdf5


In [31]:
# Part2
raw_text = raw_text.replace('\n', ' ').replace('\r', '') # replace new lines with spaces
sentences = nltk.sent_tokenize(raw_text)
max_length = 30 # truncate long sentence
sequences = list()
for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    translator = str.maketrans('', '', string.punctuation)
    tokens_cleaned = [s.translate(translator).lower() for s in tokens if s.isalpha()]
    tokens = tokens_cleaned
    encoded = [word_dict.get(word, 1) for word in tokens]
    for i in range(1, min(len(encoded), max_length)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
sequences =  np.array(pad_sequences(sequences, maxlen=max_length, padding='pre'))
# print(sequences.shape)
X = sequences[:,:-1] # length - 1
y = sequences[:,-1]
Y = to_categorical(y, num_classes=vocab_size)
# print(X.shape)
# print(Y.shape)

# def seq2sentence(seq):
#     sentence = ''
#     for i in range(len(seq)):
#         sentence += idx2word[seq[i]] + ' '
#     print(sentence.strip())
# seq2sentence([3645, 4272, 490, 4263, 3150, 1927, 3520, 2444, 5209, 2970, 4781, 3168, 1797]) 

In [32]:
# Build model
model2 = Sequential()
model2.add(Embedding(vocab_size, 50, input_length=max_length-1, mask_zero=True))
model2.add(SimpleRNN(500, activation='sigmoid'))
model2.add(Dense(vocab_size, activation='softmax'))
model2.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 29, 50)            232500    
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_8 (Dense)              (None, 4650)              2329650   
Total params: 2,837,650
Trainable params: 2,837,650
Non-trainable params: 0
_________________________________________________________________


In [9]:
# Training
part2_root_folder = 'checkpoints_part2'
checkpointer2 = ModelCheckpoint(
        filepath=part2_root_folder+'/epoch-{epoch:02d}-accu-{accuracy:.4f}-loss-{loss:.4f}.hdf5',
        monitor='loss',
        verbose=1,
        save_best_only=True, mode='auto', period=10)
logfile2 = CSVLogger(part2_root_folder+'/train.log', append=False, separator=',')
adam = optimizers.Adam(lr=0.005)
model2.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history2 = model2.fit(X,Y,
        batch_size=128,
        epochs=60,
        verbose=1,
        callbacks=[checkpointer2, logfile2],
        workers=4)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60

Epoch 00010: loss improved from inf to 3.64862, saving model to checkpoints_part2/epoch-10-accu-0.2245-loss-3.6486.hdf5
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60

Epoch 00020: loss improved from 3.64862 to 2.32693, saving model to checkpoints_part2/epoch-20-accu-0.4436-loss-2.3269.hdf5
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60

Epoch 00030: loss improved from 2.32693 to 1.83989, saving model to checkpoints_part2/epoch-30-accu-0.5383-loss-1.8399.hdf5
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60

Epoch 00040: loss improved from 1.83989 to 1.64163, saving model to checkpoints_part2/epoch-40-accu-0.5767-loss-1.6416.hdf5
Epoch 41/60
Epoch 42/60
Epoch 43/

In [33]:
# Part3
# Build model
model3 = Sequential()
model3.add(Embedding(vocab_size, 50, input_length=max_length-1, mask_zero=True))
model3.add(GRU(500, activation='sigmoid'))
model3.add(Dense(vocab_size, activation='softmax'))
model3.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 29, 50)            232500    
_________________________________________________________________
gru_3 (GRU)                  (None, 500)               826500    
_________________________________________________________________
dense_9 (Dense)              (None, 4650)              2329650   
Total params: 3,388,650
Trainable params: 3,388,650
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Training
part3_root_folder = 'checkpoints_part3'
checkpointer3 = ModelCheckpoint(
        filepath=part3_root_folder+'/epoch-{epoch:02d}-accu-{accuracy:.4f}-loss-{loss:.4f}.hdf5',
        monitor='loss',
        verbose=1,
        save_best_only=True, mode='auto', period=10)
logfile3 = CSVLogger(part3_root_folder+'/train.log', append=False, separator=',')
adam = optimizers.Adam(lr=0.005)
model3.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history3 = model3.fit(X,Y,
        batch_size=128,
        epochs=60,
        verbose=1,
        callbacks=[checkpointer3, logfile3],
        workers=4)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60

Epoch 00010: loss improved from inf to 2.04903, saving model to checkpoints_part3/epoch-10-accu-0.5016-loss-2.0490.hdf5
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60

Epoch 00020: loss improved from 2.04903 to 1.20246, saving model to checkpoints_part3/epoch-20-accu-0.6880-loss-1.2025.hdf5
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60

Epoch 00030: loss improved from 1.20246 to 1.04127, saving model to checkpoints_part3/epoch-30-accu-0.7216-loss-1.0413.hdf5
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60

Epoch 00040: loss improved from 1.04127 to 1.01060, saving model to checkpoints_part3/epoch-40-accu-0.7249-loss-1.0106.hdf5
Epoch 41/60
Epoch 42/60
Epoch 43/

In [42]:
# Predict
def generate_seq_first(model_chosen, input_text, num_output):
    model1.load_weights(model_chosen)
    pred_sentence = input_text
    for _ in range(num_output):
#         print('input_text:',input_text)
        encoded = word_dict[input_text]
        pred_y = model1.predict([[encoded]], verbose=0)
#         print('pred_y:', np.argmax(pred_y))
        pred_word = idx2word[np.argmax(pred_y)] 
        pred_sentence += ' ' + pred_word
        # set prediction to the input for the next iteration
        input_text = pred_word
    print('SimpleRNN + TBTT model predicted output:', pred_sentence.strip())
    
def generate_seq_second(model_chosen, input_text, num_output):
    model2.load_weights(model_chosen)
    for _ in range(num_output):
        words = nltk.word_tokenize(input_text);
        encoded = [word_dict.get(word, 1) for word in words]
        input_x = pad_sequences([encoded], maxlen=max_length-1, padding='pre')
        pred_y = model2.predict(input_x, verbose=0)
        pred_word = idx2word[np.argmax(pred_y)] 
        input_text += ' ' + pred_word
    print('SimpleRNN + BTT model predicted output: ', input_text.strip())

def generate_seq_third(model_chosen, input_text, num_output):
    model3.load_weights(model_chosen)
    for _ in range(num_output):
        words = nltk.word_tokenize(input_text);
        encoded = [word_dict.get(word, 1) for word in words]
        input_x = pad_sequences([encoded], maxlen=max_length-1, padding='pre')
        pred_y = model3.predict(input_x, verbose=0)
        pred_word = idx2word[np.argmax(pred_y)] 
        input_text += ' ' + pred_word
    print('GRU + BTT model predicted output:       ', input_text.strip())

num_output = 15
model1_chosen = 'checkpoints_part1/best.hdf5'
model2_chosen = 'checkpoints_part2/best.hdf5'
model3_chosen = 'checkpoints_part3/best.hdf5'

# Choose a random seed
input_texts = ['we', 'rather', 'first', 'what', 'set', 'these', 'welcome', 'each', 'keep', 'whiles']
for i in range(len(input_texts)):
#     rand = random.randint(0,len(tokens)-1)
#     input_text = tokens[rand]
    input_text = input_texts[i]
    print('\n selected seed:', input_text + '\n')
    # SimpleRNN + TBTT
    generate_seq_first(model1_chosen, input_text, num_output)
    # SimpleRNN + BTT 
    generate_seq_second(model2_chosen, input_text, num_output)
    # GRU + BTT
    generate_seq_third(model3_chosen, input_text, num_output)


 selected seed: we

SimpleRNN + TBTT model predicted output: we aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud
SimpleRNN + BTT model predicted output:  we have not yet been seen in any house nor can we lie distinguish by our
GRU + BTT model predicted output:        we are undone lady we are undone to hear speak sir and since he play but

 selected seed: rather

SimpleRNN + TBTT model predicted output: rather actor aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_
SimpleRNN + BTT model predicted output:  rather say i play the _unk_ do do not be forgot right noble lord the good
GRU + BTT model predicted output:        rather no no more shall fight with such gentle lambs and throw them hither in the

 selected seed: first

SimpleRNN + TBTT model predicted output: first future _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud _unk_ aloud
SimpleRNN + BTT model predicted output:  first senator

In [56]:
# Part4
import gluonnlp as nlp
import numpy as np
from scipy import stats

wordsim353 = nlp.data.WordSim353()
sim_word1 = []
sim_word2 = []
scores = []
for i in range(len(wordsim353)):
    w1 = wordsim353[i][0].lower()
    w2 = wordsim353[i][1].lower()
    score = wordsim353[i][2]
    if word_dict.get(w1) != None and word_dict.get(w2) != None:
        sim_word1.append(w1)
        sim_word2.append(w2)
        scores.append(score)
print(sim_word1[0:5])
print(sim_word2[0:5])
print(scores[0:5])

['mars', 'wednesday', 'attempt', 'baby', 'bank']
['water', 'news', 'peace', 'mother', 'money']
[2.94, 2.22, 4.25, 7.85, 8.12]


In [70]:
embeddings1 = model1.layers[0].get_weights()[0]
embeddings2 = model2.layers[0].get_weights()[0]
embeddings3 = model3.layers[0].get_weights()[0]

# Compare
from sklearn.metrics.pairwise import cosine_similarity
def get_similarity_score(embeddings): 
    sim_values = []
    for i in range(len(sim_word1)):
        vec1 = embeddings[word_dict[sim_word1[i]]]
        vec2 = embeddings[word_dict[sim_word2[i]]]
        sim_val = cosine_similarity(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]
        sim_values.append(sim_val)
    return sim_values

model1_score = get_similarity_score(embeddings1)
model2_score = get_similarity_score(embeddings2)
model3_score = get_similarity_score(embeddings3)

sr1 = stats.spearmanr(np.array(model1_score), np.array(scores))
sr2 = stats.spearmanr(np.array(model2_score), np.array(scores))
sr3 = stats.spearmanr(np.array(model3_score), np.array(scores))

print('Spearman rank correlation on wordsim353 {}'.format(sr1.correlation.round(3)))
print('Spearman rank correlation on wordsim353 {}'.format(sr2.correlation.round(3)))
print('Spearman rank correlation on wordsim353 {}'.format(sr3.correlation.round(3)))

Spearman rank correlation on wordsim353 -0.136
Spearman rank correlation on wordsim353 0.312
Spearman rank correlation on wordsim353 0.349


In [86]:
"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

lengths = [] # store length of each group

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
#         print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = "ISO-8859-1", mode ='r').read())
            groups.append(cat)
    return news, list(map(to_categories, groups))

data_path = "20news_subsampled"
news, groups = data_loader(data_path)
lengths = [i for i in range(1, len(groups)) if groups[i] != groups[i-1]]
lengths.append(len(groups))

[2983, 3980, 4755, 5740, 6680, 8642, 9552, 10546, 12480, 13108]


In [85]:
# split each group to 90% train and 10% validataion
news_train = []
news_valid = []
groups_train = []
groups_valid = []
for i in range(len(lengths)):
    size = 0
    if i == 0:
        size = lengths[i]
    else:
        size = lengths[i] - lengths[i - 1] 
    train_size = int(0.9*size)
    valid_size = size - train_size
    if i == 0:
        news_train += news[0:train_size]
        news_valid += news[train_size : lengths[i]]
        groups_train += groups[0:train_size]
        groups_valid += groups[train_size : lengths[i]]
    else:
        news_train += news[lengths[i-1]:lengths[i-1] + train_size]
        news_valid += news[lengths[i-1] + train_size : lengths[i]]
        groups_train += groups[lengths[i-1]:lengths[i-1] + train_size]
        groups_valid += groups[lengths[i-1] + train_size : lengths[i]]

In [69]:
tokenizer = Tokenizer(20000)
tokenizer.fit_on_texts(news)
vocab_size = len(tokenizer.word_index) + 1 # add 0 as padding
# print('vocab_size:', vocab_size)\
max_length = 200
# avg = sum( map(len, sequences) ) / len(sequences) # average length is 292
# train set
sequences_train = tokenizer.texts_to_sequences(news_train)
sequences_train = pad_sequences(sequences_train, maxlen=max_length, padding='pre')
X_train = np.array(sequences_train)
Y_train = to_categorical(np.asarray(groups_train))
print('Shape of train data tensor:', X_train.shape)
print('Shape of train label tensor:', Y_train.shape)
# valid set
sequences_valid = tokenizer.texts_to_sequences(news_valid)
sequences_valid = pad_sequences(sequences_valid, maxlen=max_length, padding='pre')
X_valid = np.array(sequences_valid)
Y_valid = to_categorical(np.asarray(groups_valid))
print('Shape of validation data tensor:', X_valid.shape)
print('Shape of validation label tensor:', Y_valid.shape)

Shape of train data tensor: (11793, 300)
Shape of train label tensor: (11793, 4)
Shape of validation data tensor: (1315, 300)
Shape of validation label tensor: (1315, 4)


In [76]:
model5 = Sequential()
model5.add(Embedding(vocab_size, 50, input_length=max_length, mask_zero=True))
model5.add(SimpleRNN(500, activation='sigmoid'))
model5.add(Dense(128, activation = 'relu'))
model5.add(Dropout(0.5))
model5.add(Dense(4, activation='softmax'))
model5.summary()

# Training
part5_root_folder = 'checkpoints_part5'
checkpointer5 = ModelCheckpoint(
        filepath=part5_root_folder+'/epoch-{epoch:02d}-accu-{val_accuracy:.4f}.hdf5',
        monitor='val_accuracy',
        verbose=1,
        save_best_only=True, mode='auto', period=5)
logfile5 = CSVLogger(part5_root_folder+'/train.log', append=False, separator=',')
model5.compile(loss='categorical_crossentropy', optimizer= 'adam', metrics=['accuracy'])
history5 = model5.fit(X_train,Y_train,
        batch_size=64,
        epochs=30,
        verbose=1,
        callbacks=[checkpointer5, logfile5],
        validation_data=(X_valid, Y_valid), 
        shuffle=True,
        workers=4)

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 300, 50)           7408350   
_________________________________________________________________
simple_rnn_7 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_12 (Dense)             (None, 128)               64128     
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 4)                 516       
Total params: 7,748,494
Trainable params: 7,748,494
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 11793 samples, validate on 1315 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30

Epoch 00005: val_accuracy improved from -inf to 0.67148, saving model to checkpoints_part5/epoch-05-accu-0.6715.hdf5
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30

Epoch 00010: val_accuracy improved from 0.67148 to 0.75285, saving model to checkpoints_part5/epoch-10-accu-0.7529.hdf5
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30

Epoch 00015: val_accuracy improved from 0.75285 to 0.77719, saving model to checkpoints_part5/epoch-15-accu-0.7772.hdf5
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30

Epoch 00020: val_accuracy improved from 0.77719 to 0.78935, saving model to checkpoints_part5/epoch-20-accu-0.7894.hdf5
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30

Epoch 00025: val_accuracy did not improve from 0.78935
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30

Epoch 00030: val_accuracy did not improve from 0.78935


In [None]:
My accuracy results on the validation set is 0.78935.

In [99]:
def predict(model_chosen, start, end):
    for k in range(start, end):
        input_x = X_valid[k]
        model5.load_weights(model_chosen)
        pred = model5.predict(X_valid[0].reshape(1,300))
        pred_class = cat[groups_valid[np.argmax(pred)]]
        print('Predicted class: {0}, Groud truth class:{1}'.format(pred_class.strip(), cat[groups_valid[k]]))

cat=["politics","rec","comp","religion"]
model5_chosen = 'checkpoints_part5/epoch-15-accu-0.7772.hdf5'
predict(model5_chosen, 1, 10)

Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec
Predicted class: rec, Groud truth class:rec


In [None]:
# Backups
# raw_text = raw_text.replace('\n', ' ').replace('\r', '') # replace new lines with spaces
# sentences = nltk.sent_tokenize(raw_text)
# max_length = 20 # truncate long sentence
# sequences = list()
# lengths = list()
# for sentence in sentences:
#     tokens = nltk.word_tokenize(sentence);
#     encoded = [word_dict.get(word, 1) for word in tokens]
#     lengths.append(min(len(encoded), max_length)-1)
#     for i in range(1,min(len(encoded), max_length)):
#         sequence = encoded[:i+1]
#         sequences.append(sequence)
# # max_length = max([len(seq) for seq in sequences])
# sequences =  np.array(pad_sequences(sequences, maxlen=max_length, padding='pre'))
# print(sequences.shape)
# X = sequences[:,:-1] # length - 1
# y = sequences[:,-1]
# # print(X[0:5])
# # print(y[0:5])
# Y = to_categorical(y, num_classes=vocab_size)
# print(X.shape)
# print(Y.shape)

# # def seq2sentence(seq):
# #     sentence = ''
# #     for i in range(len(seq)):
# #         sentence += idx2word[seq[i]] + ' '
# #     print(sentence.strip())
# # seq2sentence([4308,   26, 2546, 4147, 3632]) 

# # Build model
# model2 = Sequential()
# model2.add(Embedding(vocab_size, 100, input_length=max_length-1, mask_zero=True))
# model2.add(SimpleRNN(500, activation='sigmoid'))
# model2.add(Dense(vocab_size, activation='softmax'))
# model2.summary()

# # Training
# adam = optimizers.Adam(lr=0.005)
# model2.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

# part2_root_folder = 'checkpoints_part2_batch'
# def write_log(epoch, loss, accuracy):
#     f = open(part2_root_folder+'/train.log', "a")
#     f.write('epoch:{0:02d} loss: {1:.4f} accuracy: {2:.4f} \n'.format(epoch + 1, loss, accuracy))
    
# epochs = 20
# f = open(part2_root_folder+'/train.log', "a")
# for epoch in range(epochs):
#     start = 0
#     for i in range(len(lengths)):
#         length = lengths[i] # length of current sentence
#         [loss, accuracy] = model2.train_on_batch(X[start: start+length], Y[start: start+length])
#         start += length
#     write_log(epoch, loss, accuracy)
#     if((epoch + 1) % 5 == 0):
#         model2.save(part2_root_folder+'/epoch-{0:02d}-accu-{1:.4f}-loss-{2:.4f}.hdf5'.format(epoch + 1, loss, accuracy))
# f.close()