 # <div style="background-color:lightblue; text-align:center; vertical-align: middle; padding:40px 0;"><p style="font-family: Arial; font-size:1em; color:BLACK; text-align: center;"> Text Classification using Keras Deep Learning Python Library </p></div>

# <p style="font-family: Arial; font-size:1.1em;color:#3498DB;"> Overview</p>

* Text classification or text categorization is an activity of labelling natural language texts with relevant predefined categories. The idea is to automatically organize text in different classes. It can drastically simplify and speed-up your search through the documents or texts!

* This project will focus on algorithmical methods, exactly Deep learning algorithms which are widely used in information science and computer science. There are many public text dataset online for classification, here, I will apply classification algorithms on 20_newsgroup dataset, which has a collection of 20,000 messages, collected from 20 different netnews newsgroups. The news will be classified according to their contents.



# <p style="font-family: Arial; font-size:1em;color:#3498DB;"> Problem Statement </p> 
### <div style="background-color:#f2f2f2; text-align:center; vertical-align: middle; padding:40px 0;"><p>The data is categorized into 20 categories and our job will be to predict the categories</p></div>



* The classification of 20_newsgroup dataset is a supervised classification problem, there are news of 20 categories, each piece of news belongs to one category, the goal is to extract proper features and build an effective model to assign each piece of news to the correct category.
* I will explore the dataset in the beginning on the training part, then extract useful keywords and build vectors of features from the texts of news, based on those vectors I will use several classfication methods to do classification, compare the efficiency of these classifiers on the testing data and choose one as final model. Finally I will validate the performance of the classifiers on different dataset and parameters.



![Screen%20Shot%202019-03-20%20at%203.22.46%20PM.png](attachment:Screen%20Shot%202019-03-20%20at%203.22.46%20PM.png)


# <font style=color:#3498DB;>Data </font>

### About this Dataset
#### Context
* This dataset is a collection of newsgroup documents. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

#### Content
   * There is file (list.csv) that contains a reference to the document_id number and the newsgroup it is associated with. There are also 20 files that contain all of the documents, one document per newsgroup.

   * In this dataset, duplicate messages have been removed and the original messages only contain "From" and "Subject" headers (18828 messages total).

* Each new message in the bundled file begins with these two headers:

    From: Cat

    Subject: Meow Meow Meow

    The Newsgroup and Document_id can be referenced against list.csv

* Organization - Each newsgroup file in the bundle represents a single newsgroup - Each message in a file is the text of some newsgroup document that was posted to that newsgroup.




![Screen%20Shot%202019-03-28%20at%2011.40.03%20AM.png](attachment:Screen%20Shot%202019-03-28%20at%2011.40.03%20AM.png)




#### This is a list of the 20 newsgroups:

* <p>alt.atheism</p>
* <p>comp.os.ms-windows.misc</p>
* <p>comp.graphics</p>
* <p>comp.sys.ibm.pc.hardware</p>
* <p>comp.sys.mac.hardware</p>
* <p>comp.windows.x </p>
* <p>rec.autos</p>
* <p>rec.motorcycles</p>
* <p>rec.sport.baseball</p>
* <p>rec.sport.hockey </p>
* <p>sci.crypt</p>
* <p>sci.electronics</p>
* <p>sci.med</p>
* <p>sci.space</p>
* <p>misc.forsale </p>
* <p>talk.politics.misc</p>
* <p>talk.politics.guns</p>
* <p>talk.politics.mideast </p>
* <p>talk.religion.misc</p>
* <p>soc.religion.christian</p>


### Inspiration
* This dataset text can be used to classify text documents



* The data is categorized into 20 categories and our job will be to predict the categories. Few of the categories are very closely related. As shown below:

![Screen%20Shot%202019-03-25%20at%2011.04.45%20AM.png](attachment:Screen%20Shot%202019-03-25%20at%2011.04.45%20AM.png)

# <font color='#3498DB'> Implemented Models </font>



 ___
 <a name="(1)-1D-convnet"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(1)1D convnet</font></p>](#(1)-1D-convnet)

<a name="(2)-1D-convnet-using-Dropout"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(2) 1D convnet using Dropout</font></p>](#(2)-1D-convnet-using-Dropout)

<a name="(3)-CNN+GRU"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(3) CNN+GRU</font></p>](#(3)-CNN+GRU)

<a name="(4)-Bidirectional-GRU"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(4) Bidirectional GRU</font></p>](#(4)-Bidirectional-GRU)

<a name="(5)-Glove-word-embedding"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(5) Glove word embedding</font></p>](#(5)-Glove-word-embedding)

<a name="(6)-LSTM-Dropout"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(6) LSTM Dropout</font></p>](#(6)-LSTM-Dropout)

<a name="(7)-LSTM"></a>[<p style="font-size:1.5em;"><font color='darkblue'> (7) LSTM</font></p>](#(7)-LSTM)

 <a name="(8)-Simple-RNN"></a>[<p style="font-size:1.5em;"><font color='darkblue'>(8) Simple RNN</font></p>](#(8)-Simple-RNN)
___


# <font color='#3498DB'> Metrics </font>


* accuracy: the proportion of correct labels that we made if we apply our model to the training dataset. Ideal accuracy is 100%. 


# Introducing Keras
* Keras is a deep learning and neural networks API by François Chollet which is capable of running on top of Tensorflow (Google), Theano or CNTK (Microsoft).

#### Here's how we will solve the classification problem:

* convert all text samples in the dataset into sequences of word indices. A "word index" would simply be an integer ID for the word. We will only consider the top 20,000 most commonly occuring words in the dataset, and we will truncate the sequences to a maximum length of 1000 words.
* prepare an "embedding matrix" which will contain at index i the embedding vector for the word of index i in our word index.
* load this embedding matrix into a Keras Embedding layer, set to be frozen (its weights, the embedding vectors, will not be updated during training).
* build on top of it a 1D convolutional neural network, ending in a softmax output over our 20 categories.


# Import Data and basic data manipulation

## Import Needed libraries

In [30]:
import sys 
import os
import numpy as np# basic calculation
import pandas as pd
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer # tokenization
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense , Input , GlobalMaxPooling1D , Dropout
from keras.layers import Conv1D , MaxPooling1D ,Embedding
from keras.models import Model
from keras import utils as np_utils
from keras.layers import Activation, Dense, Dropout
from sklearn.preprocessing import LabelBinarizer
import sklearn.datasets as skds
from pathlib import Path

### get working directory

In [8]:
import os
os.getcwd()

'/Users/priyavivekbhandarkar/Desktop/INSOFE/ALL MODULES INSOFE/HOT'

### current working Directory

In [9]:
os.chdir(r'/Users/priyavivekbhandarkar/Desktop/INSOFE/ALL MODULES INSOFE/HOT/20news_18828')


In [31]:
Text_data = "/Users/priyavivekbhandarkar/Desktop/INSOFE/ALL MODULES INSOFE/HOT/20news_18828"

In [11]:
os.listdir("/Users/priyavivekbhandarkar/Desktop/INSOFE/ALL MODULES INSOFE/HOT/20news_18828")


['.DS_Store',
 'alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

* These are the 20 labels.

# Preparing the text data

* First, we will simply iterate over the folders in which our text samples are stored, and format them into a list of samples. We will also prepare at the same time a list of class indices matching the samples:



In [32]:
texts = []
labels_index = {}
labels = []

for name in sorted(os.listdir(Text_data)):
    path = os.path.join(Text_data,name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path,fname)
               #   f =  open(fpath)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath , encoding = 'latin-1')
                t = f.read()
                i = t.find('\n\n') #skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)
        print(labels_index)
        
        
print('Found %s texts.' % len(texts))


{'alt.atheism': 0}
{'alt.atheism': 0, 'comp.graphics': 1}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7}
{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x'

<p style="font-size:1.1em"> There are 18 828 newsgroup documents, distributed almost evenly across 20 different newsgroups. Our goal is to create a classifier that will classify each document based on its content. </p>

---

![Screen%20Shot%202019-03-25%20at%2011.20.20%20AM.png](attachment:Screen%20Shot%202019-03-25%20at%2011.20.20%20AM.png)

---
 <div style="background-color:#f2f2f2; text-align:justify; vertical-align: middle; padding:40px 0;">
<p> Then we can format our text samples and labels into tensors that can be fed into a neural network. To do this, we will rely on Keras utilities keras.preprocessing.text.Tokenizer and keras.preprocessing.sequence.pad_sequences.</p></div>

---


In [33]:
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
nb_epochs = 5


* Then we can format our text samples and labels into tensors that can be fed into a neural network. To do this, we will rely on Keras utilities keras.preprocessing.text.Tokenizer and keras.preprocessing.sequence.pad_sequences.

* Now we need to tokenize the data into a format that can be used by the word embeddings. Keras offers a couple of convenience methods for text preprocessing and sequence preprocessing which you can employ to prepare your text.


In [34]:
tokenizer = Tokenizer(num_words = MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

In [35]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 172668 unique tokens.


* In the text document we have 172668 unique tokens.

In [36]:
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

data

array([[  49,   81, 2534, ...,    4,  904, 3286],
       [ 263,  224,   31, ..., 5227,  578,  349],
       [   0,    0,    0, ...,    3,  322, 5676],
       ...,
       [   0,    0,    0, ...,   71,  200,  508],
       [   0,    0,    0, ..., 2045, 1860, 9514],
       [   0,    0,    0, ...,    3,    1, 2597]], dtype=int32)

In [37]:
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (18828, 1000)
Shape of label tensor: (18828, 20)


* There are 18 828 newsgroup documents and 20 different newsgroups labels.

In [22]:
data.shape


(18828, 1000)

In [23]:
labels.shape

(18828, 20)

In [38]:
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
print("Number transactions X_train dataset: ", x_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", x_val.shape)
print("Number transactions y_test dataset: ", y_val.shape)

Number transactions X_train dataset:  (15063, 1000)
Number transactions y_train dataset:  (15063, 20)
Number transactions X_test dataset:  (3765, 1000)
Number transactions y_test dataset:  (3765, 20)


* The dataset has been splitted into training part and testing part, each part consists of news of different categories. There are 15063 samples in training set, and 3765 in testing set, and each sample has been labeled by researchers. They belong to 20 different categories involved with computer science, baseball, medical, religion, politics, sports, auto, medicals, space, forsale, motorcycles, electronics etc. 
 

## Preparing the Embedding layer


* Next, we compute an index mapping words to known embeddings, by parsing the data dump of pre-trained embeddings:



In [74]:
embeddings_index = {}
f = open(os.path.join(Text_data),encoding = 'utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

IsADirectoryError: [Errno 21] Is a directory: '/Users/priyavivekbhandarkar/Desktop/INSOFE/ALL MODULES INSOFE/HOT/20news_18828'

* At this point we can leverage our embedding_index dictionary and our word_index to compute our embedding matrix:

In [75]:
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

* We load this embedding matrix into an Embedding layer. Note that we set trainable=False to prevent the weights from being updated during training.

### Keras Embedding Layer
* Keras offers an Embedding layer that can be used for neural networks on text data.
* It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

* The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

* It is a flexible layer that can be used in a variety of ways, such as:

    * It can be used alone to learn a word embedding that can be saved and used in another model later.
    * It can be used as part of a deep learning model where the embedding is learned along with the model itself.
    * It can be used to load a pre-trained word embedding model, a type of transfer learning.


* The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

#### It must specify 3 arguments:

###### input_dim: 
This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
###### output_dim: 
This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
#####  input_length:
This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

In [97]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                           EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

* An Embedding layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data (although an Embedding layer is capable of processing sequence of heterogenous length, if you don't pass an explicit input_length argument to the layer).

* All that the Embedding layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim).



# Training a 1D convnet


* In figure you can see how such a convolution works. It starts by taking a patch of input features with the size of the filter kernel. With this patch you take the dot product of the multiplied weights of the filter. The one dimensional convnet is invariant to translations, which means that certain sequences can be recognized at a different position. This can be helpful for certain patterns in the text:


![Screen%20Shot%202019-03-28%20at%2012.28.09%20PM.png](attachment:Screen%20Shot%202019-03-28%20at%2012.28.09%20PM.png)

* Now let’s have a look how you can use this network in Keras. Keras offers again various Convolutional layers which you can use for this task. The layer you’ll need is the Conv1D layer. This layer has again various parameters to choose from. The ones you are interested in for now are the number of filters, the kernel size, and the activation function. Also adding this layer in between the Embedding layer and the GlobalMaxPool1D layer:

## (1) 1D convnet
* Build a small 1D convnet to solve our classification problem

---

In [98]:
from keras.layers import Flatten
from keras.layers import Conv2D, MaxPooling2D
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])



In [99]:

model.fit(x_train, y_train, validation_data=(x_val, y_val),
          epochs=10, batch_size=256)

Train on 15063 samples, validate on 3765 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a4d2ca048>

<div class="alert alert-block alert-success">
<b>Best Accuracy :</b> This gives the best Accuracy i.e. 71.40%
</div>

---

* This model reaches 58% classification accuracy on the validation set after only 10 epochs. We could probably get to an even higher accuracy by training longer with some regularization mechanism (such as dropout).



## (2) 1D convnet using Dropout

---

* A simple and powerful regularization technique for neural networks and deep learning models is dropout.
* Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
* Dropout can be applied to input neurons called the visible layer.
* In below we add a new Dropout layer between the input (or visible layer) and hidden layer. The dropout rate is set to 20%, meaning one in 200 inputs will be randomly excluded from each update cycle.

* Additionally, as recommended in the original paper on Dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. This is done by setting the kernel_constraint argument on the Dense class when constructing the layers.

* Dropout can be applied to hidden neurons in the body of your network model.
    * Dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

    * using a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.

In [101]:
from keras.optimizers import Adam
adam = Adam(lr=0.001)
model2= Sequential()
model2.add(Embedding(MAX_NUM_WORDS,100,input_length=MAX_SEQUENCE_LENGTH))
model2.add(Dropout(0.2))

model2.add(Conv1D(64,kernel_size=3,padding='same',activation='relu',strides=1))
model2.add(GlobalMaxPooling1D())

model2.add(Dense(128,activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(20,activation='softmax'))


model2.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

model2.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 1000, 100)         2000000   
_________________________________________________________________
dropout_14 (Dropout)         (None, 1000, 100)         0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 1000, 64)          19264     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 64)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 128)               8320      
_________________________________________________________________
dropout_15 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 20)                2580      
Total para

In [102]:
history2=model2.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=1, batch_size=256, verbose=1)

Train on 15063 samples, validate on 3765 samples
Epoch 1/1


In [104]:
y_pred2=model2.predict_classes(x_val, verbose=1)




---

## (3) CNN+GRU

---

In [108]:
from keras.layers import Flatten
model3= Sequential()
model3.add(Embedding(MAX_NUM_WORDS,100,input_length=MAX_SEQUENCE_LENGTH))
model3.add(Conv1D(64,kernel_size=3,padding='same',activation='relu'))
model3.add(MaxPooling1D(pool_size=2))
model3.add(Dropout(0.25))
model3.add(GRU(128,return_sequences=True))
model3.add(Dropout(0.3))
model3.add(Flatten())
model3.add(Dense(128,activation='relu'))
model3.add(Dropout(0.5))
model3.add(Dense(20,activation='softmax'))
model3.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])
model3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 1000, 100)         2000000   
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 1000, 64)          19264     
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 500, 64)           0         
_________________________________________________________________
dropout_16 (Dropout)         (None, 500, 64)           0         
_________________________________________________________________
gru_6 (GRU)                  (None, 500, 128)          74112     
_________________________________________________________________
dropout_17 (Dropout)         (None, 500, 128)          0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 64000)             0         
__________

In [109]:
history3=model3.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=1, batch_size=256, verbose=1)

Train on 15063 samples, validate on 3765 samples
Epoch 1/1


In [112]:
y_pred3=model3.predict_classes(x_val, verbose=1)




In [116]:
# evaluate the model
scores1 = model3.evaluate(x_train, y_train)





In [117]:
print("\n%s: %.2f%%" % (model3.metrics_names[1], scores1[1]*100))


acc: 29.04%


In [113]:
# evaluate the model
scores = model3.evaluate(x_val, y_val)




NameError: name 'classifier' is not defined

In [114]:
print("\n%s: %.2f%%" % (model3.metrics_names[1], scores[1]*100))


acc: 25.15%


---

# (4) Bidirectional GRU

---

* Bidirectional GRU’s are a type of bidirectional recurrent neural networks with only the input and forget gates. It allows for the use of information from both previous time steps and later time steps to make predictions about the current state. 


In [119]:
from keras.layers import SpatialDropout1D,Bidirectional
model4 = Sequential()

model4.add(Embedding(MAX_NUM_WORDS,100,input_length=MAX_SEQUENCE_LENGTH))
model4.add(SpatialDropout1D(0.25))
model4.add(Bidirectional(GRU(128)))
model4.add(Dropout(0.5))

model4.add(Dense(20, activation='softmax'))# 20 no.of labels
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 1000, 100)         2000000   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 1000, 100)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               175872    
_________________________________________________________________
dropout_19 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 20)                5140      
Total params: 2,181,012
Trainable params: 2,181,012
Non-trainable params: 0
_________________________________________________________________


In [None]:
history4=model4.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=1, batch_size=256, verbose=1)

Train on 15063 samples, validate on 3765 samples
Epoch 1/1
  256/15063 [..............................] - ETA: 46:13 - loss: 2.9967 - acc: 0.0391

In [None]:
y_pred4=model4.predict_classes(x_test, verbose=1)

---

## (5) Glove word embedding

---

* GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.



In [None]:
model5 = Sequential()
model5.add(Embedding(max_features, embed_dim, input_length=X_train.shape[1],weights=[embedding_matrix],trainable=True))
model5.add(SpatialDropout1D(0.25))
model5.add(Bidirectional(GRU(128,return_sequences=True)))
model5.add(Bidirectional(GRU(64,return_sequences=False)))
model5.add(Dropout(0.5))
model5.add(Dense(num_classes, activation='softmax'))
model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model5.summary()

In [None]:
history5=model5.fit(X_train, y_train, validation_data=(X_val, y_val),epochs=1, batch_size=batch_size, verbose=1)

In [None]:
y_pred5=model5.predict_classes(X_test, verbose=1)


In [None]:
sub_all=pd.DataFrame({'model1':y_pred1,'model2':y_pred2,'model3':y_pred3,'model4':y_pred4,'model5':y_pred5})
pred_mode=sub_all.agg('mode',axis=1)[0].values
sub_all.head()

---

## (6) LSTM Dropout

---

In [49]:
from keras.layers import Dense, SimpleRNN, GRU, LSTM, Embedding # Import layers from Keras
from keras.optimizers import Adam
adam = Adam(lr=0.001)
model1=Sequential()
model1.add(Embedding(MAX_NUM_WORDS, 100,mask_zero=True))
model1.add(LSTM(64,dropout=0.4, recurrent_dropout=0.4,return_sequences=True))
model1.add(LSTM(32,dropout=0.5, recurrent_dropout=0.5,return_sequences=False))
model1.add(Dense(20,activation='softmax'))
model1.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])
model1.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 100)         2000000   
_________________________________________________________________
lstm_5 (LSTM)                (None, None, 64)          42240     
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_2 (Dense)              (None, 20)                660       
Total params: 2,055,316
Trainable params: 2,055,316
Non-trainable params: 0
_________________________________________________________________


In [None]:
history1=model1.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=2, batch_size=256, verbose=1)


In [None]:
y_pred1=model1.predict_classes(X_test,verbose=1)


---

## (7) LSTM

---

In [89]:
from keras.layers import Dense, SimpleRNN, GRU, LSTM, Embedding # Import layers from Keras

model = Sequential() # Call Sequential to initialize a network
model.add(Embedding(input_dim = MAX_NUM_WORDS, 
                    input_length = MAX_SEQUENCE_LENGTH, 
                    output_dim = EMBEDDING_DIM)) # Add an embedding layer which represents each unique token as a vector
model.add(LSTM(20, return_sequences=True)) # Add an LSTM layer
model.add(LSTM(12, return_sequences=False))
model.add(Dense(20, activation='softmax'))

In [81]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 1000, 100)         2000000   
_________________________________________________________________
lstm_13 (LSTM)               (None, 1000, 10)          4440      
_________________________________________________________________
lstm_14 (LSTM)               (None, 5)                 320       
_________________________________________________________________
dense_11 (Dense)             (None, 3)                 18        
Total params: 2,004,778
Trainable params: 2,004,778
Non-trainable params: 0
_________________________________________________________________


In [82]:
from keras.optimizers import Adam
adam = Adam(lr=0.001)


In [91]:
# Mention the optimizer, Loss function and metrics to be computed
model.compile(optimizer=adam,                  # 'Adam' is a variant of gradient descent technique
              loss='categorical_crossentropy', # categorical_crossentropy for multi-class classification
              metrics=['accuracy'])            # These metrics are computed for evaluating and stored in history

model.fit(x_train, y_train,batch_size=256, epochs=2, validation_split=VALIDATION_SPLIT)

Train on 12050 samples, validate on 3013 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a5de09198>

In [101]:
test_prob = model.predict(x_val)
test_prob.shape

(3765, 20)

In [102]:
test_prob[:7]

array([[1.29856572e-01, 1.86811071e-02, 3.08300648e-03, 1.98816061e-02,
        2.30636060e-01, 1.20614633e-01, 1.58552383e-03, 2.67479233e-02,
        1.20398840e-02, 2.33693095e-03, 5.79522252e-02, 3.61840380e-03,
        1.60058849e-02, 5.43684233e-03, 1.86368257e-01, 4.86588862e-04,
        1.34592667e-01, 5.53079555e-03, 7.87566602e-03, 1.66693609e-02],
       [1.75807421e-04, 6.03228982e-04, 3.87173386e-05, 8.80331936e-05,
        1.20886223e-04, 2.79254187e-03, 2.00114271e-04, 1.16952381e-03,
        9.48259830e-01, 1.39150536e-02, 1.31274888e-03, 8.21494905e-04,
        1.28773320e-02, 6.06025569e-05, 2.05680481e-04, 4.23892488e-04,
        2.05981312e-03, 2.01690826e-03, 1.27161928e-02, 1.41660508e-04],
       [7.11657479e-03, 7.00024329e-03, 5.51445894e-02, 1.17050642e-02,
        7.15339149e-04, 2.62160506e-03, 1.31754335e-02, 1.58295110e-01,
        3.74867418e-03, 5.86529002e-02, 6.66653691e-03, 2.05280297e-02,
        3.93627398e-03, 1.72371775e-01, 2.90198214e-02, 1.7586

In [103]:
test_classes = model.predict_classes(x_val)
test_classes.shape

(3765,)

In [104]:
test_classes = np.argmax(test_prob, axis=1)
test_classes.shape

(3765,)

In [105]:
test_classes[:11]

array([ 4,  8, 19, 15,  8,  3, 15,  7, 14, 14, 17])

---

## (8) Simple RNN

---

In [93]:
model = Sequential() # Call Sequential to initialize a network
model.add(Embedding(input_dim = MAX_NUM_WORDS, 
                    input_length = MAX_SEQUENCE_LENGTH, 
                    output_dim = EMBEDDING_DIM)) # Add an embedding layer which represents each unique token as a vector
model.add(SimpleRNN(40, return_sequences=True)) 
model.add(SimpleRNN(25, return_sequences=False))
model.add(Dense(20, activation='softmax')) # Add an ouput layer. Since classification, 3 nodes for 3 classes.

In [99]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 1000, 100)         2000000   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 1000, 40)          5640      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 25)                1650      
_________________________________________________________________
dense_15 (Dense)             (None, 20)                520       
Total params: 2,007,810
Trainable params: 2,007,810
Non-trainable params: 0
_________________________________________________________________


In [94]:
# Mention the optimizer, Loss function and metrics to be computed
model.compile(optimizer=adam,                  # 'Adam' is a variant of gradient descent technique
              loss='categorical_crossentropy', # categorical_crossentropy for multi-class classification
              metrics=['accuracy'])            # These metrics are computed for evaluating and stored in history

model.fit(x_train, y_train, epochs=5, batch_size=256, validation_split=0.25)

Train on 11297 samples, validate on 3766 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1a7ba05710>

<div class="alert alert-block alert-success">
<b>Best Accuracy :</b> Using All model Simple RNN gives the best Accuracy i.e. 97.72%
</div>

# Conclusion

1) 1D convnet gives better accuracy on both train and validadtion data.

2) Simple RNN gives more accuracy on training data than other models but validation data gives low accuracy.


# Future Scope

1) Increasing the no. of epoch of all the model to chech better accuracy.

---

<p style="font-size:1em; color:darkblue; ">///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////</p>

<p style="font-size:1em; color:darkblue; ">///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////</p>

In [222]:
y_pred = model.predict(X_test)

In [225]:
y_pred = np.argmax(y_pred, axis=1)

In [89]:
num_words  = min(MAX_NUM_WORDS , len(word_index))

num_words

20000

In [90]:
word_index

{'the': 1,
 'to': 2,
 'of': 3,
 'a': 4,
 'and': 5,
 'in': 6,
 'i': 7,
 'is': 8,
 'that': 9,
 "'ax": 10,
 'it': 11,
 'for': 12,
 'you': 13,
 'this': 14,
 'on': 15,
 'be': 16,
 'not': 17,
 'have': 18,
 'are': 19,
 'with': 20,
 'as': 21,
 'or': 22,
 'if': 23,
 'but': 24,
 'was': 25,
 'edu': 26,
 'they': 27,
 '1': 28,
 'from': 29,
 'by': 30,
 'at': 31,
 'an': 32,
 '2': 33,
 'my': 34,
 'can': 35,
 'what': 36,
 'all': 37,
 'would': 38,
 'there': 39,
 'will': 40,
 'one': 41,
 '0': 42,
 'do': 43,
 'about': 44,
 '3': 45,
 'writes': 46,
 'we': 47,
 'so': 48,
 'he': 49,
 'com': 50,
 'has': 51,
 'no': 52,
 'your': 53,
 'm': 54,
 'article': 55,
 'any': 56,
 'x': 57,
 'me': 58,
 'some': 59,
 'who': 60,
 "'": 61,
 'which': 62,
 'out': 63,
 'like': 64,
 "don't": 65,
 'more': 66,
 'when': 67,
 'just': 68,
 'people': 69,
 'were': 70,
 'their': 71,
 '4': 72,
 'up': 73,
 'know': 74,
 'other': 75,
 '5': 76,
 'get': 77,
 'only': 78,
 'them': 79,
 'how': 80,
 'had': 81,
 'than': 82,
 'been': 83,
 'his': 84,


In [92]:
embeddings_matrix = np.zeros((num_words , EMBEDDING_DIM))

embeddings_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [93]:
num_words = min(MAX_NUM_WORDS, len(word_index))

embeddings_matrix = np.zeros((num_words , EMBEDDING_DIM))

<p style="font-size:1em; color:darkblue; ">///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////</p>