## Building a Neural Network for Text Classification

Now that you know how does a Neural Network work and have a basic idea about working with keras, in this notebook you will learn how to implement a NN for Text Classification on a real data set. 

Let's start without much ado!

### Table of Contents

1. About the Dataset
2. Data Preprocessing for Neural Network
     - Label Encoding 
     - Converting text into sequence of tokens
     - Padding 
3. Building a Neural Network model
4. Evaluate the model 
5. Conclusion

<img src="https://farm2.staticflickr.com/1834/42271822770_6d2a1d533f_b.jpg" width=700>

### 1. About the Dataset

The dataset that you are going to use is a collection of news articles from BBC across 5 major categories, namely:
 
 - Business
 - Entertainment
 - Politics
 - Sport
 - Tech

There are a total of 2225 articles in the dataset, which is a mix of all of the above categories. Let's load the dataset using pandas and have a quick look at some of the articles. 

**Note:** You can get the dataset [here](https://trainings.analyticsvidhya.com/asset-v1:AnalyticsVidhya+LP_DL_2019+2019_T1+type@asset+block@bbc_news_mixed.csv)


In [1]:
# importing libraries
import pandas as pd
import numpy as np

# Load the dataset
bbc_news = pd.read_csv('../datasets/bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


### 2. Data Preprocessing for Neural Network

Before you can use text data in a Neural Network, you need to preprocess the data and convert in a format which works best with the NN. Here are the major preprocessing that you will be doing:

a. Label encoding the target variable "label" and converting it into categorical

b. Converting the input text to sequence of tokens

c. Padding the sequences to make uniform length

Let's start with the first one

#### a. Label encoding the target variable "label" and converting it into categorical

In [2]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# initialize label encoder
lencod = LabelEncoder()
# encode the text labels into numbers
bbc_news.label = lencod.fit_transform(bbc_news.label)
# convert labels to categorical form
y = to_categorical(bbc_news.label)

print(y)

Using TensorFlow backend.


[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]]


#### b. Converting the input text to sequence of tokens

 - Because of the way NN work, you need to convert each text in the form of sequences of tokens. 
 - But before that, let's split the data set into training and test sets. 
 - Then we will tokenize it, label encode the tokens by numbers and represent each text as a sequence of those tokens.


In [3]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(bbc_news['text'], y, test_size=0.2, random_state=42)
total_X = X_train.append(X_test)

# tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(total_X)

# calculate maximum length of sequence and vocab size
max_len = total_X.str.split().apply(lambda x: len(x)).max()
vocab_size = len(tokenizer.word_index)+1

# convert text as sequence of tokens
X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)

#### c. Padding the sequences to make uniform length

In [4]:
from keras.preprocessing.sequence import pad_sequences

# pad train and test's text sequences to make them all of uniform length
X_train_pad = pad_sequences(X_train_tokens, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_len, padding='post')

print(X_train_pad.shape)

(1780, 4432)


### 3. Building a Neural Network model

 - Now that your data is properly preprocessed, it's time to build and train a Neural Network.
 - You will be using the [Embedding layer](https://keras.io/layers/embeddings/) of keras to create an embedding of 100 dimensions for each sequence.
 - Let's learn by doing!

In [5]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten

# embedding size
EMBEDDING_SIZE = 100
vocab_100 = int(vocab_size/100)

# initialize a sequential model
model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=max_len))
model.add(Dense(500, activation='relu'))
model.add(Dense(vocab_100, activation='relu'))
model.add(Flatten())
# add the final layer that will classify into 5 classes
model.add(Dense(5, activation='softmax'))

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Your model is ready, here are some things to note:

 - You need to compile the model before you can use it, this is done in the last line above where you are also specifying the loss funciton to use, the optimizer and the metric to calculate the performance of the model.
 - You can read more about [Sequential model](https://keras.io/getting-started/sequential-model-guide/), [Dense](https://keras.io/layers/core/) and [Flatten](https://keras.io/layers/core/) layers.
 
Now that the model is compiled, let's have a look at its summary.

In [6]:
# check model summary
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4432, 100)         3236000   
_________________________________________________________________
dense_1 (Dense)              (None, 4432, 500)         50500     
_________________________________________________________________
dense_2 (Dense)              (None, 4432, 323)         161823    
_________________________________________________________________
flatten_1 (Flatten)          (None, 1431536)           0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 7157685   
Total params: 10,606,008
Trainable params: 10,606,008
Non-trainable params: 0
_________________________________________________________________


### 4. Evaluate the model

 - Now that the model is compiled and ready, you can start training the model.
 - You also need a way to evaluate the model which will be done using the test set that we created earlier.
 - In keras, you can do all of this in a single line.

In [7]:
# train and evaluate the model
model.fit(X_train_pad, y_train, epochs=11, validation_data=(X_test_pad, y_test))

Train on 1780 samples, validate on 445 samples
Epoch 1/11
Epoch 2/11
Epoch 3/11
Epoch 4/11
Epoch 5/11
Epoch 6/11
Epoch 7/11
Epoch 8/11
Epoch 9/11
Epoch 10/11
Epoch 11/11


<keras.callbacks.History at 0x7fcd7b4b16d8>

### 5. Conclusion

 - As you can see above, the "val_acc" is the accuracy of the model on the training set. This value gives us an indication as to how well our model is generalizing on unseen data.
 - With just using a simple network, the accuracy is around 93~94 % which is a really good result.
 - Notice how easy keras makes it to train a Neural Network by simple stacking layers on top (Sequential).
 - You can explore this further in [the keras sequential model guide](https://keras.io/getting-started/sequential-model-guide/).