# 11 CNN for text classifcation on Keras
In this notebook, we are going to implement three language models following three language models following their own assumptions

## Agenda

1. Data Preprocessing

2. CNN for text classification



## 1. Data Preprocessing

- Here, we load the IMDB review corpus.
- We preprocess the corpus and are going to use two columns including sentiment (label) and review text (input data)

In [1]:
import numpy as np
import pandas as pd

In [20]:
## Import Packages
from keras.models import Model
import numpy as np
from keras.layers import Input, Dense, Embedding, LSTM, Activation, Flatten
from keras.layers import Conv1D, GlobalMaxPooling1D, Dropout, Concatenate
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

In [3]:
import pandas as pd       
train = pd.read_csv("../BT5153_data/labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)

In [4]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [5]:
from bs4 import BeautifulSoup  

In [6]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    
    #letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    # 2. Convert to lower case, split into individual words
    words = review_text.lower().split()                             
    #
    return( " ".join(words)) 

In [7]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range(0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

In [8]:
# check the review sentence
clean_train_reviews[0]

'"with all this stuff going down at the moment with mj i\'ve started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 2

## 2. CNN

Here, CNN is used for sentiment analysis. Given a review, positive or negative labels are inferred.

In [9]:
## For demo purpose, the first 1000 samples are used as training data
## In addition, 100 samples are used as testing/validation set
train_reviews = clean_train_reviews[:1000]
test_reviews = clean_train_reviews[1000:1100]
all_labels = train["sentiment"].tolist()

In [10]:
train_labels = all_labels[:1000]
test_labels = all_labels[1000:1100]

In [11]:
vocab_size =  8000
tk = Tokenizer(num_words=vocab_size)  ## here, we are set the max number of words to keep. The most common 7999 words will be kept
tk.fit_on_texts(train_reviews)

In [12]:
# Convert string to index
train_sequences = tk.texts_to_sequences(train_reviews)
test_texts = tk.texts_to_sequences(test_reviews)

- Since reviews have different lengths, we need to make all the reviews having the same length  
- The unified length is set to be the max length of all documents in training corpus

In [13]:
sequence_length = max([len(ele) for ele in train_sequences]) 
# Padding
train_data = pad_sequences(train_sequences, maxlen=sequence_length, padding='post')
test_data = pad_sequences(test_texts, maxlen=sequence_length, padding='post')

In [14]:
# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')
train_classes = np.array(train_labels, dtype='int')
test_classes = np.array(test_labels, dtype='int')

#####   Model API
Keras provides a Model class that you can use to create a model from your created layers. It requires that you only specify the input and output layers.

https://machinelearningmastery.com/keras-functional-api-deep-learning/

#####   CNN Framework

1. This is a CNN network for sentence classification.

2. Filters Sizes are 2, 3, 4.

3. In our following implementation, filters sizes are 2,3,4. Each filter size has 30 filters. The embeddings size is 20. 

<img src="cnn.png" alt="cnn"
	title="cnn pic" width="600" height="200" />

In [24]:
embedding_dim = 20    # The size of embeddings is 20
input_shape = (sequence_length,)
model_input = Input(shape=input_shape)
# Embedding Layer
z = Embedding(vocab_size, embedding_dim, input_length=sequence_length, name="embedding")(model_input)
# Convolutional Layer 
conv_blocks = []
filter_sizes = [2,3,4]
num_filters = 30
for sz in filter_sizes:
    # sz is the window size
    conv = Conv1D(filters=num_filters,
                  kernel_size=sz,
                  padding="valid",
                  activation="relu",
                  strides=1)(z)
    # Pooling Layer
    conv = GlobalMaxPooling1D()(conv)
    # if you call MaxPooling1D(), you need use flatten to remove the axis 2
    conv_blocks.append(conv)
# Fully-connected Layer
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
# It is binary classifcation problem. We can use sigmoid layer.
# If it is multi-class classifcaiton problem, we can use softmax layer 
model_output = Dense(1, activation="sigmoid")(z)
model = Model(model_input, model_output)
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

- In Keras, MaxPooling1D() vs GlobalMaxPooling1D()

https://stackoverflow.com/questions/43728235/what-is-the-difference-between-keras-maxpooling1d-and-globalmaxpooling1d-functi

In [25]:
# Training
model.fit(train_data, train_classes,
          validation_data=(test_data, test_classes),
          batch_size=64,
          epochs=5,
          verbose=2)

Train on 1000 samples, validate on 100 samples
Epoch 1/5
 - 3s - loss: 0.6931 - acc: 0.4990 - val_loss: 0.6889 - val_acc: 0.5400
Epoch 2/5
 - 2s - loss: 0.6757 - acc: 0.5910 - val_loss: 0.6862 - val_acc: 0.5400
Epoch 3/5
 - 2s - loss: 0.6625 - acc: 0.6060 - val_loss: 0.6817 - val_acc: 0.5400
Epoch 4/5
 - 2s - loss: 0.6475 - acc: 0.8640 - val_loss: 0.6765 - val_acc: 0.6700
Epoch 5/5
 - 2s - loss: 0.6282 - acc: 0.9740 - val_loss: 0.6650 - val_acc: 0.6700


<keras.callbacks.History at 0x1a112a897b8>