<a href="https://colab.research.google.com/github/mohzary/python-deep-learning-f19/blob/master/Lab%202/Q6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Movie Reviews Text Classification With CNN using | hyperparameters Tuning** 

---
Dataset: Sentiment Analysis on Movie Reviews | [download](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)


In [0]:
# To import all reqired library 

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from keras import layers
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#To upload movie reviews dataset file to google colab
import io
from google.colab import files

%matplotlib inline

In [29]:
# To upload file from local dir to google colab
reviewsTrainFile = files.upload()


Saving train.tsv to train.tsv


In [0]:
# To assign train file content to pandas object 
trainReviews = pd.read_csv(io.StringIO(reviewsTrainFile['train.tsv'].decode('utf-8')), sep='\t')

In [31]:
# To upload test file from local dir to google colab
reviewsTestFile = files.upload()

Saving test.tsv to test.tsv


In [0]:
# To assign test file content to pandas object 
testReviews = pd.read_csv(io.StringIO(reviewsTestFile['test.tsv'].decode('utf-8')), sep='\t')

In [33]:
# To show the first 5 rows in trainig dataset.
trainReviews.head(5)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


# **The sentiment labels are:**


*   0 - negative
*   1 - somewhat negative
*   2 - neutral
*   3 - somewhat positive
*   4 - positive



In [34]:
trainReviews.Sentiment.value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

In [0]:
#To combine negative and somewhat negative rows under one category which is negative, also somewhat positive and positive under positive label:
trainReviews.Sentiment.replace([0,4],[1,3], inplace=True)

In [37]:
# To count values of each leabel in our dataset
trainReviews.Sentiment.value_counts()

2    79582
3    42133
1    34345
Name: Sentiment, dtype: int64

# **The new sentiment labels are:**


*   1 - negative
*   2 - neutral
*   3 - positive



In [38]:
# To show range of data, number of columns, type of features, and feature names  in the training dataset
trainReviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


**To Handle Null Values**

First we check if there is any null values in our training dataset

In [39]:
# To Show number of null values in each column in the trainig dataset:
trainNullValues = pd.DataFrame(trainReviews.isnull().sum().sort_values(ascending=False))
trainNullValues.columns = ['NO# of null values']
trainNullValues.index.name = 'Feature'
print(trainNullValues)

            NO# of null values
Feature                       
Sentiment                    0
Phrase                       0
SentenceId                   0
PhraseId                     0


**There is no null values**

The result shows that we have no null values in the training dataset. So, next step is to assign the Phrase and the sentiment into X and Y varaibles:

In [0]:
#to assign the Phrase and the sentiment into X and Y varaibles:
phrase = trainReviews['Phrase'].values
Y = trainReviews['Sentiment'].values

In [0]:
# To set our model parameters:
CNN_vocabulary_Size = 5000
CNN_features_Size = 300
CNN_batch_size = 250
CNN_embedding_dims = 300
CNN_filters = 250
CNN_kernel_size = 3
CNN_hidden_dims = 250
CNN_epochs = 10

In [0]:
#tokenizing phrase and get the vocabulary


CNN_tokenizer = Tokenizer(num_words=CNN_vocabulary_Size, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
CNN_tokenizer.fit_on_texts(phrase)
X = CNN_tokenizer.texts_to_sequences(phrase)
X = sequence.pad_sequences(X, maxlen=CNN_features_Size)

In [0]:
from sklearn.preprocessing import LabelEncoder
from keras.utils.np_utils import to_categorical
CNN_labelencoder = LabelEncoder()
CNN_integer_encoded = CNN_labelencoder.fit_transform(Y)
Y = to_categorical(CNN_integer_encoded)

In [0]:
# To split our dataset 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)

In [0]:
# To create our model we use CNN_reviewsModel function we will call before the fit function, we used sequential API to create our model


def CNN_reviewsModel():
    CNN_model = Sequential()
    #We add an embedding layer as the first hidden layer in our model, number of nureons is the vocabulary size
    CNN_model.add(Embedding(CNN_vocabulary_Size,CNN_embedding_dims, input_length=CNN_features_Size))
    # we add a dropout layer 
    CNN_model.add(Dropout(0.2))
    # we add 1d convlutional layer with its filter to process the text data
    CNN_model.add(Conv1D(CNN_filters, CNN_kernel_size, padding='valid', activation='relu', strides=1))

    CNN_model.add(GlobalMaxPooling1D())
    # we add a dense layer 
    CNN_model.add(Dense(CNN_hidden_dims))

    CNN_model.add(Dropout(0.2))

    CNN_model.add(Activation('relu'))
    # our out put layer consists of 3 output nerouns
    CNN_model.add(Dense(3, activation='softmax'))
    # to compile our mode;
    CNN_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    CNN_model.summary()
    return CNN_model

In [46]:
# To Set up TensorBoard to visualize our model loss,acc, and graph
from keras.callbacks import TensorBoard
from keras.wrappers.scikit_learn import KerasClassifier
CNN_tbCallBack= TensorBoard(log_dir='./lab2Q5', histogram_freq=0,write_graph=True, write_images=True) 

#to call the model creation function
model = KerasClassifier(build_fn= CNN_reviewsModel,verbose=2)



# to satrt our model training
history = model.fit(X_train, Y_train, batch_size=CNN_batch_size, epochs=CNN_epochs, validation_data=(X_test, Y_test), callbacks=[CNN_tbCallBack])

Model: "sequential_39"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_39 (Embedding)     (None, 300, 300)          1500000   
_________________________________________________________________
dropout_77 (Dropout)         (None, 300, 300)          0         
_________________________________________________________________
conv1d_39 (Conv1D)           (None, 298, 250)          225250    
_________________________________________________________________
global_max_pooling1d_39 (Glo (None, 250)               0         
_________________________________________________________________
dense_77 (Dense)             (None, 250)               62750     
_________________________________________________________________
dropout_78 (Dropout)         (None, 250)               0         
_________________________________________________________________
activation_39 (Activation)   (None, 250)             

**Hyperparameters Tuning**

Now to improve the accuracy we will use Grid Search model hyperparameter optimization technique: 

In [47]:
# To set up GRIDESEARCHCV parameters
from sklearn.model_selection import GridSearchCV


grid_batch_size= [64,128,255]
grid_epochs = [2, 4, 6, 8]

CNN_param_grid= {'batch_size':grid_batch_size, 'epochs':grid_epochs}

CNN_grid = GridSearchCV(estimator=model, param_grid=CNN_param_grid, n_jobs=1)
CNN_grid_result= CNN_grid.fit(X_train, Y_train)



Model: "sequential_40"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_40 (Embedding)     (None, 300, 300)          1500000   
_________________________________________________________________
dropout_79 (Dropout)         (None, 300, 300)          0         
_________________________________________________________________
conv1d_40 (Conv1D)           (None, 298, 250)          225250    
_________________________________________________________________
global_max_pooling1d_40 (Glo (None, 250)               0         
_________________________________________________________________
dense_79 (Dense)             (None, 250)               62750     
_________________________________________________________________
dropout_80 (Dropout)         (None, 250)               0         
_________________________________________________________________
activation_40 (Activation)   (None, 250)             

In [48]:
# To show the gridsearch results
print("Best: %f using %s" % (CNN_grid_result.best_score_, CNN_grid_result.best_params_))

Best: 0.729489 using {'batch_size': 128, 'epochs': 4}
