<a href="http://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project07B%20-%20Text%20Classification%20Deep%20Learning%20CNN%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN applied to sentiment analysis



## Exercise

- 1) What does it happen if you add more convolutional layers to your model? Do the results improve?
- 2) What does it happen if the network is inistialized with pre-trained word embeddings instead of using random initialization? Do the results improve?
- 3) Plot the learning curves. 
 

In [1]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews


X=[] #list to save the texts
y=[] #list to save the labels


for file_id in movie_reviews.fileids(): #this traverse all movides
    category=movie_reviews.categories(file_id)
    label=category[0]
    tokens=list(movie_reviews.words(file_id))
    text=' '.join(str(word) for word in tokens)
    X.append(text)
    y.append(label)

X=list(X)
print("instances:",len(X),len(y))


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
instances: 2000 2000


In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


We could translate each class ("pos", and "neg") to 1 or 0 in the previous for.
However, sklearn already provides us a class to translate string classes to numbers or vectors. 

In [3]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
num_classes=2 # positive -> 1, negative -> 0
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

print('Number of positive and negative reviews in the training:', sum(y_train), len(y_train)-sum(y_train))
print('Number of positive and negative reviews in the test:', sum(y_test), len(y_test)-sum(y_test))


Number of positive and negative reviews in the training: 757 743
Number of positive and negative reviews in the test: 243 257


The baseline system based on tf-idf model and the classifier SVM achieved an accuracy of 0.82, F1(1) 0.82.

In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence

tokenizer = Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X_train)

#we add a word index for the pad tokens
tokenizer.word_index['<PAD>'] = 0
print("Vocabulary size of the training={}".format(len(tokenizer.word_index)))
print("Number of Documents in the training={}".format(tokenizer.document_count))

#We now transform the words to indexes
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)
#print(X_train[0].split())
#print(train_sequences[0])

MAX_SEQUENCE_LENGTH = 1000 #most texts have less than 1000 tokens

# pad dataset to a maximum review length in words

train_seq_pad = sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_seq_pad = sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
train_seq_pad.shape, test_seq_pad.shape

Vocabulary size of the training=35192
Number of Documents in the training=1500


((1500, 1000), (500, 1000))

## Prepare the Model

Since textual data is a sequence of words, we utilize ```1D``` convolutions to scan through the sentences.
The model first transforms each word into lower dimensional embedding/vector space followed by 1d convolutions and then passing the data through dense layers before the final layer for classification

In [5]:
import tensorflow
print(tensorflow.__version__)


2.8.2


In [6]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Embedding

import numpy as np

# fix random seed for reproducibility
import tensorflow as tf
tf.random.set_seed(1)

We now create the matrix of word embeddings

In [7]:
VOCAB_SIZE = len(tokenizer.word_index)

EPOCHS=20
BATCH_SIZE=16
#Set True if you want to use pre-trained word embeddings to initialize the inputs
#False eoc
USEWE=False
EMBED_SIZE = 300


In [8]:
if USEWE:
    import gensim.downloader as api
    modelWE = api.load("glove-wiki-gigaword-50")
    EMBED_SIZE = 50

    #modelWE = api.load("word2vec-google-news-300")
    #EMBED_SIZE = 300

    # create a weight matrix for words in training docs
    embedding_matrix = np.zeros((VOCAB_SIZE, EMBED_SIZE))
    for word, i in tokenizer.word_index.items():
        try:
            embedding_vector = modelWE[word]
            embedding_matrix[i] = embedding_vector
        except:
            #if word does not exist, we do not udpate the matrix
            pass

    print('embedding matrix created')


In [9]:
# create the model
model = Sequential()
if USEWE:
    model.add(Embedding(VOCAB_SIZE, EMBED_SIZE, input_length=MAX_SEQUENCE_LENGTH,
                        weights=[embedding_matrix]), trainable=False)
else:
    model.add(Embedding(VOCAB_SIZE, EMBED_SIZE, input_length=MAX_SEQUENCE_LENGTH))

# relu max(0,a), siendo a=WX+b, grandiente es constante
# https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
# Two major benefits of ReLUs are velocity and sparsity

model.add(Conv1D(filters=128, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=128, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=128, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Conv1D(filters=32, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1000, 300)         10557600  
                                                                 
 conv1d (Conv1D)             (None, 1000, 128)         153728    
                                                                 
 max_pooling1d (MaxPooling1D  (None, 500, 128)         0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 500, 128)          49280     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 250, 128)         0         
 1D)                                                             
                                                                 
 conv1d_2 (Conv1D)           (None, 250, 128)          3

## Model Training

In [10]:
from keras.callbacks import EarlyStopping
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')


# Fit the model
history=model.fit(train_seq_pad, y_train, 
          validation_split=0.1,
          epochs=EPOCHS, 
          batch_size=BATCH_SIZE, 
          verbose=1, callbacks=[earlyStopping]
          )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


## Model Evaluation

In [11]:
# Final evaluation of the model
scores = model.evaluate(test_seq_pad, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 77.60%


In [12]:
predictions=model.predict(test_seq_pad) 
predictions.reshape(-1)
predictions = [1 if item >= 0.5 else 0 for item in predictions]
predictions[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 0, 1]

In [13]:
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

           0       0.82      0.72      0.77       257
           1       0.74      0.84      0.78       243

    accuracy                           0.78       500
   macro avg       0.78      0.78      0.78       500
weighted avg       0.78      0.78      0.78       500

