# Assignment 5

Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset).
You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.  
Your quality metric is accuracy score on test dataset. Look at "type" column for  train/test split.  
You can use pretrained embeddings from external sources.  
You have to provide data for trials with different hyperparameter values.  

You have to beat following baselines:  
[3 points] acc = 0.75  
[5 points] acc = 0.8  
[8 points] acc = 0.9  

[2 points] for using unsupervised data  

In [16]:
import pandas as pd
df = pd.read_csv('imdb_master.csv', sep=',', engine='python')
df.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [17]:
import nltk
import re
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer() 

toks = []
reviews = []
for r in df['review']:
    tokens = nltk.word_tokenize(r)
    l = []
    for t in tokens:
        tnew = t.lower()
        tnew = re.sub(r'[^\w\s]','',tnew)
        if tnew not in stoplist:
            l.append(lemmatizer.lemmatize(tnew))
    toks.append(l)
    reviews.append(' '.join(l))

df['review'] = reviews
df['tokens'] = toks


df = df.drop(columns=['Unnamed: 0', 'file'])


negs = []
poss = []
unsup = []
ls = []
for l in df['label']:
    if l == 'neg':
        negs.append(1)
        poss.append(0)
        unsup.append(0)
    elif l == 'pos':
        negs.append(0)
        poss.append(1)
        unsup.append(0)
    elif l == 'unsup':
        negs.append(0)
        poss.append(0)
        unsup.append(1)
df['neg'] = negs 
df['pos'] = poss 
df['unsup'] = unsup 

df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mariaignasina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,type,review,label,tokens,neg,pos,unsup
0,test,mr costner dragged movie far longer necessary ...,neg,"[mr, costner, dragged, movie, far, longer, nec...",1,0,0
1,test,example majority action film generic boring ...,neg,"[example, majority, action, film, , generic, b...",1,0,0
2,test,first hate moronic rapper couldnt act gun pre...,neg,"[first, hate, moronic, rapper, , couldnt, act,...",1,0,0
3,test,even beatles could write song everyone liked ...,neg,"[even, beatles, could, write, song, everyone, ...",1,0,0
4,test,brass picture movie fitting word really some...,neg,"[brass, picture, , movie, fitting, word, , rea...",1,0,0


In [18]:
df_train = df[df.type == 'train']
df_test =  df[df.type == 'test']
df_train = df_train[['review', 'pos', 'tokens', 'neg', 'unsup']]
df_test = df_test[['review', 'pos', 'tokens', 'neg', 'unsup']]

In [19]:
all_training_words = []
training_sentence_lengths = []
for s in df_train['tokens']:
    training_sentence_lengths.append(len(s))
    for t in s:
        all_training_words.append(t)
TRAINING_VOCAB = sorted(list(set(all_training_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_training_words), len(TRAINING_VOCAB)))
print("Max sentence length is %s words" % max(training_sentence_lengths))

12598050 words total, with a vocabulary size of 166195
Max sentence length is 1923 words


In [20]:
all_test_words = []
test_sentence_lengths = []
for s in df_test['tokens']:
    test_sentence_lengths.append(len(s))
    for t in s:
        all_test_words.append(t)
TEST_VOCAB = sorted(list(set(all_test_words)))
print('%s words total, with a vocabulary size of %s' % (len(all_test_words), len(TEST_VOCAB)))
print('Max sentence length is %s words' % max(test_sentence_lengths))

4096662 words total, with a vocabulary size of 86080
Max sentence length is 1713 words


In [22]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-03-21 00:20:59--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.21.37
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.21.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-03-21 00:45:28 (1.07 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [23]:
from gensim import models
word2vec_path = 'GoogleNews-vectors-negative300.bin.gz'
word2vec = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [24]:
MAX_SEQUENCE_LENGTH = 50
EMBEDDING_DIM = 300

In [37]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=len(TRAINING_VOCAB), lower=True, char_level=False)
tokenizer.fit_on_texts(df_train['review'].tolist())
training_sequences = tokenizer.texts_to_sequences(df_train['review'].tolist())
train_word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(train_word_index))
train_cnn_data = pad_sequences(training_sequences, 
                               maxlen=MAX_SEQUENCE_LENGTH)
test_sequences = tokenizer.texts_to_sequences(df_test['review'].tolist())
test_cnn_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

Found 165885 unique tokens.


In [38]:
import numpy as np
train_embedding_weights = np.zeros((len(train_word_index)+1, EMBEDDING_DIM))
for word,index in train_word_index.items():
    train_embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM)
print(train_embedding_weights.shape)

(165886, 300)


In [39]:
def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):
    if len(tokens_list)<1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

def get_word2vec_embeddings(vectors, clean_comments, generate_missing=False):
    embeddings = clean_comments['tokens'].apply(lambda x: get_average_word2vec(x, vectors, 
                                                                                generate_missing=generate_missing))
    return list(embeddings)
train_embeddings = get_word2vec_embeddings(word2vec, df_train, generate_missing=True)

In [40]:
def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index):
    
    embedding_layer = Embedding(num_words,
                            embedding_dim,
                            weights=[embeddings],
                            input_length=max_sequence_length,
                            trainable=False)
    
    sequence_input = Input(shape=(max_sequence_length,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)

    convs = []
    filter_sizes = [2,3,4,5,6]

    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=200, kernel_size=filter_size, activation='relu')(embedded_sequences)
        l_pool = GlobalMaxPooling1D()(l_conv)
        convs.append(l_pool)


    l_merge = concatenate(convs, axis=1)

    x = Dropout(0.1)(l_merge)  
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.2)(x)
    preds = Dense(labels_index, activation='sigmoid')(x)

    model = Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])
    model.summary()
    return model

In [41]:
label_names = ['pos', 'neg', 'unsup']

In [42]:
y_train = df_train[label_names].values

In [43]:
x_train = train_cnn_data
y_tr = y_train

In [45]:
from tensorflow.keras.layers import Embedding, Input, Conv1D, GlobalMaxPooling1D, concatenate, Dropout, Dense
from tensorflow.keras.models import Model
model = ConvNet(train_embedding_weights, MAX_SEQUENCE_LENGTH, len(train_word_index)+1, EMBEDDING_DIM, 
                len(list(label_names)))

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 50)           0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 50, 300)      49765800    input_1[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 49, 200)      120200      embedding[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 48, 200)      180200      embedding[0][0]                  
__________________________________________________________________________________________________
conv1d_2 (

In [46]:
num_epochs = 50
batch_size = 10

In [None]:
hist = model.fit(x_train, y_tr, epochs=num_epochs, validation_split=0.1, shuffle=True, batch_size=batch_size)

Train on 67500 samples, validate on 7500 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
 5710/67500 [=>............................] - ETA: 2:45 - loss: 0.1232 - acc: 0.9521

In [None]:
predictions = model.predict(test_cnn_data, batch_size=100, verbose=1)

In [None]:
labels = [2, 1, 0]

In [None]:
prediction_labels=[]
for p in predictions:
    prediction_labels.append(labels[np.argmax(p)])

In [None]:
sum(df_test.pos==prediction_labels)/len(prediction_labels)

In [None]:
sum(df_test.pos==prediction_labels)

In [None]:
len(prediction_labels)

In [None]:
df_test.pos.value_counts()