# Assignment 5

Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset). You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.
Your quality metric is accuracy score on test dataset. Look at "type" column for train/test split.
You can use pretrained embeddings from external sources.
You have to provide data for trials with different hyperparameter values.

You have to beat following baselines:
[3 points] acc = 0.75
[5 points] acc = 0.8
[8 points] acc = 0.9

[2 points] for using unsupervised data

In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

idd = '1smuY3sJJ6wcL28i0QcBSlnEsimB5holu'
downloaded_ = drive.CreateFile({'id':idd}) 
downloaded_.GetContentFile('imdb_master.csv')

In [36]:
import pandas as pd 
import numpy as np
import spacy
from spacy.symbols import ORTH
import re
from tqdm import tqdm
from sklearn.metrics import accuracy_score

import torch
from torchtext.data import Field, LabelField, BucketIterator, TabularDataset, Iterator, Dataset
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

SEED = 42
np.random.seed(SEED)

import nltk
nltk.download('stopwords')

spacy_en = spacy.load('en')
spacy_en.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "not"}])
spacy_en.tokenizer.add_special_case("didn't", [{ORTH: "did"}, {ORTH: "not"}]) #adding special case so that tokenizer("""don't""") != 'do'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#0. Preprocessing 

In [8]:
df = pd.read_csv('imdb_master.csv', sep=',', encoding= 'latin-1',  index_col=0)
df = df.drop(columns=['file'])
df.head()

Unnamed: 0,type,review,label
0,test,Once again Mr. Costner has dragged out a movie...,neg
1,test,This is an example of why the majority of acti...,neg
2,test,"First of all I hate those moronic rappers, who...",neg
3,test,Not even the Beatles could write songs everyon...,neg
4,test,Brass pictures (movies is not a fitting word f...,neg


In [9]:
#Let's separate 'unsup' elements for now, but we will use them later
mask = df['label'] == 'unsup'
df_unsup = df[mask]
df = df[~mask]
len(df_unsup), len(df)

(50000, 50000)

In [11]:
#making sure that we don't have 'unsup' lables in test
mask = df_unsup['type'] == 'test'
len(df_unsup[mask])

0

In [12]:
#now we split our labled data to train and test
mask = df['type'] == 'train'
df_train = df[mask]
df_test = df[~mask]
len(df_train), len(df_test)

(25000, 25000)

In [0]:
df_train.to_csv("dataset_train.csv", index=False)
df_test.to_csv("dataset_test.csv", index=False)
df_unsup.to_csv("dataset_unsup.csv", index=False)

In [0]:
def tokenizer(text):
    return [tok.lemma_ for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]

In [15]:
classes={'neg': 0, 'pos': 1}

REVIEW = Field(sequential=True, include_lengths=False, batch_first=True, tokenize=tokenizer, pad_first=True, lower=True, eos_token='<eos>',
                    stop_words=nltk.corpus.stopwords.words('english')) 
LABEL = LabelField(dtype=torch.int64, use_vocab=True, preprocessing=lambda x: classes[x])

train = TabularDataset('dataset_train.csv', 
                                format='csv', fields=[(None,None),('review', REVIEW),('label', LABEL)], 
                                skip_header=True)

test = TabularDataset('dataset_test.csv', 
                                format='csv', fields=[(None,None),('review', REVIEW),('label', LABEL)], 
                                skip_header=True)

dataset_unsup = TabularDataset('dataset_unsup.csv', 
                                format='csv', fields=[(None,None),('review', REVIEW), (None, None)], 
                                skip_header=True)

REVIEW.build_vocab(train, dataset_unsup, min_freq=5, vectors="glove.6B.100d") #we use 'unsup' data to build vocab/emb, but not test data
LABEL.build_vocab(train, dataset_unsup)
vocab = REVIEW.vocab

.vector_cache/glove.6B.zip: 862MB [06:30, 2.21MB/s]                           
100%|█████████▉| 399792/400000 [00:13<00:00, 29203.49it/s]

In [0]:
# All ready-to-use embeddings in torchtext
#['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']

In [16]:
print('Vocab size:', len(REVIEW.vocab.itos))
REVIEW.vocab.itos[:10]

Vocab size: 37105


['<unk>',
 '<pad>',
 '<eos>',
 'movie',
 'film',
 '-pron-',
 'much',
 'one',
 'good',
 'see']

In [0]:
train, valid = train.split(0.8, stratified=True, random_state=np.random.seed(SEED))

In [0]:
print(train[0].review)
print(train[0].label)

['look', 'nothing', 'spectacularly', 'offensive', 'film', '-pron-', 'bore', '-pron-', 'typical', 'rom', 'com', 'end', 'see', 'come', '-pron-', 'see', 'much', 'trailer', 'key', 'difference', 'classic', 'rom', 'coms', 'tackle', 'story', 'wit', 'lack', 'pretension', 'movie', 'pretension', 'really', 'sense', 'movement', 'feel', 'though', 'get', 'walk', 'away', 'moment', 'production', 'movie', 'also', 'feel', 'debut', 'movie', 'make', 'fifteen', 'year', 'ago', '-pron-', 'recommend', 'watch', 'classic', 'movie', 'like', 'harry', 'met', 'sally', 'instead', 'shallow', 'imitation', 'oh', 'one', 'big', 'problem', 'chemistry', '-pron-', 'use', 'see', 'michael', 'look', 'cute', 'vaughn', 'alias', '-pron-', 'go', 'seriously', 'disappoint', 'way', '-pron-', 'make', 'look']
0


# 1. Model

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, kernels, dropout):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.embedding.weight.data.copy_(vocab.vectors)        
        self.convs = nn.ModuleList([nn.Conv1d(embed_size, hidden_size, k, padding=5) for k in kernels])
        self.dropout = nn.Dropout(dropout)        
        self.fc = nn.Linear(hidden_size * len(kernels), 2)
        self.batch_norm = nn.BatchNorm1d(hidden_size)

        
    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        
        concatenated = []
        for conv in self.convs:
            z = conv(x)
            z = self.batch_norm(z)
            z = F.relu(z)
            z = F.max_pool1d(z, kernel_size=z.size(2)).squeeze(2)
            concatenated.append(z)
            
        x = torch.cat(concatenated, 1)
        x = self.dropout(x)
        x = self.fc(x)
        return x

## Hyperparams 

Попробуем руками посмотреть/подобрать гиперпараметры. Для этого создадим несколько моделей с разными наборами гиперпараметров и выберем ту, у которой лосс после двух эпох наименьший.

In [0]:
def create_model(batch_size, hidden_size, kernels, dropout):
    torch.cuda.empty_cache()    

    model = MyModel(len(REVIEW.vocab.itos),
                    embed_size=100,
                    hidden_size=hidden_size,
                    kernels=kernels,
                    dropout = dropout
                )

    train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
        (train, valid, test),
        batch_sizes=(batch_size, batch_size, batch_size),
        shuffle=True,
        sort_key=lambda x: len(x.review),
    )

    optimizer = optim.Adam(model.parameters())
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, verbose=True, cooldown=5)
    criterion = nn.CrossEntropyLoss()
    return model, train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion

In [0]:
def train_cnn(model, train_iterator, valid_iterator, criterion, device, scheduler, n_epochs=20):
    
    history = []

    for epoch in range(n_epochs):
        train_loss = []
        train_acc = []
        model.train()

        for item in tqdm(train_iterator):
            x = item.review
            y = item.label
            optimizer.zero_grad()
            preds = model(x)
            loss = criterion(preds, y)
            loss.backward()
            optimizer.step()
            train_loss.append(loss.data.detach().item())
            train_acc.append(accuracy_score(y, np.argmax(preds.data.detach(), axis=1)))

        train_loss = np.mean(train_loss)
        train_acc = np.mean(train_acc)

        model.eval()
        val_loss = []
        val_acc = []
        with torch.no_grad():
            for item in valid_iterator:
                x = item.review
                y = item.label
                optimizer.zero_grad()
                preds = model(x)
                loss = criterion(preds, y)
                val_loss.append(loss.data.detach().item())
                val_acc.append(accuracy_score(y, torch.argmax(preds, dim=1)))
        val_loss = np.mean(val_loss)
        val_acc = np.mean(val_acc)   

        scheduler.step(val_loss)    

        print()
        print('Epoch: {}; Train loss: {:.3f}; Train accuracy: {:.3f}; Val loss: {:.3f}; Val accuracy: {:.3f}'.format(
            epoch, train_loss, train_acc, val_loss, val_acc
        ))        
        
        history.append({
            'epoch': epoch,
            'train_loss': train_loss,
            'train_acc': train_acc,
            'val_loss': val_loss,
            'val_acc' : val_acc
        })

    return history

In [0]:
def clean_tqdm():
    for instance in list(tqdm._instances): 
        tqdm._decr_instances(instance)

Пойдем по жадному пути. Сначала выберем kernels, затем batch_size, затем hidden_state. 

In [0]:
#kernels = [2,3,4,5]
clean_tqdm()
model, train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion = create_model(32, 128, [2,3,4,5], 0.1)
history = train_cnn(model, train_iterator, valid_iterator,
          criterion, scheduler=scheduler, device='cpu', n_epochs=2)


100%|██████████| 625/625 [05:21<00:00,  1.84it/s]
  0%|          | 0/625 [00:00<?, ?it/s]


Epoch: 0; Train loss: 0.499; Train accuracy: 0.789; Val loss: 0.367; Val accuracy: 0.838


100%|██████████| 625/625 [05:17<00:00,  2.02it/s]



Epoch: 1; Train loss: 0.262; Train accuracy: 0.893; Val loss: 0.296; Val accuracy: 0.876


In [0]:
#kernels = [2,3]
clean_tqdm()
model, train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion = create_model(32, 128, [2,3], 0.1)
history = train_cnn(model, train_iterator, valid_iterator,
          criterion, scheduler=scheduler,  device='cpu', n_epochs=2)

100%|██████████| 625/625 [02:38<00:00,  3.79it/s]
  0%|          | 0/625 [00:00<?, ?it/s]


Epoch: 0; Train loss: 0.485; Train accuracy: 0.780; Val loss: 0.351; Val accuracy: 0.846


100%|██████████| 625/625 [02:36<00:00,  4.06it/s]



Epoch: 1; Train loss: 0.285; Train accuracy: 0.882; Val loss: 0.295; Val accuracy: 0.879


В принципе, loss не отличается почти, но вторая модель обучается в 2 раза быстрее => Будем использовать ее.

In [0]:
# now we check batch_size = 64
clean_tqdm()
model, train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion = create_model(64, 128, [2,3], 0.1)
history = train_cnn(model, train_iterator, valid_iterator,
          criterion, scheduler=scheduler, device='cpu', n_epochs=2)

100%|██████████| 313/313 [02:44<00:00,  2.46it/s]
  0%|          | 0/313 [00:00<?, ?it/s]


Epoch: 0; Train loss: 0.541; Train accuracy: 0.755; Val loss: 0.331; Val accuracy: 0.859


100%|██████████| 313/313 [02:39<00:00,  1.95it/s]



Epoch: 1; Train loss: 0.297; Train accuracy: 0.873; Val loss: 0.309; Val accuracy: 0.873


Тут лучше себя показала модель с batch_size 32. Будем использовать ее.

In [0]:
#hidden_size = 64
clean_tqdm()
model, train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion = create_model(32, 64, [2,3], 0.1)
history = train_cnn(model, train_iterator, valid_iterator,
          criterion, scheduler=scheduler, device='cpu', n_epochs=2)

100%|██████████| 625/625 [01:43<00:00,  5.83it/s]
  0%|          | 1/625 [00:00<01:46,  5.84it/s]


Epoch: 0; Train loss: 0.477; Train accuracy: 0.775; Val loss: 0.380; Val accuracy: 0.828


100%|██████████| 625/625 [01:41<00:00,  6.23it/s]



Epoch: 1; Train loss: 0.300; Train accuracy: 0.876; Val loss: 0.302; Val accuracy: 0.870


Берем hidden_size = 64

Итого, лучшая модель будет такая (я запустила все ячейки выше пару раз, потом уже пропускала этот этап, чтобы не тратить время, сразу брала данные параметры): 

*   batch_size: 32
*   hidden_size: 64
*   kernels: [2,3]





In [0]:
#Увеличила dropout т.к. модель переобучалась жестко.
model , train_iterator, valid_iterator, test_iterator, optimizer, scheduler, criterion = create_model(32, 64, [2,3], 0.4)

In [23]:
model

MyModel(
  (embedding): Embedding(37105, 100)
  (convs): ModuleList(
    (0): Conv1d(100, 64, kernel_size=(2,), stride=(1,), padding=(5,))
    (1): Conv1d(100, 64, kernel_size=(3,), stride=(1,), padding=(5,))
  )
  (dropout): Dropout(p=0.4, inplace=False)
  (fc): Linear(in_features=128, out_features=2, bias=True)
  (batch_norm): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

# 3. Training and evaluating our model

In [0]:
clean_tqdm()
history = train_cnn(model, train_iterator, valid_iterator,
          criterion, scheduler=scheduler, device='cpu', n_epochs=5)

100%|██████████| 625/625 [01:41<00:00,  5.98it/s]
  0%|          | 1/625 [00:00<01:48,  5.73it/s]


Epoch: 0; Train loss: 0.613; Train accuracy: 0.715; Val loss: 0.387; Val accuracy: 0.826


100%|██████████| 625/625 [01:40<00:00,  6.23it/s]
  0%|          | 0/625 [00:00<?, ?it/s]


Epoch: 1; Train loss: 0.373; Train accuracy: 0.837; Val loss: 0.328; Val accuracy: 0.860


100%|██████████| 625/625 [01:40<00:00,  6.88it/s]
  0%|          | 1/625 [00:00<01:50,  5.65it/s]


Epoch: 2; Train loss: 0.290; Train accuracy: 0.879; Val loss: 0.312; Val accuracy: 0.866


100%|██████████| 625/625 [01:40<00:00,  6.51it/s]
  0%|          | 1/625 [00:00<01:49,  5.72it/s]


Epoch: 3; Train loss: 0.218; Train accuracy: 0.914; Val loss: 0.313; Val accuracy: 0.866


100%|██████████| 625/625 [01:39<00:00,  6.04it/s]



Epoch: 4; Train loss: 0.152; Train accuracy: 0.945; Val loss: 0.332; Val accuracy: 0.868


In [0]:
def test_model(model, test_iterator):
    model.eval()
    test_acc = []

    with torch.no_grad():
        for item in test_iterator:
            x = item.review
            y = item.label
            preds = model(x)
            hard_label_pred = torch.argmax(preds, dim=1)
            test_acc.append(accuracy_score(y, hard_label_pred))
    test_acc = np.mean(test_acc) 
    return test_acc

In [0]:
test_accuracy = test_model(model, test_iterator)
test_accuracy

0.853300831202046

#4. "UNSUP" data

Идея простая: Берем две модели (TextBlob и SentimentIntensityAnalyzer), смотрим что они предсказывают для unsup данных. Если предсказания совпадают, берем, если нет - выкидываем. Но оказалось, что вторая модель большинство текстов определяет как нейтральные. Поэтому я взяла только TextBlob.

In [1]:
!pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
[K     |██▋                             | 10kB 25.5MB/s eta 0:00:01[K     |█████▏                          | 20kB 1.8MB/s eta 0:00:01[K     |███████▉                        | 30kB 2.3MB/s eta 0:00:01[K     |██████████▍                     | 40kB 1.7MB/s eta 0:00:01[K     |█████████████                   | 51kB 1.9MB/s eta 0:00:01[K     |███████████████▋                | 61kB 2.3MB/s eta 0:00:01[K     |██████████████████▎             | 71kB 2.5MB/s eta 0:00:01[K     |████████████████████▉           | 81kB 2.8MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 2.9MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▋   | 112kB 2.8MB/s eta 0:00:01[K     |███████████████████████████████▎| 12

In [25]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
def SIA_fill(sentence):
    sentence = ' '.join(tokenizer(sentence))
    analyzer = SentimentIntensityAnalyzer()
    vs = analyzer.polarity_scores(sentence)
    neg = vs['neg']
    pos = vs['pos']
    score = 'pos' if score > 0 else 'neg'
    return score

In [0]:
def TextBlob_fill(sentence):
    blob = TextBlob(sentence)
    sentence = ' '.join(tokenizer(sentence))
    score = blob.sentences[0].sentiment.polarity
    score = 'pos' if score > 0 else 'neg'
    return score

In [0]:
df_unsup['label'] = df_unsup['review'].apply(TextBlob_fill) #TextBlob
#df_unsup['label2'] = df_unsup['review'].apply(SIA_fill) #SentimentIntensityAnalyzer

In [29]:
mask = df_unsup['label'] == 'pos'
print(len(df_unsup[mask]))
print(len(df_unsup[~mask]))

26905
23095


In [33]:
df_unsup.head()

Unnamed: 0,type,review,label
50000,train,"I admit, the great majority of films released ...",pos
50001,train,"Take a low budget, inexperienced actors doubli...",neg
50002,train,"Everybody has seen 'Back To The Future,' right...",pos
50003,train,Doris Day was an icon of beauty in singing and...,pos
50004,train,"After a series of silly, fun-loving movies, 19...",neg


In [0]:
df_unsup.to_csv("unsup_labels.csv", index=False)

In [0]:
dataset_unsup = TabularDataset('unsup_labels.csv', 
                                format='csv', fields=[(None, None), ('review', REVIEW), ('label', LABEL)], 
                                skip_header=True)

In [0]:
ds_concat  = train + dataset_unsup
list_of_ex = [x for x in ds_concat]
new_ds = Dataset(list_of_ex, [('review', REVIEW), ('label', LABEL)])

In [0]:
unsup_iterator = BucketIterator(
        new_ds,
        batch_size=32,
        shuffle=False,
        sort_key=lambda x: len(x.review),
    )

In [39]:
#У меня упал колаб, поэтому тут уже просто загружаю веса (благо я их сохранила)
model.load_state_dict(torch.load('model'))

<All keys matched successfully>

In [40]:
clean_tqdm()
history = train_cnn(model, unsup_iterator, valid_iterator,
          criterion, scheduler=scheduler, device='cpu', n_epochs=15)


100%|██████████| 2188/2188 [06:17<00:00,  5.80it/s]
  0%|          | 1/2188 [00:00<04:43,  7.71it/s]


Epoch: 0; Train loss: 0.511; Train accuracy: 0.733; Val loss: 0.532; Val accuracy: 0.762


100%|██████████| 2188/2188 [06:24<00:00,  5.69it/s]
  0%|          | 1/2188 [00:00<04:50,  7.52it/s]


Epoch: 1; Train loss: 0.477; Train accuracy: 0.751; Val loss: 0.537; Val accuracy: 0.726


100%|██████████| 2188/2188 [06:28<00:00,  5.64it/s]
  0%|          | 1/2188 [00:00<04:55,  7.40it/s]


Epoch: 2; Train loss: 0.435; Train accuracy: 0.779; Val loss: 0.561; Val accuracy: 0.711


100%|██████████| 2188/2188 [06:28<00:00,  5.63it/s]
  0%|          | 1/2188 [00:00<04:39,  7.83it/s]


Epoch: 3; Train loss: 0.395; Train accuracy: 0.807; Val loss: 0.553; Val accuracy: 0.726


100%|██████████| 2188/2188 [06:26<00:00,  5.66it/s]
  0%|          | 1/2188 [00:00<04:49,  7.56it/s]

Epoch     4: reducing learning rate of group 0 to 1.0000e-04.

Epoch: 4; Train loss: 0.349; Train accuracy: 0.837; Val loss: 0.619; Val accuracy: 0.697


100%|██████████| 2188/2188 [06:29<00:00,  5.62it/s]
  0%|          | 1/2188 [00:00<04:32,  8.03it/s]


Epoch: 5; Train loss: 0.363; Train accuracy: 0.843; Val loss: 0.580; Val accuracy: 0.737


100%|██████████| 2188/2188 [06:28<00:00,  5.63it/s]
  0%|          | 1/2188 [00:00<04:47,  7.61it/s]


Epoch: 6; Train loss: 0.338; Train accuracy: 0.853; Val loss: 0.584; Val accuracy: 0.737


 18%|█▊        | 399/2188 [01:12<05:45,  5.18it/s]

KeyboardInterrupt: ignored

In [43]:
test_accuracy = test_model(model, test_iterator)
test_accuracy

0.6929747442455243

Как-то вообще не очень. Но лучший результат на тесте: 0.85!