<h3><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

## Классификация текстов

В этом задании вам предстоит попробовать несколько методов, используемых в задаче классификации, а также понять насколько хорошо модель понимает смысл слов и какие слова в примере влияют на результат.

Будем рассматривать датасет отзывов о фильмах с сайта IMDB

In [1]:
!pip install torchtext==0.8.1



In [2]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

from torchtext.vocab import Vectors, GloVe

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from tqdm.autonotebook import tqdm

В этом задании мы будем использовать библиотеку torchtext. Она довольна проста в использовании и поможет нам сконцентрироваться на задаче, а не на написании Dataloader-а.

In [3]:
TEXT = Field(sequential=True, lower=True, include_lengths=True)  # Поле текста
LABEL = LabelField(dtype=torch.float)  # Поле метки



In [4]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [69]:
train, test = datasets.IMDB.splits(TEXT, LABEL)  # загрузим датасет
train, valid = train.split(random_state=random.seed(SEED))  # разобьем на части



In [70]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)

In [71]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)



In [72]:
print(TEXT.vocab.itos[:10])
print(LABEL.vocab.itos)

['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'i']
['neg', 'pos']


## RNN

Для начала попробуем использовать рекурентные нейронные сети (**LSTM**). 

In [15]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional = False, dropout = 0, pad_idx = 1):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim,
                           num_layers = n_layers,bidirectional=bidirectional,dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * (1+int(bidirectional)), output_dim)

        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))
        
        # cell arg for LSTM
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)  

        #output = [sent len, batch size, hid dim * num directions]
        #output over padding tokens are zero tensors
        
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        if bidirectional:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
            
        hidden = self.dropout(hidden) 

        #hidden = [batch size, hid dim * num directions] or [batch_size, hid dim * num directions]
            
        return self.fc(hidden)

In [74]:
vocab_size = len(TEXT.vocab)
emb_dim = 100
hidden_dim = 256
output_dim = 1
n_layers = 2
bidirectional = True
dropout = 0.4
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
patience=5

In [75]:
model = RNNBaseline(
    vocab_size=vocab_size,
    embedding_dim=emb_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    n_layers=n_layers,
    bidirectional=bidirectional,
    dropout=dropout,
    pad_idx=PAD_IDX
)

In [76]:
model = model.to(device)

In [77]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()

max_epochs = 20

Обучаем сеть

In [78]:
import numpy as np

min_loss = np.inf

cur_patience = 0

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        opt.zero_grad()
        text, length = batch.text
        target = batch.label
        pred = model(text,length).squeeze(1)
        loss =  loss_func(pred,target)
        train_loss+=loss
        loss.backward()
        opt.step()
    train_loss /= len(train_iter)
    val_loss = 0.0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        text, length = batch.text
        target = batch.label
        pred = model(text,length).squeeze(1)
        loss =  loss_func(pred,target)
        val_loss +=loss
    val_loss /= len(valid_iter)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model = model.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model.load_state_dict(best_model)

  0%|          | 0/274 [00:00<?, ?it/s]



  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.667199969291687, Validation Loss: 0.608317494392395


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.6536913514137268, Validation Loss: 0.6610695719718933


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.6171404719352722, Validation Loss: 0.5253218412399292


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.49541422724723816, Validation Loss: 0.46212080121040344


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.4275961220264435, Validation Loss: 0.4123981297016144


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.3273160457611084, Validation Loss: 0.4631347060203552


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.2831081748008728, Validation Loss: 0.36191853880882263


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 8, Training Loss: 0.22645193338394165, Validation Loss: 0.3651513159275055


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 9, Training Loss: 0.18926623463630676, Validation Loss: 0.4048888683319092


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

<All keys matched successfully>

Посчитаем f1-score классификатора на тестовом датасете.


In [79]:
from sklearn.metrics import f1_score

f1 = 0
model.eval()

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
pbar.set_description(f"Epoch {epoch}")

for it, batch in pbar: 
    pred = model(*batch.text).squeeze(1)
    pred = torch.round(torch.sigmoid(pred.detach()))
    f1 += f1_score(batch.label.cpu(),pred.cpu())
f1/=len(test_iter)

print(f'F1 score : {f1}')

  0%|          | 0/391 [00:00<?, ?it/s]



F1 score : 0.8158482577466912


## CNN

![](https://www.researchgate.net/publication/333752473/figure/fig1/AS:769346934673412@1560438011375/Standard-CNN-on-text-classification.png)

Для классификации текстов также часто используют сверточные нейронные сети. Идея в том, что как правило сентимент содержат словосочетания из двух-трех слов, например "очень хороший фильм" или "невероятная скука". Проходясь сверткой по этим словам мы получим какой-то большой скор и выхватим его с помощью MaxPool. Далее идет обычная полносвязная сетка. Важный момент: свертки применяются не последовательно, а параллельно.

In [20]:
TEXT = Field(sequential=True, lower=True, batch_first=True)  # batch_first тк мы используем conv  
LABEL = LabelField(batch_first=True, dtype=torch.float)

train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))

TEXT.build_vocab(trn)
LABEL.build_vocab(trn)

device = "cuda" if torch.cuda.is_available() else "cpu"



In [21]:
BATCH_SIZE = 128

train_iter, valid_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
    batch_size = BATCH_SIZE, 
    device = device)




In [28]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        emb_dim,
        out_channels,
        kernel_sizes,
        dropout=0.5,
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.conv_0 = nn.Conv2d(in_channels=1,out_channels=out_channels,kernel_size=(kernel_sizes[0], emb_dim))
        
        self.conv_1 = nn.Conv2d(in_channels=1,out_channels=out_channels,kernel_size=(kernel_sizes[1], emb_dim))
        
        self.conv_2 = nn.Conv2d(in_channels=1,out_channels=out_channels,kernel_size=(kernel_sizes[2], emb_dim))
        
        self.fc = nn.Linear(len(kernel_sizes) * out_channels, 1)
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text):
        
        embedded = self.embedding(text)
 
        embedded = embedded.unsqueeze(1) #один канал
        
        conved_0 = F.relu(self.conv_0(embedded).squeeze(3))  # (N,Cout​,Hout​,Wout​) -> Wout = 1
        conved_1 = F.relu(self.conv_1(embedded).squeeze(3))  
        conved_2 = F.relu(self.conv_2(embedded).squeeze(3))  
        
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
        
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))
            
        return self.fc(cat)

In [29]:
kernel_sizes = [3, 4, 5]
vocab_size = len(TEXT.vocab)
out_channels=64
dropout = 0.5
dim = 300

model = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=out_channels,
            kernel_sizes=kernel_sizes, dropout=dropout)

In [30]:
model.to(device)

CNN(
  (embedding): Embedding(202264, 300)
  (conv_0): Conv2d(1, 64, kernel_size=(3, 300), stride=(1, 1))
  (conv_1): Conv2d(1, 64, kernel_size=(4, 300), stride=(1, 1))
  (conv_2): Conv2d(1, 64, kernel_size=(5, 300), stride=(1, 1))
  (fc): Linear(in_features=192, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [31]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()

In [32]:
max_epochs = 30

Обучаем

In [34]:
import numpy as np

min_loss = np.inf
patience = 5
cur_patience = 0

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        opt.zero_grad()
        pred = model(batch.text).squeeze(1)
        loss = loss_func(pred,batch.label)
        loss.backward()
        opt.step()
        train_loss+=loss 
    train_loss /= len(train_iter)
    val_loss = 0.0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        pred = model(batch.text).squeeze(1)
        loss = loss_func(pred,batch.label)
        val_loss+=loss
    val_loss /= len(valid_iter)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model = model.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model.load_state_dict(best_model)

  0%|          | 0/137 [00:00<?, ?it/s]



  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.09872795641422272, Validation Loss: 0.3553328514099121


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.06532081961631775, Validation Loss: 0.39188867807388306


  0%|          | 0/137 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [36]:
model.load_state_dict(best_model)

<All keys matched successfully>

Посчитаем f1-score CNN классификатора.



In [37]:
from sklearn.metrics import f1_score

f1 = 0
model.eval()

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
pbar.set_description(f"Epoch {epoch}")

for it, batch in pbar: 
    pred = model(batch.text).squeeze()
    pred = torch.round(torch.sigmoid(pred.detach()))
    f1 += f1_score(batch.label.cpu(),pred.cpu())
f1/=len(test_iter)

print(f'F1 score : {f1}')

  0%|          | 0/196 [00:00<?, ?it/s]



F1 score : 0.825784565363231


## Интерпретируемость

Посмотрим, куда смотрит наша модель. Достаточно запустить код ниже.

In [38]:
!pip install -q captum

[?25l[K     |▎                               | 10 kB 35.2 MB/s eta 0:00:01[K     |▌                               | 20 kB 36.1 MB/s eta 0:00:01[K     |▊                               | 30 kB 41.5 MB/s eta 0:00:01[K     |█                               | 40 kB 27.6 MB/s eta 0:00:01[K     |█▏                              | 51 kB 16.6 MB/s eta 0:00:01[K     |█▍                              | 61 kB 14.2 MB/s eta 0:00:01[K     |█▋                              | 71 kB 13.6 MB/s eta 0:00:01[K     |██                              | 81 kB 14.8 MB/s eta 0:00:01[K     |██▏                             | 92 kB 13.7 MB/s eta 0:00:01[K     |██▍                             | 102 kB 12.6 MB/s eta 0:00:01[K     |██▋                             | 112 kB 12.6 MB/s eta 0:00:01[K     |██▉                             | 122 kB 12.6 MB/s eta 0:00:01[K     |███                             | 133 kB 12.6 MB/s eta 0:00:01[K     |███▎                            | 143 kB 12.6 MB/s eta 0:

In [49]:
from captum.attr import LayerIntegratedGradients, TokenReferenceBase, visualization

PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)

In [50]:
def forward_with_softmax(inp):
    logits = model(inp)
    return torch.softmax(logits, 0)[0][1]

def forward_with_sigmoid(input):
    return torch.sigmoid(model(input))


# accumalate couple samples in this array for visualization purposes
vis_data_records_ig = []

def interpret_sentence(model, sentence, min_len = 7, label = 0):
    model.eval()
    text = [tok for tok in TEXT.tokenize(sentence)]
    if len(text) < min_len:
        text += ['pad'] * (min_len - len(text))
    indexed = [TEXT.vocab.stoi[t] for t in text]

    model.zero_grad()

    input_indices = torch.tensor(indexed, device=device)
    input_indices = input_indices.unsqueeze(0)
    
    # input_indices dim: [sequence_length]
    seq_length = min_len

    # predict
    pred = forward_with_sigmoid(input_indices).item()
    pred_ind = round(pred)

    # generate reference indices for each sample
    reference_indices = token_reference.generate_reference(seq_length, device=device).unsqueeze(0)

    # compute attributions and approximation delta using layer integrated gradients
    attributions_ig, delta = lig.attribute(input_indices, reference_indices, \
                                           n_steps=5000, return_convergence_delta=True)

    print('pred: ', LABEL.vocab.itos[pred_ind], '(', '%.2f'%pred, ')', ', delta: ', abs(delta))

    add_attributions_to_visualizer(attributions_ig, text, pred, pred_ind, label, delta, vis_data_records_ig)
    
def add_attributions_to_visualizer(attributions, text, pred, pred_ind, label, delta, vis_data_records):
    attributions = attributions.sum(dim=2).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    attributions = attributions.cpu().detach().numpy()

    # storing couple samples in an array for visualization purposes
    vis_data_records.append(visualization.VisualizationDataRecord(
                            attributions,
                            pred,
                            LABEL.vocab.itos[pred_ind],
                            LABEL.vocab.itos[label],
                            LABEL.vocab.itos[1],
                            attributions.sum(),       
                            text,
                            delta))

In [51]:
interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)
interpret_sentence(model, 'This films is bullshit', label=0)
interpret_sentence(model, 'It is fantastic bullshit', label=0)

pred:  pos ( 0.89 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([1.2154e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.52 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([4.8771e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.06 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.03 ) , delta:  tensor([7.7810e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.04 ) , delta:  tensor([1.1684e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.28 ) , delta:  tensor([6.0081e-05], device='cuda:0', dtype=torch.float64)


Попробуйте добавить свои примеры!

In [52]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.89),pos,1.6,It was a fantastic performance ! pad
,,,,
pos,neg (0.01),pos,1.05,Best film ever pad pad pad pad
,,,,
pos,pos (0.52),pos,1.17,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,0.29,It was a horrible movie pad pad
,,,,
neg,neg (0.06),pos,0.96,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.89),pos,1.6,It was a fantastic performance ! pad
,,,,
pos,neg (0.01),pos,1.05,Best film ever pad pad pad pad
,,,,
pos,pos (0.52),pos,1.17,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,0.29,It was a horrible movie pad pad
,,,,
neg,neg (0.06),pos,0.96,I've never watched something as bad pad
,,,,


## Эмбеддинги слов

Вы ведь не забыли, как мы можем применить знания о word2vec и GloVe. Давайте попробуем!

In [55]:
TEXT.build_vocab(trn, vectors=GloVe())

LABEL.build_vocab(trn)

word_embeddings = TEXT.vocab.vectors

kernel_sizes = [3, 4, 5]
vocab_size = len(TEXT.vocab)
dropout = 0.5
dim = 300

.vector_cache/glove.840B.300d.zip: 2.18GB [06:53, 5.26MB/s]                            
100%|█████████▉| 2196016/2196017 [03:53<00:00, 9390.29it/s]


In [None]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))

device = "cuda" if torch.cuda.is_available() else "cpu"
        return self.fc(hidden)
train_iter, valid_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=128,
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

In [58]:
model = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=64,
            kernel_sizes=kernel_sizes, dropout=dropout)

word_embeddings = TEXT.vocab.vectors

prev_shape = model.embedding.weight.shape

model.embedding.weight.data.copy_(word_embeddings)

assert prev_shape == model.embedding.weight.shape
model.to(device)

opt = torch.optim.Adam(model.parameters())

Обучаем.

In [59]:
import numpy as np

min_loss = np.inf

cur_patience = 0

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        opt.zero_grad()
        pred = model(batch.text).squeeze(1)
        loss = loss_func(pred,batch.label)
        loss.backward()
        opt.step()
        train_loss+=loss 

    train_loss /= len(train_iter)
    val_loss = 0.0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        pred = model(batch.text).squeeze(1)
        loss = loss_func(pred,batch.label)
        val_loss+=loss 
    val_loss /= len(valid_iter)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model = model.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model.load_state_dict(best_model)

  0%|          | 0/137 [00:00<?, ?it/s]



  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.492378294467926, Validation Loss: 0.3421889543533325


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.2996968626976013, Validation Loss: 0.29726365208625793


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.18376262485980988, Validation Loss: 0.2857840955257416


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.08877665549516678, Validation Loss: 0.30427122116088867


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.03441455587744713, Validation Loss: 0.3363267183303833


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.014137471094727516, Validation Loss: 0.36563611030578613


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.007308458909392357, Validation Loss: 0.3880952298641205


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/59 [00:00<?, ?it/s]

<All keys matched successfully>

Посчитаем f1-score  классификатора.


In [60]:
from sklearn.metrics import f1_score

f1 = 0
model.eval()

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
pbar.set_description(f"Epoch {epoch}")

for it, batch in pbar: 
    pred = model(batch.text).squeeze()
    pred = torch.round(torch.sigmoid(pred.detach()))
    f1 += f1_score(batch.label.cpu(),pred.cpu())
f1/=len(test_iter)

print(f'F1 score : {f1}')

  0%|          | 0/196 [00:00<?, ?it/s]



F1 score : 0.8623392789262905


Проверим насколько все хорошо!

In [63]:
PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)
vis_data_records_ig = []

interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)
interpret_sentence(model, 'This films is bullshit', label=0)
interpret_sentence(model, 'It is fantastic bullshit', label=0)

pred:  pos ( 0.99 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.35 ) , delta:  tensor([1.2456e-06], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.80 ) , delta:  tensor([7.3927e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.44 ) , delta:  tensor([3.7723e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([6.8266e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([9.7082e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([7.8175e-05], device='cuda:0', dtype=torch.float64)


In [64]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,1.42,It was a fantastic performance ! pad
,,,,
pos,neg (0.35),pos,0.62,Best film ever pad pad pad pad
,,,,
pos,pos (0.80),pos,1.47,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,-0.7,It was a horrible movie pad pad
,,,,
neg,neg (0.44),pos,0.72,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,1.42,It was a fantastic performance ! pad
,,,,
pos,neg (0.35),pos,0.62,Best film ever pad pad pad pad
,,,,
pos,pos (0.80),pos,1.47,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,-0.7,It was a horrible movie pad pad
,,,,
neg,neg (0.44),pos,0.72,I've never watched something as bad pad
,,,,


##  Предобученные эмбеддинги для RNN



In [36]:
TEXT.build_vocab(train, vectors="glove.6B.100d")
LABEL.build_vocab(train)
word_embeddings = TEXT.vocab.vectors

100%|█████████▉| 399999/400000 [00:16<00:00, 23785.42it/s]


In [37]:
train, test = datasets.IMDB.splits(TEXT, LABEL)  # загрузим датасет
train, valid = train.split(random_state=random.seed(SEED))  # разобьем на части



In [38]:
vocab_size = len(TEXT.vocab)
emb_dim = 100
hidden_dim = 256
output_dim = 1
n_layers = 2
bidirectional = True
dropout = 0.4
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
patience=5

In [39]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)



In [49]:
model = RNNBaseline(
    vocab_size=vocab_size,
    embedding_dim=emb_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    n_layers=n_layers,
    bidirectional=bidirectional,
    dropout=dropout,
    pad_idx=PAD_IDX
)

In [50]:
word_embeddings.shape

torch.Size([202264, 100])

In [57]:
model = model.to(device)

word_embeddings = TEXT.vocab.vectors

prev_shape = model.embedding.weight.shape

model.embedding.weight.data.copy_(word_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],
       device='cuda:0')

In [58]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()

max_epochs = 20
num_freeze_iter = 250

In [59]:
def freeze_embeddings(model, req_grad=False):
    embeddings = model.embedding
    for c_p in embeddings.parameters():
        c_p.requires_grad = req_grad

Обучаем

In [60]:
import numpy as np

min_loss = np.inf

cur_patience = 0

freeze_embeddings(model, req_grad=False)

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        if it > num_freeze_iter and epoch < 1:
                freeze_embeddings(model, True)

        opt.zero_grad()
        text, length = batch.text
        target = batch.label
        pred = model(text,length).squeeze(1)
        loss =  loss_func(pred,target)
        train_loss+=loss
        loss.backward()
        opt.step()
    train_loss /= len(train_iter)
    val_loss = 0.0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        text, length = batch.text
        target = batch.label
        pred = model(text,length).squeeze(1)
        loss =  loss_func(pred,target)
        val_loss +=loss
    val_loss /= len(valid_iter)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model = model.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model.load_state_dict(best_model)

  0%|          | 0/274 [00:00<?, ?it/s]



  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.6262624859809875, Validation Loss: 0.6076393723487854


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.6589497923851013, Validation Loss: 0.6429558992385864


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.5897068977355957, Validation Loss: 0.535618782043457


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.49889516830444336, Validation Loss: 0.45260748267173767


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.42533987760543823, Validation Loss: 0.4282582700252533


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.4011823832988739, Validation Loss: 0.35448482632637024


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.37507784366607666, Validation Loss: 0.45309382677078247


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 8, Training Loss: 0.3626154959201813, Validation Loss: 0.33058393001556396


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 9, Training Loss: 0.34352949261665344, Validation Loss: 0.36345547437667847


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 10, Training Loss: 0.3280170261859894, Validation Loss: 0.31421732902526855


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 11, Training Loss: 0.3256390392780304, Validation Loss: 0.31421059370040894


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 12, Training Loss: 0.30632832646369934, Validation Loss: 0.2995014786720276


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 13, Training Loss: 0.2917752265930176, Validation Loss: 0.2963368594646454


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 14, Training Loss: 0.2874944508075714, Validation Loss: 0.2939200699329376


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 15, Training Loss: 0.2718098759651184, Validation Loss: 0.34172213077545166


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

<All keys matched successfully>

Посчитаем f1-score



In [61]:
from sklearn.metrics im port f1_score

f1 = 0
model.eval()

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
pbar.set_description(f"Epoch {epoch}")

for it, batch in pbar: 
    pred = model(*batch.text).squeeze(1)
    pred = torch.round(torch.sigmoid(pred.detach()))
    f1 += f1_score(batch.label.cpu(),pred.cpu())
f1/=len(test_iter)

print(f'F1 score : {f1}')

  0%|          | 0/391 [00:00<?, ?it/s]



F1 score : 0.8345359177058754


# Вывод



F1-мера полученных моделей

||RNN|CNN|
|---|--|--|
|Обычные эмбеддинги|0.8158 |0.8258|
|Предобученные эмбеддинги|0.8345 | 0.8623

RNN и CNN показывают схожее качество класификации.

Использование предобученных эмбеддингов дает прирост качества для обеих архитектур. 


Это видно и на примерах: CNN с GloVe значет что слово bullshit окрашено негативно, а без предобученных эмбеддингов - не знает.

Глядя на то, как менялся лосс при обучении RNN с эмбеддингами от GloVe, я подозреваю, что что то сделал не так) Ну или надо было на дольше их замораживать. И тем не менее, качество выросло.

Кроме того, в RNN используются эмбеддинги меньшей размерности, чем CNN. Это может объяснять отствание по качеству.

