# Домашнее задание 2

### Описание

В вашем распоряжении датасет с русскоязычными отзывами о мобильных телефонах с выставленным рейтингом от 1 до 5.
Ключевая задача – обучить любую модель регрессии (или классификации, если решите таким путём пойти) из пакетов scikit, XGBoost, LightGBM, CatBoost.


Необходимая метрика:

1. Со звёздочкой (дополнительный балл) – MAE <= 0.5
2. Минимальное допустимое значение – МАЕ <= 1.0

### Что необходимо сделать

1. Откройте датасет
2. Разделите на обучение и тест
3. Осуществите лемматизацию с помощью любого из озвученных на занятии инструментов 
4. Обучение одну или несколько моделей машинного обучения на разных представлениях данных
5. Валидируйте модель. Если модель соответствует условиям метрик, то работа завершена. В ином случае, экспериментируйте, начиная с пункта 7. 
6. По всем попыткам обучить качественную модель пишите свои выводы и замечания, почему так получилось.


## 0. Импорт библиотк, определение констант

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
stopwords = set(stopwords.words('english'))

import torch

from torch import nn
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

import torch.nn.functional as F
import torch.optim as optim

from sklearn.metrics import classification_report, confusion_matrix

import os
from tqdm import tqdm
tqdm.pandas()
from collections import Counter

[nltk_data] Downloading package stopwords to /home/tiv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/tiv/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/tiv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Загрузка и обработка данных

In [4]:
if os.path.exists("data/data_lemma_cleared.csv"):
    df = pd.read_csv("data/data_lemma_cleared.csv", engine='python')

df.head()   

Unnamed: 0,Review,Rating,lemma
0,3d touch просто восхитительная вещь заряд дер...,5.0,3d touch просто восхитительный вещь заряд держ...
1,отключается при температуре близкой к нулю не...,4.0,отключаться температура близкий нуль непонятно...
2,в apple окончательно решили не заморачиваться ...,3.0,apple окончательно решить не заморачиваться де...
3,постарался наиболее ёмко и коротко описать все...,4.0,постараться наиболее ёмко коротко описать всё ...
4,достойный телефон пользоваться одно удовольст...,5.0,достойный телефон пользоваться удовольствие


In [5]:
df = df.drop('Review', axis=1)
df = df.dropna()
df.columns = ['label', 'review']
df.shape

(319791, 2)

### Оставлю только небольшой кусок данных на время разработки модели

In [6]:
df = df[:100000]

In [7]:
reviews = df.review.values
words = ' '.join(reviews)
words = words.split()

print(len(words))
words[:10]

4356550


['3d',
 'touch',
 'просто',
 'восхитительный',
 'вещь',
 'заряд',
 'держать',
 'целый',
 'день',
 'розовый']

In [8]:
counter = Counter(words)
vocab = sorted(counter, key=counter.get, reverse=True)
int2word = dict(enumerate(vocab, 1))
int2word[0] = '<PAD>'
word2int = {word: id for id, word in int2word.items()}

In [9]:
len(word2int)

118745

In [10]:
reviews_enc = []

for review in tqdm(reviews):
    reviews_enc += [[]]
    
    for word in review.split():
        reviews_enc[-1].append(word2int[word])

for i in range(5):
    print(reviews_enc[i][:5])

100%|██████████| 100000/100000 [00:00<00:00, 104252.93it/s]

[1382, 1594, 22, 5659, 421]
[1050, 2137, 803, 2558, 1325]
[434, 1979, 133, 1, 2228]
[1686, 2872, 47509, 4322, 798]
[204, 2, 20, 629]





In [11]:
def pad_features(reviews, pad_id, seq_length=128):
    features = np.full((len(reviews), 
                        seq_length), 
                       pad_id, 
                       dtype=int)

    for i, row in enumerate(reviews):
        features[i, :len(row)] = np.array(row)[:seq_length]

    return features

seq_length = 256
features = pad_features(reviews_enc, 
                        pad_id=word2int['<PAD>'], 
                        seq_length=seq_length)

assert len(features) == len(reviews_enc)
assert len(features[0]) == seq_length

features.shape

(100000, 256)

In [12]:
labels = df.label.to_numpy()
labels

array([5., 4., 3., ..., 4., 1., 5.])

In [13]:
train_size = .7
val_size = .5

split_id = int(len(features) * train_size)
train_x, remain_x = features[:split_id], features[split_id:]
train_y, remain_y = labels[:split_id], labels[split_id:]

split_val_id = int(len(remain_x) * val_size)
val_x, test_x = remain_x[:split_val_id], remain_x[split_val_id:]
val_y, test_y = remain_y[:split_val_id], remain_y[split_val_id:]

print('Feature Shapes:')
print('===============')
print('Train set: {}'.format(train_x.shape))
print('Validation set: {}'.format(val_x.shape))
print('Test set: {}'.format(test_x.shape))

Feature Shapes:
Train set: (70000, 256)
Validation set: (15000, 256)
Test set: (15000, 256)


In [14]:
batch_size  = 64

trainset = TensorDataset(torch.from_numpy(train_x), 
                         torch.from_numpy(train_y))


validset = TensorDataset(torch.from_numpy(val_x), 
                         torch.from_numpy(val_y))

testset = TensorDataset(torch.from_numpy(test_x), 
                        torch.from_numpy(test_y))

train_iterator = DataLoader(trainset, 
                            shuffle=True, 
                            batch_size=batch_size)

valid_iterator = DataLoader(validset, 
                            shuffle=True, 
                            batch_size=batch_size)

test_iterator = DataLoader(testset, 
                           shuffle=True, 
                           batch_size=batch_size)


## CNN

### Функция подсчета accuracy

In [15]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

In [16]:
def mae(preds, y):
    rounded_preds = torch.round(preds)
    error = torch.mean(torch.abs(rounded_preds - y).float())
    return error

### Функция обучения сети

In [17]:

def train_func(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()
    model.cuda()
    
    for batch in iterator:
        optimizer.zero_grad()
        
        predictions = model(batch[0].T.cuda()).squeeze(1)
        
        loss = criterion(predictions.float(), 
                          batch[1].float().cuda())
        
        acc = mae(predictions.float(), 
                              batch[1].float().cuda())
        
        loss.backward()
        optimizer.step()

        epoch_loss += loss
        epoch_acc += acc

    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [18]:
def evaluate_func(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch[0].T.cuda()).squeeze(1)
            
            loss = criterion(predictions.float(), 
                              batch[1].float().cuda())
            
            acc = mae(predictions.float(), 
                                  batch[1].float().cuda())
            
            epoch_loss += loss
            epoch_acc += acc

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Архитектура сети

In [19]:
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, 
                 vocab_size,
                 embedding_dim, 
                 n_filters, 
                 filter_sizes, 
                 output_dim, 
                 dropout):
        
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, 
                                      embedding_dim)
        
        self.conv_0 = nn.Conv2d(in_channels=1, 
                                out_channels=n_filters, 
                                kernel_size=(filter_sizes[0], 
                                             embedding_dim))
        
        self.conv_1 = nn.Conv2d(in_channels=1, 
                                out_channels=n_filters, 
                                kernel_size=(filter_sizes[1], 
                                             embedding_dim))
        
        self.conv_2 = nn.Conv2d(in_channels=1, 
                                out_channels=n_filters, 
                                kernel_size=(filter_sizes[2], 
                                             embedding_dim))
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, 
                            output_dim)
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        #x = [sent len, batch size]
        x = x.permute(1, 0)

        #x = [batch size, sent len]
        embedded = self.embedding(x)

        #embedded = [batch size, sent len, emb dim]
        embedded = embedded.unsqueeze(1)

        #embedded = [batch size, 1, sent len, emb dim]
        conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
        conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
        conved_2 = F.relu(self.conv_2(embedded).squeeze(3))

        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)

        #pooled_n = [batch size, n_filters]
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        #cat = [batch size, n_filters * len(filter_sizes)]
        return self.fc(cat)
    

In [20]:

INPUT_DIM = len(word2int)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5

model = CNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            N_FILTERS, 
            FILTER_SIZES, 
            OUTPUT_DIM, 
            DROPOUT)

In [21]:

optimizer = optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()

model = model.cuda()
model

CNN(
  (embedding): Embedding(118745, 100)
  (conv_0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
  (conv_1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  (conv_2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  (fc): Linear(in_features=300, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [22]:
N_EPOCHS = 0

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_func(model, 
                                       train_iterator, 
                                       optimizer, 
                                       criterion)
    
    valid_loss, valid_acc = evaluate_func(model, 
                                          valid_iterator, 
                                          criterion)
    
    train_msg = f'Epoch: {epoch+1:02}, '
    train_msg += f'Train Loss: {train_loss:.3f}, '
    train_msg += f'Train Acc: {train_acc:.2f}, '
    train_msg += f'Val. Loss: {valid_loss:.3f}, '
    train_msg += f'Val. Acc: {valid_acc:.2f}'
    
    print(train_msg)

In [23]:

test_loss , test_acc = evaluate_func(model, 
                                     test_iterator, 
                                     criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}')


Test Loss: 26.029, Test Acc: 4.89


## RNN

In [24]:
dataiter = iter(train_iterator)
x, y = next(dataiter)

print('Sample batch size: ', x.size()) 
print('Sample batch input: \n', x)
print()
print('Sample label size: ', y.size())
print('Sample label input: \n', y)

Sample batch size:  torch.Size([64, 256])
Sample batch input: 
 tensor([[   5,   92,    2,  ...,    0,    0,    0],
        [1032,  192,    4,  ...,    0,    0,    0],
        [2871,   86,  121,  ...,    0,    0,    0],
        ...,
        [ 257,  254,    9,  ...,    0,    0,    0],
        [   2,   87,  164,  ...,    0,    0,    0],
        [  25,  146,  212,  ...,    0,    0,    0]])

Sample label size:  torch.Size([64])
Sample label input: 
 tensor([5., 5., 5., 5., 5., 5., 2., 5., 5., 5., 5., 4., 5., 5., 5., 4., 3., 3.,
        1., 1., 4., 5., 5., 5., 3., 5., 5., 5., 4., 5., 4., 5., 4., 5., 5., 5.,
        1., 3., 5., 5., 5., 5., 1., 4., 5., 4., 5., 5., 4., 4., 5., 2., 5., 1.,
        5., 2., 4., 5., 3., 5., 1., 4., 1., 3.], dtype=torch.float64)


### Моделирование

In [25]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = 'cpu'
print(device)

cuda


In [26]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.hidden_size = hidden_size
        self.attn_weights = nn.Parameter(torch.Tensor(hidden_size, 1))
        nn.init.uniform_(self.attn_weights, -0.1, 0.1)

    def forward(self, encoder_outputs):
        """
        Args:
            encoder_outputs (torch.Tensor): Тензор размерности (batch_size, max_length, hidden_size).
        Returns:
            torch.Tensor: Взвешенная сумма encoder_outputs с учетом весов Attention.
        """
        # Рассчитываем веса Attention
        attn_energies = torch.bmm(encoder_outputs, self.attn_weights.unsqueeze(0).expand(encoder_outputs.size(0), *self.attn_weights.size()))
        attn_weights = F.softmax(attn_energies, dim=1)

        # Выполняем взвешенную сумму
        context = torch.bmm(attn_weights.transpose(1, 2), encoder_outputs)

        return context.squeeze(1)

In [34]:
class SentimentModel(nn.Module):
    def __init__(self, vocab_size, output_size, hidden_size=128, 
                 embedding_size=400, n_layers=2, dropout=0.2):
        
        super(SentimentModel, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_size)

        self.lstm = nn.LSTM(embedding_size, hidden_size, n_layers, 
                            dropout=dropout, batch_first=True)

        self.attention = Attention(hidden_size)
        
        self.dropout = nn.Dropout(dropout)

        self.fc = nn.Linear(hidden_size, output_size)
        
        self.fc2 = nn.Linear(hidden_size, hidden_size)

        self.bn = nn.BatchNorm1d(hidden_size)
        
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        
        # convert feature to long
        x = x.long()

        # map input to vector
        x = self.embedding(x)
        
        # pass forward to lstm
        x, _ =  self.lstm(x)
        x = x[:, -1, :]
        
        # x = self.attention(x) не взлетело
        x = self.bn(x)
        
        x = self.fc2(x)
        
        x = self.dropout(x)     
        
        x = self.fc2(x)
        
        x = self.dropout(x)     
 
        x = self.fc2(x)
        
        x = self.dropout(x)    
        
        x = self.fc2(x)
        
        x = self.dropout(x)    
        
        x = self.fc2(x)
        
        x = self.dropout(x)    
        
        x = self.fc2(x)
        
        x = self.dropout(x)    
                   
        x = self.fc(x)
        
        return x

In [35]:
history = {
    'train_loss': [],
    'train_acc': [],
    'train_mae': [],
    'val_loss': [],
    'val_acc': [],
    'val_mae': []
}

lr = 0.003
es_limit = 5
vocab_size = len(word2int)
output_size = 1
embedding_size = 256
hidden_size = 512
n_layers = 3
dropout=0.25
epochs = 10
grad_clip = 1
print_every = 1

model = SentimentModel(vocab_size, 
                       output_size, 
                       hidden_size, 
                       embedding_size, 
                       n_layers, 
                       dropout)

criterion = torch.nn.MSELoss()
optim = Adam(model.parameters(), lr=lr)

model = model.to(device)

epochloop = tqdm(range(epochs), position=0, desc='Training', leave=True)

# early stop trigger
es_trigger = 0
val_loss_min = 1000 #torch.inf

for e in epochloop:

    # Обучение
    
    model.train()

    train_loss = 0
    train_acc = 0
    train_mae = 0
    
    for id_, (feature, target) in enumerate(train_iterator):
        
        epochloop.set_postfix_str(f'Training batch {id_}/{len(train_iterator)}')

        feature, target = feature.to(device), target.to(device)

        optim.zero_grad()

        out = model(feature)
        #print(out[:5])
        #predicted = torch.tensor([1 if i == True else 0 for i in out > 0.5], device=device)
        #predicted = torch.tensor(torch.round(out), device=device)
        predicted = torch.round(out.squeeze().clone().detach())
        #predicted = out.clone().detach()
        #predicted = torch.tensor(out, device=device)
        #print('-------- OUT')
        #print(out)        
        #print('--------- OUT squeeze')
        #print(out.squeeze(),)       
        #print('PREDICTED')
        #print(predicted)
        #print('TARGET')
        #print(target)
        
        equals = predicted == target
        acc = torch.mean(equals.type(torch.FloatTensor))
        train_acc += acc.item()
        
        mae = torch.mean(torch.abs(predicted - target).float())
        train_mae += mae.item()
        
        loss = criterion(out.squeeze(), target.float())
        train_loss += loss.item()
        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

        optim.step()

        del feature, target, predicted

    history['train_loss'].append(train_loss / len(train_iterator))
    history['train_acc'].append(train_acc / len(train_iterator))
    history['train_mae'].append(train_mae / len(train_iterator))

    # Валидация
    model.eval()

    val_loss = 0
    val_acc = 0
    val_mae = 0
    
    with torch.no_grad():
        for id_, (feature, target) in enumerate(valid_iterator):
            epochloop.set_postfix_str(f'Validation batch {id_}/{len(valid_iterator)}')
            
            feature, target = feature.to(device), target.to(device)

            out = model(feature)

             
            predicted = torch.round(out.clone().detach())
            
            equals = predicted == target
            acc = torch.mean(equals.type(torch.FloatTensor))
            val_acc += acc.item()
            
            mae = torch.mean(torch.abs(predicted - target).float())
            val_mae += mae.item()
            
            loss = criterion(out.squeeze(), target.float())
            val_loss += loss.item()

            del feature, target, predicted

        history['val_loss'].append(val_loss / len(valid_iterator))
        history['val_acc'].append(val_acc / len(valid_iterator))
        history['val_mae'].append(val_mae / len(valid_iterator))
    
    # Возвращаем модель в режим обучения
    # Возвращаем модель в режим обучения
    model.train()

    info_str = f'Val Loss: {val_loss / len(valid_iterator):.3f} '
    info_str += f'| Val mae: {val_mae / len(valid_iterator):.3f}'
    epochloop.set_postfix_str(info_str)

    if (e+1) % print_every == 0:
        info_str = f'Epoch {e+1}/{epochs} | TRAIN Loss: {train_loss / len(train_iterator):.3f} '
        info_str += f' mae: {train_mae / len(train_iterator):.3f} '
        info_str += f' acc: {train_acc / len(train_iterator):.3f} '
        info_str += f'| VAL Loss: {val_loss / len(valid_iterator):.3f} '
        info_str += f' mae: {val_mae / len(valid_iterator):.3f}'
        info_str += f' acc: {val_acc / len(valid_iterator):.3f}'
        
        epochloop.write(info_str)
        epochloop.update()

    if val_loss / len(valid_iterator) <= val_loss_min:
        #torch.save(model.state_dict(), './sentiment_lstm.pt')
        val_loss_min = val_loss / len(valid_iterator)
        es_trigger = 0
    else:
        info_str = '[WARNING] Validation loss did not improved ('
        info_str += f'{val_loss_min:.3f} --> {val_loss / len(valid_iterator):.3f})'
        
        epochloop.write(info_str)
        es_trigger += 1

    if es_trigger >= es_limit:
        epochloop.write(f'Early stopped at Epoch-{e+1}')
        history['epochs'] = e+1
        break

Training:  20%|██        | 2/10 [01:32<12:18, 92.34s/it, Training batch 3/1094]           

Epoch 1/10 | TRAIN Loss: 2.597  mae: 1.259  acc: 0.231 | VAL Loss: 2.114  mae: 1.065 acc: 0.374


Training:  40%|████      | 4/10 [03:06<05:51, 58.66s/it, Training batch 2/1094]           

Epoch 2/10 | TRAIN Loss: 1.964  mae: 1.108  acc: 0.255 | VAL Loss: 1.946  mae: 1.310 acc: 0.155


Training:  50%|█████     | 5/10 [04:40<04:23, 52.69s/it, Training batch 3/1094]           

Epoch 3/10 | TRAIN Loss: 1.796  mae: 1.049  acc: 0.276 | VAL Loss: 1.878  mae: 1.087 acc: 0.377


Training:  70%|███████   | 7/10 [06:12<03:10, 63.41s/it, Training batch 2/1094]           

Epoch 4/10 | TRAIN Loss: 1.460  mae: 0.927  acc: 0.315 | VAL Loss: 1.221  mae: 1.181 acc: 0.184


Training:  80%|████████  | 8/10 [07:45<01:52, 56.01s/it, Training batch 3/1094]           

Epoch 5/10 | TRAIN Loss: 1.045  mae: 0.767  acc: 0.385 | VAL Loss: 0.951  mae: 1.231 acc: 0.264


Training:  90%|█████████ | 9/10 [09:18<01:04, 64.67s/it, Training batch 3/1094]           

Epoch 6/10 | TRAIN Loss: 0.879  mae: 0.695  acc: 0.429 | VAL Loss: 1.369  mae: 1.591 acc: 0.251


Training: 11it [10:51, 71.82s/it, Training batch 3/1094]                                   

Epoch 7/10 | TRAIN Loss: 0.826  mae: 0.657  acc: 0.459 | VAL Loss: 0.940  mae: 1.321 acc: 0.288


Training: 12it [12:23, 61.12s/it, Training batch 3/1094]           

Epoch 8/10 | TRAIN Loss: 0.797  mae: 0.641  acc: 0.470 | VAL Loss: 1.003  mae: 1.278 acc: 0.339


Training: 13it [13:56, 68.37s/it, Training batch 3/1094]           

Epoch 9/10 | TRAIN Loss: 0.818  mae: 0.647  acc: 0.469 | VAL Loss: 1.017  mae: 1.295 acc: 0.310


Training: 100%|██████████| 10/10 [15:29<00:00, 92.92s/it, Val Loss: 1.153 | Val mae: 1.412]

Epoch 10/10 | TRAIN Loss: 0.912  mae: 0.679  acc: 0.454 | VAL Loss: 1.153  mae: 1.412 acc: 0.200





In [37]:
test_mae = 0

with torch.no_grad():
        for id_, (feature, target) in enumerate(test_iterator):
            
            feature, target = feature.to(device), target.to(device)

            out = model(feature)
            
            predicted = torch.round(out.clone().detach())
            
            equals = predicted == target
            acc = torch.mean(equals.type(torch.FloatTensor))
            val_acc += acc.item()
            
            mae = torch.mean(torch.abs(predicted - target).float())
            test_mae += mae.item()
            
            
print(f' mae: {test_mae / len(test_iterator):.3f}')                        

 mae: 1.380
