<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

---

# Embeddings

Привет! В этом домашнем задании мы с помощью эмбеддингов решим задачу семантической классификации твитов.

Для этого мы воспользуемся предобученными эмбеддингами word2vec.

Student: Oleg Navolotsky / Наволоцкий Олег  
Stepik: https://stepik.org/users/2403189  
Telegram: [@mehwhatever0](https://t.me/mehwhatever0)  

 **Note**: reproducibility depends on [different things](https://pytorch.org/docs/stable/notes/randomness.html):
 >Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

 Some used software versions:
 - PyTorch 1.7.0
 - NumPy 1.18.5
 - Gensim 3.8.3
 - NLTK 3.5
 - seaborn 0.10.1
 - pandas 1.0.5
 - scikit-learn 0.23.1
 - Python 3.8.3 (default, Jul  2 2020, 17:30:36) \[MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
 - NVIDIA Driver 460.79
 - NVIDIA CUDA 11.2
 - Windows 10 Pro 1909, build 18363.535


 Hardware:
 - i5 2500 8 gb
 - GTX 1060 6 gb

In [1]:
import os
import random

import numpy as np
import torch


SEED = 42


def enable_reproducibility(
        seed=SEED, raise_if_no_deterministic=True,
        cudnn_deterministic=True, disable_cudnn_benchmarking=True):
    # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
    if raise_if_no_deterministic:
        torch.set_deterministic(True)

    # https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ":4096:8"
    
    if disable_cudnn_benchmarking:
        torch.backends.cudnn.benchmark = False
    if cudnn_deterministic:
        torch.backends.cudnn.deterministic = True

    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

Для начала скачаем датасет для семантической классификации твитов:

In [40]:
# !gdown https://drive.google.com/uc?id=1eE1FiUkXkcbw0McId4i7qY-L8hH-_Qph&export=download
# !unzip archive.zip

Импортируем нужные библиотеки:

In [2]:
import math
import random
import string

import numpy as np
import pandas as pd
import seaborn as sns

import torch
import nltk
import gensim
import gensim.downloader as api

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [3]:
enable_reproducibility()

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda')

In [4]:
data = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding="latin", header=None, names=["emotion", "id", "date", "flag", "user", "text"])

Посмотрим на данные:

In [5]:
data.head()

Unnamed: 0,emotion,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Выведем несколько примеров твитов, чтобы понимать, с чем мы имеем дело:

In [6]:
examples = data["text"].sample(10)
print("\n".join(examples))

@chrishasboobs AHHH I HOPE YOUR OK!!! 
@misstoriblack cool , i have no tweet apps  for my razr 2
@TiannaChaos i know  just family drama. its lame.hey next time u hang out with kim n u guys like have a sleepover or whatever, ill call u
School email won't open  and I have geography stuff on there to revise! *Stupid School* :'(
upper airways problem 
Going to miss Pastor's sermon on Faith... 
on lunch....dj should come eat with me 
@piginthepoke oh why are you feeling like that? 
gahh noo!peyton needs to live!this is horrible 
@mrstessyman thank you glad you like it! There is a product review bit on the site  Enjoy knitting it!


Как вилим, тексты твитов очень "грязные". Нужно предобработать датасет, прежде чем строить для него модель классификации.

Чтобы сравнивать различные методы обработки текста/модели/прочее, разделим датасет на dev(для обучения модели) и test(для получения качества модели).

In [7]:
indexes = np.arange(data.shape[0])
np.random.shuffle(indexes)
dev_size = math.ceil(data.shape[0] * 0.8)

dev_indexes = indexes[:dev_size]
test_indexes = indexes[dev_size:]

dev_data = data.iloc[dev_indexes]
test_data = data.iloc[test_indexes]

dev_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

## Обработка текста

Токенизируем текст, избавимся от знаков пунктуации и выкинем все слова, состоящие менее чем из 4 букв:

In [8]:
tokenizer = nltk.WordPunctTokenizer()
line = tokenizer.tokenize(dev_data["text"][0].lower())
print(" ".join(line))

@ claire_nelson i ' m on the north devon coast the next few weeks will be down in devon again in may sometime i hope though !


In [9]:
filtered_line = [w for w in line if all(c not in string.punctuation for c in w) and len(w) > 3]
print(" ".join(filtered_line))

north devon coast next weeks will down devon again sometime hope though


Загрузим предобученную модель эмбеддингов. 

Если хотите, можно попробовать другую. Полный список можно найти здесь: https://github.com/RaRe-Technologies/gensim-data.

Данная модель выдает эмбеддинги для **слов**. Строить по эмбеддингам слов эмбеддинги предложений мы будем ниже.

In [10]:
word2vec = api.load("word2vec-google-news-300")

  return _load_word2vec_format(


In [11]:
emb_line = [word2vec.get_vector(w) for w in filtered_line if w in word2vec]
print(sum(emb_line).shape)

(300,)


Нормализуем эмбеддинги, прежде чем обучать на них сеть. 
(наверное, вы помните, что нейронные сети гораздо лучше обучаются на нормализованных данных)

In [12]:
w2v_mean = np.mean(word2vec.vectors, 0)
w2v_std = np.std(word2vec.vectors, 0)
norm_emb_line = [(word2vec.get_vector(w) - w2v_mean) / w2v_std for w in filtered_line if w in word2vec and len(w) > 3]
print(sum(norm_emb_line).shape)
print([all(norm_emb_line[i] == emb_line[i]) for i in range(len(emb_line))])

(300,)
[False, False, False, False, False, False, False, False, False, False, False, False]


Сделаем датасет, который будет по запросу возвращать подготовленные данные.

In [13]:
from torch.utils.data import Dataset


class TwitterDataset(Dataset):
    def __init__(self, data: pd.DataFrame, feature_column: str, target_column: str, word2vec: gensim.models.Word2Vec):
        self.tokenizer = nltk.WordPunctTokenizer()
        
        self.data = data

        self.feature_column = feature_column
        self.target_column = target_column

        self.word2vec = word2vec

        self.label2num = lambda label: 0 if label == 0 else 1
        self.mean = np.mean(word2vec.vectors, axis=0)
        self.std = np.std(word2vec.vectors, axis=0)

    def __getitem__(self, item):
        text = self.data[self.feature_column][item]
        label = self.label2num(self.data[self.target_column][item])

        tokens = self.get_tokens_(text)
        embeddings = self.get_embeddings_(tokens)

        return {"feature": embeddings, "target": label}

    def get_tokens_(self, text):
        # Получи все токены из текста и профильтруй их
        return  [w for w in self.tokenizer.tokenize(text.lower()) if all(c not in string.punctuation for c in w) and len(w) > 3]

    def get_embeddings_(self, tokens):
        # Получи эмбеддинги слов и усредни их
        embeddings = [(self.word2vec[tok] - self.mean) / self.std for tok in tokens if tok in self.word2vec] 

        if len(embeddings) == 0:
            embeddings = np.zeros((1, self.word2vec.vector_size))
        else:
            embeddings = np.array(embeddings)
            if len(embeddings.shape) == 1:
                embeddings = embeddings.reshape(-1, 1)

        return embeddings

    def __len__(self):
        return self.data.shape[0]

In [14]:
%%time
dev = TwitterDataset(dev_data, "text", "emotion", word2vec)

Wall time: 2min 10s


In [15]:
import re

import gensim
import nltk
import numpy as np
from torch.utils.data import Dataset
from tqdm.notebook import tqdm

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

# "\w+" pattern is a full equivalent of
# nltk.tokenize.wordpunct_tokenize() + filtering punctuation
def text_to_tokens(text):
    return [tok for tok in re.findall(r"\w+", text.lower()) if tok not in stop_words]


class SlightlyMoreComputationallyEfficientDataset(Dataset):
    def __init__(
        self, texts, labels,
        word2vec: gensim.models.Word2Vec, w2v_mean=None, w2v_std=None,
        tokenizer=text_to_tokens, **_init_features_w2v_embeddings_kwargs
    ):
        super().__init__()
        self._no_missing_embeddings = False
        self._embedding_dim = word2vec.vector_size
        self._embedding_dtype = word2vec.vectors.dtype
        self._tokenizer = tokenizer
        if w2v_mean is None:
            self._w2v_mean = np.mean(word2vec.vectors, axis=0)
        else:
            self._w2v_mean = w2v_mean
        if w2v_std is None:
            self._w2v_std = np.std(word2vec.vectors, axis=0)
        else:
            self._w2v_std = w2v_std
        self._init_features_w2v_embeddings(
            texts, word2vec, **_init_features_w2v_embeddings_kwargs)
        self._init_targets(labels)

    def _get_tokens(self, text):
        return self._tokenizer(text)

    def _init_targets(self, labels):
        targets = []
        label2target = {}
        target2label = {}
        for target, label in enumerate(sorted(set(labels))):
            target2label[target] = label
            label2target[label] = target
        targets = [label2target[label] for label in labels]
        self._targets = targets
        self.label2target = label2target
        self.target2label = target2label

    def _init_features_w2v_embeddings(
            self, features, word2vec):

        # used as single stub token for text in which no proper tokens found
        shared_zero_embedding = np.zeros(
            self._embedding_dim, dtype=self._embedding_dtype)
        shared_zero_embedding_index = 0

        # Stores token embeddings:
        embeddings = [shared_zero_embedding]
        # Stores lists of token embedding indexes, one list for each text:
        embeddings_indexes = []

        token2index = {}

        for text in tqdm(features, desc="processing texts", postfix="gathering embeddings for tokens"):
            tokens = self._get_tokens(text)

            if not tokens:  # text are empty for us, no way to do something else
                embeddings_indexes.append([shared_zero_embedding_index])
                continue

            indexes = []
            for tok in tokens:
                if tok in token2index:
                    indexes.append(token2index[tok])
                elif tok in word2vec:
                    new_embedding = (
                        word2vec[tok] - self._w2v_mean) / self._w2v_std
                    assert new_embedding.shape == (
                        self._embedding_dim,), new_embedding.shape
                    embeddings.append(new_embedding)
                    new_index = len(embeddings) - 1
                    token2index[tok] = new_index
                    indexes.append(new_index)
                else:
                    indexes.append(None)
            embeddings_indexes.append(indexes)

        embeddings = np.array(embeddings)
        assert len(
            embeddings.shape) == 2 and embeddings.shape[-1] == self._embedding_dim, embeddings.shape
        self._features_w2v_embeddings_indexes = embeddings_indexes
        self._features_w2v_embeddings = embeddings
        self._shared_zero_embedding_index = shared_zero_embedding_index

    def _get_features_w2v_embeddings(self, indexes):
        if self._no_missing_embeddings:
            embeddings = self._features_w2v_embeddings[indexes]
            assert embeddings.shape[0] != 0, (indexes, embeddings.shape)
        else:
            embeddings = [self._features_w2v_embeddings[idx]
                          for idx in indexes if idx is not None]
            if not embeddings:
                embeddings = [
                    self._features_w2v_embeddings[self._shared_zero_embedding_index]]
            embeddings = np.array(embeddings)
        assert len(
            embeddings.shape) == 2 and embeddings.shape[-1] == self._embedding_dim, (indexes, embeddings.shape)
        return embeddings

    def __getitem__(self, item):
        if isinstance(item, int):
            indexes = self._features_w2v_embeddings_indexes[item]
            embeddings = self._get_features_w2v_embeddings(indexes)
        else:
            embeddings = []
            for indexes in self._features_w2v_embeddings_indexes[item]:
                embeddings.append(self._get_features_w2v_embeddings(indexes))
        target = self._targets[item]
        return {"feature": embeddings, "target": target}

    def __len__(self):
        return len(self._features_w2v_embeddings_indexes)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user0\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
import math
import time

import pandas as pd
import numpy as np


def prepare_and_split_data(
        data="training.1600000.processed.noemoticon.csv",
        feature_col="text", target_col="emotion",
        valid_part=0.2, test_part=0.2, reproducibility=True):
    
    valid_part = 0 if valid_part is None else valid_part
    test_part = 0 if test_part is None else test_part

    if valid_part < 0 or test_part < 0 or (valid_part + test_part) > 1:
        raise ValueError

    if isinstance(data, str):
        print(f"Loading data from {data}...", flush=True)
        start = time.time()
        data = pd.read_csv(data, encoding="latin", header=None, names=[
                           "emotion", "id", "date", "flag", "user", "text"])
        print(f"Data loaded. Time spent: {time.time() - start}.", flush=True)

    if reproducibility:
        enable_reproducibility()

    data = data[[feature_col, target_col]]
    data_size = len(data)
    indexes = np.arange(data_size)
    np.random.shuffle(indexes)

    res = dict.fromkeys(("train_features", "train_targets",
                         "valid_features", "valid_targets",
                         "test_features", "test_targets"))

    if test_part > 0:
        test_size = math.ceil(data_size * test_part)
        test = data.iloc[indexes[:test_size]]
        indexes = indexes[test_size:]
        res["test_features"], res["test_targets"] = test[feature_col], test[target_col]
    if valid_part > 0:
        valid_size = math.ceil(data_size * valid_part)
        valid = data.iloc[indexes[:valid_size]]
        indexes = indexes[valid_size:]
        res["valid_features"], res["valid_targets"] = valid[feature_col], valid[target_col]

    train = data.iloc[indexes]
    res["train_features"], res["train_targets"] = train[feature_col], train[target_col]
    
    return res

In [17]:
def create_datasets(
        dataset_cls,
        train_features=None, train_targets=None,
        valid_features=None, valid_targets=None,
        test_features=None, test_targets=None,
        reproducibility=True,
        datasets_kwargs=None):

    res = dict.fromkeys(("train_dataset", "valid_dataset", "test_dataset"))

    locals_ = locals()
    for name in ("train", "valid", "test"):
        features = locals_[f"{name}_features"]
        if features is None:
            continue
        if datasets_kwargs is not None:
            kwargs = datasets_kwargs.get(name, {})
        targets = locals_[f"{name}_targets"]
        print(f"Creating {name} dataset...", end='\n\n', flush=True)
        start = time.time()
        if reproducibility:
            enable_reproducibility()
        res[f"{name}_dataset"] = dataset_cls(features, targets, **kwargs)
        print(f"{name.title()} dataset created. "
              f"Time spent: {time.time() - start}.", flush=True)

    return res

In [18]:
def create_dataloaders(
        collate_fn,
        train_dataset=None, valid_dataset=None, test_dataset=None,
        batch_size=1024, workers_num=0, pin_memory=False,
        dataloaders_params=None, update_default_dataloaders_params=True,
        reproducibility=True):
    
    res = dict.fromkeys(("train_dataloader", "valid_dataloader", "test_dataloader"))
    
    default_dataloaders_params = {
        "train": dict(
            shuffle=True, drop_last=True,
            batch_size=batch_size, num_workers=workers_num,
            collate_fn=collate_fn, pin_memory=pin_memory),
        "valid": dict(
            shuffle=False, drop_last=False,
            batch_size=batch_size, num_workers=workers_num,
            collate_fn=collate_fn, pin_memory=pin_memory),
        "test": dict(
            shuffle=False, drop_last=False,
            batch_size=batch_size, num_workers=workers_num,
            collate_fn=collate_fn, pin_memory=pin_memory)
    }
    
    locals_ = locals()
    for name in ("train", "valid", "test"):
        dataset = locals_[f"{name}_dataset"]
        if dataset is None:
            continue
        params = default_dataloaders_params[name]
        if dataloaders_params is None:
            passed_params = None
        else:
            passed_params = dataloaders_params.get(name)
        if passed_params is not None:
            if update_default_dataloaders_params:
                params.update(passed_params)
            else:
                params = passed_params
        if reproducibility:
            enable_reproducibility()
        res[f"{name}_dataloader"] = DataLoader(dataset, **params)

    return res

Отлично, мы готовы с помощью эмбеддингов слов превращать твиты в векторы и обучать нейронную сеть.

Превращать твиты в векторы, используя эмбеддинги слов, можно несколькими способами. А именно такими:

## Average embedding (2 балла)
---
Это самый простой вариант, как получить вектор предложения, используя векторные представления слов в предложении. А именно: вектор предложения есть средний вектор всех слов в предлоежнии (которые остались после токенизации и удаления коротких слов, конечно). 

In [19]:
indexes = np.arange(len(dev))
np.random.shuffle(indexes)
example_indexes = indexes[::1000]

examples = {"features": [np.sum(dev[i]["feature"], axis=0) for i in example_indexes], 
            "targets": [dev[i]["target"] for i in example_indexes]}
print(len(examples["features"]))

1280


Давайте сделаем визуализацию полученных векторов твитов тренировочного (dev) датасета. Так мы увидим, насколько хорошо твиты с разными target значениями отделяются друг от друга, т.е. насколько хорошо усреднение эмбеддингов слов предложения передает информацию о предложении.

Для визуализации векторов надо получить их проекцию на плоскость. Сделаем это с помощью `PCA`. Если хотите, можете вместо PCA использовать TSNE: так у вас получится более точная проекция на плоскость (а значит, более информативная, т.е. отражающая реальное положение векторов твитов в пространстве). Но TSNE будет работать намного дольше.

In [20]:
from sklearn.decomposition import PCA


pca = PCA(n_components=2)
# Обучи PCA на эмбеддингах слов
examples["transformed_features"] = pca.fit_transform(examples["features"])

In [21]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [22]:
draw_vectors(
    examples["transformed_features"][:, 0], 
    examples["transformed_features"][:, 1], 
    color=[["red", "blue"][t] for t in examples["targets"]]
    )

Скорее всего, на визуализации нет четкого разделения твитов между классами. Это значит, что по полученным нами векторам твитов не так-то просто определить, к какому классу твит пренадлежит. Значит, обычный линейный классификатор не очень хорошо справится с задачей. Надо будет делать глубокую (хотя бы два слоя) нейронную сеть.

Подготовим загрузчики данных.
Усреднее векторов будем делать в "батчевалке"(`collate_fn`). Она используется для того, чтобы собирать из данных `torch.Tensor` батчи, которые можно отправлять в модель.


In [23]:
import numpy as np
import torch

def average_emb(batch):
    features = [np.mean(b["feature"], axis=0) for b in batch]
    targets = [b["target"] for b in batch]
    return {"features": torch.FloatTensor(features), "targets": torch.LongTensor(targets)}

In [None]:
# no need of these anymore
del dev_data, test_data, dev, examples, pca

Определим функции для тренировки и теста модели:

In [24]:
from tqdm.notebook import tqdm


def training(model, optimizer, criterion, train_dataloader, epoch, device="cpu"):
    model.train()
    loss_sum = 0
    with tqdm(train_dataloader, desc=f"[training, epoch {epoch}]") as pbar:
        for batch in pbar:
            features = batch["features"].to(device)
            targets = batch["targets"].view(-1, 1).to(device)

            optimizer.zero_grad()
            # Получи предсказания модели
            output = model(features)
            # Посчитай лосс
            loss = criterion(output, targets.type(output.dtype))
            # Обнови параметры модели
            loss.backward()
            optimizer.step()

            loss_sum += loss.item()
            pbar.set_postfix_str(f"Batch Loss: {loss.item():.4}")
    
        mean_loss = loss_sum / len(train_dataloader)
        pbar.set_postfix_str(f"Mean Loss: {mean_loss:.4}")
    return {"mean loss": mean_loss}


@torch.no_grad()
def testing(model, criterion, test_dataloader, predict, device="cpu", desc="testing"):
    model.eval()
    loss_sum = 0
    true_preds_num_sum = 0
    total_samples = 0
    with tqdm(test_dataloader, desc=desc) as pbar:
        for batch in pbar:
            features = batch["features"].to(device)
            targets = batch["targets"].view(-1, 1).to(device)
            
            # Получи предсказания модели
            output = model(features)
            # Посчитай лосс
            loss = criterion(output, targets.type(output.dtype)).item()
            # Посчитай точность модели
            preds = predict(output)
            true_preds_num = (preds == targets.type(preds.dtype)).sum().item()

            loss_sum += loss
            true_preds_num_sum += true_preds_num
            cur_batch_size = len(targets)
            total_samples += cur_batch_size
            
            pbar.set_postfix_str(f"Batch Loss: {loss:.4}, Batch Accuracy: {true_preds_num / cur_batch_size:.4}")
    
        mean_loss = loss_sum / len(test_dataloader)
        accuracy = true_preds_num_sum / total_samples
        pbar.set_postfix_str(f"Mean Loss: {mean_loss:.4}, Accuracy: {accuracy:.4}")
    return {"mean loss": mean_loss, "accuracy": accuracy}

In [29]:
from copy import deepcopy


def train_valid_test_model(
        epochs_num, model, optimizer, criterion, predict,
        train_dataloader, valid_dataloader, test_dataloader,
        save_best_by='accuracy', device='cpu', reproducibility=True):
    
    if reproducibility:
        enable_reproducibility()
    model = model.to(device)
    info = {
        'validation metric': save_best_by,
        'total epochs': epochs_num,
        # 'best epoch': {},
        # 'last epoch': {}
    }

    if save_best_by == 'accuracy':
        best_metric_val = 0
        metric_key = 'accuracy'
        def is_the_best_result(log):
            nonlocal best_metric_val
            curr_metric_val = log[metric_key]
            if curr_metric_val > best_metric_val:
                best_metric_val = curr_metric_val
                return True
            return False
    
    elif save_best_by == 'loss':
        best_metric_val = np.inf
        metric_key = 'mean loss'
        def is_the_best_result(log):
            nonlocal best_metric_val
            curr_metric_val = log[metric_key]
            if curr_metric_val < best_metric_val:
                best_metric_val = curr_metric_val
                return True
            return False
    
    else:
        raise ValueError("unknown value for save_best_by")
    for e in range(1, epochs_num + 1):
        train_log = training(model, optimizer, criterion, train_dataloader, e, device)
        valid_log = testing(model, criterion, valid_dataloader, predict, device, desc=f"[validation, epoch {e}]")
        if is_the_best_result(valid_log):
            info['best epoch'] = {
                'epoch': e, 'model state dict': model.state_dict(),
                **{'train ' + key: value for key, value in train_log.items()},
                **{'valid ' + key: value for key, value in valid_log.items()}
            }
    if info['best epoch']['epoch'] == epochs_num:
        last_epoch_test_log = testing(model, criterion, test_dataloader, predict, device, desc=f"[testing, last & best epoch]")
        info['best epoch'].update({'test ' + key: value for key, value in last_epoch_test_log.items()})
        info['last epoch'] = info['best epoch']
    else:
        last_epoch_test_log = testing(model, criterion, test_dataloader, predict, device, desc=f"[testing, last]")
        info['last epoch'] = {
            'epoch': e, 'model state dict': model.state_dict(),
            **{'train ' + key: value for key, value in train_log.items()},
            **{'valid ' + key: value for key, value in valid_log.items()},
            **{'test ' + key: value for key, value in last_epoch_test_log.items()},
        }
        best_epoch_model = deepcopy(model)
        best_epoch_model.load_state_dict(info['best epoch']['model state dict'])#, map_location=device)
        best_epoch_test_log = testing(
            best_epoch_model, criterion, test_dataloader, predict, device,
            desc=f"[testing, best epoch ({info['best epoch']['epoch']})]")
        info['best epoch'].update({'test ' + key: value for key, value in best_epoch_test_log.items()})
    return info

Создадим модель, оптимизатор и целевую функцию. Вы можете сами выбрать количество слоев в нейронной сети, ваш любимый оптимизатор и целевую функцию.


In [25]:
import torch.nn as nn


class Classifier(nn.Module):
    def __init__(
            self, features_num, classes_num=1,
            layer_neurons_num=256, layers_num=2, bias=True,
            batchnorm=False, batchnorm_before_activation=False):
        if layers_num < 1:
            raise ValueError
        super().__init__()
        layers = [nn.Linear(features_num, layer_neurons_num if layers_num > 1 else classes_num, bias=bias)]
        if layers_num - 1 > 0:
            for _ in range(layers_num - 2):
                a = nn.ReLU(inplace=True)
                if batchnorm:
                    b = nn.BatchNorm1d(layer_neurons_num)
                    layers.extend((b, a) if batchnorm_before_activation else (a, b))
                else:
                    layers.append(a)
                layers.append(nn.Linear(layer_neurons_num, layer_neurons_num, bias=bias))
            # We do not need batch normalization before the output layer
            # (there is opinion it may make worse classification results).
            layers.extend([nn.ReLU(inplace=True), nn.Linear(layer_neurons_num, classes_num, bias=bias)])
        self.net = nn.Sequential(*layers)

    def forward(self, input):
        return self.net(input)


def binary_predict(input, output_type=torch.long):
    return (torch.sigmoid(input) >= 0.5).type(output_type)

In [26]:
from torch.optim import AdamW

features_num = word2vec.vector_size
classes_num = 1
criterion = nn.BCEWithLogitsLoss()
predict = binary_predict
collate_fn = average_emb

default_model_init_kwargs = dict(
    features_num=word2vec.vector_size, classes_num=1,
    layer_neurons_num=256, layers_num=2, bias=True,
    batchnorm=False, batchnorm_before_activation=False)
default_epochs_num = 15

In [27]:
import shelve

EXPR_METRICS_STORAGE = shelve.open('experiments_storage', writeback=True)

def add_experiment_metrics(expr_desc, model_init_kwargs, train_params, train_val_test_results):
    EXPR_METRICS_STORAGE[expr_desc] = dict(expr_desc=expr_desc, model_init_kwargs=model_init_kwargs, train_params=train_params, train_val_test_results=train_val_test_results)
    EXPR_METRICS_STORAGE.sync()

In [28]:
import pandas as pd


def show_results(train_val_test_results):
    best = train_val_test_results['best epoch']
    last = train_val_test_results['last epoch']
    df = pd.DataFrame(
        [
            [best['epoch'], best['test accuracy'], best['valid accuracy'],
                best['test mean loss'], best['valid mean loss'], best['train mean loss']],
            [last['epoch'], last['test accuracy'], last['valid accuracy'],
                last['test mean loss'], last['valid mean loss'], last['train mean loss']],
        ], index=['best', 'last'],
        columns=[
            'epoch', 'test accuracy', 'valid accuracy',
            'test mean loss', 'valid mean loss', 'train mean loss']
    )
    return df

In [30]:
all_data_splitted = prepare_and_split_data(data)

In [135]:
import gensim
import numpy as np

w2v_model_name = "word2vec-google-news-300"
locals_ = locals()
if locals_.get('word2vec') is None:
    word2vec = gensim.downloader.load(w2v_model_name)
if locals_.get('w2v_mean') is None:
    w2v_mean = np.mean(word2vec.vectors, axis=0)
if locals_.get('w2v_std') is None:
    w2v_std = np.std(w2v_std.vectors, axis=0)
dataset_w2v_kwargs = dict(word2vec=word2vec, w2v_mean=w2v_mean, w2v_std=w2v_std)

Наконец, обучим модель и протестируем её.

После каждой эпохи будем проверять качество модели на валидационной части датасета. Если метрика стала лучше, будем сохранять модель. **Подумайте, какая метрика (точность или лосс) будет лучше работать в этой задаче?** 

** Ответ **

По лоссу мы напрямую оптизируемся во время обучения, поэтому использовать его в качестве метрики на валидации не совсем правильно. Плюс лосс показывает усреднённую разницу (не разность) между вероятностями классов, а не непосредственным предсказаниями. Accuracy хорошо отражает качество классификации, если классы в данных сбалансированы.

In [193]:
data["emotion"].value_counts()

4    800000
0    800000
Name: emotion, dtype: int64

Классы сбалансированы идеально, но хороший ли у нас рандом при разбиении данных?

In [350]:
print(*[key + '\n' + str(targets.value_counts()) for key, targets in all_data_splitted.items() if key.endswith("_targets")], sep='\n\n')

train_targets
0    480271
4    479729
Name: emotion, dtype: int64

valid_targets
0    160235
4    159765
Name: emotion, dtype: int64

test_targets
4    160506
0    159494
Name: emotion, dtype: int64


Рандом тоже отличный, поэтому можно смело использовать accuracy. Мы, конечно, косвенно оптизимируемся и по ней, выбирая лучшую на валидации модель, поэтому проводим финальную оценку качества на отложенных тест-данных.

## Experiment #1
default classifier, accuracy, word2vec-google-news-300

In [515]:
splitted_data = all_data_splitted

In [412]:
datasets_kwargs = dict(train=dataset_w2v_kwargs, valid=dataset_w2v_kwargs, test=dataset_w2v_kwargs)
datasets = create_datasets(SlightlyMoreComputationallyEfficientDataset, **splitted_data, datasets_kwargs=datasets_kwargs)
dataloaders = create_dataloaders(collate_fn, **datasets)

Creating train dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=960000.0, style=ProgressStyle(desc…


Train dataset created. Time spent: 67.75587940216064.
Creating valid dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…


Valid dataset created. Time spent: 77.35805130004883.
Creating test dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…


Test dataset created. Time spent: 20.24873638153076.


In [413]:
enable_reproducibility()
model_init_kwargs = default_model_init_kwargs
epochs_num = default_epochs_num
optim_cls = AdamW
lr=3e-4
save_best_by='accuracy'
expr_desc = "expr #1"
model = Classifier(**model_init_kwargs)
optimizer = optim_cls(model.parameters(), lr=lr)
train_params = dict(
    optimizer_cls=optimizer.__class__.__name__,
    criterion_cls=criterion.__class__.__name__,
    predict_func=predict.__name__, collate_fn_func=collate_fn.__name__,
    lr=lr, epochs_num=epochs_num, save_best_by=save_best_by, w2v_model_name=w2v_model_name)
train_valid_test_args = dict(
    epochs_num=epochs_num,
    model=model, optimizer=optimizer, criterion=criterion,
    **dataloaders,
    predict=predict, device=DEVICE,
    save_best_by=save_best_by
)
model

Classifier(
  (net): Sequential(
    (0): Linear(in_features=300, out_features=256, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=256, out_features=1, bias=True)
  )
)

In [414]:
%%time
results = train_valid_test_model(**train_valid_test_args)
add_experiment_metrics(expr_desc, model_init_kwargs, train_params, results)
show_results(results)

HBox(children=(FloatProgress(value=0.0, description='[training, epoch 1]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 1]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 2]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 2]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 3]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 3]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 4]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 4]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 5]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 5]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 6]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 6]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 7]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 7]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 8]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 8]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 9]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 9]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 10]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 10]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 11]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 11]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 12]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 12]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 13]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 13]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 14]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 14]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 15]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 15]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[testing, last & best epoch]', max=313.0, style=ProgressS…


Wall time: 27min 48s


Unnamed: 0,epoch,test accuracy,valid accuracy,test mean loss,valid mean loss,train mean loss
best,15,0.765953,0.765613,0.484946,0.483516,0.461675
last,15,0.765953,0.765613,0.484946,0.483516,0.461675


## Experiment #2
default classifier, loss, word2vec-google-news-300

In [415]:
enable_reproducibility()
model_init_kwargs = default_model_init_kwargs
epochs_num = default_epochs_num
optim_cls = AdamW
lr=3e-4
save_best_by='loss'
expr_desc = "expr #2"
model = Classifier(**model_init_kwargs)
optimizer = optim_cls(model.parameters(), lr=lr)
train_params = dict(
    optimizer_cls=optimizer.__class__.__name__,
    criterion_cls=criterion.__class__.__name__,
    predict_func=predict.__name__, collate_fn_func=collate_fn.__name__,
    lr=lr, epochs_num=epochs_num, save_best_by=save_best_by, w2v_model_name=w2v_model_name)
train_valid_test_args = dict(
    epochs_num=epochs_num,
    model=model, optimizer=optimizer, criterion=criterion,
    **dataloaders,
    predict=predict, device=DEVICE,
    save_best_by=save_best_by
)

In [416]:
%%time
results = train_valid_test_model(**train_valid_test_args)
add_experiment_metrics(expr_desc, model_init_kwargs, train_params, results)
show_results(results)

HBox(children=(FloatProgress(value=0.0, description='[training, epoch 1]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 1]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 2]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 2]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 3]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 3]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 4]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 4]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 5]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 5]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 6]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 6]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 7]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 7]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 8]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 8]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 9]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 9]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 10]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 10]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 11]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 11]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 12]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 12]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 13]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 13]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 14]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 14]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 15]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 15]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[testing, last]', max=313.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='[testing, best epoch (13)]', max=313.0, style=ProgressSty…


Wall time: 25min 52s


Unnamed: 0,epoch,test accuracy,valid accuracy,test mean loss,valid mean loss,train mean loss
best,13,0.765953,0.765556,0.484946,0.482811,0.464732
last,15,0.765953,0.765613,0.484946,0.483516,0.461675


**Вывод по #1 и #2**: Нужно учить дольше и более глубокую модель, чтобы различия проявились. Здесь же лосс и точность улучшаются вместе.

## Embeddings for unknown words (8 баллов)

Пока что использовалась не вся информация из текста. Часть информации фильтровалось – если слова не было в словаре эмбеддингов, то мы просто превращали слово в нулевой вектор. Хочется использовать информацию по-максимуму. Поэтому рассмотрим другие способы обработки слов, которых нет в словаре. А именно:

- Для каждого незнакомого слова будем запоминать его контекст(слова слева и справа от этого слова). Эмбеддингом нашего незнакомого слова будет сумма эмбеддингов всех слов из его контекста. (4 балла)
- Для каждого слова текста получим его эмбеддинг из Tfidf с помощью ```TfidfVectorizer``` из [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer). Итоговым эмбеддингом для каждого слова будет сумма двух эмбеддингов: предобученного и Tfidf-ного. Для слов, которых нет в словаре предобученных эмбеддингов, результирующий эмбеддинг будет просто полученный из Tfidf. (4 балла)

Реализуйте оба варианта **ниже**. Напишите, какой способ сработал лучше и ваши мысли, почему так получилось.

## Первый вариант

In [453]:
from collections import defaultdict

import gensim
import numpy as np
from tqdm.notebook import tqdm


class DatasetWithInferringUnknownTokens(SlightlyMoreComputationallyEfficientDataset):
    def __init__(
        self, texts, labels,
        word2vec: gensim.models.Word2Vec, w2v_mean=None, w2v_std=None,
        tokenizer=text_to_tokens,
        window_to_infer_unknown_tokens=None,
        only_unique_context_tokens_to_infer=True,
        context_tokens_reduction='mean',
        zero_embedding_for_which_inferring_failed=True,
        test_mode=False
    ):
        if window_to_infer_unknown_tokens is not None and window_to_infer_unknown_tokens <= 0:
            raise ValueError(
                f"invalid value {window_to_infer_unknown_tokens} for window_to_infer_unknown_tokens, "
                "None or int > 0 are only acceptable")
        if context_tokens_reduction not in ('mean', 'sum'):
            raise ValueError(
                f"invalid value {context_tokens_reduction} for context_tokens_reduction, "
                "'mean' or 'sum' are only acceptable")
        super().__init__(
            texts, labels, word2vec, w2v_mean, w2v_std, tokenizer,
            window_to_infer_unknown_tokens=window_to_infer_unknown_tokens,
            only_unique_context_tokens_to_infer=only_unique_context_tokens_to_infer,
            context_tokens_reduction=context_tokens_reduction,
            zero_embedding_for_which_inferring_failed=zero_embedding_for_which_inferring_failed,
            test_mode=test_mode)

    def _init_features_w2v_embeddings(
            self, features, word2vec,
            window_to_infer_unknown_tokens,
            only_unique_context_tokens_to_infer,
            context_tokens_reduction,
            zero_embedding_for_which_inferring_failed,
            test_mode):
        no_missing_embeddings = (
            window_to_infer_unknown_tokens is not None and
            zero_embedding_for_which_inferring_failed)

        # used for token for which it's failed to infer embedding
        # or as single stub token for text in which no proper tokens found
        shared_zero_embedding = np.zeros(
            self._embedding_dim, dtype=self._embedding_dtype)
        shared_zero_embedding_index = 0

        # Stores token embeddings:
        embeddings = [shared_zero_embedding]
        # Stores lists of token embedding indexes, one list for each text:
        embeddings_indexes = []

        token2index = {}

        if window_to_infer_unknown_tokens is not None:
            unknown_tokens_positions_in_embeddings_indexes = defaultdict(list)

            if only_unique_context_tokens_to_infer:
                indexes_for_inferring_unknown_tokens = defaultdict(set)
                _container_method_name = "update"
            else:
                indexes_for_inferring_unknown_tokens = defaultdict(list)
                _container_method_name = "extend"
            

            # In test_mode only neighbors in the current token position are consider
            # as its context (we store contexts for different positions separetly).
            # In default mode token context is considered as its neighbors in all its positions
            # (i.e. in all samples of dataset).
            # Case of several occurences in one sample is not supported yet.
            def add_to_gathered_neighbors(tok, context, pos=None):
                if test_mode and pos is None:
                    raise ValueError("if test_mode=True, pos must be given")
                key = (tok, pos) if test_mode else tok
                method = getattr(
                    indexes_for_inferring_unknown_tokens[key], _container_method_name)
                method(context)

            def get_gathered_neighbors(tok, pos=None):
                if test_mode and pos is None:
                    raise ValueError("if test_mode=True, pos must be given")
                key = (tok, pos) if test_mode else tok
                return indexes_for_inferring_unknown_tokens[key]

            if context_tokens_reduction == 'mean':
                def infer(context):
                    return np.mean(context, axis=0)
            elif context_tokens_reduction == 'sum':
                def infer(context):
                    return np.sum(context, axis=0)
            else:
                raise ValueError(
                    f"context_tokens_reduction must be "
                    f"'mean' or 'sum', not {context_tokens_reduction}")

        for text in tqdm(features, desc="processing texts", postfix="gathering embeddings for tokens"):
            tokens = self._get_tokens(text)

            if not tokens:  # text are empty for us, no way to do something else
                embeddings_indexes.append([shared_zero_embedding_index])
                continue

            indexes = []
            for tok in tokens:
                if tok in token2index:
                    indexes.append(token2index[tok])
                elif tok in word2vec:
                    new_embedding = (
                        word2vec[tok] - self._w2v_mean) / self._w2v_std
                    assert new_embedding.shape == (
                        self._embedding_dim,), new_embedding.shape
                    embeddings.append(new_embedding)
                    new_index = len(embeddings) - 1
                    token2index[tok] = new_index
                    indexes.append(new_index)
                else:
                    indexes.append(None)
            embeddings_indexes.append(indexes)

            # Inferring unknown tokens, stage #1:
            # gather neighbors indexes in the current position of unknown token,
            # add them to gathered ones in the already processed positions of the same unknown token.
            if window_to_infer_unknown_tokens is None:
                continue
            for pos, (idx, tok) in enumerate(zip(indexes, tokens)):
                if idx is not None:
                    continue
                left = max(0, pos - window_to_infer_unknown_tokens)
                right = pos + window_to_infer_unknown_tokens + 1
                context = indexes[left:pos] + indexes[pos + 1:right]
                context = [
                    tok_idx for tok_idx in context if tok_idx is not None]
                unknown_tok_pos = (
                    len(embeddings_indexes) - 1,
                    pos
                )
                unknown_tokens_positions_in_embeddings_indexes[tok].append(
                    unknown_tok_pos)
                add_to_gathered_neighbors(tok, context, unknown_tok_pos)

        # Inferring unknown tokens, stage #2:
        # get neighbours embeddings by gathered indexes
        # and use them to infer embedding for unknown token,
        # add new embedding to embeddings,
        # update embedding index from None to index of created embedding
        # in all the positions of previously unknown token.
        def get_new_embedding_index(neighbors_emb_indexes):
            context = [embeddings[idx] for idx in neighbors_emb_indexes]
            if context:
                new_embedding = infer(context)
                assert new_embedding.shape == (
                    self._embedding_dim,), (new_embedding.shape, context[0].shape)
                embeddings.append(new_embedding)
                new_index = len(embeddings) - 1
            elif zero_embedding_for_which_inferring_failed:
                new_index = shared_zero_embedding_index
            else:
                new_index = None
            return new_index
        
        if window_to_infer_unknown_tokens is not None:
            for tok, positions in tqdm(
                    unknown_tokens_positions_in_embeddings_indexes.items(),
                    desc="filling embeddings gaps",
                    postfix="with inferred embeddings for unknown tokens"):
                if test_mode:
                    for pos in positions:
                        neighbors = get_gathered_neighbors(tok, pos)
                        new_index = get_new_embedding_index(neighbors)
                        if new_index is None:
                            continue
                        i, j = pos
                        assert embeddings_indexes[i][j] is None
                        embeddings_indexes[i][j] = new_index
                else:
                    neighbors = get_gathered_neighbors(tok)
                    new_index = get_new_embedding_index(neighbors)
                    if new_index is None:
                        continue
                    for i, j in positions:
                        assert embeddings_indexes[i][j] is None
                        embeddings_indexes[i][j] = new_index

        embeddings = np.array(embeddings)
        assert len(
            embeddings.shape) == 2 and embeddings.shape[-1] == self._embedding_dim, embeddings.shape
        self._features_w2v_embeddings_indexes = embeddings_indexes
        self._features_w2v_embeddings = embeddings
        self._shared_zero_embedding_index = shared_zero_embedding_index
        self._no_missing_embeddings = no_missing_embeddings

### Experiment #3
default classifier, accuracy, word2vec-google-news-300 + inferring unknown tokens

In [418]:
train_dataset_kwargs = dict(dataset_w2v_kwargs,
    window_to_infer_unknown_tokens=5, only_unique_context_tokens_to_infer=True, context_tokens_reduction='mean')
valid_test_datasets_kwargs = dict(train_dataset_kwargs, test_mode=True)
datasets_kwargs = dict(train=train_dataset_kwargs, valid=valid_test_datasets_kwargs, test=valid_test_datasets_kwargs)
datasets = create_datasets(DatasetWithInferringUnknownTokens, **splitted_data, datasets_kwargs=datasets_kwargs)
dataloaders = create_dataloaders(collate_fn, **datasets)

Creating train dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=960000.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='filling embeddings gaps', max=421383.0, style=ProgressSty…


Train dataset created. Time spent: 156.95744729042053.
Creating valid dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='filling embeddings gaps', max=179164.0, style=ProgressSty…


Valid dataset created. Time spent: 43.80369448661804.
Creating test dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='filling embeddings gaps', max=179305.0, style=ProgressSty…


Test dataset created. Time spent: 36.86923050880432.


In [419]:
enable_reproducibility()
model_init_kwargs = default_model_init_kwargs
epochs_num = default_epochs_num
optim_cls = AdamW
lr=3e-4
save_best_by='accuracy'
expr_desc = "expr #3"
model = Classifier(**model_init_kwargs)
optimizer = optim_cls(model.parameters(), lr=lr)
train_params = dict(
    optimizer_cls=optimizer.__class__.__name__,
    criterion_cls=criterion.__class__.__name__,
    predict_func=predict.__name__, collate_fn_func=collate_fn.__name__,
    lr=lr, epochs_num=epochs_num, save_best_by=save_best_by, w2v_model_name=w2v_model_name + "inferring unknown tokens")
train_valid_test_args = dict(
    epochs_num=epochs_num,
    model=model, optimizer=optimizer, criterion=criterion,
    **dataloaders,
    predict=predict, device=DEVICE,
    save_best_by=save_best_by
)

In [420]:
%%time
results = train_valid_test_model(**train_valid_test_args)
add_experiment_metrics(expr_desc, model_init_kwargs, train_params, results)
show_results(results)

HBox(children=(FloatProgress(value=0.0, description='[training, epoch 1]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 1]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 2]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 2]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 3]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 3]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 4]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 4]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 5]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 5]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 6]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 6]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 7]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 7]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 8]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 8]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 9]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 9]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 10]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 10]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 11]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 11]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 12]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 12]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 13]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 13]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 14]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 14]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 15]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 15]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[testing, last]', max=313.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='[testing, best epoch (13)]', max=313.0, style=ProgressSty…


Wall time: 26min 7s


Unnamed: 0,epoch,test accuracy,valid accuracy,test mean loss,valid mean loss,train mean loss
best,13,0.763731,0.763528,0.489463,0.486165,0.463681
last,15,0.763731,0.763519,0.489463,0.487663,0.460519


## Второй вариант

In [421]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD 


class TfidfWordVectors:
    def __init__(self, vectors, vocabulary):
        if not isinstance(vectors, np.ndarray):
            raise TypeError
        if not isinstance(vocabulary, dict):
            raise TypeError
        if len(vectors.shape) != 2:
            raise ValueError
        if len(vectors) != len(vocabulary):
            raise ValueError
        self._vocabulary = vocabulary
        self.vectors = vectors
        self.vector_size = vectors.shape[-1]

    def __getitem__(self, key):
        if not isinstance(key, str):
            raise TypeError
        return self.vectors[self._vocabulary[key]]

    def __contains__(self, key):
        return key in self._vocabulary


class TooFewTokensError(Exception):
    pass


def train_tfidf_word_vectors(documents, output_token_dim=300, tokenizer=text_to_tokens, min_df=3, max_df=1.0, dtype=np.float32, random_state=SEED):
    if len(documents) < output_token_dim:
        raise ValueError("number of documents must be strictly >= output_token_dim to produce tfidf word vectors")
    # no need to have lowercase=True because analyzer does all work
    vectorizer = TfidfVectorizer(analyzer=tokenizer, lowercase=False, min_df=min_df, max_df=max_df, dtype=dtype)
    doc_embeddings_matrix = vectorizer.fit_transform(documents)
    vocabulary = vectorizer.vocabulary_
    tok_embeddings_matrix = doc_embeddings_matrix.T
    if tok_embeddings_matrix.shape[0] < output_token_dim:
        raise TooFewTokensError(
            f"cannot build embeddings matrix with output_token_dim = {output_token_dim}, "
            f"because extracted tokens number = {tok_embeddings_matrix.shape[0]} that < {output_token_dim}")
    if len(documents) > output_token_dim:
        t_svd = TruncatedSVD(n_components=output_token_dim, random_state=random_state)
        tok_embeddings_matrix = t_svd.fit_transform(tok_embeddings_matrix)
    else:
        tok_embeddings_matrix = tok_embeddings_matrix.todense()
    assert tok_embeddings_matrix.shape[-1] == output_token_dim, (tok_embeddings_matrix.shape, len(documents), output_token_dim)
    return TfidfWordVectors(tok_embeddings_matrix, vocabulary)

In [438]:
%%time
tfidf_based_w2v = train_tfidf_word_vectors(all_data_splitted['train_features'], output_token_dim=features_num)
tfidf_based_w2v_mean = np.mean(tfidf_based_w2v.vectors, axis=0)
tfidf_based_w2v_std = np.std(tfidf_based_w2v.vectors, axis=0)

Wall time: 4min 10s


In [517]:
from collections import defaultdict

import gensim
import numpy as np
from tqdm.notebook import tqdm


class DatasetUsingTfidfWordVectorsInAddition(SlightlyMoreComputationallyEfficientDataset):
    def __init__(
        self, texts, labels,
        word2vec: gensim.models.Word2Vec, tfidf_based_w2v, w2v_mean=None, w2v_std=None,
        tfidf_based_w2v_mean=None, tfidf_based_w2v_std=None,
        tokenizer=text_to_tokens,
    ):
        if tfidf_based_w2v_mean is None:
            self._tfidf_based_w2v_mean = np.mean(tfidf_based_w2v.vectors, axis=0)
        else:
            self._tfidf_based_w2v_mean = tfidf_based_w2v_mean
        if tfidf_based_w2v_std is None:
            self._tfidf_based_w2v_std = np.std(tfidf_based_w2v.vectors, axis=0)
        else:
            self._tfidf_based_w2v_std = tfidf_based_w2v_std
        super().__init__(
            texts, labels, word2vec, w2v_mean, w2v_std, tokenizer,
            tfidf_based_w2v=tfidf_based_w2v)

    def _init_features_w2v_embeddings(
            self, features, word2vec, tfidf_based_w2v):
    
        # used as single stub token for text in which no proper tokens found
        shared_zero_embedding = np.zeros(
            self._embedding_dim, dtype=self._embedding_dtype)
        shared_zero_embedding_index = 0

        # Stores token embeddings:
        embeddings = [shared_zero_embedding]
        # Stores lists of token embedding indexes, one list for each text:
        embeddings_indexes = []

        token2index = {}

        
        for text in tqdm(features, desc="processing texts", postfix="gathering embeddings for tokens"):
            tokens = self._get_tokens(text)

            if not tokens:  # text are empty for us, no way to do something else
                embeddings_indexes.append([shared_zero_embedding_index])
                continue

            indexes = []
            for tok in tokens:
                if tok in token2index:
                    indexes.append(token2index[tok])
                    continue
                
                new_embedding_word2vec = None
                if tok in word2vec:
                    new_embedding_word2vec = (
                        word2vec[tok] - self._w2v_mean) / self._w2v_std
                    assert new_embedding_word2vec.shape == (
                        self._embedding_dim,), new_embedding_word2vec.shape
                
                new_embedding_tfidf_based_w2v = None
                if tok in tfidf_based_w2v:
                    new_embedding_tfidf_based_w2v  = (
                        tfidf_based_w2v[tok] - self._tfidf_based_w2v_mean) / self._tfidf_based_w2v_std
                    assert new_embedding_tfidf_based_w2v.shape == (
                        self._embedding_dim,), new_embedding_tfidf_based_w2v.shape
                
                if (new_embedding_word2vec is not None and new_embedding_tfidf_based_w2v is not None):
                    new_embedding = new_embedding_word2vec + new_embedding_tfidf_based_w2v
                elif new_embedding_word2vec is not None:
                    new_embedding = new_embedding_word2vec
                elif new_embedding_tfidf_based_w2v is not None:
                    new_embedding = new_embedding_tfidf_based_w2v
                else:
                    indexes.append(None)
                    continue

                embeddings.append(new_embedding)
                new_index = len(embeddings) - 1
                token2index[tok] = new_index
                indexes.append(new_index)
            
            embeddings_indexes.append(indexes)

        embeddings = np.array(embeddings)
        assert len(
            embeddings.shape) == 2 and embeddings.shape[-1] == self._embedding_dim, embeddings.shape
        self._features_w2v_embeddings_indexes = embeddings_indexes
        self._features_w2v_embeddings = embeddings
        self._shared_zero_embedding_index = shared_zero_embedding_index

### Experiment #4
default classifier, accuracy, word2vec-google-news-300 + tdfidf word vectors trained on train part of data

In [520]:
dataset_w2v_and_tfidf_based_w2v_kwargs = dict(
    dataset_w2v_kwargs, tfidf_based_w2v=tfidf_based_w2v, tfidf_based_w2v_mean=tfidf_based_w2v_mean, tfidf_based_w2v_std=tfidf_based_w2v_std)
datasets_kwargs.update(train=dataset_w2v_and_tfidf_based_w2v_kwargs, valid=dataset_w2v_and_tfidf_based_w2v_kwargs, test=dataset_w2v_and_tfidf_based_w2v_kwargs)
datasets = create_datasets(DatasetUsingTfidfWordVectorsInAddition, **splitted_data, datasets_kwargs=datasets_kwargs)
dataloaders = create_dataloaders(collate_fn, **datasets)

Creating train dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=960000.0, style=ProgressStyle(desc…


Train dataset created. Time spent: 116.45020794868469.
Creating valid dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…


Valid dataset created. Time spent: 22.92027449607849.
Creating test dataset...



HBox(children=(FloatProgress(value=0.0, description='processing texts', max=320000.0, style=ProgressStyle(desc…


Test dataset created. Time spent: 20.41074776649475.


In [521]:
enable_reproducibility()
model_init_kwargs = default_model_init_kwargs
epochs_num = default_epochs_num
optim_cls = AdamW
lr=3e-4
save_best_by='accuracy'
expr_desc = "expr #4"
model = Classifier(**model_init_kwargs)
optimizer = optim_cls(model.parameters(), lr=lr)
train_params = dict(
    optimizer_cls=optimizer.__class__.__name__,
    criterion_cls=criterion.__class__.__name__,
    predict_func=predict.__name__, collate_fn_func=collate_fn.__name__,
    lr=lr, epochs_num=epochs_num, save_best_by=save_best_by, w2v_model_name=w2v_model_name + "tdidf word vectors")
train_valid_test_args = dict(
    epochs_num=epochs_num,
    model=model, optimizer=optimizer, criterion=criterion,
    **dataloaders,
    predict=predict, device=DEVICE,
    save_best_by=save_best_by
)

In [522]:
%%time
results = train_valid_test_model(**train_valid_test_args)
add_experiment_metrics(expr_desc, model_init_kwargs, train_params, results)
show_results(results)

HBox(children=(FloatProgress(value=0.0, description='[training, epoch 1]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 1]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 2]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 2]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 3]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 3]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 4]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 4]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 5]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 5]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 6]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 6]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 7]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 7]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 8]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 8]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 9]', max=937.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 9]', max=313.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 10]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 10]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 11]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 11]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 12]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 12]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 13]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 13]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 14]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 14]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[training, epoch 15]', max=937.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='[validation, epoch 15]', max=313.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='[testing, last]', max=313.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='[testing, best epoch (14)]', max=313.0, style=ProgressSty…


Wall time: 27min 35s


Unnamed: 0,epoch,test accuracy,valid accuracy,test mean loss,valid mean loss,train mean loss
best,14,0.750797,0.751169,0.508857,0.506487,0.471824
last,15,0.750797,0.750678,0.508857,0.507814,0.470152


### Вывод
В первом варианте качество ухудшилось по сравнению с простым word2vec, хоть и незначительно. Нужно учить дольше и перебирать разные параметры. Всю неделю кодил эту возможность перебирать параметры, в итоге непосредственно поэкспериментировать не успеваю.

Во втором варианте accuracy росла медленее и итоговый результат уже ощутимо хуже всех предыдущих вариантов. Наверное, это связано с тем, что методы построения эмбеддингов сильно различаются и простое складывание векторов из полученных векторных пространств только смазывает информацию. Возможно, если бы вместо TF-IDF + SVD использовать PMI + SVD, то результат был бы лучше: в случае PMI используется информация о совстречаемости слов, что близко к принципам word2vec. В случае TF-IDF используется информация о частотности отдельного слова, и обычно с помощью TF-IDF получают эмбеддинги документов (строки), а не слов (столбцы).

## Experiment #5
default classifier, accuracy, glove-twitter-200, TweetTokenizer

In [31]:
import nltk
from nltk.tokenize.casual import TweetTokenizer

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

twitter_tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)


def tweet_text_to_tokens(text):
    return [tok for tok in twitter_tknzr.tokenize(text) if tok not in stop_words]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user0\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import gensim
import numpy as np

w2v_model_name = "glove-twitter-200"
word2vec = gensim.downloader.load(w2v_model_name)
w2v_mean = np.mean(word2vec.vectors, axis=0)
w2v_std = np.std(w2v_std.vectors, axis=0)
dataset_w2v_kwargs = dict(word2vec=word2vec, w2v_mean=w2v_mean, w2v_std=w2v_std)

In [None]:
datasets_kwargs = dict(train=dataset_w2v_kwargs, valid=dataset_w2v_kwargs, test=dataset_w2v_kwargs)
datasets = create_datasets(SlightlyMoreComputationallyEfficientDataset, **splitted_data, datasets_kwargs=datasets_kwargs)
dataloaders = create_dataloaders(collate_fn, **datasets)

In [None]:
enable_reproducibility()
model_init_kwargs = default_model_init_kwargs
epochs_num = 5
optim_cls = AdamW
lr=3e-4
save_best_by='accuracy'
expr_desc = "expr #5"
model = Classifier(**model_init_kwargs)
optimizer = optim_cls(model.parameters(), lr=lr)
train_params = dict(
    optimizer_cls=optimizer.__class__.__name__,
    criterion_cls=criterion.__class__.__name__,
    predict_func=predict.__name__, collate_fn_func=collate_fn.__name__,
    lr=lr, epochs_num=epochs_num, save_best_by=save_best_by, w2v_model_name=w2v_model_name)
train_valid_test_args = dict(
    epochs_num=epochs_num,
    model=model, optimizer=optimizer, criterion=criterion,
    **dataloaders,
    predict=predict, device=DEVICE,
    save_best_by=save_best_by
)

In [None]:
%%time
results = train_valid_test_model(**train_valid_test_args)
add_experiment_metrics(expr_desc, model_init_kwargs, train_params, results)
show_results(results)