Потренируемся самостоятельно писать многослойный перцептрон для работы с текстами.

Возьмем для этого датасет про юридические тексты. В этом датасете есть описания дел, а в качестве цп - то, что с делами произошло.

In [49]:
!wget https://raw.githubusercontent.com/rsuh-python/mag2022/main/CL/term02/06-Embeddings/legal_text_classification.csv

--2024-12-30 12:10:35--  https://raw.githubusercontent.com/rsuh-python/mag2022/main/CL/term02/06-Embeddings/legal_text_classification.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68202412 (65M) [text/plain]
Saving to: ‘legal_text_classification.csv.1’


2024-12-30 12:10:40 (130 MB/s) - ‘legal_text_classification.csv.1’ saved [68202412/68202412]



Для начала напишем бейзлайн - логистическую регрессию. Возьмем в качестве признаков только текст - описание самого дела (case_text). Целевую переменную, очевидно, нужно превратить в чиселки (OHE).

- проверьте данные на пропуски
- проверьте баланс классов - это очень важно!
- используйте TF-IDF
- не забудьте использовать LabelEncoder
- логистической регрессии может понадобиться выставить solver='liblinear'
- если не помните, как работать с несбалансированными датасетами, просмотрите наши конспекты - точно где-то было (на худой конец документация к логрегу)

In [50]:
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

In [51]:
data = pd.read_csv('legal_text_classification.csv')
data.head(20)

Unnamed: 0,case_id,case_outcome,case_title,case_text
0,Case1,cited,Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Lt...,Ordinarily that discretion will be exercised s...
1,Case2,cited,Black v Lipovac [1998] FCA 699 ; (1998) 217 AL...,The general principles governing the exercise ...
2,Case3,cited,Colgate Palmolive Co v Cussons Pty Ltd (1993) ...,Ordinarily that discretion will be exercised s...
3,Case4,cited,Dais Studio Pty Ltd v Bullett Creative Pty Ltd...,The general principles governing the exercise ...
4,Case5,cited,Dr Martens Australia Pty Ltd v Figgins Holding...,The preceding general principles inform the ex...
5,Case6,cited,GEC Marconi Systems Pty Ltd v BHP Information ...,I accept that the making of a rolled up offer ...
6,Case7,cited,John S Hayes &amp; Associates Pty Ltd v Kimber...,The preceding general principles inform the ex...
7,Case8,cited,Seven Network Limited v News Limited (2007) 24...,On the question of the level of unreasonablene...
8,Case9,applied,Australian Broadcasting Corporation v O'Neill ...,recent decision of the High Court in Australia...
9,Case10,followed,Hexal Australia Pty Ltd v Roche Therapeutics I...,Hexal Australia Pty Ltd v Roche Therapeutics I...


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24985 entries, 0 to 24984
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   case_id       24985 non-null  object
 1   case_outcome  24985 non-null  object
 2   case_title    24985 non-null  object
 3   case_text     24809 non-null  object
dtypes: object(4)
memory usage: 780.9+ KB


In [52]:
data = data.dropna()

In [None]:
data.case_outcome.unique()

array(['cited', 'applied', 'followed', 'referred to', 'related',
       'considered', 'discussed', 'distinguished', 'affirmed', 'approved'],
      dtype=object)

In [None]:
data.head()

Unnamed: 0,case_id,case_outcome,case_title,case_text
0,Case1,cited,Alpine Hardwood (Aust) Pty Ltd v Hardys Pty Lt...,Ordinarily that discretion will be exercised s...
1,Case2,cited,Black v Lipovac [1998] FCA 699 ; (1998) 217 AL...,The general principles governing the exercise ...
2,Case3,cited,Colgate Palmolive Co v Cussons Pty Ltd (1993) ...,Ordinarily that discretion will be exercised s...
3,Case4,cited,Dais Studio Pty Ltd v Bullett Creative Pty Ltd...,The general principles governing the exercise ...
4,Case5,cited,Dr Martens Australia Pty Ltd v Figgins Holding...,The preceding general principles inform the ex...


In [None]:
data['case_outcome'].value_counts()

Unnamed: 0_level_0,count
case_outcome,Unnamed: 1_level_1
cited,12110
referred to,4363
applied,2438
followed,2252
considered,1699
discussed,1018
distinguished,603
related,112
approved,108
affirmed,106


In [53]:
X = data['case_text']
y = data['case_outcome']

In [54]:
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

In [None]:
y.value_counts()

AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
vec = TfidfVectorizer(ngram_range=(1, 1))
bow = vec.fit_transform(X_train)
clf = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced')
clf.fit(bow, y_train)
pred = clf.predict(vec.transform(X_test))
print(classification_report(pred, y_test))



              precision    recall  f1-score   support

           0       0.76      0.29      0.42        75
           1       0.25      0.30      0.27       503
           2       0.37      0.20      0.26        50
           3       0.67      0.66      0.67      3012
           4       0.27      0.26      0.27       467
           5       0.41      0.21      0.28       476
           6       0.43      0.23      0.30       298
           7       0.29      0.37      0.32       452
           8       0.37      0.52      0.43       792
           9       0.56      0.23      0.33        78

    accuracy                           0.49      6203
   macro avg       0.44      0.33      0.36      6203
weighted avg       0.51      0.49      0.49      6203



Если все сделали как я, должна получиться средняя f-score в районе 0.5.

Теперь давайте попробуем написать нейронную сетку по аналогии с тетрадкой про твиттер из прошлого семинара.

In [56]:
import numpy as np
from string import punctuation
from collections import Counter
from sklearn.utils import shuffle, class_weight
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
from sklearn.metrics import f1_score

In [57]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cuda')

class_weight - очень полезная для нас штука. Можно вычислить веса классов автоматически с ее помощью:

In [58]:
# первый аргумент - какие веса высчитывать, второй - какие у нас классы, третий - какие их частоты
yweights = class_weight.compute_class_weight('balanced', classes=np.unique(data.case_outcome), y=data.case_outcome)
yweights = torch.tensor(yweights, dtype=torch.float).to(device)

Заметьте, что возвращает оно np.array.

Нужно написать:

- функцию для предобработки текста, которая получает сырой текст и возвращает список токенов
- создать словарь word2id
- и обратный ему id2word

In [None]:
def preprocess(text):
    tokens = text.lower().split()
    tokens = [token.strip(punctuation) for token in tokens]
    return tokens

In [59]:
vocab = Counter()

for text in X:
    vocab.update(preprocess(text))
print('всего уникальных токенов:', len(vocab))

всего уникальных токенов: 63378


In [60]:
filtered_vocab = set()

for word in vocab:
    if vocab[word] > 2:
        filtered_vocab.add(word)
print('уникальных токенов, втретившихся больше 2 раз:', len(filtered_vocab))

уникальных токенов, втретившихся больше 2 раз: 38244


In [61]:
word2id = {'PAD': 0}

for word in filtered_vocab:
    word2id[word] = len(word2id)

In [62]:
id2word = {i: word for word, i in word2id.items()}

In [63]:
MAX_LEN = 0

for text in X:
    tokens = preprocess(text)
    MAX_LEN = max(len(tokens), MAX_LEN)
MAX_LEN

22466

In [None]:
X = torch.LongTensor(size=(X_train.shape[0], MAX_LEN))

for i, text in enumerate(X_train):
    tokens = preprocess(text)
    ids = [word2id[token] for token in tokens if token in word2id][:MAX_LEN]

    ids = F.pad(torch.LongTensor(ids), (0, MAX_LEN - len(ids)))
    X[i] = ids

In [None]:
print(X[4].shape)
print(X[4])
print([id2word[int(id_)] for id_ in  X[4]])

torch.Size([22466])
tensor([23952, 24459,  1440,  ...,     0,     0,     0])
['at', 'the', 'very', 'least', 'the', 'examples', 'in', 'heading', '3808', 'and', 'the', 'references', 'to', 'camphor', 'and', 'mosquito', 'spirals', 'and', 'coils', 'in', 'sub-heading', '3808.10.10', 'suggest', 'that', 'there', 'is', 'some', 'ambiguity', 'about', 'the', 'way', 'in', 'which', 'the', 'term', 'insecticide', 'is', 'used', 'in', 'the', 'heading', 'to', 'adapt', 'the', 'language', 'used', 'by', 'burchett', 'j', 'in', 'baxter', 'healthcare', '', 'a', 'legal', 'choice', 'must', 'be', 'made', 'about', 'possible', 'concepts', 'conveyed', 'by', 'the', 'statutory', 'expression', 'and', 'in', 'making', 'that', 'choice', 'it', 'is', 'appropriate', 'to', 'take', 'account', 'of', 'the', 'intent', 'and', 'purpose', 'of', 'the', 'customs', 'tariff', 'act', '', 'in', 'my', 'opinion', 'it', 'would', 'have', 'been', 'permissible', 'for', 'the', 'tribunal', 'to', 'have', 'regard', 'to', 'the', 'harmonised', 'syste

Лучше это все, конечно, запускать в колабе... не забудьте там выбрать T4 GPU в рантайме

In [64]:
def set_random_seed(seed):
    torch.backends.cudnn.deterministic = True
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

set_random_seed(53)


Нужно написать класс для нашего датасета (можно беспощадно копипастить из тетрадки про твиттер)

In [65]:

class LegalTextsDataset(Dataset):
    def __init__(self, text, labels, word2id, device):
        self.text = list(text)
        self.word2id = word2id
        self.length = len(labels)
        self.target = list(labels)
        self.device = device

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        tokens = self.preprocess(self.text[index])
        ids = torch.LongTensor([self.word2id.get(token, 0) for token in tokens])
        y = torch.tensor([self.target[index]], dtype=torch.float32)
        return ids, y

    def preprocess(self, text):
        from string import punctuation
        tokens = text.lower().split()
        tokens = [token.strip(punctuation) for token in tokens]
        tokens = [token for token in tokens if token]
        return tokens

    def collate_fn(self, batch):
        from torch.nn.utils.rnn import pad_sequence
        ids, y = list(zip(*batch))
        padded_ids = pad_sequence(ids, batch_first=True).to(self.device)
        y = torch.stack(y).to(self.device)
        return padded_ids, y


In [66]:
train_dataset = LegalTextsDataset(X_train, y_train, word2id, device)

train_sampler = RandomSampler(train_dataset)

train_iterator = DataLoader(
    train_dataset,
    collate_fn=train_dataset.collate_fn,
    sampler=train_sampler,
    batch_size=256
)

In [67]:
val_dataset = LegalTextsDataset(X_test, y_test, word2id, device)

val_sampler = RandomSampler(val_dataset)

val_iterator = DataLoader(
    val_dataset,
    collate_fn=val_dataset.collate_fn,
    sampler=val_sampler,
    batch_size=256
)

Ну и наконец напишем архитектуру. Модель при инициализации должна принимать размер словаря и эмбеддинга. У нас в датасете 10 классов, поэтому, в отличие от тетрадки про твиттер, нужно использовать Softmax и возвращать вероятности классов. В качестве лосса подойдет кросс-энтропия (я ее уже за вас вписала вместе с весами классов).

In [None]:
class MLP(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_classes=10):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.emb2h = nn.Linear(embedding_dim, 20)
        self.act1 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, num_classes)


    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out

Для начала взяла архитектуру из семинара про твиттер.

In [None]:
batch, y = next(iter(train_iterator))
batch = batch.to(device)
print(y)

tensor([[3.],
        [9.],
        [1.],
        [3.],
        [7.],
        [7.],
        [3.],
        [1.],
        [3.],
        [8.],
        [3.],
        [8.],
        [3.],
        [3.],
        [8.],
        [8.],
        [4.],
        [4.],
        [7.],
        [3.],
        [5.],
        [3.],
        [1.],
        [3.],
        [3.],
        [1.],
        [1.],
        [3.],
        [7.],
        [6.],
        [8.],
        [7.],
        [3.],
        [4.],
        [3.],
        [3.],
        [8.],
        [3.],
        [4.],
        [1.],
        [3.],
        [3.],
        [3.],
        [3.],
        [6.],
        [8.],
        [1.],
        [3.],
        [3.],
        [3.],
        [3.],
        [3.],
        [7.],
        [8.],
        [3.],
        [1.],
        [3.],
        [3.],
        [4.],
        [3.],
        [7.],
        [3.],
        [3.],
        [4.],
        [1.],
        [7.],
        [8.],
        [3.],
        [5.],
        [7.],
        [8.],
      

In [None]:
#пропустим через модель наш первый батч, чтобы проверить, что все работает
model = MLP(len(id2word), 5).to(device)
output = torch.argmax(model(batch), dim=1)
output

tensor([1, 8, 8, 1, 8, 8, 0, 8, 8, 1, 0, 1, 8, 8, 1, 1, 8, 1, 8, 8, 1, 1, 1, 8,
        8, 1, 0, 1, 1, 1, 8, 1, 0, 8, 8, 1, 8, 1, 8, 1, 0, 1, 1, 8, 0, 1, 1, 0,
        1, 8, 8, 8, 0, 1, 8, 1, 1, 8, 8, 1, 1, 1, 8, 8, 1, 1, 1, 1, 8, 8, 1, 8,
        0, 8, 0, 8, 1, 1, 8, 1, 1, 8, 1, 8, 1, 1, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1,
        1, 8, 0, 8, 1, 1, 1, 8, 1, 8, 8, 1, 1, 1, 1, 1, 1, 0, 1, 8, 8, 1, 1, 1,
        1, 8, 1, 8, 1, 1, 0, 8, 1, 0, 0, 1, 0, 1, 0, 8, 8, 8, 0, 8, 1, 8, 8, 8,
        0, 1, 8, 1, 1, 1, 1, 1, 8, 1, 8, 8, 0, 8, 1, 1, 8, 8, 1, 0, 1, 1, 1, 1,
        8, 1, 8, 8, 1, 8, 0, 1, 8, 1, 1, 1, 1, 1, 1, 1, 8, 8, 0, 1, 8, 8, 1, 8,
        8, 8, 8, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 8, 1, 8, 8, 8, 8, 0, 1, 1, 1, 8,
        8, 0, 1, 1, 1, 1, 8, 0, 1, 1, 8, 1, 8, 8, 8, 8, 1, 1, 1, 1, 8, 8, 0, 1,
        0, 1, 8, 8, 1, 8, 1, 1, 1, 1, 8, 1, 8, 8, 1, 8], device='cuda:0')

Теперь нужно написать трейнлуп (лучше скопипастить откуда-нибудь), инициализировать нашу модель и запустить)

In [68]:
def train_loop(model, iterator, optimizer, criterion):
    #print('Training...')
    epoch_loss = 0
    all_preds = []
    all_labels = []

    model.train()
    for i, (texts, ys) in enumerate(iterator):
        optimizer.zero_grad()
        preds_proba = model(texts)
        preds = torch.argmax(preds_proba, dim=1)


        ys = ys.squeeze(dim=1).to(torch.long)  #пришлось убрать лишнюю размерность


        loss = criterion(preds_proba, ys)


        loss.backward()
        optimizer.step()


        epoch_loss += loss.item()
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(ys.cpu().numpy())




    f1 = f1_score(all_labels, all_preds, average='weighted') #добавила f1, чтобы не ориентироваться по одному лоссу


    return epoch_loss / len(iterator), f1

In [69]:
def evaluate(model, iterator, criterion, device):
    print("\nValidating...")
    epoch_loss = 0
    all_preds = []
    all_labels = []

    model.eval()
    with torch.no_grad():
        for i, (texts, ys) in enumerate(iterator):
            texts = texts.to(device)
            ys = ys.to(device)


            ys = ys.squeeze(dim=1).to(torch.long)


            predictions = model(texts)
            preds = torch.argmax(predictions, dim=1)

            loss = criterion(predictions, ys)

            epoch_loss += loss.item()
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(ys.cpu().numpy())

            if not (i + 1) % 5:
                print(f'Val loss: {epoch_loss / (i + 1)}')

    f1 = f1_score(all_labels, all_preds, average='weighted')
    print(f'Val F1-Score: {f1:.4f}')

    return epoch_loss / len(iterator), f1


In [None]:

set_random_seed(53)
model = MLP(len(word2id), 5)
optimizer = optim.AdamW(model.parameters(), lr=0.001) #решила посмотреть, какие есть варианты Адама
criterion = nn.CrossEntropyLoss(weight=yweights)

model = model.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100

for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3090, F1: 0.0955
Epoch 10/100, Loss: 2.3004, F1: 0.3017
Epoch 20/100, Loss: 2.2975, F1: 0.2703
Epoch 30/100, Loss: 2.2946, F1: 0.2094
Epoch 40/100, Loss: 2.2880, F1: 0.2436
Epoch 50/100, Loss: 2.2886, F1: 0.2446
Epoch 60/100, Loss: 2.2862, F1: 0.2816
Epoch 70/100, Loss: 2.2799, F1: 0.2055
Epoch 80/100, Loss: 2.2760, F1: 0.2129
Epoch 90/100, Loss: 2.2767, F1: 0.2627
Epoch 100/100, Loss: 2.2682, F1: 0.1251


Скорее всего, вам понадобится учиться очень много эпох, чтобы предсказывать что-нибудь стоящее (эпох 100...), и, вероятнее всего, придется играться с архитектурой, чтобы получить приличное качество. На семинаре на эксперименты времени нет, поэтому добаловаться можно дома - и заодно попробовать подключить эмбеддинги w2v, например.

In [None]:
evaluate(model, val_iterator, criterion, device)



Validating...
Val loss: 2.268031883239746
Val loss: 2.271714115142822
Val loss: 2.272287050882975
Val loss: 2.2650322198867796
Val loss: 2.257837133407593
Val F1-Score: 0.2079


(2.257837133407593, 0.20794588411630807)

Лосс во время обучения очень медленно снижается, f1 хуже, чем у логистической регрессии.

In [None]:
class MLP2(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_classes=10):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.emb2h = nn.Linear(embedding_dim, 20)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, num_classes)


    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out


Пробовала менять размеры скрытых слоев (100, 50/50, 20), процент дропаута (0.2, 0.5, 0.6), менять ReLu и LeakyRelu, результат либо ухудшался, либо не менялся, останавливала обучение на 40-50 эпохе, поэтому вывод не сохранила.


In [None]:
set_random_seed(53)
model2 = MLP2(len(word2id), 5)
optimizer = optim.AdamW(model2.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss(weight=yweights)

model2 = model2.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model2, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3063, F1: 0.2166
Epoch 10/100, Loss: 2.2954, F1: 0.2187
Epoch 20/100, Loss: 2.2987, F1: 0.2541
Epoch 30/100, Loss: 2.2448, F1: 0.0761
Epoch 40/100, Loss: 2.0471, F1: 0.0994
Epoch 50/100, Loss: 1.8923, F1: 0.0769
Epoch 60/100, Loss: 1.7863, F1: 0.1391
Epoch 70/100, Loss: 1.8150, F1: 0.1191
Epoch 80/100, Loss: 1.5949, F1: 0.1750
Epoch 90/100, Loss: 1.5763, F1: 0.2173
Epoch 100/100, Loss: 1.6034, F1: 0.1736


In [None]:
evaluate(model2, val_iterator, criterion, device)


Validating...
Val loss: 1.8369765996932983
Val loss: 2.3603370428085326
Val loss: 2.3162912050882976
Val loss: 2.3241359412670137
Val loss: 2.369232964515686
Val F1-Score: 0.2244


(2.369232964515686, 0.22439468822282577)

Каждый раз, когда у меня получалось добиться заметного снижения лосса на трейне, на тесте либо ничего не менялось, либо лосс вообще рос. Это лучшее, чего удалось добиться.

Попробовала добавить w2v:

In [70]:
import gensim
texts = X.apply(preprocess).tolist()
w2v = gensim.models.Word2Vec(texts, vector_size=100, window=5, min_count=2)

In [71]:
def create_embedding_matrix(word2id, w2v_model, embedding_dim=100):
    weights = np.zeros((len(word2id), embedding_dim))
    oov_count = 0
    for word, i in word2id.items():
        if word == 'PAD':
            continue
        try:
            weights[i] = w2v_model.wv[word]
        except KeyError:
            oov_count += 1
            weights[i] = np.random.normal(0, 0.1, embedding_dim)
    print(f"Количество OOV слов: {oov_count}")
    return weights

weights = create_embedding_matrix(word2id, w2v)
embedding_matrix = torch.tensor(weights, dtype=torch.float32)

Количество OOV слов: 0


Не совсем поняла, что за слова подсчитываются и на что это влияет.

In [None]:
class MLP_w2v(nn.Module):
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.dropout(hidden)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out

Первая архитектура для w2v из того же семинара.

In [None]:
class MLP_w2v2(nn.Module): #попробовала увеличь количество слоев и их размер. Дропаут оставила 0.5
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 100)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2 = nn.Linear(100, 50)
        self.act2 = nn.LeakyReLU()
        self.h3out = nn.Linear(50, 10)


    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.dropout(hidden)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        hidden2 = self.h2(hidden)
        hidden2 = self.dropout(hidden2)
        hidden2 = self.act2(hidden2)
        hidden2 = self.dropout(hidden2)
        out = self.h3out(hidden2)
        return out

In [None]:
class MLP_w2v3(nn.Module): #архитектура, которая дала лучший результат без w2v
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, 10)


    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out

In [None]:
set_random_seed(53)
model3 = MLP_w2v(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model3.parameters(), lr=0.01)
#optimizer = torch.optim.SGD(model3.parameters(), lr=0.01, momentum=0.9) думала, что этот оптимайзер поможет от переобучения, но стало хуже
criterion = nn.CrossEntropyLoss(weight=yweights)

model3 = model3.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model3, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3101, F1: 0.2536
Epoch 10/100, Loss: 2.2999, F1: 0.1977
Epoch 20/100, Loss: 2.2966, F1: 0.0615
Epoch 30/100, Loss: 2.2999, F1: 0.0153
Epoch 40/100, Loss: 2.2977, F1: 0.1094
Epoch 50/100, Loss: 2.3027, F1: 0.0434
Epoch 60/100, Loss: 2.3027, F1: 0.1763
Epoch 70/100, Loss: 2.2956, F1: 0.2155
Epoch 80/100, Loss: 2.2970, F1: 0.0971
Epoch 90/100, Loss: 2.2952, F1: 0.0241
Epoch 100/100, Loss: 2.2967, F1: 0.0704


In [None]:
evaluate(model3, val_iterator, criterion, device)



Validating...
Val loss: 2.297984409332275
Val loss: 2.297394347190857
Val loss: 2.3002013842264812
Val loss: 2.3017213344573975
Val loss: 2.3009741497039795
Val F1-Score: 0.0144


(2.3009741497039795, 0.014420646349566418)

In [None]:
set_random_seed(53)
model3 = MLP_w2v(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model3.parameters(), lr=0.001) #то же самое, но меньше lr
#optimizer = torch.optim.SGD(model3.parameters(), lr=0.01, momentum=0.9)
criterion = nn.CrossEntropyLoss(weight=yweights)

model3 = model3.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model3, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3181, F1: 0.2978
Epoch 10/100, Loss: 2.3011, F1: 0.0162
Epoch 20/100, Loss: 2.2949, F1: 0.0163
Epoch 30/100, Loss: 2.2986, F1: 0.0162
Epoch 40/100, Loss: 2.2996, F1: 0.0161
Epoch 50/100, Loss: 2.2994, F1: 0.0164
Epoch 60/100, Loss: 2.2951, F1: 0.0177
Epoch 70/100, Loss: 2.2964, F1: 0.0167
Epoch 80/100, Loss: 2.2955, F1: 0.0167
Epoch 90/100, Loss: 2.2932, F1: 0.0164
Epoch 100/100, Loss: 2.2937, F1: 0.0172


In [None]:
evaluate(model3, val_iterator, criterion, device)



Validating...
Val loss: 2.2986007213592528
Val loss: 2.2981488704681396
Val loss: 2.300425656636556
Val loss: 2.3018643736839293
Val loss: 2.3010852336883545
Val F1-Score: 0.0140


(2.3010852336883545, 0.013985956367500718)

От уменьшения lr ничего не поменялось.


In [None]:
set_random_seed(53)
model3b = MLP_w2v2(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model3b.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss(weight=yweights)

model3b = model3b.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model3b, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3570, F1: 0.1186
Epoch 10/100, Loss: 2.3095, F1: 0.0393
Epoch 20/100, Loss: 2.3036, F1: 0.0272
Epoch 30/100, Loss: 2.3026, F1: 0.0767
Epoch 40/100, Loss: 2.3042, F1: 0.0217
Epoch 50/100, Loss: 2.3038, F1: 0.0239
Epoch 60/100, Loss: 2.3487, F1: 0.1020
Epoch 70/100, Loss: 2.3031, F1: 0.1091
Epoch 80/100, Loss: 2.3038, F1: 0.1745
Epoch 90/100, Loss: 2.3030, F1: 0.0513
Epoch 100/100, Loss: 2.3032, F1: 0.0437


In [None]:
evaluate(model3b, val_iterator, criterion, device)



Validating...
Val loss: 2.2980159759521483
Val loss: 2.3006460428237916
Val loss: 2.3027770042419435
Val loss: 2.3025057911872864
Val loss: 2.3044393920898436
Val F1-Score: 0.0190


(2.3044393920898436, 0.019023387777106215)

К этому моменту 2.3 стало моим нелюбимым числом.

In [None]:
set_random_seed(53)
model3с = MLP_w2v3(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model3с.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss(weight=yweights)

model3с = model3с.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model3с, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3137, F1: 0.1778
Epoch 10/100, Loss: 2.2978, F1: 0.1189
Epoch 20/100, Loss: 2.2982, F1: 0.0847
Epoch 30/100, Loss: 2.2936, F1: 0.0510
Epoch 40/100, Loss: 2.2690, F1: 0.0417
Epoch 50/100, Loss: 1.9343, F1: 0.0869
Epoch 60/100, Loss: 1.6639, F1: 0.1340
Epoch 70/100, Loss: 1.5827, F1: 0.1603
Epoch 80/100, Loss: 1.3734, F1: 0.1826
Epoch 90/100, Loss: 1.3159, F1: 0.2338
Epoch 100/100, Loss: 1.3336, F1: 0.2125


In [None]:
evaluate(model3с, val_iterator, criterion, device)


Validating...
Val loss: 2.1933315992355347
Val loss: 3.075034499168396
Val loss: 3.3973006884257
Val loss: 3.178031998872757
Val loss: 3.328015570640564
Val F1-Score: 0.1160


(3.328015570640564, 0.11604893250003304)

Здесь я пошла гуглить, что еще можно сделать с эмбеддингами, и решила попробовать вместо усреднения брать максимальные значения, в теории это должно было сохранить основную информацию.

In [None]:
class MLP_w2v_max(nn.Module):
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        max_emb, _ = torch.max(embedded, dim=1)
        hidden = self.emb2h(max_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out


In [None]:
set_random_seed(53)
model4 = MLP_w2v_max(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model4.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss(weight=yweights)

model4 = model4.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model4, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3256, F1: 0.1346
Epoch 10/100, Loss: 2.2779, F1: 0.1194
Epoch 20/100, Loss: 1.8912, F1: 0.1529
Epoch 30/100, Loss: 1.6633, F1: 0.2285
Epoch 40/100, Loss: 1.1269, F1: 0.3827
Epoch 50/100, Loss: 0.8686, F1: 0.5296
Epoch 60/100, Loss: 0.7493, F1: 0.6187
Epoch 70/100, Loss: 0.6159, F1: 0.6693
Epoch 80/100, Loss: 0.4892, F1: 0.7309
Epoch 90/100, Loss: 0.5695, F1: 0.7117
Epoch 100/100, Loss: 0.4512, F1: 0.7798


In [None]:
evaluate(model4, val_iterator, criterion, device)


Validating...
Val loss: 20.48136348724365
Val loss: 22.4211483001709
Val loss: 21.21394589742025
Val loss: 20.99771738052368
Val loss: 21.913810081481934
Val F1-Score: 0.4733


(21.913810081481934, 0.473292257905774)

Мягко говоря, переобучились...

In [None]:
class MLP_w2v_mean(nn.Module): #попробовала избежать переобучения: вернулась к усреднению эмбеддингов, снизила процент dropout (на первых архитектурах более низкий dropout давал более низкие метрики)
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.3)
        self.h2out = nn.Linear(20, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out


In [None]:
set_random_seed(53)
model4b = MLP_w2v_mean(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model4b.parameters(), lr=0.001, weight_decay=1e-4) #регуляризация, чтобы снизить переобучение
criterion = nn.CrossEntropyLoss(weight=yweights)

model4b = model4b.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model4b, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3141, F1: 0.2502
Epoch 10/100, Loss: 2.3004, F1: 0.2759
Epoch 20/100, Loss: 2.2975, F1: 0.2028
Epoch 30/100, Loss: 2.2892, F1: 0.2488
Epoch 40/100, Loss: 2.2849, F1: 0.2083
Epoch 50/100, Loss: 2.2754, F1: 0.1333
Epoch 60/100, Loss: 2.2513, F1: 0.1573
Epoch 70/100, Loss: 2.2091, F1: 0.1667
Epoch 80/100, Loss: 2.1502, F1: 0.1740
Epoch 90/100, Loss: 2.0905, F1: 0.1446
Epoch 100/100, Loss: 2.0560, F1: 0.1711


In [None]:
evaluate(model4b, val_iterator, criterion, device)


Validating...
Val loss: 2.0536735534667967
Val loss: 1.9823457241058349
Val loss: 2.0314860343933105
Val loss: 2.011348509788513
Val loss: 2.0035588455200197
Val F1-Score: 0.1066


(2.0035588455200197, 0.10663156572145423)

Лучшее, что получалось.

In [None]:
class MLP_w2v_max2(nn.Module): #архитектура, на которой не было переобучения (два дропаута) + max pooling
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)
        self.h2out = nn.Linear(20, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        max_emb, _ = torch.max(embedded, dim=1)
        hidden = self.emb2h(max_emb)
        hidden = self.dropout(hidden)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out

In [None]:
set_random_seed(53)
model4c = MLP_w2v_max2(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model4c.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=yweights)

model4c = model4c.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model4c, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.4998, F1: 0.1983
Epoch 10/100, Loss: 2.5364, F1: 0.1944
Epoch 20/100, Loss: 2.5294, F1: 0.1963
Epoch 30/100, Loss: 2.5365, F1: 0.2025
Epoch 40/100, Loss: 2.5187, F1: 0.2048
Epoch 50/100, Loss: 2.5068, F1: 0.2010
Epoch 60/100, Loss: 2.5308, F1: 0.1987
Epoch 70/100, Loss: 2.5204, F1: 0.1992
Epoch 80/100, Loss: 2.5357, F1: 0.1962
Epoch 90/100, Loss: 2.5148, F1: 0.2022
Epoch 100/100, Loss: 2.4846, F1: 0.2001


In [None]:
evaluate(model4c, val_iterator, criterion, device)


Validating...
Val loss: 2.330614471435547
Val loss: 2.3445924043655397
Val loss: 2.36965594291687
Val loss: 2.358574962615967
Val loss: 2.368682622909546
Val F1-Score: 0.0023


(2.368682622909546, 0.002287988069806836)

Max Pooling все-таки не помог.

In [None]:
На всякий случай попробовала вариант, где усредненный и максимальный эмбеддинги конкатенируются (не помогло).

In [72]:

class MLPw2v_mm(nn.Module):
    def __init__(self, vocab_size, weights, hidden_dim=40):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h1 = nn.Linear(weights.shape[1] * 2, hidden_dim) #умножение на 2, чтобы объединить усредненные эмбеддинги и max pooling
        self.act1 = nn.LeakyReLU()
        self.dropout1 = nn.Dropout(p=0.4)
        self.bn1 = nn.BatchNorm1d(hidden_dim) #40
        self.h1h2 = nn.Linear(hidden_dim, hidden_dim // 2) #20
        self.act2 = nn.LeakyReLU()
        self.dropout2 = nn.Dropout(p=0.4)
        self.bn2 = nn.BatchNorm1d(hidden_dim // 2)
        self.h2out = nn.Linear(hidden_dim // 2, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        max_emb, _ = torch.max(embedded, dim=1)
        mean_emb = torch.mean(embedded, dim=1)
        combined_emb = torch.cat((max_emb, mean_emb), dim=1)

        hidden1 = self.act1(self.emb2h1(combined_emb))
        hidden1 = self.dropout1(hidden1)
        hidden1 = self.bn1(hidden1)

        hidden2 = self.act2(self.h1h2(hidden1))
        hidden2 = self.dropout2(hidden2)
        hidden2 = self.bn2(hidden2)

        out = self.h2out(hidden2)
        return out


In [73]:
set_random_seed(53)
model4d = MLPw2v_mm(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model4d.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=yweights)

model4d = model4d.to(device)
criterion = criterion.to(device)


In [74]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model4d, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3993, F1: 0.1263
Epoch 10/100, Loss: 2.2762, F1: 0.1012
Epoch 20/100, Loss: 2.1556, F1: 0.0655
Epoch 30/100, Loss: 1.9912, F1: 0.1251
Epoch 40/100, Loss: 1.9413, F1: 0.1066
Epoch 50/100, Loss: 1.7878, F1: 0.1458
Epoch 60/100, Loss: 1.7279, F1: 0.1197
Epoch 70/100, Loss: 1.6990, F1: 0.1224
Epoch 80/100, Loss: 1.6207, F1: 0.1426
Epoch 90/100, Loss: 1.5654, F1: 0.1332
Epoch 100/100, Loss: 1.5116, F1: 0.1342


In [75]:
evaluate(model4d, val_iterator, criterion, device)


Validating...
Val loss: 2.510327386856079
Val loss: 2.4704058170318604
Val loss: 2.4083879232406615
Val loss: 2.4121340572834016
Val loss: 2.3212630939483643
Val F1-Score: 0.0994


(2.3212630939483643, 0.0994071450662023)

Нашла, как использовать готовые эмбеддинги, решила попробоавать их, здесь FastText

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz


--2024-12-30 11:33:54--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.254.55, 13.227.254.30, 13.227.254.68, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.254.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2024-12-30 11:34:50 (23.0 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]



In [None]:
from gensim.models import KeyedVectors

#вместо w2v попробовала использовать предобученные эмбеддинги
fasttext_path = "/content/cc.en.300.vec.gz"
pretrained_embeddings = KeyedVectors.load_word2vec_format(fasttext_path, binary=False)

In [None]:
def create_embedding_matrix(word2id, w2v_model, embedding_dim=100):
    weights = np.zeros((len(word2id), embedding_dim))
    oov_count = 0
    for word, i in word2id.items():
        if word == 'PAD':
            continue
        try:
            weights[i] = w2v_model[word]
        except KeyError:
            oov_count += 1
            weights[i] = np.random.normal(0, 0.1, embedding_dim)
    print(f"Количество OOV слов: {oov_count}")
    return weights


weights = create_embedding_matrix(word2id, pretrained_embeddings, embedding_dim=300)
embedding_matrix = torch.tensor(weights, dtype=torch.float32)


Количество OOV слов: 12134


In [None]:
class MLP_ft(nn.Module): #архитектура, которая дала лучший результат с w2v
    def __init__(self, vocab_size, weights):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.weight = nn.Parameter(weights, requires_grad=False)
        self.emb2h = nn.Linear(weights.shape[1], 20)
        self.act1 = nn.LeakyReLU()
        self.dropout = nn.Dropout(p=0.3)
        self.h2out = nn.Linear(20, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        mean_emb = torch.mean(embedded, dim=1)
        hidden = self.emb2h(mean_emb)
        hidden = self.act1(hidden)
        hidden = self.dropout(hidden)
        out = self.h2out(hidden)
        return out


In [None]:
set_random_seed(53)
model5 = MLP_ft(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model5.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=yweights)

model5 = model5.to(device)
criterion = criterion.to(device)


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    train_loss, train_f1 = train_loop(model5, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

Epoch 1/100, Loss: 2.3099, F1: 0.0003
Epoch 10/100, Loss: 2.2915, F1: 0.3005
Epoch 20/100, Loss: 2.2825, F1: 0.2926
Epoch 30/100, Loss: 2.2751, F1: 0.3090
Epoch 40/100, Loss: 2.2559, F1: 0.2798
Epoch 50/100, Loss: 2.2315, F1: 0.2669
Epoch 60/100, Loss: 2.2079, F1: 0.2797
Epoch 70/100, Loss: 2.1729, F1: 0.2714
Epoch 80/100, Loss: 2.1658, F1: 0.2761
Epoch 90/100, Loss: 2.1529, F1: 0.2877
Epoch 100/100, Loss: 2.1470, F1: 0.2748


In [None]:
evaluate(model5, val_iterator, criterion, device)



Validating...
Val loss: 2.141446590423584
Val loss: 2.1724565267562865
Val loss: 2.1535562674204507
Val loss: 2.1453964471817017
Val loss: 2.145128688812256
Val F1-Score: 0.2983


(2.145128688812256, 0.29834651032756027)

Неплохо (относительно w2v, в целом печально), loss почти такой же, f1 повыше.

In [76]:

class MLP_ft_2(nn.Module):
    def __init__(self, vocab_size, weights, hidden_dim=40):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, weights.shape[1])
        self.embedding.from_pretrained(weights, freeze=True)
        self.emb2h1 = nn.Linear(weights.shape[1] * 2, hidden_dim) #умножение на 2, чтобы объединить усредненные эмбеддинги и max pooling
        self.act1 = nn.LeakyReLU()
        self.dropout1 = nn.Dropout(p=0.4)
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.h1h2 = nn.Linear(hidden_dim, hidden_dim // 2)
        self.act2 = nn.LeakyReLU()
        self.dropout2 = nn.Dropout(p=0.4)
        self.bn2 = nn.BatchNorm1d(hidden_dim // 2)
        self.h2out = nn.Linear(hidden_dim // 2, 10)

    def forward(self, text):
        embedded = self.embedding(text)
        max_emb, _ = torch.max(embedded, dim=1)
        mean_emb = torch.mean(embedded, dim=1)
        combined_emb = torch.cat((max_emb, mean_emb), dim=1)

        hidden1 = self.act1(self.emb2h1(combined_emb))
        hidden1 = self.dropout1(hidden1)
        hidden1 = self.bn1(hidden1)

        hidden2 = self.act2(self.h1h2(hidden1))
        hidden2 = self.dropout2(hidden2)
        hidden2 = self.bn2(hidden2)

        out = self.h2out(hidden2)
        return out


In [None]:
set_random_seed(53)
model6 = MLP_ft_2(len(word2id), embedding_matrix)
optimizer = optim.AdamW(model6.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=yweights)

model6 = model6.to(device)
criterion = criterion.to(device)


С предобученными эмбеддингами поэксперементировать особо не получилось, быстро заканчивалась память:

In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    torch.cuda.empty_cache()
    train_loss, train_f1 = train_loop(model6, train_iterator, optimizer, criterion)
    if (epoch + 1) % 10 == 0 or epoch == 0:
      print(f'Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, F1: {train_f1:.4f}')

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.94 GiB is free. Process 3226 has 12.81 GiB memory in use. Of the allocated memory 6.62 GiB is allocated by PyTorch, and 6.06 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
evaluate(model6, val_iterator, criterion, device)


Наилучший результат дали w2v (2.0035588455200197, 0.10663156572145423) и предобученные эмбеддинги (2.145128688812256, 0.29834651032756027).