### **Дополнительное задание 3**

В этом задании вам предстоит обучить две модели для предсказания рейтинга отзывов об отелях.

1. Загрузите датасет отзывов - [Trip Advisor Hotel Reviews](https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews)
2. Подготовьте данные к обучению - **1 балл**
    - Разделите датасет на обучающую, валидационную и тестовую выборки со стратификацией в пропорции 60/20/20.
    - Создайте `Dataset` и `DataLoader` для обучающей, валидационной и тестовой выборок. Выберите оптимальный, на ваш взгляд, `batch_size`.
3. Реализуйте и обучите сверточную сеть (сеть с использованием слоев Conv1d) для решения задачи - **1 балл**
4. Реализуйте и обучите рекуррентную сеть (сеть с использованием LSTM-слоя) для решения задачи - **1 балл**
5. Сравните между собой метрики и динамику обучения обеих моделей - **1 балл**

**Общее**

Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **1 балл**

In [109]:
# Основные библиотеки
import random
import numpy as np
import pandas as pd

# PyTorch и связанные библиотеки
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.optim import Adam
from torch.optim.lr_scheduler import StepLR
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Метрики
from torchmetrics.classification import Accuracy, F1Score

# PyTorch Lightning
import pytorch_lightning as pl
from pytorch_lightning.callbacks.early_stopping import EarlyStopping


# HuggingFace Tokenizer
from transformers import AutoTokenizer

# Sklearn
from sklearn.model_selection import train_test_split

In [88]:
def set_random_state(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
set_random_state(42)

### Загрузка данных 

In [None]:
# загрузка датасета из csv
file_path = "../data/tripadvisor_hotel_reviews.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [90]:
# Разделение данных на X (тексты) и y (рейтинги)
X = df['Review']
# Приводим рейтинги к диапазону [0, 4]
y = df['Rating'] - 1  

# Разделение на обучающую, валидационную и тестовую выборки (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

print(f"Train size: {len(X_train)}, Validation size: {len(X_val)}, Test size: {len(X_test)}")

Train size: 12294, Validation size: 4098, Test size: 4099


### Подготовка данных к обучению

In [91]:
# Выбор модели токенизации
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

class HotelReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        label = self.labels.iloc[idx]
        tokens = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        return tokens['input_ids'].squeeze(0), label

# Создание датасетов
train_dataset = HotelReviewDataset(X_train, y_train, tokenizer)
val_dataset = HotelReviewDataset(X_val, y_val, tokenizer)
test_dataset = HotelReviewDataset(X_test, y_test, tokenizer)

# Collate-функция для паддинга
def collate_fn(batch):
    texts, labels = zip(*batch)
    texts = pad_sequence(texts, batch_first=True)
    labels = torch.tensor(labels)
    return texts, labels

# DataLoaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

### CNN

In [92]:
class ConvNet(pl.LightningModule):
    def __init__(self, vocab_size, embedding_dim, num_classes, lr=1e-4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv = nn.Conv1d(embedding_dim, 128, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.linear = nn.Linear(128, num_classes)
        self.lr = lr

        # Метрики
        self.train_accuracy = Accuracy(task="multiclass", num_classes=num_classes)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=num_classes)
        self.test_accuracy = Accuracy(task="multiclass", num_classes=num_classes)

        self.train_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")
        self.val_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")
        self.test_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")

    def forward(self, x):
        x = self.embedding(x)               # Вход: (batch_size, seq_len)
        x = x.permute(0, 2, 1)              # Для Conv1d: (batch_size, embedding_dim, seq_len)
        x = self.conv(x)                    # Свертка
        x = self.relu(x)                    # Активация
        x = self.pool(x).squeeze(-1)        # Глобальный пулинг
        x = self.linear(x)                  # Линейный слой
        return x

    def training_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("train_loss", loss, on_epoch=True, prog_bar=True)
        self.log("train_accuracy", self.train_accuracy(preds, labels), on_epoch=True, prog_bar=True)
        self.log("train_f1", self.train_f1(preds, labels), on_epoch=True, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("val_loss", loss, on_epoch=True, prog_bar=True)
        self.log("val_accuracy", self.val_accuracy(preds, labels), on_epoch=True, prog_bar=True)
        self.log("val_f1", self.val_f1(preds, labels), on_epoch=True, prog_bar=True)

    def test_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("test_loss", loss, on_epoch=True, prog_bar=True)
        self.log("test_accuracy", self.test_accuracy(preds, labels), on_epoch=True)
        self.log("test_f1", self.test_f1(preds, labels), on_epoch=True)


    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr, weight_decay=1e-5)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode="min", factor=0.5, patience=2, verbose=True
        )
        return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}

In [None]:
# Гиперпараметры
vocab_size = tokenizer.vocab_size
embedding_dim = 256
num_classes = 5

# Модель
model = ConvNet(vocab_size, embedding_dim, num_classes)

early_stopping = EarlyStopping(
    monitor="val_loss",
    patience=5,
    mode="min"
)

trainer = pl.Trainer(
    max_epochs=25,
    accelerator="auto",
    devices="auto",
    default_root_dir="../logs/extra_3/lightning_logs_cnn",
    callbacks=[early_stopping]
)

# Обучение
trainer.fit(model, train_loader, val_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

   | Name           | Type               | Params | Mode 
---------------------------------------------------------------
0  | embedding      | Embedding          | 7.8 M  | train
1  | conv           | Conv1d             | 98.4 K | train
2  | relu           | ReLU               | 0      | train
3  | pool           | AdaptiveMaxPool1d  | 0      | train
4  | linear         | Linear             | 645    | train
5  | train_accuracy | MulticlassAccuracy | 0      | train
6  | val_accuracy   | MulticlassAccuracy | 0      | train
7  | test_accuracy  | MulticlassAccuracy | 0      | train
8  | train_f1       | MulticlassF1Score  | 0      | train
9  | val_f1         | MulticlassF1Score  | 0      | train
10 | test_f1        | MulticlassF1Score  | 0      | train
---------------------------------------------------------------
7.9 M     Trainable params
0         Non-trainable params
7.

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.
/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=25` reached.


In [94]:
# Тестирование
trainer.test(model, test_loader)

/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.


Testing: |          | 0/? [00:00<?, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.6225908994674683
         test_f1            0.6095143556594849
        test_loss           0.8991456627845764
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.8991456627845764,
  'test_accuracy': 0.6225908994674683,
  'test_f1': 0.6095143556594849}]

In [97]:
# Запуск TensorBoard сервера
from tensorboard import program
from IPython.display import IFrame

# Создаём объект TensorBoard
tb = program.TensorBoard()
tb.configure(argv=[None, "--logdir", "../logs/extra_3/lightning_logs_cnn"])
# Запускаем TensorBoard сервер
url = tb.launch()  

# Отображение TensorBoard в IFrame
IFrame(src=url, width="100%", height="800px")

### RNN

In [111]:
class LSTMNet(pl.LightningModule):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, lr=1e-4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.linear = nn.Linear(hidden_dim * 2, num_classes)  # *2 из-за bidirectional LSTM
        self.lr = lr
        self.dropout = nn.Dropout(0.3)

        # Метрики
        self.train_accuracy = Accuracy(task="multiclass", num_classes=num_classes)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=num_classes)
        self.test_accuracy = Accuracy(task="multiclass", num_classes=num_classes)

        self.train_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")
        self.val_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")
        self.test_f1 = F1Score(task="multiclass", num_classes=num_classes, average="weighted")

    def forward(self, x):
        x = self.embedding(x)               # Вход: (batch_size, seq_len)
        lstm_out, _ = self.lstm(x)          # Выход LSTM: (batch_size, seq_len, hidden_dim*2)
        x = lstm_out[:, -1, :]              # Берём только последний hidden state
        x = self.dropout(x)                 # Применяем Dropout к последнему hidden state
        x = self.linear(x)                  # Линейный слой для классификации
        return x

    def training_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("train_loss", loss, on_epoch=True, prog_bar=True)
        self.log("train_accuracy", self.train_accuracy(preds, labels), on_epoch=True, prog_bar=True)
        self.log("train_f1", self.train_f1(preds, labels), on_epoch=True, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("val_loss", loss, on_epoch=True, prog_bar=True)
        self.log("val_accuracy", self.val_accuracy(preds, labels), on_epoch=True, prog_bar=True)
        self.log("val_f1", self.val_f1(preds, labels), on_epoch=True, prog_bar=True)

    def test_step(self, batch, batch_idx):
        texts, labels = batch
        preds = self(texts)
        loss = nn.CrossEntropyLoss()(preds, labels)

        # Логируем метрики и лосс
        self.log("test_loss", loss, on_epoch=True, prog_bar=True)
        self.log("test_accuracy", self.test_accuracy(preds, labels), on_epoch=True)
        self.log("test_f1", self.test_f1(preds, labels), on_epoch=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr, weight_decay=1e-5)
        scheduler = ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=2, verbose=True)
        return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}

In [112]:
# Гиперпараметры
embedding_dim = 256
hidden_dim = 256
num_classes = 5

# Создание модели
lstm_model = LSTMNet(vocab_size, embedding_dim, hidden_dim, num_classes)

# Callback для ранней остановки
early_stopping = EarlyStopping(
    monitor="val_loss",
    patience=5,
    mode="min"
)

# Тренер
trainer_lstm = pl.Trainer(
    max_epochs=25,
    accelerator="auto",
    devices="auto",
    callbacks=[early_stopping],
    default_root_dir="../logs/extra_3/lightning_logs_lstm" 
)

# Обучение
trainer_lstm.fit(lstm_model, train_loader, val_loader)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name           | Type               | Params | Mode 
--------------------------------------------------------------
0 | embedding      | Embedding          | 7.8 M  | train
1 | lstm           | LSTM               | 1.1 M  | train
2 | linear         | Linear             | 2.6 K  | train
3 | dropout        | Dropout            | 0      | train
4 | train_accuracy | MulticlassAccuracy | 0      | train
5 | val_accuracy   | MulticlassAccuracy | 0      | train
6 | test_accuracy  | MulticlassAccuracy | 0      | train
7 | train_f1       | MulticlassF1Score  | 0      | train
8 | val_f1         | MulticlassF1Score  | 0      | train
9 | test_f1        | MulticlassF1Score  | 0      | train
--------------------------------------------------------------
8.9 M     Trainable params
0         Non-trainable params
8.9 M     Total params
35.475    Total estimated model params size (MB)
1

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.
/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [113]:
# Тестирование
trainer_lstm.test(lstm_model, test_loader)

/Users/anastasia/docs/ITMO/DL_Course/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.


Testing: |          | 0/? [00:00<?, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.4681629538536072
         test_f1              0.3846455514431
        test_loss           1.2264670133590698
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 1.2264670133590698,
  'test_accuracy': 0.4681629538536072,
  'test_f1': 0.3846455514431}]

In [None]:
# Тестирование
trainer_lstm.test(lstm_model, test_loader)

### Сравнение метрик CNN и RNN
Результат у CNN получился получше в различных комбинациях

In [114]:
# Создаём объект TensorBoard
tb = program.TensorBoard()
tb.configure(argv=[None, "--logdir", "../logs/extra_3"]) 
url = tb.launch()

# Отображение TensorBoard в IFrame
IFrame(src=url, width="100%", height="800px")