# Классификация текстов
В этом ноутбуке ваша задача будет разобраться с классификацией твитов на русском языке на позитивные и негативные.

Для начала подготовим датасет к чтению:

In [34]:
import os
import shutil
import gdown

# Define the source and destination paths
source_files = ['train.csv', 'val.csv']
destination_dir = os.path.join('.', 'data')

# Create the destination directory if it doesn't exist
os.makedirs(destination_dir, exist_ok=True)

# Check if files exist in the destination directory
for file in source_files:
    dest_file_path = os.path.join(destination_dir, file)
    if not os.path.exists(dest_file_path):
        # Download the file if it does not exist
        if file == 'train.csv':
            gdown.download(id="1GujrcFzRdo3E7UtUkcrljzDS9czBBy3s", output=file, quiet=False)
        elif file == 'val.csv':
            gdown.download(id="1vvm-PrV0r2wuGbYYovZSuReYOXpu0JRK", output=file, quiet=False)
        
        # Move the file to the destination directory
        if os.path.exists(file):
            shutil.move(file, destination_dir)
        else:
            print(f"File {file} does not exist.")
    else:
        print(f"File {file} already exists in the destination directory.")

File train.csv already exists in the destination directory.
File val.csv already exists in the destination directory.


In [35]:
%pip install torch==2.2.2
%pip install torchtext==0.17.2

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [36]:
from csv import reader

def dataset_iter(part):
    with open("data/" + part + ".csv", "rt", newline="") as f_in:
        r = reader(f_in)
        next(r)
        while r:
            try: 
                _, text, cls = next(r)
                yield cls, text
            except StopIteration:
                return

In [37]:
def dataset_rows_num(part):
    with open("data/" + part + ".csv", "rt") as f_in:
        rows_num = len(f_in.readlines()) - 1
    return rows_num

Got error:
```
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
```
Therefore downgrade numpy:

In [38]:
%pip install "numpy<2.0"

Note: you may need to restart the kernel to use updated packages.


In [39]:
from torch.utils.data import IterableDataset

class RawTextIterableDataset(IterableDataset):
    """Простой итератор по текстовому набору данных.
    """

    def __init__(self, full_num_lines, current_pos, iterator):
        """Конструктор
        """
        super(RawTextIterableDataset, self).__init__()
        self.full_num_lines = full_num_lines
        self._iterator = iterator
        self.num_lines = full_num_lines
        self.current_pos = current_pos

    def __iter__(self):
        return self

    def __next__(self):
        if self.current_pos == self.num_lines - 1:
            raise StopIteration
        item = next(self._iterator)
        if self.current_pos is None:
            self.current_pos = 0
        else:
            self.current_pos += 1
        return item

    def __len__(self):
        return self.num_lines

    def pos(self):
        """
        Возвращает текущую позицию в наборе данных.
        """
        return self.current_pos



In [40]:
def RU_TW(part):
    return RawTextIterableDataset(dataset_rows_num(part), 0, dataset_iter(part))

Теперь сделаем словарь:

In [41]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [42]:
from torchtext.data.utils import get_tokenizer
from collections import Counter, OrderedDict
from torchtext.vocab import vocab as _vocab

tokenizer = get_tokenizer('toktok', 'ru')
train_iter = RU_TW("train")
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

unk_token = '<unk>'
vocab = _vocab(ordered_dict, min_freq=1000, specials=[unk_token])
vocab.set_default_index(vocab[unk_token])


Зададим функции предобработки датасета:

In [43]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x)

Сделаем загрузчик датасета (на жаргоне "батчеварку"):

In [44]:
import torch
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = RU_TW("train")
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

Пришло время сделать модель для классификации. Вот ее графическое изображение:

<img src="https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png" width="800" height="400">

А вот код:

In [45]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

Создадим объект модели:

In [46]:
train_iter = RU_TW("train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 4
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

Зададим функции тренировки и проверки модели:

In [47]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predited_label = model(text, offsets)
        loss = criterion(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predited_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

In [48]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predited_label = model(text, offsets)
            loss = criterion(predited_label, label)
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Наконец, обучение:

In [49]:
from torch.utils.data.dataset import random_split
# Hyperparameters
EPOCHS = 1 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter = RU_TW("train")
test_iter = RU_TW("val")
train_dataset = list(train_iter)
test_dataset = list(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 2694 batches | accuracy    0.548
| epoch   1 |  1000/ 2694 batches | accuracy    0.577
| epoch   1 |  1500/ 2694 batches | accuracy    0.586
| epoch   1 |  2000/ 2694 batches | accuracy    0.587
| epoch   1 |  2500/ 2694 batches | accuracy    0.596
-----------------------------------------------------------
| end of epoch   1 | time:  7.58s | valid accuracy    0.614 
-----------------------------------------------------------


И проверка:

In [50]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.617


А также финальная, т.н. ручная проверка. Здесь можно задать любой текст, который вы хотите проверить:

In [51]:
ag_news_label = {0: "Negative",
                 1: "Positive"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item()

ex_text_str = "привет"

model = model.to("cpu")

print("This is a %s twit" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Positive twit


### Ваша задача состоит в том, чтобы улучшить качество модели на представленных данных. Все-таки 57% - это немногим лучше слепого угадывания ответа.