# HW2 Weights Clustering

ДЗ: Применить изученные подходы к своим моделям и замерить производительность.

В качестве модели взяли **BERT base model (uncased)**

Будем решать задачу классификации отзывов(токсичный/не токсичный) для "виртуального интернет магазина". То есть бинарная классификация.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.


**ПРИМЕЧАНИЕ:** так как работа носит учебный характер, то все манипуляции с данными будут проводится в усеченном варианте, то есть будет браться часть датасета для обучения и тестирования (2 000 записей).

# Подготовка

Установим и загрузим необходимые библиотеки для работы

In [1]:
!pip -q install transformers

In [2]:
import pandas as pd
import numpy as np
import re
import os
import time
import copy

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from tqdm import notebook

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.cluster import KMeans

import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup, DistilBertForSequenceClassification, DistilBertTokenizer

In [3]:
!gdown 1ltdf-wuq52y6JLPzPaEywLYgRz5gfMsB #for colab

Downloading...
From: https://drive.google.com/uc?id=1ltdf-wuq52y6JLPzPaEywLYgRz5gfMsB
To: /content/toxic_comments.csv
100% 64.1M/64.1M [00:00<00:00, 242MB/s]


In [4]:
df_orig = pd.read_csv('/content/toxic_comments.csv') #for colab
# df_orig = pd.read_csv('toxic_comments.csv') #for local

Посмотрим на данные

In [5]:
df_orig.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


Сначала сделаем предпреобразование наших данных

In [6]:
def lemmatize(text):
    lem = WordNetLemmatizer()
    clear_text = ' '.join(re.sub(r'[^a-zA-Z\']', ' ', text).split())
    lemm_list = lem.lemmatize(clear_text)
    ready_text = "".join(lemm_list)

    return ready_text

In [7]:
nltk.download('wordnet')
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Возьмем часть данных, как писалось выше. И дополнительно сбалансируем классы, для корректной работы алгоритмов.

Если выборку не сбалансировать, то в нее может попасть очень малое количество положительных таргетов. И некоторые модели не сумеют найти закономерности, показав 0 метрику.

In [8]:
df_sample = df_orig.sample(n=2000, weights=1./df_orig.groupby('toxic')['toxic'].transform('count'), random_state=101).reset_index(drop=True)
df_sample['text'] = df_sample['text'].apply(lambda x: lemmatize(x))

df = df_sample.copy()

Проверим баланс классов

In [9]:
df_sample['toxic'].value_counts(normalize=True)

0    0.5125
1    0.4875
Name: toxic, dtype: float64

Создадим таблицу для занесения результатов тестирования

In [10]:
result_df = pd.DataFrame(columns=['Name', 'F1_test', 'Size(mb)', 'Time for 1 predict(s)'])

# Дистилляция

## Исходный вариант

В данном варианте обучим модель и проведем замер интересующих характеристик. Это необходимо для сравнения результатов дистилляции.

In [11]:
# создается класс для загрузки данных и их подготовки
class CustomDataset(Dataset):

  def __init__(self, texts, targets, tokenizer, max_len=512):
    self.texts = texts
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = str(self.texts[idx])
    target = self.targets[idx]

    encoding = self.tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=self.max_len,
        return_token_type_ids=False,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt',
        truncation=True
    )

    return {
      'text': text,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

Разобьем выборку на две части: тренировочную (60%), валидационную (20%) и тестовую (20%)

In [12]:
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=101)
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=101)

Проверка

In [13]:
print(df_train.shape, df_valid.shape, df_test.shape)

(1200, 2) (400, 2) (400, 2)


Данные для обучения, валидации и тестирования

In [14]:
features_train = df_train.drop(['toxic'], axis=1)
target_train = df_train['toxic']

features_valid = df_valid.drop(['toxic'], axis=1)
target_valid = df_valid['toxic']

features_test = df_test.drop(['toxic'], axis=1)
target_test = df_test['toxic']

Задаем параметры и модель

In [15]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_save_path='/content/bert_distil.pt'
n_classes = 2
max_len = 512
batch_size = 2
epochs = 10

Инициализируем сеть-учитель и изменяем количество фичей на выходе

In [22]:
# Load pre-trained BERT model and tokenizer
teacher_model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
teacher_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer = teacher_tokenizer

out_features_teacher = teacher_model.bert.encoder.layer[1].output.dense.out_features
# out_features = teacher_model.bert.encoder.layer[1].output.dense.out_features
teacher_model.classifier = torch.nn.Linear(out_features_teacher, n_classes)
teacher_model.to(device)

optimizer = torch.optim.AdamW(teacher_model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(features_train) * epochs
    )
loss_fn = torch.nn.CrossEntropyLoss().to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# создание датасетов
train_set = CustomDataset(list(features_train['text']), list(target_train), teacher_tokenizer)
valid_set = CustomDataset(list(features_valid['text']), list(target_valid), teacher_tokenizer)

# создание дата лоудеров
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size, shuffle=True)

In [18]:
def train(model, train_loader, valid_loader, features_train, features_valid, loss_fn, optimizer):

    for epoch in range(epochs):

        print(f'---------------Epoch:{epoch+1}/{epochs}----------------')
        train_losses = []
        val_losses = []
        train_correct_predictions = 0
        val_correct_predictions = 0
        best_accuracy = 0

        ###Train###
        model.train()

        for data in train_loader:
            input_ids = data["input_ids"].to(device)
            attention_mask = data["attention_mask"].to(device)
            targets = data["targets"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
                )

            preds = torch.argmax(outputs.logits, dim=1)
            loss = loss_fn(outputs.logits, targets)

            train_correct_predictions += torch.sum(preds == targets)

            train_losses.append(loss.item())

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        train_acc = train_correct_predictions.double() / len(features_train)
        train_loss = np.mean(train_losses)

        ###Eval###
        model.eval()

        with torch.no_grad():
            for data in valid_loader:
                input_ids = data["input_ids"].to(device)
                attention_mask = data["attention_mask"].to(device)
                targets = data["targets"].to(device)

                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                    )

                preds = torch.argmax(outputs.logits, dim=1)
                loss = loss_fn(outputs.logits, targets)
                val_correct_predictions += torch.sum(preds == targets)
                val_losses.append(loss.item())

        val_acc = val_correct_predictions.double() / len(features_valid)
        val_loss = np.mean(val_losses)

        print(f'Train loss=   {train_loss:.4f},   accuracy= {train_acc:.4f}')
        print(f'Val   loss=   {val_loss:.4f},   accuracy= {val_acc:.4f}')

        if val_acc > best_accuracy:
            torch.save(model, model_save_path)
            best_accuracy = val_acc

        model = torch.load(model_save_path)

def predict(model, text):

    model.eval()
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_len,
        return_token_type_ids=False,
        truncation=True,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt',
    )

    out = {
          'text': text,
          'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten()
      }

    input_ids = out["input_ids"].to(device)
    attention_mask = out["attention_mask"].to(device)

    outputs = model(
        input_ids=input_ids.unsqueeze(0),
        attention_mask=attention_mask.unsqueeze(0)
    )

    prediction = torch.argmax(outputs.logits, dim=1).cpu().numpy()[0]

    return prediction

Тренируем последний слой учителя, для выдачи правильных ответов

In [23]:
# на cpu считается больше часа для одной эпохи, не дождался окончания
# на gpu 2 минуты на одну эпоху, но тогда квантизация не работает
train(teacher_model, train_loader, valid_loader, features_train, features_valid, loss_fn, optimizer)

---------------Epoch:1/10----------------
Train loss=   0.5300,   accuracy= 0.8533
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:2/10----------------
Train loss=   0.1909,   accuracy= 0.9458
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:3/10----------------
Train loss=   0.1904,   accuracy= 0.9450
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:4/10----------------
Train loss=   0.1781,   accuracy= 0.9467
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:5/10----------------
Train loss=   0.1891,   accuracy= 0.9417
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:6/10----------------
Train loss=   0.1897,   accuracy= 0.9483
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:7/10----------------
Train loss=   0.1971,   accuracy= 0.9467
Val   loss=   0.2729,   accuracy= 0.9275
---------------Epoch:8/10----------------
Train loss=   0.1867,   accuracy= 0.9467
Val   loss=   0.2729,   accuracy= 0.9275
--------

Делаем предсказания и считаем метрики

In [24]:
def predict_and_metrics(model, name, number,
                        features_test=features_test,
                        target_test=target_test,
                        df_test=df_test,
                        result_df=result_df):
    # делаем предсказания и замеряем скорость
    texts = list(features_test['text'])
    start_time = time.time()
    target_pred = [predict(model, t) for t in texts]
    total_time = round((time.time() - start_time)/len(df_test), 4)

    # высчитываем размер модели
    torch.save(model.state_dict(), "temp.p")
    size = round(os.path.getsize("temp.p")/1e6, 3)
    os.remove('temp.p')

    # считаем метрику F-1
    bert_report = classification_report(target_test, target_pred, output_dict=True)
    result_df.loc[number]=[name, round(bert_report['1']['f1-score'], 3), size, total_time]

    return result_df

In [25]:
result_df = predict_and_metrics(teacher_model, 'BERT_orig', 0)
result_df

Unnamed: 0,Name,F1_test,Size(mb),Time for 1 predict(s)
0,BERT_orig,0.93,438.003,0.0343


## Обучение дистиллированного BERT

В качестве "уменьшенной модели", возьмем distilbert которому так-же как и модели-учителю, заменим выходной слой

In [26]:
import torch.nn.functional as F
from torch.nn import KLDivLoss

student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
student_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer = student_tokenizer

out_features_student = student_model.pre_classifier.out_features
student_model.classifier = torch.nn.Linear(out_features_student, n_classes)
student_model.to(device)

optimizer = torch.optim.AdamW(student_model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(features_train) * epochs
    )
loss_fn = torch.nn.CrossEntropyLoss().to(device)
distillation_loss_fn = KLDivLoss(reduction='batchmean') # Добавили новую функцию loss для дистилляции


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Для дистилляции введем 2 новых переменные, alpha - для контрля компромисса между двумя потерями, и temperature - для контроля "мягкости" распределений, создаваемых логитами. Оба гиперпараметра подбираются для конкретной задачи.

In [27]:
alpha = 0.1
temperature = 2

In [28]:
# создание датасетов для студента (другой токенайзер)
train_set = CustomDataset(list(features_train['text']), list(target_train), student_tokenizer)
valid_set = CustomDataset(list(features_valid['text']), list(target_valid), student_tokenizer)

# создание дата лоудеров для студента
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size, shuffle=True)

In [29]:
def train_student(teacher_model, student_model, train_loader, valid_loader, features_train, features_valid, loss_fn, optimizer):

    for epoch in range(epochs):

        print(f'---------------Epoch:{epoch+1}/{epochs}----------------')
        train_losses = []
        val_losses = []
        train_correct_predictions = 0
        val_correct_predictions = 0
        best_accuracy = 0

        ###Train###
        student_model.train()
        for data in train_loader:
          input_ids = data["input_ids"].to(device)
          attention_mask = data["attention_mask"].to(device)
          targets = data["targets"].to(device)

          with torch.no_grad():
            teacher_outputs  = teacher_model(
                input_ids=input_ids,
                attention_mask=attention_mask
                )

          student_outputs  = student_model(
                input_ids=input_ids,
                attention_mask=attention_mask
                )

          preds = torch.argmax(student_outputs.logits, dim=1)
          classification_loss = loss_fn(student_outputs.logits, targets)

          distillation_loss = (
          distillation_loss_fn(
              F.log_softmax(student_outputs.logits / temperature, dim=-1),
              F.softmax(teacher_outputs.logits / temperature, dim=-1),
            )
          )

          loss = (1 - alpha) * classification_loss + alpha * distillation_loss

          train_correct_predictions += torch.sum(preds == targets)

          train_losses.append(loss.item())

          loss.backward()
          torch.nn.utils.clip_grad_norm_(student_model.parameters(), max_norm=1.0)
          optimizer.step()
          scheduler.step()
          optimizer.zero_grad()

        train_acc = train_correct_predictions.double() / len(features_train)
        train_loss = np.mean(train_losses)

        ###Eval###
        student_model.eval()

        with torch.no_grad():
            for data in valid_loader:
                input_ids = data["input_ids"].to(device)
                attention_mask = data["attention_mask"].to(device)
                targets = data["targets"].to(device)

                with torch.no_grad():
                  teacher_outputs  = teacher_model(
                      input_ids=input_ids,
                      attention_mask=attention_mask
                      )

                student_outputs  = student_model(
                      input_ids=input_ids,
                      attention_mask=attention_mask
                      )

                preds = torch.argmax(student_outputs.logits, dim=1)
                classification_loss = loss_fn(student_outputs.logits, targets)

                distillation_loss = (
                distillation_loss_fn(
                    F.log_softmax(student_outputs.logits / temperature, dim=-1),
                    F.softmax(teacher_outputs.logits / temperature, dim=-1),
                  )
                )

                loss = (1 - alpha) * classification_loss + alpha * distillation_loss

                preds = torch.argmax(student_outputs.logits, dim=1)
                val_correct_predictions += torch.sum(preds == targets)
                val_losses.append(loss.item())

        val_acc = val_correct_predictions.double() / len(features_valid)
        val_loss = np.mean(val_losses)

        print(f'Train loss=   {train_loss:.4f},   accuracy= {train_acc:.4f}')
        print(f'Val   loss=   {val_loss:.4f},   accuracy= {val_acc:.4f}')

        if val_acc > best_accuracy:
            torch.save(student_model, model_save_path)
            best_accuracy = val_acc

        student_outputs = torch.load(model_save_path)


Тренируем сеть студента, используя 2 функции loss

In [30]:
train_student(teacher_model, student_model, train_loader, valid_loader, features_train, features_valid, loss_fn, optimizer)

---------------Epoch:1/10----------------
Train loss=   0.4627,   accuracy= 0.8433
Val   loss=   0.2201,   accuracy= 0.9300
---------------Epoch:2/10----------------
Train loss=   0.1837,   accuracy= 0.9542
Val   loss=   0.4911,   accuracy= 0.8925
---------------Epoch:3/10----------------
Train loss=   0.1053,   accuracy= 0.9750
Val   loss=   0.2546,   accuracy= 0.9350
---------------Epoch:4/10----------------
Train loss=   0.0851,   accuracy= 0.9767
Val   loss=   0.3071,   accuracy= 0.9275
---------------Epoch:5/10----------------
Train loss=   0.0680,   accuracy= 0.9817
Val   loss=   0.2604,   accuracy= 0.9375
---------------Epoch:6/10----------------
Train loss=   0.0278,   accuracy= 0.9925
Val   loss=   0.2633,   accuracy= 0.9450
---------------Epoch:7/10----------------
Train loss=   0.0274,   accuracy= 0.9925
Val   loss=   0.2420,   accuracy= 0.9375
---------------Epoch:8/10----------------
Train loss=   0.0225,   accuracy= 0.9950
Val   loss=   0.2603,   accuracy= 0.9350
--------

Делаем предсказания и считаем метрики

In [31]:
result_df = predict_and_metrics(student_model, 'BERT_distil', 1)
result_df

Unnamed: 0,Name,F1_test,Size(mb),Time for 1 predict(s)
0,BERT_orig,0.93,438.003,0.0343
1,BERT_distil,0.906,267.854,0.0182


# Выводы

In [32]:
result_df

Unnamed: 0,Name,F1_test,Size(mb),Time for 1 predict(s)
0,BERT_orig,0.93,438.003,0.0343
1,BERT_distil,0.906,267.854,0.0182


Значение метрики **F-1** снизилось, но не сильно А показатели, такие как вес модели и скорость инференса улучшились ~2 раза, чего мы и ожидали.