<center><i>Łukasz Staniszewski</i></center>
<h1><center>Mini projekt 6 - SSNE - Klasyfikator mowy nienawiści</center></h1>

## Polecenie
Zadanie polega na stworzeniu modelu który klasyfikował będzie mowę nienawiści na podstawie otagowanych komentarzy w języku polskim.

## Przygotowanie notatnika

In [161]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_metric

import numpy as np
import pandas as pd
import re

from sklearn.utils import compute_class_weight
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import imblearn

In [162]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cuda')

In [163]:
if torch.cuda.is_available():
    torch.cuda.manual_seed(2115)
    torch.cuda.manual_seed_all(2115)

torch.manual_seed(2115)

torch.backends.cudnn.determinstic = True
torch.backends.cudnn.benchmark = False

## Dane

In [164]:
train_df = pd.read_csv('./data/hate_train.csv')

In [165]:
train_df.head()

Unnamed: 0,sentence,label
0,Dla mnie faworytem do tytułu będzie Cracovia. ...,0
1,@anonymized_account @anonymized_account Brawo ...,0
2,"@anonymized_account @anonymized_account Super,...",0
3,@anonymized_account @anonymized_account Musi. ...,0
4,"Odrzut natychmiastowy, kwaśna mina, mam problem",0


In [166]:
test_df = pd.read_table("./data/hate_test_data.txt", header=None)
test_df.rename(columns={0:'sentence'}, inplace=True)

In [167]:
test_df.head()

Unnamed: 0,sentence
0,"@anonymized_account Spoko, jak im Duda z Moraw..."
1,@anonymized_account @anonymized_account Ale on...
2,@anonymized_account No czy Prezes nie miał rac...
3,@anonymized_account @anonymized_account Przeci...
4,@anonymized_account @anonymized_account Owszem...


Usuwanie URL

In [168]:
def url_deletion(in_text):
    pattern = re.compile(r'https?[a-zA-Z:/.0-9]*')
    return pattern.sub(r'', in_text)

Usuwanie nicku

In [169]:
def nick_deletion(text_in):
  pattern = r'@anonymized_account'
  return re.sub(pattern, '', text_in)

Usuwanie znaków specjalnych

In [170]:
def signs_deletion(text_in):
    pattern = r'[,„•’\"-]'
    return re.sub(pattern, ' ', text_in).strip()

In [171]:
def whitespace_deletion(text_in):
    pattern = r'^\s*|\s\s*'
    return re.sub(pattern, ' ', text_in).strip().replace(r'\n', ' ')

In [172]:
def slash_deletion(text_in):
    pattern = r"\\"
    return re.sub(pattern, '', text_in)

Usuwanie emotek

In [173]:
def emote_deletion(text_in):
    pattern = re.compile("["
       u"\U0001F600-\U0001F64F"  # emoticons
       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
       u"\U0001F680-\U0001F6FF"  # transport & map symbols
       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
       "]+", flags=re.UNICODE)
    return pattern.sub(r'', text_in)

Zastosowanie processingu

In [174]:
indexes_delete = []
for index,row in train_df.iterrows():
    if "RT" in row['sentence']:
        indexes_delete.append(index)

train_df = train_df.drop(indexes_delete)

In [175]:
train_df = train_df.reset_index()

In [176]:
train_df = train_df.drop(columns=['index'])

In [178]:
train_df.head()

Unnamed: 0,sentence,label
0,Dla mnie faworytem do tytułu będzie Cracovia. ...,0
1,@anonymized_account @anonymized_account Brawo ...,0
2,"@anonymized_account @anonymized_account Super,...",0
3,@anonymized_account @anonymized_account Musi. ...,0
4,"Odrzut natychmiastowy, kwaśna mina, mam problem",0


In [179]:
train_df['sentence'] = train_df['sentence'].apply(url_deletion).apply(nick_deletion).apply(whitespace_deletion).apply(signs_deletion).apply(emote_deletion).apply(slash_deletion)

In [180]:
test_df['sentence'] = test_df['sentence'].apply(url_deletion).apply(nick_deletion).apply(whitespace_deletion).apply(signs_deletion).apply(emote_deletion).apply(slash_deletion)

In [181]:
train_df.head()

Unnamed: 0,sentence,label
0,Dla mnie faworytem do tytułu będzie Cracovia. ...,0
1,Brawo ty Daria kibic ma być na dobre i złe,0
2,Super polski premier składa kwiaty na grobach...,0
3,Musi. Innej drogi nie mamy.,0
4,Odrzut natychmiastowy kwaśna mina mam problem,0


In [182]:
test_df.head()

Unnamed: 0,sentence
0,Spoko jak im Duda z Morawieckim zamówią po pi...
1,Ale on tu nie miał szans jej zagrania a ta 'p...
2,No czy Prezes nie miał racji mówiąc ze to są ...
3,Przecież to nawet nie jest przewrotka
4,Owszem podatki tak. Ale nie w takich okoliczno...


In [122]:
train_df.to_csv("train_processed.csv")
test_df.to_csv("test_processed.csv")

## Tokenizer

In [123]:
model_name = "dkleczek/Polish-Hate-Speech-Detection-Herbert-Large"

In [124]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer

PreTrainedTokenizerFast(name_or_path='dkleczek/Polish-Hate-Speech-Detection-Herbert-Large', vocab_size=50000, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'})

In [125]:
sentence = tokenizer(train_df['sentence'][27])
tokenizer.convert_ids_to_tokens(sentence["input_ids"])

['<s>',
 'ale</w>',
 'może</w>',
 'w</w>',
 'końcu</w>',
 'dojdzie</w>',
 'do</w>',
 'wniosku</w>',
 'że</w>',
 'skoro</w>',
 'go</w>',
 'klub</w>',
 'już</w>',
 'nie</w>',
 'j',
 'ara</w>',
 'to</w>',
 'lepiej</w>',
 'go</w>',
 'sprzedać</w>',
 'i</w>',
 'mieć</w>',
 'po</w>',
 'kłopo',
 'cie</w>',
 '</s>']

## Zbiory danych

In [126]:
input_data, input_targets = train_df['sentence'].values, train_df['label'].values

In [134]:
# undersampler = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=0.18, random_state=2115)
# input_data_under, input_targets_under = undersampler.fit_resample(input_data.reshape(-1,1), input_targets)
# train_data, val_data, train_targets, val_targets = train_test_split(input_data_under, input_targets_under, test_size=0.15, random_state=2115)

In [127]:
oversampler = imblearn.over_sampling.RandomOverSampler(sampling_strategy=0.1, random_state=2115)
input_data_over, input_targets_over = oversampler.fit_resample(input_data.reshape(-1,1), input_targets)
train_data, val_data, train_targets, val_targets = train_test_split(input_data_over, input_targets_over, test_size=0.15, random_state=2115)

In [136]:
# train_data, val_data, train_targets, val_targets = train_test_split(input_data, input_targets, test_size=0.15, random_state=2115)

In [128]:
def tokenize_function(in_data):
    return tokenizer(in_data, padding=True, truncation=True)

In [129]:
train_tokenized = tokenize_function(list(train_data.squeeze()))
val_tokenized = tokenize_function(list(val_data.squeeze()))

In [130]:
class HateDataset(Dataset):
    def __init__(self, tokenized_data, targets):
        self.tokenized_data = tokenized_data
        self.targets = targets

    def __getitem__(self, idx):
        temp = {key: torch.tensor(val[idx]) for key, val in self.tokenized_data.items()}
        temp['labels'] = torch.tensor(self.targets[idx])
        return temp

    def __len__(self):
        return len(self.targets)

In [131]:
train_dataset = HateDataset(tokenized_data=train_tokenized, targets=train_targets)
val_dataset = HateDataset(tokenized_data=val_tokenized, targets=val_targets)

In [132]:
BATCH_SIZE = 36

In [133]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)

## Model

In [134]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50000, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1

Uczymy tylko klasyfikator!

In [135]:
for param in model.parameters():
  param.requires_grad = False
for param in model.classifier.parameters():
  param.requires_grad = True

Hiperparametry

In [136]:
N_EPOCH = 14
LR = 4e-4

In [137]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cuda')

In [138]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50000, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1

## Nauka modelu

In [139]:
y_np = np.array(train_dataset[:]['labels']) # classes
class_weights=compute_class_weight('balanced',classes=np.unique(y_np),y=y_np)
class_weights=torch.tensor(class_weights,dtype=torch.float)

In [140]:
class_weights

tensor([0.5502, 5.4755])

In [141]:
def count_accuracy(model, loader):
    acc = load_metric("accuracy")
    y_pred = []
    y_true = []
    model.eval()
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            acc.add_batch(predictions=predictions, references=labels)
            y_pred.extend([p.item() for p in predictions])
            y_true.extend([l.item() for l in labels])
    score = acc.compute()
    model.train()
    return score['accuracy'], y_true, y_pred

In [142]:
optim = torch.optim.Adam(model.parameters(), lr=LR)
loss_fun = nn.CrossEntropyLoss(weight=class_weights.to(device))
# loss_fun = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optim, gamma=0.96)

In [143]:
model.train()

for epoch in range(N_EPOCH):
    epoch_loss = []
    for batch in train_loader:
        optim.zero_grad()
        inputs = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        targets = batch['labels'].to(device)
        outputs = model(inputs, attention_mask=attention_mask, labels=targets)
        loss = loss_fun(outputs.logits, targets)
        loss.backward()
        optim.step()
        epoch_loss.append(loss.item())

    loss_mean = np.array(epoch_loss).mean()
    val_acc,_,_ = count_accuracy(model, val_loader)
    print(f"EPOCH {epoch+1}/{N_EPOCH} | loss: {loss_mean:.3f} | val_acc: {val_acc:.3f}")
    scheduler.step()

EPOCH 1/14 | loss: 0.431 | val_acc: 0.862
EPOCH 2/14 | loss: 0.418 | val_acc: 0.816
EPOCH 3/14 | loss: 0.410 | val_acc: 0.821
EPOCH 4/14 | loss: 0.404 | val_acc: 0.859
EPOCH 5/14 | loss: 0.402 | val_acc: 0.864
EPOCH 6/14 | loss: 0.395 | val_acc: 0.812
EPOCH 7/14 | loss: 0.391 | val_acc: 0.850
EPOCH 8/14 | loss: 0.389 | val_acc: 0.777
EPOCH 9/14 | loss: 0.395 | val_acc: 0.804
EPOCH 10/14 | loss: 0.388 | val_acc: 0.843
EPOCH 11/14 | loss: 0.390 | val_acc: 0.855
EPOCH 12/14 | loss: 0.384 | val_acc: 0.814
EPOCH 13/14 | loss: 0.373 | val_acc: 0.856
EPOCH 14/14 | loss: 0.397 | val_acc: 0.874


Zapis modelu

In [148]:
torch.save(model.state_dict(), "model/model.pt")

Statystyki

In [144]:
acc, y_true, y_pred = count_accuracy(model, val_loader)

In [145]:
acc

0.8742094167252283

In [158]:
target_names = ['no_hate', 'hate']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     no_hate       0.97      0.89      0.93      1297
        hate       0.39      0.74      0.51       126

    accuracy                           0.87      1423
   macro avg       0.68      0.81      0.72      1423
weighted avg       0.92      0.87      0.89      1423



Niska precyzja, natomiast nienajgorszy recall w przypadku hate, gdzie mamy bardzo mało klas

## Ewaluacja

In [149]:
test_data = test_df['sentence']
test_tokenized = tokenize_function(list(test_data))

In [150]:
len(test_tokenized['input_ids'])

1000

In [151]:
class HateEvalDataset(Dataset):
    def __init__(self, tokenized_data):
        self.tokenized_data = tokenized_data

    def __getitem__(self, idx):
        temp = {key: torch.tensor(val[idx]) for key, val in self.tokenized_data.items()}
        return temp

    def __len__(self):
        return len(self.tokenized_data['input_ids'])

In [152]:
eval_dataset = HateEvalDataset(tokenized_data=test_tokenized)

In [153]:
BATCH_SIZE=35

In [154]:
eval_loader = DataLoader(eval_dataset, batch_size=BATCH_SIZE, shuffle=False)

In [155]:
model.eval()
all_preds = []
with torch.no_grad():
    for batch in eval_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        all_preds.extend(p.item() for p in predictions)

In [159]:
df_out = pd.DataFrame(all_preds)
df_out.head()

Unnamed: 0,0
0,1
1,0
2,0
3,0
4,0


In [160]:
csv = df_out.to_csv(index=False, header=False)
with open('results.csv', 'w', newline="") as f:
    f.write(csv)