Uvozimo vse potrebne knjižnice:

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

In [2]:
!pip install transformers datasets sklearn numpy torch torchvision



### Ustvarjanje *dataset*-a za učenje modela

In [3]:
from datasets import load_dataset, load_metric



Naložimo csv datoteki, ki smo ju pripravili na koncu preprocesiranja: eno za *training dataset* in drugo za *test dataset*. Iz obeh odstranimo prvo vrstico, ki vsebuje imeni stolpcev, ter na novo ustvarjeni datoteki shranimo.

In [4]:
i = 0
with open("hateSpeechTest.csv",'r') as f:
    with open("hateSpeechTestHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [5]:
i = 0
with open("hateSpeechTrain.csv",'r') as f:
    with open("hateSpeechTrainHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            i+=1
            if i<36480:
                f1.write(line)

In [6]:
with open("dataParlamentH.csv",'r') as f:
    with open("dataParlamentHheadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

Ustvarimo instanco *dataset*, pri čemer ustrezno definiramo *train* in *test* zbirki.

In [11]:
dataset = load_dataset(
    'csv',
    data_files={
        'train': 'hateSpeechTrainHeadless.csv',
        'validation': 'dataParlamentHheadless.csv',
        'test': 'hateSpeechTestHeadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-2998bfbbb5dbb1cf


Downloading and preparing dataset csv/default to /home/ncirar/.cache/huggingface/datasets/csv/default-2998bfbbb5dbb1cf/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ncirar/.cache/huggingface/datasets/csv/default-2998bfbbb5dbb1cf/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [12]:
len(dataset['train'])

36479

In [13]:
len(dataset['test'])

15634

In [14]:
len(dataset['validation'])

54

Uvozimo že predefinirane parametre GLUE ter tokenizer tipa [CroCloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert).

In [15]:
metric = load_metric('glue', 'sst2')

### Tokeniziranje podatkov v zbirki

In [16]:
from transformers import AutoTokenizer

In [17]:
tokenizer = AutoTokenizer.from_pretrained(
    'EMBEDDIA/sloberta',
    use_fast=True
)

Tekstovni zapis oznake sentimentov spremenimo v števke: 3 za *3 nasilje*, 2 za *2 žalitev*, 1 za *1 nespodobni govor* in 0 za *0 ni sporni govor*. Določimo tudi maksimalno dolžino twittov, tj. 512 znakov.

In [18]:
label2id = {'3 nasilje': 3, '2 žalitev': 2, '1 nespodobni govor': 1, '0 ni sporni govor': 0}
id2label = ['0 ni sporni govor', '1 nespodobni govor', '2 žalitev', '3 nasilje']

In [19]:
def preprocess(examples):
  result = tokenizer(examples['sentence'], truncation=True, max_length=512)
  result['label'] = [label2id[l] for l in examples['label']]
  return result

In [20]:
encoded_dataset = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/37 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/16 [00:00<?, ?ba/s]

### Učenje modela
Uvozimo model strojnega učenja, ki je istega tipa kot tokenizer (CroSloEngual BERT), definiramo argumente učenja modela ter ustvarimo novo instanco *trainerja*.

In [21]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

2022-08-27 17:55:09.987589: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [22]:
model = AutoModelForSequenceClassification.from_pretrained(
    'EMBEDDIA/sloberta',
    num_labels=4
)

Some weights of the model checkpoint at EMBEDDIA/sloberta were not used when initializing CamembertForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at EMBEDDIA/sloberta and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight

In [23]:
args = TrainingArguments(
    "hatespeech-sloberta",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    )

In [24]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [25]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=None,
    )

In [26]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 36479
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 6840


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8208,0.638339,0.851852
2,0.8262,0.625516,0.851852
3,0.838,0.645604,0.851852


The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 54
  Batch size = 16
Saving model checkpoint to hatespeech-sloberta/checkpoint-2280
Configuration saved in hatespeech-sloberta/checkpoint-2280/config.json
Model weights saved in hatespeech-sloberta/checkpoint-2280/pytorch_model.bin
tokenizer config file saved in hatespeech-sloberta/checkpoint-2280/tokenizer_config.json
Special tokens file saved in hatespeech-sloberta/checkpoint-2280/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this

TrainOutput(global_step=6840, training_loss=0.825769002814042, metrics={'train_runtime': 661.5814, 'train_samples_per_second': 165.417, 'train_steps_per_second': 10.339, 'total_flos': 2145042147068496.0, 'train_loss': 0.825769002814042, 'epoch': 3.0})

In [27]:
eval_results = trainer.evaluate()
print(eval_results)

The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 54
  Batch size = 16


{'eval_loss': 0.6383387446403503, 'eval_accuracy': 0.8518518518518519, 'eval_runtime': 0.8783, 'eval_samples_per_second': 61.479, 'eval_steps_per_second': 4.554, 'epoch': 3.0}


In [28]:
trainer.save_model(output_dir='tweet-hatespeech-model-sloberta')

Saving model checkpoint to tweet-hatespeech-model-sloberta
Configuration saved in tweet-hatespeech-model-sloberta/config.json
Model weights saved in tweet-hatespeech-model-sloberta/pytorch_model.bin
tokenizer config file saved in tweet-hatespeech-model-sloberta/tokenizer_config.json
Special tokens file saved in tweet-hatespeech-model-sloberta/special_tokens_map.json


In [29]:
model = AutoModelForSequenceClassification.from_pretrained('tweet-hatespeech-model-sloberta')

loading configuration file tweet-hatespeech-model-sloberta/config.json
Model config CamembertConfig {
  "_name_or_path": "tweet-hatespeech-model-sloberta",
  "architectures": [
    "CamembertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "camembert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_vers

In [30]:
examples=['vidi se da ste pravi idiot in nimate pojma o čem govorite',
          'mojca vi ste zmagali to rundo',
          'to je totalni nesmisel']

inputs = tokenizer(examples, padding='longest', return_tensors="pt")
outputs = model(**inputs)
probs = outputs[0].detach().numpy()
for i in range(len(examples)):
    print(examples[i],'\t', id2label[np.argmax(probs[i])])

vidi se da ste pravi idiot in nimate pojma o čem govorite 	 0 ni sporni govor
mojca vi ste zmagali to rundo 	 0 ni sporni govor
to je totalni nesmisel 	 0 ni sporni govor


### Analiza sovražnega govora na podatkovni zbirki parlamentarnih debat
Naložimo podatkovno zbirko z zapisi parlamentarnih debat ter poženemo analizo sovražnega govora s pomočjo ravnokar ustvarjenega modela.

In [32]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving dataframe.csv to dataframe.csv
User uploaded file "dataframe.csv" with length 92383142 bytes


In [33]:
import csv

corpus = []
text = []

with open('dataframe.csv', 'r') as f:
  lineReader = csv.reader(f, delimiter=',', quotechar="\"")
  for row in lineReader:
    if row:
      #print(row)
      text.append(row[0])
      corpus.append({'text': row[0], 'datum': row[1]})

In [None]:
res = []

inputs = tokenizer(text, padding='longest', return_tensors="pt")
outputs = model(**inputs)
probs = outputs[0].detach().numpy()
for i in range(len(text)):
    #print(corpus[i],'\t', id2label[np.argmax(probs[i])])
    res.append(id2label[np.argmax(probs[i])])