Uvozimo vse potrebne knjižnice:

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

In [2]:
!pip install transformers datasets sklearn numpy torch torchvision



### Ustvarjanje *dataset*-a za učenje modela

In [3]:
from datasets import load_dataset, load_metric



Naložimo csv datoteke, ki smo jih pripravili na koncu preprocesiranja:  *training dataset*, *evaluation dataset* in *test dataset*. Iz vseh odstranimo prvo vrstico, ki vsebuje imeni stolpcev, ter na novo ustvarjeni datoteki shranimo.

In [4]:
with open("hateSpeechTest.csv",'r') as f:
    with open("hateSpeechTestHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [5]:
with open("hateSpeechTrain.csv",'r') as f:
    with open("hateSpeechTrainHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [6]:
with open("hateSpeechEvaluation.csv",'r') as f:
    with open("hateSpeechEvaluationHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

Ustvarimo instanco *dataset*, pri čemer ustrezno definiramo *train* in *test* zbirki.

In [7]:
dataset = load_dataset(
    'csv',
    data_files={
        'train': 'hateSpeechTrainHeadless.csv',
        'validation': 'hateSpeechEvaluationHeadless.csv',
        'test': 'hateSpeechTestHeadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-e321d39746622800


Downloading and preparing dataset csv/default to /home/ncirar/.cache/huggingface/datasets/csv/default-e321d39746622800/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ncirar/.cache/huggingface/datasets/csv/default-e321d39746622800/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1000
    })
})

Uvozimo že predefinirane parametre GLUE ter tokenizer tipa [CroCloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert).

In [9]:
metric = load_metric('glue', 'sst2')

### Tokeniziranje podatkov v zbirki

In [10]:
from transformers import AutoTokenizer

In [11]:
tokenizer = AutoTokenizer.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    use_fast=True
)

Tekstovni zapis oznake sentimentov spremenimo v števke: 1 za *sovrazni* in 0 za *nesovrazni*. Določimo tudi maksimalno dolžino twittov, tj. 512 znakov.

In [12]:
label2id = {'sovrazni': 1, 'nesovrazni': 0}
id2label = ['nesovrazni', 'sovrazni']

In [13]:
def preprocess(examples):
  result = tokenizer(examples['sentence'], truncation=True, max_length=512)
  result['label'] = [label2id[l] for l in examples['label']]
  return result

In [14]:
encoded_dataset = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [15]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [16]:
sentence = dataset['train'][0]
sentence

{'sentence': 'lepo in koliko ljudi je bilo oguljfanihmajnka še titov in leninov spomenik',
 'label': 'nesovrazni'}

In [17]:
inputs = tokenizer(sentence['sentence'])
inputs

{'input_ids': [103, 3871, 1003, 1578, 1216, 1001, 1091, 25364, 13257, 9989, 1041, 18583, 4170, 1063, 33515, 1111, 1003, 11190, 18481, 8258, 104], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'lepo',
 'in',
 'koliko',
 'ljudi',
 'je',
 'bilo',
 'og',
 '##ulj',
 '##fan',
 '##ih',
 '##maj',
 '##nka',
 'še',
 'tit',
 '##ov',
 'in',
 'len',
 '##inov',
 'spomenik',
 '[SEP]']

### Učenje modela
Uvozimo model strojnega učenja, ki je istega tipa kot tokenizer (CroSloEngual BERT), definiramo argumente učenja modela ter ustvarimo novo instanco *trainerja*.

In [19]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

2022-09-27 10:30:51.563051: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [20]:
model = AutoModelForSequenceClassification.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    num_labels=2
)

Some weights of the model checkpoint at EMBEDDIA/crosloengual-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model chec

In [21]:
args = TrainingArguments(
    "hatespeech",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3.0,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    )

In [22]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [23]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=None,
    )

In [24]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1500


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6823,0.714014,0.544
2,0.6799,0.703309,0.544
3,0.6778,0.697595,0.544


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to hatespeech/checkpoint-500
Configuration saved in hatespeech/checkpoint-500/config.json
Model weights saved in hatespeech/checkpoint-500/pytorch_model.bin
tokenizer config file saved in hatespeech/checkpoint-500/tokenizer_config.json
Special tokens file saved in hatespeech/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Bat

TrainOutput(global_step=1500, training_loss=0.6799911499023438, metrics={'train_runtime': 189.3518, 'train_samples_per_second': 126.748, 'train_steps_per_second': 7.922, 'total_flos': 474224788404480.0, 'train_loss': 0.6799911499023438, 'epoch': 3.0})

In [25]:
eval_results = trainer.evaluate()
print(eval_results)

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


{'eval_loss': 0.7140135765075684, 'eval_accuracy': 0.544, 'eval_runtime': 1.2825, 'eval_samples_per_second': 779.721, 'eval_steps_per_second': 49.122, 'epoch': 3.0}


In [26]:
trainer.save_model(output_dir='hatespeech-model')

Saving model checkpoint to hatespeech-model
Configuration saved in hatespeech-model/config.json
Model weights saved in hatespeech-model/pytorch_model.bin
tokenizer config file saved in hatespeech-model/tokenizer_config.json
Special tokens file saved in hatespeech-model/special_tokens_map.json


In [27]:
model = AutoModelForSequenceClassification.from_pretrained('hatespeech-model')

loading configuration file hatespeech-model/config.json
Model config BertConfig {
  "_name_or_path": "hatespeech-model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 49601
}

loading weights file hatespeech-model/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model ch

In [28]:
predictions = trainer.predict(encoded_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 16


In [29]:
preds = np.argmax(predictions.predictions, axis=-1)

In [30]:
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.523}

### Analiza sovražnega govora na podatkovni zbirki parlamentarnih debat
Naložimo podatkovno zbirko z zapisi parlamentarnih debat ter poženemo analizo sovražnega govora s pomočjo ravnokar ustvarjenega modela.

In [25]:
import csv

corpus = []
transkript = []
dvajset = []
devetnajst = []
osemnajst = []
sedemnajst = []
sestnajst = []

i=0
with open('dataframe.csv', 'r') as f:
  lineReader = csv.reader(f, delimiter=',', quotechar="\"")
  next(f)
  for row in lineReader:
    transkript.append(row[0])
    t = row[1].split('-')
    leto = t[0]
    mesec = t[1]
    rojstvo = row[4]
    stranka = row[5]
    spol = row[6]
    corpus.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    if leto == '2020':
        dvajset.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    elif leto == '2019':
        devetnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    elif leto == '2018':
        osemnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    elif leto == '2017':
        sedemnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    elif leto == '2016':
        sestnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'rojstvo': rojstvo, 'stranka': stranka, 'spol': spol})
    else:
        continue

In [26]:
import pandas as pd
df = pd.DataFrame(corpus, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])

In [27]:
df.head()

Unnamed: 0,text,leto,mesec,rojstvo,stranka,spol
0,"spoštovani, prosim, da zasedete svoja mesta. v...",2014,8,1944,DeSUS,F
1,"hvala za besedo, predsedujoča. spoštovani pred...",2014,8,1960,SD,M
2,zahvaljujem se spoštovanemu gospodu janku vebr...,2014,8,1944,DeSUS,F
3,"spoštovana gospa predsedujoča, spoštovane posl...",2014,8,1963,,M
4,"predsedniku republike, spoštovanemu gospodu bo...",2014,8,1944,DeSUS,F


In [28]:
dva = pd.DataFrame(dvajset, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])
devet = pd.DataFrame(devetnajst, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])
osem = pd.DataFrame(osemnajst, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])
sedem = pd.DataFrame(sedemnajst, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])
sest = pd.DataFrame(sestnajst, index=None, columns=['text', 'leto', 'mesec', 'rojstvo', 'stranka', 'spol'])

In [31]:
januarDvajset = dva[dva['mesec']=='01']
marecDvajset = dva[dva['mesec']=='03']
aprilDvajset = dva[dva['mesec']=='04']
majDvajset = dva[dva['mesec']=='05']
junijDvajset = dva[dva['mesec']=='06']

In [32]:
majOsem = osem[osem['mesec']=='05']
junijOsem = osem[osem['mesec']=='06']
julijOsem = osem[osem['mesec']=='07']

In [29]:
def getHatespeech(leto):
    #examples = list(leto['text'])
    results = []
    
    for i in range(0, len(leto), 300):
        examples = list(leto['text'])
        if i+300<len(leto):
            print(i, i+300)
            examples = examples[i:i+300]
        else:
            print(i)
            examples = examples[i:]
            
        inputs = tokenizer(examples, padding='longest', return_tensors="pt", max_length=100, truncation=True)
        outputs = model(**inputs)
        probs = outputs[0].detach().numpy()

        for i in range(len(examples)):
            results.append(id2label[np.argmax(probs[i])])
            
    leto['hs'] = results

In [30]:
getHatespeech(sest)
sest['hs'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700 12000
12000 12300
12300 12600
12600 12900
12900 13200
13200 13500
13500 13800
13800 14100
14100 14400
14400 14700
14700


nesovrazni    13435
sovrazni       1289
Name: hs, dtype: int64

In [31]:
getHatespeech(sedem)
sedem['hs'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700 12000
12000 12300
12300 12600
12600 12900
12900 13200
13200 13500
13500


nesovrazni    12327
sovrazni       1244
Name: hs, dtype: int64

In [32]:
getHatespeech(osem)
osem['hs'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400


nesovrazni    7817
sovrazni       698
Name: hs, dtype: int64

In [33]:
getHatespeech(devet)
devet['hs'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700


nesovrazni    11312
sovrazni        565
Name: hs, dtype: int64

In [34]:
getHatespeech(dva)
dva['hs'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600


nesovrazni    6247
sovrazni       354
Name: hs, dtype: int64

In [35]:
sest.to_csv('dvaSestnajst-crosloengualBERT-hatespeech.csv', encoding = 'utf-8-sig', index=False)

In [36]:
sedem.to_csv('dvaSedemnajst-crosloengualBERT-hatespeech.csv', encoding = 'utf-8-sig', index=False)

In [37]:
osem.to_csv('dvaOsemnajst-crosloengualBERT-hatespeech.csv', encoding = 'utf-8-sig', index=False)

In [38]:
devet.to_csv('dvaDevetnajst-crosloengualBERT-hatespeech.csv', encoding = 'utf-8-sig', index=False)

In [39]:
dva.to_csv('dvaDvajset-crosloengualBERT-hatespeech.csv', encoding = 'utf-8-sig', index=False)

In [31]:
januarDvajset.shape

(758, 6)

Testiranje izvedemo še na naključno izbranih parlamentarnih govorih, katerim je bila oznaka sovrazni/nesovrazni govor bila dodana ročno.

In [31]:
dataset2 = load_dataset(
    'csv',
    data_files={
        'test': 'dataParlamentHheadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-3501b4611139347a


Downloading and preparing dataset csv/default to /home/ncirar/.cache/huggingface/datasets/csv/default-3501b4611139347a/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ncirar/.cache/huggingface/datasets/csv/default-3501b4611139347a/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
encoded_dataset2 = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [33]:
predictions = trainer.predict(encoded_dataset2["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 16


In [34]:
preds = np.argmax(predictions.predictions, axis=-1)

In [35]:
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.523}