Uvozimo vse potrebne knjižnice:

In [1]:
!pip install transformers datasets sklearn numpy torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 8.6 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 73.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 69.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 49.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.4 MB/s 
Collecting xxhash
  Download

### Ustvarjanje *dataset*-a za učenje modela

In [2]:
from datasets import load_dataset, load_metric

Naložimo csv datoteki, ki smo ju pripravili na koncu preprocesiranja: eno za *training dataset* in drugo za *test dataset*. Iz obeh odstranimo prvo vrstico, ki vsebuje imeni stolpcev, ter na novo ustvarjeni datoteki shranimo.

In [3]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving hateSpeechTest.csv to hateSpeechTest.csv
Saving hateSpeechTrain.csv to hateSpeechTrain.csv
User uploaded file "hateSpeechTest.csv" with length 1627080 bytes
User uploaded file "hateSpeechTrain.csv" with length 6454275 bytes


In [4]:
with open("hateSpeechTest.csv",'r') as f:
    with open("hateSpeechTestHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [5]:
files.download('hateSpeechTestHeadless.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [6]:
with open("hateSpeechTrain.csv",'r') as f:
    with open("hateSpeechTrainHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [7]:
files.download('hateSpeechTrainHeadless.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Ustvarimo instanco *dataset*, pri čemer ustrezno definiramo *train* in *test* zbirki.

In [8]:
dataset = load_dataset(
    'csv',
    data_files={
        'train': 'hateSpeechTrainHeadless.csv',
        'test': 'hateSpeechTestHeadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-128266728e169b86


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-128266728e169b86/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-128266728e169b86/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
len(dataset['train'])

62536

In [10]:
len(dataset['test'])

15634

Uvozimo že predefinirane parametre GLUE ter tokenizer tipa [CroCloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert).

In [11]:
metric = load_metric('glue', 'sst2')

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

### Tokeniziranje podatkov v zbirki

In [12]:
from transformers import AutoTokenizer

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    use_fast=True
)

Downloading tokenizer_config.json:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/321k [00:00<?, ?B/s]

Tekstovni zapis oznake sentimentov spremenimo v števke: 3 za *3 nasilje*, 2 za *2 žalitev*, 1 za *1 nespodobni govor* in 0 za *0 ni sporni govor*. Določimo tudi maksimalno dolžino twittov, tj. 512 znakov.

In [30]:
label2id = {'3 nasilje': 3, '2 žalitev': 2, '1 nespodobni govor': 1, '0 ni sporni govor': 0}
id2label = ['0 ni sporni govor', '1 nespodobni govor', '2 žalitev', '3 nasilje']

In [15]:
def preprocess(examples):
  result = tokenizer(examples['sentence'], truncation=True, max_length=512)
  result['label'] = [label2id[l] for l in examples['label']]
  return result

In [16]:
encoded_dataset = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/63 [00:00<?, ?ba/s]

  0%|          | 0/16 [00:00<?, ?ba/s]

### Učenje modela
Uvozimo model strojnega učenja, ki je istega tipa kot tokenizer (CroSloEngual BERT), definiramo argumente učenja modela ter ustvarimo novo instanco *trainerja*.

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

In [18]:
model = AutoModelForSequenceClassification.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    num_labels=4
)

Downloading pytorch_model.bin:   0%|          | 0.00/476M [00:00<?, ?B/s]

Some weights of the model checkpoint at EMBEDDIA/crosloengual-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model chec

In [19]:
args = TrainingArguments(
    "tweet-hatespeech",
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=0.1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    )

In [20]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [21]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=None,
    )

In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 62536
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 391


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=391, training_loss=0.8586124507972347, metrics={'train_runtime': 61.6397, 'train_samples_per_second': 101.454, 'train_steps_per_second': 6.343, 'total_flos': 122291281167744.0, 'train_loss': 0.8586124507972347, 'epoch': 0.1})

In [23]:
eval_results = trainer.evaluate()
print(eval_results)

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 15634
  Batch size = 16


{'eval_loss': 0.82618248462677, 'eval_accuracy': 0.6366892669822183, 'eval_runtime': 36.8921, 'eval_samples_per_second': 423.777, 'eval_steps_per_second': 26.51, 'epoch': 0.1}


In [24]:
trainer.save_model(output_dir='C:\Users\gogi1\Desktop\diploma\model\HateSpeech')

Saving model checkpoint to tweet-sentiment-model
Configuration saved in tweet-sentiment-model/config.json
Model weights saved in tweet-sentiment-model/pytorch_model.bin
tokenizer config file saved in tweet-sentiment-model/tokenizer_config.json
Special tokens file saved in tweet-sentiment-model/special_tokens_map.json


In [25]:
model = AutoModelForSequenceClassification.from_pretrained('C:\Users\gogi1\Desktop\diploma\model\HateSpeech')

loading configuration file tweet-sentiment-model/config.json
Model config BertConfig {
  "_name_or_path": "tweet-sentiment-model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.21.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 49601
}

loading weights file tweet

### Analiza sovražnega govora na podatkovni zbirki parlamentarnih debat
Naložimo podatkovno zbirko z zapisi parlamentarnih debat ter poženemo analizo sovražnega govora s pomočjo ravnokar ustvarjenega modela.

In [32]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving dataframe.csv to dataframe.csv
User uploaded file "dataframe.csv" with length 92383142 bytes


In [33]:
import csv

corpus = []
text = []

with open('dataframe.csv', 'r') as f:
  lineReader = csv.reader(f, delimiter=',', quotechar="\"")
  for row in lineReader:
    if row:
      #print(row)
      text.append(row[0])
      corpus.append({'text': row[0], 'datum': row[1]})

In [None]:
res = []

inputs = tokenizer(text, padding='longest', return_tensors="pt")
outputs = model(**inputs)
probs = outputs[0].detach().numpy()
for i in range(len(text)):
    #print(corpus[i],'\t', id2label[np.argmax(probs[i])])
    res.append(id2label[np.argmax(probs[i])])