Za strojno učenje se bomo povezali s strežnikom na FRI. Spodnji ukaz določi grafično kartico, ki jo bomo uporabili.

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

Uvozimo txt datoteko iz podatkovne zbirke [SentiNews](https://www.clarin.si/repository/xmlui/handle/11356/1110), ki vsebuje besedila novic ter določeno oznako s sentimentom (neutral, negative, positive). 

In [2]:
import csv
import re
data = []

with open('SentiNews_paragraph-level.txt', 'r') as f:
  lineReader = csv.reader(f, delimiter=',', quotechar="\"")
  for row in lineReader:
    if row:
      row = ' '.join(row)
      elementi = row.split('\t')
      sentence = elementi[2]
      sentiment = elementi[11]
      sentiment = re.sub(r'[^A-Z\Č\Š\Ž\Ća-z\č\š\ž\ć\.,!?]+', "", sentiment)
      data.append({'text': sentence, 'sent': sentiment})

In [3]:
data = data[1:]

In [4]:
import pandas as pd
df = pd.DataFrame(data, index=None, columns=['text', 'sent'])

In [5]:
df['sent'].value_counts()

neutral     40358
negative    18268
positive    10781
Name: sent, dtype: int64

In [6]:
df.sample(15)

Unnamed: 0,text,sent
43998,Juri Pomemben je psihološki učinek skupne valute,neutral
19634,katja.svensek@dnevnik.si,neutral
13041,Pred prenagljenim poseganjem po varčevalnih uk...,negative
4161,Predsednik uprave Mure Franc Huber je nadzorne...,neutral
35571,AIG želi zbrati svež kapital,neutral
31928,Japonska centralna banka z najtežjim topništvo...,negative
3476,V Gorenju so delavcem že znižali plače za 10 o...,negative
58869,Na trgu dela se je poslabševanje razmer umiril...,neutral
57561,Če se bo izkazalo da so trditve točne bodo m...,negative
1536,V skladu z vladno Uredbo o oblikovanju cen naf...,neutral


## Preprocesiranje
Besedilo pretvorimo v majhne črke, odstranimo dodatne presledke ter posebne simbole. Odtsranimo vse vrstice s praznimi vrednostmi.

In [7]:
df['text'] = df['text'].apply(lambda x: ' '.join(x.lower().split()))

In [8]:
df['text'] = df['text'].apply(lambda x: ' '.join(x.split()))

In [9]:
import re
df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w ]+', "", x))

In [10]:
df['text'] = df['text'].apply(lambda x: re.sub("\d+", "", x))

In [11]:
df.shape

(69407, 2)

In [12]:
df = df.dropna()

In [13]:
df.shape

(69407, 2)

## Učna, ocenjevalna in testna množica
Tabelo razdelimo na tri dele: učno, ocenjevalno in testno množico, pri čemer je razmerje njihovih velikosti enako 8:1:1.

In [14]:
train = df[:8000]

In [15]:
evaluation = df[8001:9001]

In [16]:
test = df[9002:10002]

In [17]:
train.shape

(8000, 2)

In [18]:
evaluation.shape

(1000, 2)

In [19]:
test.shape

(1000, 2)

In [20]:
#from google.colab import files
train.to_csv('sentTrain.csv', encoding = 'utf-8-sig', index=False)
evaluation.to_csv('sentEvaluation.csv', encoding='utf-8-sig', index=False)
test.to_csv('sentTest.csv', encoding = 'utf-8-sig', index=False)

Uvozimo vse potrebne knjižnice.

In [21]:
!pip install transformers datasets sklearn numpy torch torchvision



In [22]:
from datasets import load_dataset, load_metric



Iz vseh treh množic odstranimo prvo vrstico, ki vsebuje imena atributov.

In [23]:
with open("sentTrain.csv",'r') as f:
    with open("sentTrainHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
          f1.write(line)

In [24]:
with open("sentTest.csv",'r') as f:
    with open("sentTestHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
          f1.write(line)

In [25]:
with open("sentEvaluation.csv",'r') as f:
    with open("sentEvaluationHeadless.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

Ustvarimo novo instanco podatkovne zbirke, ki vsebuje vse tri množice.

In [26]:
dataset = load_dataset(
    'csv',
    data_files={
        'train': 'sentTrainHeadless.csv',
        'validation': 'sentEvaluationHeadless.csv',
        'test': 'sentTestHeadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-3b380ed15746577e


Downloading and preparing dataset csv/default to /home/ncirar/.cache/huggingface/datasets/csv/default-3b380ed15746577e/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ncirar/.cache/huggingface/datasets/csv/default-3b380ed15746577e/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [27]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1000
    })
})

Uvozimo predefinirane uteži naloge [SST2](https://huggingface.co/datasets/sst2) ter ustvarimo novo instanco tokenizatorja, s pomočjo katerega besedilo tokeniziramo (besede pretvorimo v žetone in numerične predstavitve oz. vektorje).

In [28]:
metric = load_metric('glue', 'sst2')

In [29]:
from transformers import AutoTokenizer

In [30]:
tokenizer = AutoTokenizer.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    use_fast=True
)

In [31]:
tokenizer(['hello', 'world'])

{'input_ids': [[103, 17592, 1169, 104], [103, 2329, 104]], 'token_type_ids': [[0, 0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1]]}

In [32]:
label2id = {'positive': 2, 'neutral': 1, 'negative': 0}
id2label = ['negative', 'neutral', 'positive']

In [33]:
def preprocess(examples):
  result = tokenizer(examples['sentence'], truncation=True, max_length=512)
  result['label'] = [label2id[l] for l in examples['label']]
  return result

In [34]:
encoded_dataset = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [35]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [36]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

2022-09-01 15:28:14.447594: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


### Učenje modela
Ustvarimo novo instanco modela, učne argumente ter novo instanco trainerja, s pomočjo katerega nato izvedemo učenje modela. Učenje poteka 3 epohe, pri čemer na koncu vsake epohe izračuna točnost napovedi modela.

In [37]:
model = AutoModelForSequenceClassification.from_pretrained(
    'EMBEDDIA/crosloengual-bert',
    num_labels=3
)

Some weights of the model checkpoint at EMBEDDIA/crosloengual-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model chec

In [38]:
args = TrainingArguments(
    "sentiment",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3.0,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    )

In [39]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [40]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=None,
    )

In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1500


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6745,0.678667,0.701
2,0.4703,0.673197,0.701
3,0.3208,0.773768,0.694


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to sentiment/checkpoint-500
Configuration saved in sentiment/checkpoint-500/config.json
Model weights saved in sentiment/checkpoint-500/pytorch_model.bin
tokenizer config file saved in sentiment/checkpoint-500/tokenizer_config.json
Special tokens file saved in sentiment/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch si

TrainOutput(global_step=1500, training_loss=0.4885126241048177, metrics={'train_runtime': 342.0821, 'train_samples_per_second': 70.159, 'train_steps_per_second': 4.385, 'total_flos': 1463066826159456.0, 'train_loss': 0.4885126241048177, 'epoch': 3.0})

In [23]:
eval_results = trainer.evaluate()
print(eval_results)

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


{'eval_loss': 0.6786672472953796, 'eval_accuracy': 0.701, 'eval_runtime': 3.3131, 'eval_samples_per_second': 301.83, 'eval_steps_per_second': 19.015, 'epoch': 3.0}


Model shranimo in naložimo.

In [24]:
trainer.save_model(output_dir='sentiment-model')

Saving model checkpoint to sentiment-model
Configuration saved in sentiment-model/config.json
Model weights saved in sentiment-model/pytorch_model.bin
tokenizer config file saved in sentiment-model/tokenizer_config.json
Special tokens file saved in sentiment-model/special_tokens_map.json


In [41]:
model = AutoModelForSequenceClassification.from_pretrained('sentiment-model')

loading configuration file sentiment-model/config.json
Model config BertConfig {
  "_name_or_path": "sentiment-model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 49601
}

loading weights file sentiment-model/pytorch_model.bin
All model checkpoint 

Delovanje modela preverimo na naključnih stavkih.

In [23]:
examples=['Rezultati za prejšnje leto so res pohvale vredni.',
          'Najlepša hvala za pomoč, zelo sem hvaležen.',
          'Neumni politiki nimajo pojma.', 
          'Če me ne pustiš pri miru, te bom udaril!',
          'Jutri bo deževalo.',
          'Ne maram mleka.',
          'Sovražim ponedeljke.',
          'Lansko poročilo kaže res dobre rezultate, super.',
          'Veselimo se sodelovanja z vami.',
          'Nemški ovčar je vrsta psa.',
          'Včeraj sem videl čudovito mavrico, kar me je zelo osrečilo.',
          'Oblaki so prekrili nebo',
          'Lahko nadaljujete z govorom.',
          'Takoj prenehajte, drugače dobite opomin.']

inputs = tokenizer(examples, padding='longest', return_tensors="pt")
outputs = model(**inputs)
probs = outputs[0].detach().numpy()
for i in range(len(examples)):
    print(examples[i],'\t', id2label[np.argmax(probs[i])])

Rezultati za prejšnje leto so res pohvale vredni. 	 positive
Najlepša hvala za pomoč, zelo sem hvaležen. 	 neutral
Neumni politiki nimajo pojma. 	 negative
Če me ne pustiš pri miru, te bom udaril! 	 negative
Jutri bo deževalo. 	 neutral
Ne maram mleka. 	 neutral
Sovražim ponedeljke. 	 neutral
Lansko poročilo kaže res dobre rezultate, super. 	 positive
Veselimo se sodelovanja z vami. 	 positive
Nemški ovčar je vrsta psa. 	 neutral
Včeraj sem videl čudovito mavrico, kar me je zelo osrečilo. 	 positive
Oblaki so prekrili nebo 	 negative
Lahko nadaljujete z govorom. 	 neutral
Takoj prenehajte, drugače dobite opomin. 	 neutral


### Končno testiranje modela
Izvedemo še končno testiranje točnosti napovedi modela na testni množici.

In [27]:
predictions = trainer.predict(encoded_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 16


In [28]:
preds = np.argmax(predictions.predictions, axis=-1)

In [29]:
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.739}

## Analiza sentimenta na parlamentarnem korpusu ParlaMint
Uvozimo datoteko, ki smo jo pripravili (vsebuje prepise parlamentarnih sej in metapodatke o govorcih).

In [30]:
import csv

corpus = []
transkript = []
dvajset = []
devetnajst = []
osemnajst = []
sedemnajst = []
sestnajst = []

i=0
with open('dataframe.csv', 'r') as f:
  lineReader = csv.reader(f, delimiter=',', quotechar="\"")
  next(f)
  for row in lineReader:
    transkript.append(row[0])
    t = row[1].split('-')
    leto = t[0]
    mesec = t[1]
    ime = row[4]
    rojstvo = row[5]
    
    if rojstvo == '-\n':
        starost = 'Ni oznake'
    else:
        rojstvo = int(rojstvo)
        if 2022 - rojstvo >= 65:
            starost = 'nad 65'
        elif 2022 - rojstvo < 65 and 2022 - rojstvo >=50:
            starost = '50-65'
        else:
            starost = '30-49'
        
    stranka = row[6]
    spol = row[7]
    corpus.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime,'starost': starost, 'stranka': stranka, 'spol': spol})
    if leto == '2020':
        dvajset.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime, 'starost': starost, 'stranka': stranka, 'spol': spol})
    elif leto == '2019':
        devetnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime, 'starost': starost, 'stranka': stranka, 'spol': spol})
    elif leto == '2018':
        osemnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime, 'starost': starost, 'stranka': stranka, 'spol': spol})
    elif leto == '2017':
        sedemnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime, 'starost': starost, 'stranka': stranka, 'spol': spol})
    elif leto == '2016':
        sestnajst.append({'text': row[0], 'leto': leto, 'mesec': mesec, 'ime': ime, 'starost': starost, 'stranka': stranka, 'spol': spol})
    else:
        continue

In [33]:
import pandas as pd
df = pd.DataFrame(corpus, index=None, columns=['text', 'leto', 'mesec', 'ime','starost', 'stranka', 'spol'])

In [34]:
df.head(5)

Unnamed: 0,text,leto,mesec,ime,starost,stranka,spol
0,"spoštovani, prosim, da zasedete svoja mesta. v...",2014,8,"Kotnik Poropat, Marjana",nad 65,DeSUS,F
1,"hvala za besedo, predsedujoča. spoštovani pred...",2014,8,"Veber, Janko",50-65,SD,M
2,zahvaljujem se spoštovanemu gospodu janku vebr...,2014,8,"Kotnik Poropat, Marjana",nad 65,DeSUS,F
3,"spoštovana gospa predsedujoča, spoštovane posl...",2014,8,"Pahor, Borut",50-65,,M
4,"predsedniku republike, spoštovanemu gospodu bo...",2014,8,"Kotnik Poropat, Marjana",nad 65,DeSUS,F


Za vsako leto posebej naredimo nov pandas dataframe. Nato gremo čez vseh 5 tabel in ustvarimo sezname, z metapodatki, ki jih bomo nato uporabili za analizo. Na novo pridobljene sezname brez duplikatov shranimo v csv datoteko.

In [38]:
dva = pd.DataFrame(dvajset, index=None, columns=['text', 'leto', 'mesec','ime','starost', 'stranka', 'spol'])
devet = pd.DataFrame(devetnajst, index=None, columns=['text', 'leto', 'mesec','ime','starost', 'stranka', 'spol'])
osem = pd.DataFrame(osemnajst, index=None, columns=['text', 'leto', 'mesec','ime','starost', 'stranka', 'spol'])
sedem = pd.DataFrame(sedemnajst, index=None, columns=['text', 'leto', 'mesec','ime','starost', 'stranka', 'spol'])
sest = pd.DataFrame(sestnajst, index=None, columns=['text', 'leto', 'mesec','ime','starost', 'stranka', 'spol'])

In [66]:
poslanciSest = []

for index, row in sest.iterrows():
    ime = row['ime']
    spol = row['spol']
    stranka = row['stranka']
    starost = row['starost']
    poslanciSest.append({'ime': ime, 'spol': spol, 'stranka': stranka, 'starost': starost})

In [67]:
dataSest = pd.DataFrame(poslanciSest, index=None, columns=['ime','spol', 'stranka', 'starost'])

In [68]:
dataSest = dataSest.drop_duplicates()

In [60]:
poslanciSedem = []
for index, row in sedem.iterrows():
    ime = row['ime']
    spol = row['spol']
    stranka = row['stranka']
    starost = row['starost']
    poslanciSedem.append({'ime': ime, 'spol': spol, 'stranka': stranka, 'starost': starost})
dataSedem = pd.DataFrame(poslanciSedem, index=None, columns=['ime','spol', 'stranka', 'starost'])
dataSedem = dataSedem.drop_duplicates()

In [61]:
poslanciOsem = []
for index, row in osem.iterrows():
    ime = row['ime']
    spol = row['spol']
    stranka = row['stranka']
    starost = row['starost']
    poslanciOsem.append({'ime': ime, 'spol': spol, 'stranka': stranka, 'starost': starost})
dataOsem = pd.DataFrame(poslanciOsem, index=None, columns=['ime','spol', 'stranka', 'starost'])
dataOsem = dataOsem.drop_duplicates()

In [62]:
poslanciDevet = []
for index, row in devet.iterrows():
    ime = row['ime']
    spol = row['spol']
    stranka = row['stranka']
    starost = row['starost']
    poslanciDevet.append({'ime': ime, 'spol': spol, 'stranka': stranka, 'starost': starost})
dataDevet = pd.DataFrame(poslanciDevet, index=None, columns=['ime','spol', 'stranka', 'starost'])
dataDevet = dataDevet.drop_duplicates()

In [63]:
poslanciDvajset = []
for index, row in dva.iterrows():
    ime = row['ime']
    spol = row['spol']
    stranka = row['stranka']
    starost = row['starost']
    poslanciDvajset.append({'ime': ime, 'spol': spol, 'stranka': stranka, 'starost': starost})
dataDvajset = pd.DataFrame(poslanciDvajset, index=None, columns=['ime','spol', 'stranka', 'starost'])
dataDvajset = dataDvajset.drop_duplicates()

In [64]:
dataSest.to_csv('poslanciDvaSestnajst.csv', encoding = 'utf-8-sig', index=False)
dataSedem.to_csv('poslanciDvaSedemnajst.csv', encoding = 'utf-8-sig', index=False)
dataOsem.to_csv('poslanciDvaOsemnajst.csv', encoding = 'utf-8-sig', index=False)
dataDevet.to_csv('poslanciDvaDevetnajst.csv', encoding = 'utf-8-sig', index=False)
dataDvajset.to_csv('poslanciDvaDvajset.csv', encoding = 'utf-8-sig', index=False)

In [25]:
januarDvajset = dva[dva['mesec']=='01']
marecDvajset = dva[dva['mesec']=='03']
aprilDvajset = dva[dva['mesec']=='04']

In [26]:
majDvajset = dva[dva['mesec']=='05']
junijDvajset = dva[dva['mesec']=='06']

In [40]:
januarDvajset.shape

(758, 6)

In [60]:
majOsem = osem[osem['mesec']=='05']
junijOsem = osem[osem['mesec']=='06']
julijOsem = osem[osem['mesec']=='07']

In [45]:
sest['mesec'].value_counts()

03    2176
11    2061
06    1532
12    1452
05    1445
09    1290
07    1120
04    1061
10    1049
01     794
02     744
Name: mesec, dtype: int64

In [54]:
sest.shape    

(14724, 6)

Ustvarimo funkcijo, ki bo šla čez vsako leto posebej in z zamiki po 300 vrstic analizirala sentiment govorov. Velikost zamika smo določili na podlagi tega, ker lahko model naenkrat prejme le določeno število podatkov. Funkcijo poženemo za vseh 5 let.

In [4]:
def getSentiment(leto):
    #examples = list(leto['text'])
    results = []
    
    for i in range(0, len(leto), 300):
        examples = list(leto['text'])
        if i+300<len(leto):
            print(i, i+300)
            examples = examples[i:i+300]
        else:
            print(i)
            examples = examples[i:]
            
        inputs = tokenizer(examples, padding='longest', return_tensors="pt", max_length=100, truncation=True)
        outputs = model(**inputs)
        probs = outputs[0].detach().numpy()

        for i in range(len(examples)):
            results.append(id2label[np.argmax(probs[i])])
            
    leto['sent'] = results

In [70]:
sest.shape

(14724, 6)

In [74]:
getSentiment(sest)

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700 12000
12000 12300
12300 12600
12600 12900
12900 13200
13200 13500
13500 13800
13800 14100
14100 14400
14400 14700
14700


In [75]:
sest['sent'].value_counts()

neutral     14667
negative       30
positive       27
Name: sent, dtype: int64

In [77]:
sedem.shape

(13571, 6)

In [78]:
getSentiment(sedem)
sedem['sent'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700 12000
12000 12300
12300 12600
12600 12900
12900 13200
13200 13500
13500


neutral     13502
positive       45
negative       24
Name: sent, dtype: int64

In [79]:
getSentiment(osem)
osem['sent'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400


neutral     8447
positive      36
negative      32
Name: sent, dtype: int64

In [80]:
getSentiment(devet)
devet['sent'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600 6900
6900 7200
7200 7500
7500 7800
7800 8100
8100 8400
8400 8700
8700 9000
9000 9300
9300 9600
9600 9900
9900 10200
10200 10500
10500 10800
10800 11100
11100 11400
11400 11700
11700


neutral     11810
negative       35
positive       32
Name: sent, dtype: int64

In [81]:
getSentiment(dva)
dva['sent'].value_counts()

0 300
300 600
600 900
900 1200
1200 1500
1500 1800
1800 2100
2100 2400
2400 2700
2700 3000
3000 3300
3300 3600
3600 3900
3900 4200
4200 4500
4500 4800
4800 5100
5100 5400
5400 5700
5700 6000
6000 6300
6300 6600
6600


neutral     6547
negative      29
positive      25
Name: sent, dtype: int64

Dobljene rezultate shranimo v csv datoteke.

In [82]:
sest.to_csv('dvaSestnajst-crosloengualBERT.csv', encoding = 'utf-8-sig', index=False)

In [83]:
sedem.to_csv('dvaSedemnajst-crosloengualBERT.csv', encoding = 'utf-8-sig', index=False)

In [84]:
osem.to_csv('dvaOsemnajst-crosloengualBERT.csv', encoding = 'utf-8-sig', index=False)

In [85]:
devet.to_csv('dvaDevetnajst-crosloengualBERT.csv', encoding = 'utf-8-sig', index=False)

In [86]:
dva.to_csv('dvaDvajset-crosloengualBERT.csv', encoding = 'utf-8-sig', index=False)

Izvedemo še testiranje modela na naključno izbranih parlamentarnih govorih, katerim je bila oznaka sentimenta določena ročno.

In [14]:
dataset2 = load_dataset(
    'csv',
    data_files={
        'test': 'dataParlamentSheadless.csv'
    },
    column_names = ['sentence', 'label']
)

Using custom data configuration default-4763339a84e14004


Downloading and preparing dataset csv/default to /home/ncirar/.cache/huggingface/datasets/csv/default-4763339a84e14004/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ncirar/.cache/huggingface/datasets/csv/default-4763339a84e14004/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
encoded_dataset2 = dataset.map(preprocess, batched=True, load_from_cache_file=False)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [26]:
predictions = trainer.predict(encoded_dataset2["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence. If sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 16


In [27]:
preds = np.argmax(predictions.predictions, axis=-1)

In [28]:
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.739}