# Przygotowanie środowiska

In [1]:
!ls

sample_data


## Instalacja niezbędnych bibliotek

### Simple transformers

In [2]:
! pip install simpletransformers



## Pobranie zbiorów danych

### Model Roberta-PL

In [3]:
! wget https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v2.9.0/roberta_base_transformers.zip

--2020-10-18 09:29:44--  https://github.com/sdadas/polish-roberta/releases/download/models-transformers-v2.9.0/roberta_base_transformers.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/247501435/a3767200-95fb-11ea-9f18-7d025e942860?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201018%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201018T092945Z&X-Amz-Expires=300&X-Amz-Signature=a44c630570472adb291ecdd315be12459660b4310609203d951f321b71d2a01a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=247501435&response-content-disposition=attachment%3B%20filename%3Droberta_base_transformers.zip&response-content-type=application%2Foctet-stream [following]
--2020-10-18 09:29:45--  https://github-production-release-asset-2e65be.s3.amazonaws.com/247501435/a3767200-95fb-

In [6]:
! unzip roberta_base_transformers.zip

Archive:  roberta_base_transformers.zip
replace config.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


### Pobranie zbioru danych do klasyfikacji

Próbujemy odróżniać komedie od thrillerów na podstawie krótkiego opisu.

In [7]:
import gdown

urls = ['https://drive.google.com/uc?id=1a0FiWf_LoQhjjRORKoj9MZi4ghTnZHK0', 'http://2019.poleval.pl/task6/task_6-1.zip']
outputs = ['selected_films.csv', 'poleval.zip']
for url, output in zip(urls,outputs):
  gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1a0FiWf_LoQhjjRORKoj9MZi4ghTnZHK0
To: /content/selected_films.csv
100%|██████████| 442k/442k [00:00<00:00, 70.0MB/s]
Downloading...
From: http://2019.poleval.pl/task6/task_6-1.zip
To: /content/poleval.zip
100%|██████████| 340k/340k [00:00<00:00, 511kB/s]


In [8]:
! unzip poleval.zip

Archive:  poleval.zip
  inflating: training_set_clean_only_text.txt  
  inflating: training_set_clean_only_tags.txt  


Podejrzenie formatu danych

In [9]:
! head selected_films.csv

title,year,label,description
Pracownik miesiąca,1997,komedia,"Zack, leniwy pracownik supermarketu, zakochuje się w koleżance z pracy, Amy. Chcąc zdobyć jej uznanie, staje do walki o tytuł ""Pracownika miesiąca""."
Zero Dark Thirty,2019,thriller,"Film opowiada o polowaniu na najsłynniejszego terrorystę w historii, Osamę bin Ladena, z perspektywy młodej agentki CIA."
Prima aprilis,1986,thriller,Podczas podróży jeden ze studentów ulega nieszczęśliwemu wypadkowi. Niedługo po tym zdarzeniu zaczynają ginąć kolejni.
Wasabi - Hubert zawodowiec,2001,komedia,"Paryski policjant, Hubert Fiorentini, przylatuje do Tokio, by wziąć udział w pogrzebie dawnej narzeczonej. Na miejscu dowiaduje się, że ma nastoletnią córkę, którą ściga japońska mafia."
Child 44,1987,thriller,"Związek Radziecki, rządy Stalina. Okryty niesławą oficer służb bezpieczeństwa rozpoczyna śledztwo w sprawie serii tajemniczych morderstw dzieci."
"Jak za dawnych, dobrych czasów",1980,komedia,"Nicholas, zostając zmuszony do napadu na

In [10]:
! head training_set_clean_only_text.txt  

Dla mnie faworytem do tytułu będzie Cracovia. Zobaczymy, czy typ się sprawdzi.
@anonymized_account @anonymized_account Brawo ty Daria kibic ma być na dobre i złe
@anonymized_account @anonymized_account Super, polski premier składa kwiaty na grobach kolaborantów. Ale doczekaliśmy czasów.
@anonymized_account @anonymized_account Musi. Innej drogi nie mamy.
Odrzut natychmiastowy, kwaśna mina, mam problem
Jaki on był fajny xdd pamiętam, że spóźniłam się na jego pierwsze zajęcia i to sporo i za karę kazał mi usiąść w pierwszej ławce XD
@anonymized_account No nie ma u nas szczęścia 😉
@anonymized_account Dawno kogoś tak wrednego nie widziałam xd
@anonymized_account @anonymized_account Zaległości były, ale ważne czy były wezwania do zapłaty z których się klub nie wywiązał.
@anonymized_account @anonymized_account @anonymized_account Gdzie jest @anonymized_account . Brudziński jesteś kłamcą i marnym kutasem @anonymized_account


In [2]:
! head training_set_clean_only_tags.txt  

0
0
0
0
0
0
0
0
0
1


# Użycie pre-trenowanego modelu językowego

## Weryfikacja poprawności modelu Roberta-PL


In [3]:
import torch
from tokenizers import SentencePieceBPETokenizer
from tokenizers.processors import RobertaProcessing
from transformers import RobertaModel, AutoModel

model_dir = "."
tokenizer = SentencePieceBPETokenizer(f"{model_dir}/vocab.json", f"{model_dir}/merges.txt")
getattr(tokenizer, "_tokenizer").post_processor = RobertaProcessing(sep=("</s>", 2), cls=("<s>", 0))
model: RobertaModel = AutoModel.from_pretrained(model_dir)

text = tokenizer.encode("Zażółcić gęślą jaźń.")
output = model(torch.tensor([text.ids]))[0]
print(output[0][1])



tensor([-2.6855e-01,  3.5748e-01, -2.8293e-02, -2.2627e-01,  4.0249e-02,
         1.2084e-01, -3.3320e-02,  1.8996e-01,  3.7394e-01, -8.1368e-02,
        -4.1473e-01,  4.1180e-01,  5.7224e-02,  1.6965e-01, -4.1380e-01,
        -2.1688e-01,  2.8083e-01, -7.0458e-02, -8.7586e-03, -1.3442e-01,
        -6.1923e-02,  1.9540e-01, -2.0082e-01, -1.4715e-01,  2.0480e-01,
        -2.4495e-01, -5.6091e-02,  8.1546e-02,  7.1805e-03, -1.6336e-01,
        -7.7267e-02,  1.7735e-02,  4.2532e-01, -5.5318e-01, -1.3614e-01,
         1.4777e-01,  4.0715e-01, -1.3531e-01, -2.2392e-01, -1.8556e-01,
         2.2421e-01, -3.1906e-01,  1.7641e-01, -4.0308e-01,  3.2301e-01,
        -2.8429e-01, -3.4698e-01, -1.4387e-01, -3.3632e-02, -1.0512e-01,
         1.1643e-01,  2.6407e-01,  1.1686e-03,  3.3845e-02,  1.1994e-01,
        -2.2901e-01, -1.6247e-01,  3.6817e-02, -5.2713e-02,  4.1187e-02,
         1.4172e-01, -1.0289e+00, -3.1740e-01,  6.1496e-02, -6.9546e-02,
         4.5484e-01, -2.5655e-01, -1.6461e-01, -8.1

## Trening modelu klasyfikacyjnego

### Przygotowanie danych

#### Filmy


In [4]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

Wczytujemy dane z pliku CSV. Ponieważ biblioteka Simple Transformers korzysta z predefiniowanych nazw etykiet (`labels` - wartości klas oraz `text` - tekst podlegający klasyfikacji), zmieniamy nazwy etykiet z pliku CSV oraz mapujemy etykiety do liczb: 0 (komedia) i 1 (thriller).

In [5]:
from sklearn.model_selection import train_test_split

all_data = pd.read_csv("selected_films.csv")
all_data = all_data.rename(columns={'label': 'labels', 'description': 'text'})
all_data['text'] += ' ' + all_data['title']
all_data['labels'] = all_data['labels'].map({'thriller': 1, 'komedia': 0})
print(all_data.columns)
print(all_data['labels'].value_counts())

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


Index(['title', 'year', 'labels', 'text'], dtype='object')
0    1283
1    1273
Name: labels, dtype: int64


Dzielimy zbiór na część treningową i testową oraz sprawdzamy rozkład etykiet w każdej części.

In [6]:
train_df, test_df = train_test_split(all_data, train_size=0.9)
print(train_df.columns)
print(train_df['labels'].value_counts())
print(test_df['labels'].value_counts())

Index(['title', 'year', 'labels', 'text'], dtype='object')
0    1167
1    1133
Name: labels, dtype: int64
1    140
0    116
Name: labels, dtype: int64


#### Cyberbullying


Należy przetworzyć dane oraz podzielić je analogicznie jak dla danych z filmów.

In [7]:
pol_eval = open("training_set_clean_only_text.txt", "r")
tweets = pol_eval.readlines()
tweets = [tweet.rstrip() for tweet in tweets]
# print(tweets)
pol_eval_tags = open("training_set_clean_only_tags.txt", "r")
labels = pol_eval_tags.readlines();
labels = [int(label) for label in labels]

pol_eval_data = {}
pol_eval_data['labels'] = labels
pol_eval_data['text'] = tweets

pol_eval_df = pd.DataFrame(data=pol_eval_data)

In [8]:
train_pol_eval, test_pol_eval = train_test_split(pol_eval_df, train_size=0.9)
print(train_pol_eval.columns)
print(train_pol_eval['labels'].value_counts())
print(test_pol_eval['labels'].value_counts())

Index(['labels', 'text'], dtype='object')
0    8276
1     760
Name: labels, dtype: int64
0    914
1     91
Name: labels, dtype: int64


### Uruchomienie treningu

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_2 = ClassificationModel('roberta', './')
cls_model_2.train_model(train_df, args={"num_train_epochs": 5})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2300.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=288.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=288.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=288.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=288.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=288.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(1440, 0.24403662240042145)

### Ewaluacja wyników klasyfikacji

In [16]:
import sklearn

In [None]:
result, model_outputs, wrong_predictions = cls_model_2.eval_model(test_df, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=256.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=32.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:{'mcc': 0.8517964850293742, 'tp': 120, 'tn': 117, 'fp': 11, 'fn': 8, 'acc': 0.92578125, 'eval_loss': 0.3432560822329833}





In [None]:
print(result)

{'mcc': 0.8517964850293742, 'tp': 120, 'tn': 117, 'fp': 11, 'fn': 8, 'acc': 0.92578125, 'eval_loss': 0.3432560822329833}


In [None]:
for example in wrong_predictions:
  print(example.text_a)
  print(['komedia', 'thriller'][example.label == 1])

Marge i Dick Nelson zostają uprowadzeni przez cesarza Toda na odległą planetę Spengo, którą zamieszkują sami idioci. Okrutny władca zamierza poślubić kobietę oraz zniszczyć Ziemię. Mama i tata ocalają świat
komedia
Nałogowy hazardzista zdobywa informację o pewniaku na jedną z gonitw. W nadchodzący weekend postanawia wykorzystać nadarzającą się szansę. Niech się dzieje co chce
komedia
Topper wyrusza w niebezpieczną podróż, aby ocalić pułkownika Dentona. Hot Shots 2
komedia
Kat po kolejnym upokorzeniu przez męża postanawia się na nim zemścić. Kaliber 45
thriller
Trzech najlepszych przyjaciół dopada refleksja na temat spraw sercowych. Ten niezręczny moment
komedia
Życie popularnego pisarza zamienia się w koszmar, gdy nawiązuje znajomość ze swoim trzynastoletnim fanem. Nocny słuchacz
thriller
Rok 1944. Rozwiedziona Nita Longley przeprowadza się z dwoma synami do małego miasta w Teksasie, by dzień i noc pracować jako telefonistka. Przybłęda
thriller
Podczas wyprawy do Los Angeles matka z sy

Trenowanie sieci na zbiorze PolEval

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_3 = ClassificationModel('roberta', './')
cls_model_3.train_model(train_pol_eval, args={"num_train_epochs": 5})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(5650, 0.29246110232167805)

In [None]:
result, model_outputs, wrong_predictions = cls_model_3.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 904, 'fp': 0, 'fn': 101, 'acc': 0.8995024875621891, 'eval_loss': 0.3370856500807263}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 904, 'fp': 0, 'fn': 101, 'acc': 0.8995024875621891, 'eval_loss': 0.3370856500807263}


In [None]:
for example in wrong_predictions:
  print(example.text_a)
  print(['no_cyberbulling', 'cyberbulling'][example.label == 1])

@anonymized_account @anonymized_account Dopóki tego nie udowodnisz pajacu jesteś oszczerca i manipulantem.
cyberbulling
@anonymized_account Trochę mózgu, troszeczkę, minimalna ilość szarych komórek
cyberbulling
Codziennie ludzie umieraja, dzieci na raka, w wypadkach, juz po chuj tej żaloby
cyberbulling
@anonymized_account Wygląda jakby karmiła dziecko niepełnosprawne
cyberbulling
@anonymized_account @anonymized_account ssa wszystkimi otworami co tylko da sie wessać, dekompozycja obozu rządzącego, drużyna jest w rozsypce
cyberbulling
Premier @anonymized_account w @anonymized_account PSL i PO wyprzedawali polską ziemię.
cyberbulling
@anonymized_account A ktoś cię jeszcze ogląda pajacu?
cyberbulling
@anonymized_account Tomek ty i analizy? Ty i wnioski? Chłopie nie wiem co ty bierzesz ale zmnieksz dawkę o połowę...
cyberbulling
@anonymized_account @anonymized_account Same k****a przypadki.Jakie to żałosne.
cyberbulling
@anonymized_account Pierdzenie z rana Pani Kamilo, myślałem ze kobiety 

# Klasyfikacja przy użyciu prostszych metod

In [9]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

INFO:summarizer.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English


In [10]:
def grid_search(train_x, train_y, test_x, test_y, genres, parameters, pipeline):
    grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=3, verbose=10)
    grid_search_tune.fit(train_x, train_y)

    print()
    print("Best parameters set:")
    print(grid_search_tune.best_estimator_.steps)
    print()

    # measuring performance on test set
    print("Applying best classifier on test data:")
    best_clf = grid_search_tune.best_estimator_
    predictions = best_clf.predict(test_x)

    print(classification_report(test_y, predictions, target_names=genres))

In [11]:
pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=[])),
                ('clf', OneVsRestClassifier(MultinomialNB(
                    fit_prior=True, class_prior=None))),
            ])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

train_x = [x.strip() for x in train_df['text'].tolist()]
test_x = [x.strip() for x in test_df['text'].tolist()]
train_y = [str(x) for x in train_df['labels'].tolist()]
test_y = [str(x) for x in test_df['labels'].tolist()]
print(len(train_x), len(test_x), len(train_y), len(test_y))
grid_search(train_x, train_y, test_x, test_y, ['0', '1'], parameters, pipeline)

2300 256 2300 256
Fitting 2 folds for each of 18 candidates, totalling 36 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    1.7s
[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    2.3s
[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:    2.9s
[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:    3.7s
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:    4.5s



Best parameters set:
[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.5, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=[], strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)), ('clf', OneVsRestClassifier(estimator=MultinomialNB(alpha=0.01, class_prior=None,
                                            fit_prior=True),
                    n_jobs=None))]

Applying best classifier on test data:
              precision    recall  f1-score   support

           0       0.85      0.90      0.87       116
           1       0.91      0.86      0.89       140

    accuracy                           0.88       256
   macro avg       0.88      0.88   

[Parallel(n_jobs=3)]: Done  36 out of  36 | elapsed:    5.6s finished


Trenowanie klasyfikatora bayesowskiego na zbiorze danych PolEval

In [12]:
train_pol_eval_x = [x.strip() for x in train_pol_eval['text'].tolist()]
test_pol_eval_x = [x.strip() for x in test_pol_eval['text'].tolist()]
train_pol_eval_y = [str(x) for x in train_pol_eval['labels'].tolist()]
test_pol_eval_y = [str(x) for x in test_pol_eval['labels'].tolist()]
print(len(train_pol_eval_x), len(test_pol_eval_x), len(train_pol_eval_y), len(test_pol_eval_y))
grid_search(train_pol_eval_x, train_pol_eval_y, test_pol_eval_x, test_pol_eval_y, ['0', '1'], parameters, pipeline)

9036 1005 9036 1005
Fitting 2 folds for each of 18 candidates, totalling 36 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    0.6s
[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    2.6s
[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:    4.6s
[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:    7.0s
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:    9.6s



Best parameters set:
[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.5, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=[], strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)), ('clf', OneVsRestClassifier(estimator=MultinomialNB(alpha=0.01, class_prior=None,
                                            fit_prior=True),
                    n_jobs=None))]

Applying best classifier on test data:
              precision    recall  f1-score   support

           0       0.94      0.98      0.96       914
           1       0.65      0.33      0.44        91

    accuracy                           0.92      1005
   macro avg       0.79      0.66   

[Parallel(n_jobs=3)]: Done  36 out of  36 | elapsed:   13.2s finished


Batch size: 16

In [21]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_4 = ClassificationModel('roberta', './')
cls_model_4.train_model(train_pol_eval, args={"num_train_epochs": 5, "train_batch_size": 16})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=565.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=565.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=565.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=565.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=565.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(2825, 0.29565434765894855)

In [None]:
result, model_outputs, wrong_predictions = cls_model_4.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.26408732765250736}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.26408732765250736}


Batch size 4

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_5 = ClassificationModel('roberta', './')
cls_model_5.train_model(train_pol_eval, args={"num_train_epochs": 5, "batch_size": 4})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(5650, 0.2981829092256)

In [None]:
result, model_outputs, wrong_predictions = cls_model_5.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.26939207099614637}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.26939207099614637}


Number of epochs: 10

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_6 = ClassificationModel('roberta', './')
cls_model_6.train_model(train_pol_eval, args={"num_train_epochs": 10})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=10.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 5 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 6 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 7 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 8 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 9 of 10'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(11300, 0.29764161938208)

In [None]:
result, model_outputs, wrong_predictions = cls_model_6.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.2743388610108504}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.2743388610108504}


Number of epochs: 2

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_7 = ClassificationModel('roberta', './')
cls_model_7.train_model(train_pol_eval, args={"num_train_epochs": 2})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=2.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 2'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 2'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(2260, 0.3000014986274187)

In [None]:
result, model_outputs, wrong_predictions = cls_model_7.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.27382335669937585}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.27382335669937585}


Wagi klas: 0.1 i 0.9

In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_8 = ClassificationModel('roberta', './', num_labels=2, weight=[0.1,0.9])
cls_model_8.train_model(train_pol_eval, args={"num_train_epochs": 5})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(5650, 0.7830568385928606)

In [None]:
result, model_outputs, wrong_predictions = cls_model_8.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.7421786063128993}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.7421786063128993}


In [None]:
!rm -rf outputs/

In [None]:
ClassificationModel.tokenizer = tokenizer
cls_model_9 = ClassificationModel('roberta', './', num_labels=2, weight=[0.9,0.1])
cls_model_9.train_model(train_pol_eval, args={"num_train_epochs": 5})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f29267f5ef0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process





HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=5.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 5'), FloatProgress(value=0.0, max=1130.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(5650, 0.06694079809822143)

In [None]:
result, model_outputs, wrong_predictions = cls_model_9.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
INFO:simpletransformers.classification.classification_model:{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.058149016543572386}





In [None]:
print(result)

{'mcc': 0.0, 'tp': 0, 'tn': 927, 'fp': 0, 'fn': 78, 'acc': 0.9223880597014925, 'eval_loss': 0.058149016543572386}


Kombinacja różnych wartości parametrów. 

Wagi klas: 0.5 i 0.5

Number of epochs: 6

Batch size: 128

In [None]:
!rm -rf outputs/

In [30]:
ClassificationModel.tokenizer = tokenizer
cls_model_10 = ClassificationModel('roberta', './', num_labels=2, weight=[0.5,0.5])
cls_model_10.train_model(train_pol_eval, args={"num_train_epochs": 6, "train_batch_size": 128})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=6.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f1301725be0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f13017537f0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    sel




HBox(children=(HTML(value='Running Epoch 4 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 5 of 6'), FloatProgress(value=0.0, max=71.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f127f291d68>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f127f5019b0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    sel





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(426, 0.2131048981974802)

In [31]:
result, model_outputs, wrong_predictions = cls_model_10.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f13017253c8>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process





HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:{'mcc': 0.49572522983446343, 'tp': 39, 'tn': 890, 'fp': 15, 'fn': 61, 'acc': 0.9243781094527364, 'eval_loss': 0.2523548046296965}





In [32]:
print(result)

{'mcc': 0.49572522983446343, 'tp': 39, 'tn': 890, 'fp': 15, 'fn': 61, 'acc': 0.9243781094527364, 'eval_loss': 0.2523548046296965}


In [13]:
!rm -rf outputs/

In [14]:
ClassificationModel.tokenizer = tokenizer
cls_model_11 = ClassificationModel('roberta', './', num_labels=2, weight=[1,2])
cls_model_11.train_model(train_pol_eval, args={"num_train_epochs": 10, "train_batch_size": 64})

Some weights of the model checkpoint at ./ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9036.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=10.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:   Starting fine-tuning.


HBox(children=(HTML(value='Running Epoch 0 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))








HBox(children=(HTML(value='Running Epoch 1 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 4 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 5 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 6 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 7 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 8 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 9 of 10'), FloatProgress(value=0.0, max=142.0), HTML(value='')))





INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(1420, 0.18758027542884212)

In [16]:
result, model_outputs, wrong_predictions = cls_model_11.eval_model(test_pol_eval, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_model: Converting to features started. Cache is not used.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1005.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=126.0), HTML(value='')))

INFO:simpletransformers.classification.classification_model:{'mcc': 0.54509398624971, 'tp': 46, 'tn': 891, 'fp': 23, 'fn': 45, 'acc': 0.9323383084577115, 'eval_loss': 0.47893669660247506}





In [17]:
print(result)

{'mcc': 0.54509398624971, 'tp': 46, 'tn': 891, 'fp': 23, 'fn': 45, 'acc': 0.9323383084577115, 'eval_loss': 0.47893669660247506}
