In [10]:
import dacite

from pipeline import Pipeline
from pipeline_config import PipelineConfig
from model_training.config import TrainConfig

In [6]:
text_cleaner = ["UsernameRemover"]

pipeline_config = {
                    "text_cleaners": ["UsernameRemover"]
                    # List of text cleaning classes - see pipeline_steps/text_cleaning.py
                   , "max_seq_len": 64}
pipeline_config = dacite.from_dict(PipelineConfig, pipeline_config)
# There is no much to configure in pipelines architecture
#- BERT configuration for Polbert is already pre-defined

In [11]:
train_config = {
                "batch_size": 32,
                "num_epochs": 10,
                "output_model_name": "bert_for_hatespeech",
                # Model would be saved in the repo_root/trained_models/output_model_name
                "main_metric": "f1-score",
                # metric on which saving the best model would be done
                "freeze_embeddings": True,
                # whether to freeze embedding module during the training - size of the data is small
                # and training with this module unfrozen is very prone to overfitting
                "class_balanced_sampling": True,
                # classes cyberbulling and hate-speech have small number of samples in the data
                # thanks to using this sampler convergence of training is faster
                "optimization_schedule": {
                                            "init_lr": 1e-04,
                                            # Initial value of learning rate
                                            "weight_decay": 1e-03,
                                            # Value of L2 regularization coefficient
                                            "num_warmup_steps": 100,
                                            # Number of warmup steps ( where learning rate increases from 0 to init_lr value)
                                            "optimizer_name": "adamw"
                                            # Name of optimizer to be used (adam, adamw and sgd are supported)
                                         }
                
}
train_config = dacite.from_dict(TrainConfig, train_config)

In [12]:
!export CUDA_VISIBLE_DEVICES=0

The training data from http://2019.poleval.pl/index.php/tasks/task6 has been splited in proportion 0.8 for train, 0.2 for validation.

It has been assumed that there are no leaks in data and the split could be done in pure random way.

Even if there were some leaks no metadata was provided for the utterances so it would be hard. 

The test data is the official test data for the task

The data (all splits) is kept within the repository in datafiles directory. Each split has two files: {split_name}_texts.txt and {split_name}_tags.txt first one corresponds to training utterances, second one to the labels.

After each epoch the validation on valid dataset is run - when macro avg f1 score is higher than previous best ( previous best is initialized from 0. ) than the model is saved and we also run it over test and training set.

For each split both per class precision, recall and f1 are calculated and micro and macro averaging of f1 ( as in evaluation http://2019.poleval.pl/index.php/results/ ). 

In [13]:
pipeline = Pipeline(pipeline_config=pipeline_config)
pipeline.train(train_config=train_config)

Some weights of the model checkpoint at dkleczek/bert-base-polish-cased-v1 were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


sampler class weights = {0: 1.0, 2: 1.4443161466911736, 1: 1.686544469631841}


0 epoch training in progress: 251it [01:21,  3.07it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:06<00:00,  9.43it/s]


validation results fpr valid


              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1846
           1       0.24      0.16      0.19        58
           2       0.65      0.27      0.39       124

    accuracy                           0.92      2028
   macro avg       0.61      0.47      0.51      2028
weighted avg       0.90      0.92      0.90      2028

 micro f1: 0.9176528599605522


Model evaluation on test: 100%|██████████| 32/32 [00:03<00:00,  9.32it/s]


validation results fpr test


              precision    recall  f1-score   support

           0       0.89      0.99      0.94       866
           1       0.22      0.16      0.19        25
           2       0.74      0.13      0.22       109

    accuracy                           0.88      1000
   macro avg       0.62      0.43      0.45      1000
weighted avg       0.86      0.88      0.84      1000

 micro f1: 0.878


Model evaluation on train: 100%|██████████| 251/251 [00:26<00:00,  9.44it/s]

validation results fpr train


              precision    recall  f1-score   support

           0       0.94      0.99      0.97      7043
           1       0.59      0.49      0.53       317
           2       0.83      0.39      0.53       653

    accuracy                           0.92      8013
   macro avg       0.79      0.62      0.68      8013
weighted avg       0.92      0.92      0.91      8013

 micro f1: 0.9239985024335455



1 epoch training in progress: 251it [01:23,  3.00it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:06<00:00,  9.25it/s]


validation results fpr valid


              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1846
           1       0.40      0.17      0.24        58
           2       0.51      0.44      0.48       124

    accuracy                           0.92      2028
   macro avg       0.62      0.53      0.56      2028
weighted avg       0.91      0.92      0.91      2028

 micro f1: 0.9181459566074951


Model evaluation on test: 100%|██████████| 32/32 [00:03<00:00,  9.15it/s]


validation results fpr test


              precision    recall  f1-score   support

           0       0.91      0.98      0.94       866
           1       0.19      0.12      0.15        25
           2       0.55      0.26      0.35       109

    accuracy                           0.88      1000
   macro avg       0.55      0.45      0.48      1000
weighted avg       0.85      0.88      0.86      1000

 micro f1: 0.88


Model evaluation on train: 100%|██████████| 251/251 [00:27<00:00,  9.29it/s]

validation results fpr train


              precision    recall  f1-score   support

           0       0.98      0.99      0.99      7014
           1       0.88      0.81      0.84       327
           2       0.88      0.82      0.85       672

    accuracy                           0.97      8013
   macro avg       0.91      0.88      0.89      8013
weighted avg       0.97      0.97      0.97      8013

 micro f1: 0.9696742792961438



2 epoch training in progress: 251it [01:25,  2.94it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.12it/s]


validation results fpr valid


              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1846
           1       0.24      0.34      0.28        58
           2       0.76      0.36      0.49       124

    accuracy                           0.92      2028
   macro avg       0.65      0.56      0.58      2028
weighted avg       0.92      0.92      0.91      2028

 micro f1: 0.9166666666666666


Model evaluation on test: 100%|██████████| 32/32 [00:03<00:00,  9.07it/s]


validation results fpr test


              precision    recall  f1-score   support

           0       0.90      0.98      0.94       866
           1       0.18      0.28      0.22        25
           2       0.67      0.09      0.16       109

    accuracy                           0.87      1000
   macro avg       0.58      0.45      0.44      1000
weighted avg       0.86      0.87      0.84      1000

 micro f1: 0.87


Model evaluation on train: 100%|██████████| 251/251 [00:27<00:00,  9.20it/s]

validation results fpr train


              precision    recall  f1-score   support

           0       0.99      0.99      0.99      7097
           1       0.67      0.94      0.78       297
           2       0.98      0.74      0.84       619

    accuracy                           0.97      8013
   macro avg       0.88      0.89      0.87      8013
weighted avg       0.98      0.97      0.97      8013

 micro f1: 0.9714214401597404



3 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.13it/s]


validation results fpr valid


              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1846
           1       0.35      0.26      0.30        58
           2       0.64      0.43      0.51       124

    accuracy                           0.93      2028
   macro avg       0.65      0.56      0.59      2028
weighted avg       0.91      0.93      0.92      2028

 micro f1: 0.9250493096646942


Model evaluation on test: 100%|██████████| 32/32 [00:03<00:00,  9.08it/s]


validation results fpr test


              precision    recall  f1-score   support

           0       0.90      0.99      0.94       866
           1       0.17      0.16      0.17        25
           2       0.63      0.16      0.25       109

    accuracy                           0.88      1000
   macro avg       0.57      0.44      0.45      1000
weighted avg       0.86      0.88      0.85      1000

 micro f1: 0.879


Model evaluation on train: 100%|██████████| 251/251 [00:27<00:00,  9.20it/s]

validation results fpr train


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      7025
           1       0.77      0.96      0.85       310
           2       0.98      0.85      0.91       678

    accuracy                           0.98      8013
   macro avg       0.91      0.93      0.92      8013
weighted avg       0.98      0.98      0.98      8013

 micro f1: 0.9812804193186072



4 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.12it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.94      0.99      0.96      1846
           1       0.39      0.12      0.18        58
           2       0.64      0.39      0.48       124

    accuracy                           0.93      2028
   macro avg       0.66      0.50      0.54      2028
weighted avg       0.91      0.93      0.91      2028

 micro f1: 0.925542406311637



5 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.11it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1846
           1       0.45      0.16      0.23        58
           2       0.68      0.44      0.54       124

    accuracy                           0.93      2028
   macro avg       0.69      0.53      0.58      2028
weighted avg       0.92      0.93      0.92      2028

 micro f1: 0.9309664694280079



6 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.08it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1846
           1       0.47      0.16      0.23        58
           2       0.45      0.58      0.51       124

    accuracy                           0.91      2028
   macro avg       0.63      0.56      0.57      2028
weighted avg       0.91      0.91      0.91      2028

 micro f1: 0.9097633136094675



7 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.06it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.94      0.99      0.97      1846
           1       0.67      0.14      0.23        58
           2       0.66      0.45      0.54       124

    accuracy                           0.93      2028
   macro avg       0.76      0.53      0.58      2028
weighted avg       0.92      0.93      0.92      2028

 micro f1: 0.9304733727810651



8 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.10it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1846
           1       0.38      0.14      0.20        58
           2       0.65      0.44      0.53       124

    accuracy                           0.93      2028
   macro avg       0.66      0.52      0.56      2028
weighted avg       0.91      0.93      0.92      2028

 micro f1: 0.9265285996055227



9 epoch training in progress: 251it [01:25,  2.92it/s]
Model evaluation on valid: 100%|██████████| 64/64 [00:07<00:00,  9.09it/s]

validation results fpr valid


              precision    recall  f1-score   support

           0       0.94      0.99      0.96      1846
           1       0.35      0.12      0.18        58
           2       0.70      0.44      0.54       124

    accuracy                           0.93      2028
   macro avg       0.66      0.51      0.56      2028
weighted avg       0.91      0.93      0.92      2028

 micro f1: 0.9285009861932939





As you may see we are achiving micro avg f1 close to 0.90 and macro avg f1 ~0.5 which is close to the best models trained during poleval competition - for need of this homework we find it sufficient, however no grid search over training hyperparameters (like dropout, weight decay etc.) has been done.

Probably pre-training BERT language model on large set of social media data could significantly improve the results. Also provided data is just a flat list while Twitter discussion probably has some tree structure - providing somehow context of discussion to the model could give another significant improvement. 

We see that after a few epochs the model overfits highly to the train set which is not suprising.

NOTE: after the training upload the weights to hatespeechml S3 bucket, for access contact kdziedzic66@gmail.com.