<a href="https://colab.research.google.com/github/patrycjalazna/transformers/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importy💅🏻💅🏻💅🏻

In [28]:
!pip install 'transformers==4.12.5' 'tokenizers==0.10.3' 'sentencepiece==0.1.96' 'datasets==1.16.1' 'accelerate==0.5.1' 'sacremoses==0.0.46' 'sacrebleu==2.0.0' 'torch';

ERROR: Invalid requirement: "'transformers==4.12.5'"
You should consider upgrading via the 'c:\users\masob\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


In [29]:
import torch
from torch import nn
from torch.nn import MSELoss, CrossEntropyLoss, BCEWithLogitsLoss
from transformers import RobertaForSequenceClassification, RobertaModel
from transformers.modeling_outputs import SequenceClassifierOutput
import json
from pathlib import Path
from typing import Dict, List
from datasets import load_dataset
import os
import random

## 🤗 Dataset

Dataset *emotion* jest zbiorem danych angielskich wiadomości na Twitterze zawierających sześć podstawowych emocji: gniew, strach, radość, miłość, smutek i zaskoczenie.

Link do datasetu: [hugginface](https://huggingface.co/datasets/emotion)

Przykład:

```
{
    "label": 0,
    "text": "im feeling quite sad and sorry for myself but ill snap out of it soon"
}
```



In [30]:
dataset = load_dataset('emotion')

Using custom data configuration default
Reusing dataset emotion (C:\Users\masob\.cache\huggingface\datasets\emotion\default\0.0.0\348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)
100%|██████████| 3/3 [00:00<00:00, 333.43it/s]


Dane mamy automatycznie podzielone train set, validation set i test set w stosunku 8:1:1.

In [31]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


Następnie tworzymy folder, w którym zapiszemy dane.

In [32]:
if not os.path.exists("./data"):
    os.makedirs("./data")

In [33]:
train_path = Path('data/train.json')
valid_path = Path('data/valid.json')
test_path = Path('data/test.json')

train_path_binary = Path('data/train_binary.json')
valid_path_binary = Path('data/valid_binary.json')
test_path_binary = Path('data/test_binary.json')

In [34]:
data_train_list, data_valid_list, data_test_list = [], [], []

for data_line, data_list in [
  (dataset['train'], data_train_list),
  (dataset['test'], data_test_list),
  (dataset['validation'], data_valid_list)
]:
  for i, data in enumerate(data_line):
    line = {
      'label': int(data['label']),
      'text': data['text'],
    }
    data_list.append(line)

print(f'Train: {len(data_train_list)}')
print(f'Test: {len(data_valid_list)}')
print(f'Validation: {len(data_test_list)}')

Train: 16000
Test: 2000
Validation: 2000


In [41]:
# Zależy czy mapujemy tylko na pozytywne i negatywne czy na 6 co są w datasecie
def get_map_label_translation(num_classes = 6):
    '''
    Possible numbers [2, 6]
    '''
    if(num_classes == 2):
        return {
            0: 'negative',
            1: 'positive',
            2: 'positive',
            3: 'negative',
            4: 'negative',
            5: 'positive',
        }
    elif(num_classes == 6):
        return {
            0: 'sadness',
            1: 'joy',
            2: 'love',
            3: 'anger',
            4: 'fear',
            5: 'suprise',
        }

def get_value_from_label(label):
    if(label in [1, 2, 5]):
        return 1
    else: 
        return 0

MAP_LABEL_TRANSLATION_2 = get_map_label_translation(2)
MAP_LABEL_TRANSLATION_6 = get_map_label_translation(6)

In [36]:
data_class_test = {}
data_class_train = {}
data_class_validation = {}

data_class_test_binary = {}
data_class_train_binary = {}
data_class_validation_binary = {}

for label in MAP_LABEL_TRANSLATION_6:
  if(MAP_LABEL_TRANSLATION_6[label] not in data_class_test):
    data_class_test[MAP_LABEL_TRANSLATION_6[label]] = []
    data_class_validation[MAP_LABEL_TRANSLATION_6[label]] = []
    data_class_train[MAP_LABEL_TRANSLATION_6[label]] = []

for label in MAP_LABEL_TRANSLATION_2:
  if(MAP_LABEL_TRANSLATION_2[label] not in data_class_test):
    data_class_test_binary[MAP_LABEL_TRANSLATION_2[label]] = []
    data_class_validation_binary[MAP_LABEL_TRANSLATION_2[label]] = []
    data_class_train_binary[MAP_LABEL_TRANSLATION_2[label]] = []

for data in data_valid_list:
  data_class_validation[MAP_LABEL_TRANSLATION_6[int(data['label'])]].append(data)
for data in data_train_list:
  data_class_train[MAP_LABEL_TRANSLATION_6[int(data['label'])]].append(data)
for data in data_test_list:
  data_class_test[MAP_LABEL_TRANSLATION_6[int(data['label'])]].append(data)

for data in data_valid_list:
  data_class_validation_binary[MAP_LABEL_TRANSLATION_2[int(data['label'])]].append(data)
for data in data_train_list:
  data_class_train_binary[MAP_LABEL_TRANSLATION_2[int(data['label'])]].append(data)
for data in data_test_list:
  data_class_test_binary[MAP_LABEL_TRANSLATION_2[int(data['label'])]].append(data)

print('-- Stats for train set on 6 labels --')
for label in data_class_train:
  print(f'Label {label}: {len(data_class_train[label]):6d}')
print('-- Stats for test set on 6 labels --')
for label in data_class_test:
  print(f'Label {label}: {len(data_class_test[label]):6d}')
print('-- Stats for validation set on 6 labels--')
for label in data_class_validation:
  print(f'Label {label}: {len(data_class_validation[label]):6d}')
  
print('-- Stats for train set on 2 labels --')
for label in data_class_train_binary:
  print(f'Label {label}: {len(data_class_train_binary[label]):6d}')
print('-- Stats for test set on 2 labels --')
for label in data_class_test_binary:
  print(f'Label {label}: {len(data_class_test_binary[label]):6d}')
print('-- Stats for validation set on 2 labels--')
for label in data_class_validation_binary:
  print(f'Label {label}: {len(data_class_validation_binary[label]):6d}')


-- Stats for train set on 6 labels --
Label sadness:   4666
Label joy:   5362
Label love:   1304
Label anger:   2159
Label fear:   1937
Label suprise:    572
-- Stats for test set on 6 labels --
Label sadness:    581
Label joy:    695
Label love:    159
Label anger:    275
Label fear:    224
Label suprise:     66
-- Stats for validation set on 6 labels--
Label sadness:    550
Label joy:    704
Label love:    178
Label anger:    275
Label fear:    212
Label suprise:     81
-- Stats for train set on 2 labels --
Label negative:   8762
Label positive:   7238
-- Stats for test set on 2 labels --
Label negative:   1080
Label positive:    920
-- Stats for validation set on 2 labels--
Label negative:   1037
Label positive:    963


In [37]:
   
def remove_if_exists(f):
    if(Path(f).exists()):
        f.unlink()

def save_unchanged(f, data, binary = True):
    remove_if_exists(f)
    print(f'Saving into: {f}')
    with open(f, 'wt') as f_write:
        for data_line in data:
            if(binary):
                data_line['label'] = get_value_from_label((data_line['label']))
            data_line_str = json.dumps(data_line)
            f_write.write(f'{data_line_str}\n')

def save_as_translations(f, data_classes, num_entries):
    file_name = 'translations-' + f.name
    file_path = f.parent / file_name
    stats = {}
    remove_if_exists(Path(file_path))
    print(f'Saving into: {file_path}')
    
    with open(file_path, 'wt') as f_write:
        for class_list in data_classes:
            if(num_entries > len(data_classes[class_list])):
                samples = data_classes[class_list]
            else:
                samples = random.sample(data_classes[class_list], num_entries)

            stats[f'{class_list} entries'] = len(samples)

            for data_line in samples:
                data_line['label'] = class_list
                data_line_str = json.dumps(data_line)
                f_write.write(f'{data_line_str}\n')
        print(stats)

In [42]:
# Rozmiar zbiorów, podana wartość to ilość lini dla każdegj klasy, jeżeli dana klasa nie posiada danej ilości lini, wszystkie linie zostaja przekazane.
def get_num_of_samples(set_name):
    if(set_name == 'train'):
        return 1000
    else:
        return 100

for file_path, data_to_save, data_classes, num_entries in [ (train_path, data_train_list, data_class_train, get_num_of_samples('train') ), (valid_path, data_valid_list, data_class_validation, get_num_of_samples('valid')), (test_path, data_test_list, data_class_test, get_num_of_samples('test'))]:
  save_unchanged(file_path, data_to_save, False)
  save_as_translations(file_path, data_classes, num_entries)

for file_path, data_to_save, data_classes, num_entries in [ (train_path_binary, data_train_list, data_class_train_binary, get_num_of_samples('train') ), (valid_path_binary, data_valid_list, data_class_validation_binary, get_num_of_samples('valid')), (test_path_binary, data_test_list, data_class_test_binary, get_num_of_samples('test'))]:
  save_unchanged(file_path, data_to_save)
  save_as_translations(file_path, data_classes, num_entries)

Saving into: data\train.json
Saving into: data\translations-train.json
{'sadness entries': 1000, 'joy entries': 1000, 'love entries': 1000, 'anger entries': 1000, 'fear entries': 1000, 'suprise entries': 572}
Saving into: data\valid.json
Saving into: data\translations-valid.json
{'sadness entries': 100, 'joy entries': 100, 'love entries': 100, 'anger entries': 100, 'fear entries': 100, 'suprise entries': 81}
Saving into: data\test.json
Saving into: data\translations-test.json
{'sadness entries': 100, 'joy entries': 100, 'love entries': 100, 'anger entries': 100, 'fear entries': 100, 'suprise entries': 66}
Saving into: data\train_binary.json
Saving into: data\translations-train_binary.json
{'negative entries': 1000, 'positive entries': 1000}
Saving into: data\valid_binary.json
Saving into: data\translations-valid_binary.json
{'negative entries': 100, 'positive entries': 100}
Saving into: data\test_binary.json
Saving into: data\translations-test_binary.json
{'negative entries': 100, 'pos

## 🤗 Train

Pobranie skryptu dostępnego w bibliotece transformes potrzebnego do uruchomienia modelu.

In [39]:
!wget 'https://raw.githubusercontent.com/huggingface/transformers/v4.12.5/examples/pytorch/text-classification/run_glue_no_trainer.py' -O 'original_run_glue_no_trainer.py'
!wget 'https://raw.githubusercontent.com/patrycjalazna/transformers/main/gpt2.py' -O 'gpt2.py'
!wget 'https://raw.githubusercontent.com/patrycjalazna/transformers/main/roberta.py' -O 'roberta.py'
!wget 'https://raw.githubusercontent.com/patrycjalazna/transformers/main/run_glue_no_trainer.py' -O 'run_glue_no_trainer.py'

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


## GPT2

Podstawowy model GPT2, próba polegała na zwiększeniu ilości epoch co poskutkowało wzrostem accuracy z 0.83 na 0.938
- Epoch 0: accuracy: 0.9095
- Epoch 1: accuracy: 0.9315
- Epoch 2: accuracy: 0.9385
- Epoch 3: accuracy: 0.938
- Evaluation: accuracy: 0.9275

In [40]:
!python run_glue_no_trainer.py \
  --model_name_or_path gpt2 \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --max_length 128 \
  --learning_rate 2e-5 \
  --num_train_epochs 4 \
  --output_dir out/gpt2_version1

^C


## Version 2
### GPT2ForSequenceClassificationCustom
Model z pliku gpt2.py, dodatkowo uruchomiony z flagą freeze_model uruchomiony na 4 epochach:
- Epoch 0 accuracy: 0.462
- Epoch 1 accuracy: 0.4645
- Epoch 2 accuracy: 0.4615
- Epoch 3 accuracy: 0.4745
- Evaluation accurracy: 0.4795

In [None]:
!python run_glue_no_trainer.py \
  --model_name_or_path gpt2 \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --max_length 128 \
  --freeze_model \
  --custom_model \
  --learning_rate 2e-5 \
  --num_train_epochs 4 \
  --output_dir out/gpt2_version_2

## Version 3
### GPT2ForSequenceClassificationCustomVersion2
Dodana została nowa warstwa, dodatkowo uruchomiony z flagą freeze_model na 2 epochach. Zmieniony został parametr max_length z 128 na 256, oraz train_batch_size z 24 na 32: 
- Epoch 0: accuracy: 0.3765
- Epoch 1: accuracy: 0.4210
- Evaluation accurracy: 0.4339

In [None]:
!python run_glue_no_trainer.py \
 !python run_glue_no_trainer.py \
  --model_name_or_path gpt2 \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --max_length 254 \
  --freeze_model \
  --custom_model \
  --return_hidden_states \
  --learning_rate 2e-5 \
  --num_train_epochs 2 \
  --output_dir out/imdb/gpt2_version_3

## Version 4
### GPT2ForSequenceClassificationCustomVersion2
Dodana została nowa warstwa, dodatkowo uruchomiony z flagą freeze_model na 8 epochach. Zmieniony został parametr, oraz train_batch_size z 24 na 16: 
- Epoch 0: accuracy: 0.3765
- Epoch 1: accuracy: 0.4210
- Evaluation accurracy: 0.4339

In [None]:
!python run_glue_no_trainer.py \
  --model_name_or_path gpt2 \
  --train_file data/train.json  \
  --validation_file data/valid.json \
  --test_file data/test.json \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --max_length 128 \
  --freeze_model \
  --custom_model \
  --return_hidden_states \
  --learning_rate 2e-5 \
  --num_train_epochs 8 \
  --output_dir out/gpt2_version_4

## Version 5
### GPT2ForSequenceClassificationCustomVersion2
Dodana została nowa warstwa, dodatkowo uruchomiony 4 epochach. Różnica w tej wersji polega na zmianie klasyfikacji z 6 labeli na 2. Uznaliśmy że ciekawym będzie porównanie wyników i na potrzeby prób przekonwertujemy nasz dataset emocji tylko na podział pomiędzy pozytywnymi, a negatywnymi:
- sadness = negative
- joy = positive
- love = positive
- anger = negative
- fear = negative
- suprise = positive

Wyniki prezentują się następująco:
- Epoch 0: accuracy: 0.3765
- Epoch 1: accuracy: 0.4210
- Evaluation accurracy: 0.4339

In [None]:
!python run_glue_no_trainer.py \
  --model_name_or_path gpt2 \
  --train_file data/train_binary.json  \
  --validation_file data/valid_binary.json \
  --test_file data/test_binary.json \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --freeze_model \
  --custom_model \
  --return_hidden_states \
  --max_length 128 \
  --learning_rate 2e-5 \
  --num_train_epochs 4 \
  --output_dir out/gpt2