## Introduction

The notebook is based on the article  [*Multi-label Classification of Commit Messages Using Transfer Learning*](https://www.researchgate.net/profile/Mohamed-Wiem-Mkaouer-2/publication/348228961_Multi-label_Classification_of_Commit_Messages_using_Transfer_Learning/links/61eacfc2c5e3103375ae596d/Multi-label-Classification-of-Commit-Messages-using-Transfer-Learning.pdf)

The task is to classify commit messages into 3 categories (labels), but each commit message can have more than one label at once

## Preparation

In [1]:
!pip install transformers
!pip install lets_plot

[0mCollecting lets_plot
  Downloading lets_plot-2.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypng
  Downloading pypng-0.20220715.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.1/58.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypng, lets_plot
Successfully installed lets_plot-2.5.0 pypng-0.20220715.0
[0m

In [2]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from sklearn import metrics
from lets_plot import *
from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
import random
import os
import torch
import warnings

warnings.filterwarnings('ignore')

In [3]:
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print(f'Using GPU : {torch.cuda.get_device_name(0)}')
else:
    device = torch.device("cpu")
    print(f'Using CPU')

Using GPU : Tesla T4


In [4]:
LetsPlot.setup_html()

## Exploration

In [5]:
path_to_data = '../input/commit-data/train_2K.csv'

In [6]:
data = pd.read_csv(path_to_data)

In [7]:
data = data.rename(columns={'Unnamed: 0': 'id'})
data.head()

Unnamed: 0,id,text,Corrective,Adaptive,Perfective
0,0,netfilter: xt_log: fix mark logging for ip...,1,0,0
1,1,[patch] inode-diet: eliminate i_blksize fr...,1,0,0
2,2,tensor roll op implementation (#14953) ...,1,1,0
3,3,improve video updates from sheet,0,0,1
4,4,[spark-9372] [sql] filter nulls in join ke...,0,0,1


Let's look at the distribution of labels in our dataset, to begin with, we need to count the frequency of occuring of each label solely

In [8]:
uni_label_info = pd.DataFrame({
    'Corrective' : data.Corrective,
    'Adaptive' :  data.Adaptive,
    'Perfective' : data.Perfective
})
stat_info = pd.DataFrame({
    'Label' : uni_label_info.keys(),
    'Frequency' : [sum(uni_label_info.Corrective), sum(uni_label_info.Adaptive), 
                   sum(uni_label_info.Perfective)]
})

In [9]:
ggplot(stat_info, aes(x=stat_info.Label, weight=stat_info.Frequency, fill=stat_info.Label)) + \
    geom_bar() + labs(x='Label', y='Times Occured')

Also we need to look at the real multi-labeled distribution in our dataset

In [10]:
combinations = {'Corrective' : 0, 'Corrective Adaptive' : 0, 'Corrective Adaptive Perfective' : 0, 'Adaptive' : 0, 'Perfective' : 0, 'Adaptive Perfective' : 0, 'Corrective Perfective' : 0}
multi_label = pd.Series(["" for i in range(len(uni_label_info.index))])
idx_to_drop = []
for i in range(len(uni_label_info.index)):
    if not uni_label_info.iloc[i].any():
        idx_to_drop.append(i)
        continue
    if uni_label_info.iloc[i].Corrective:
        multi_label[i] += 'Corrective '
    if uni_label_info.iloc[i].Adaptive:
        multi_label[i] += 'Adaptive '
    if uni_label_info.iloc[i].Perfective:
        multi_label[i] += 'Perfective '
    combinations[multi_label[i].strip()] += 1
total_number = sum(combinations.values())
combinations = {k: v / total_number for k,v in combinations.items()}
multi_label_info = pd.DataFrame({
    'Label' : combinations.keys(),
    'Frequency' : combinations.values()
})
print(f'Label combinations in total: {len(combinations.keys())}\nAmount of labeled commit messages: {total_number} \
      \nDataset rows size: {data.shape[0]} \nThere are {data.shape[0] - total_number} commit messages \
without labels in dataset')

Label combinations in total: 7
Amount of labeled commit messages: 2035       
Dataset rows size: 2037 
There are 2 commit messages without labels in dataset


In [11]:
ggplot(multi_label_info, aes(x=multi_label_info.Label, weight=multi_label_info.Frequency, fill=multi_label_info.Label)) + \
    geom_bar() + labs(x='Label', y='Density') + coord_cartesian(ylim=(0.0, 0.5))

As we can see according to the distribution, majority of commits are uni-labeled, in other words, they are best option, since they only refer to one "atomic" change (in one category). But in reality there are many others commits that have more than one label based on their messages. 

Let's ignore values like `[0, 0, 0]` (in article they are also ignored)

In [12]:
data = data.drop(labels=idx_to_drop, axis=0)
data.shape

(2035, 5)

Let's add column of labeled vectors to our dataset for the convinience

In [13]:
labels_all = [[0, 0, 0] for i in range(len(data.index))]
for i in range(len(data.index)):
    if data.iloc[i].Corrective:
        labels_all[i][0] = 1
    if data.iloc[i].Adaptive:
        labels_all[i][1] = 1
    if data.iloc[i].Perfective:
        labels_all[i][2] = 1
data['labels'] = labels_all
data.shape

(2035, 6)

In [14]:
data.head()

Unnamed: 0,id,text,Corrective,Adaptive,Perfective,labels
0,0,netfilter: xt_log: fix mark logging for ip...,1,0,0,"[1, 0, 0]"
1,1,[patch] inode-diet: eliminate i_blksize fr...,1,0,0,"[1, 0, 0]"
2,2,tensor roll op implementation (#14953) ...,1,1,0,"[1, 1, 0]"
3,3,improve video updates from sheet,0,0,1,"[0, 0, 1]"
4,4,[spark-9372] [sql] filter nulls in join ke...,0,0,1,"[0, 0, 1]"


## Preprocessing and Tokenizing

Let's extract X and y for our data and use pretrained `DistilBert` tokenizer in order to preprocess commit messages texts 

In [15]:
text_data = data.text.values
labels_data = list(data.labels)

In [16]:
model_name = 'distilbert-base-uncased'

In [17]:
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Tokenizing and converting data to tensors, we truncate it to `max_length=256`, since it's enough in our case

In [18]:
max_len = np.zeros(len(text_data))
for i in range(len(text_data)):
    input_ids = tokenizer.encode(text_data[i], add_special_tokens=True)
    max_len[i] = len(input_ids)
print('Max length: ', max_len.max())

Max length:  161.0


Also we add special tokens

In [19]:
input_ids = []
attention_masks = []

for text in text_data:
    encoded_dict = tokenizer.encode_plus(
                        text,                     
                        add_special_tokens = True,
                        max_length = 256,          
                        pad_to_max_length = True,
                        truncation=True,
                        return_attention_mask = True,  
                        return_tensors = 'pt')
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels_tensor = torch.tensor(labels_data)

## Setting up Dataset and Model

Let's split our dataset into 3 parts: train set (80%), validation set (10%) and test set (10%) and create `DataLoader` for them

In [20]:
dataset = TensorDataset(input_ids, attention_masks, labels_tensor)
train_size = int(0.8 * len(dataset))
val_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset, test_dataset = random_split(dataset, [train_size + val_size, test_size])
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

print(f'Total amount of samples : {train_size + val_size + test_size}')
print(f'Train set : {train_size}')
print(f'Validation set : {val_size}')
print(f'Test set : {test_size}')

Total amount of samples : 2035
Train set : 1628
Validation set : 203
Test set : 204


Set the `batch_size` to 8 (16 or 32) and select batches randomly

In [21]:
batch_size = 16

In [22]:
train_dataloader = DataLoader(
            train_dataset,  
            sampler = RandomSampler(train_dataset), 
            batch_size = batch_size)

validation_dataloader = DataLoader(
            val_dataset, 
            sampler = RandomSampler(val_dataset), 
            batch_size = batch_size)

test_dataloader = DataLoader(
            test_dataset, 
            sampler = RandomSampler(test_dataset), 
            batch_size = batch_size)

We will use a little bit changed version of pretrained `DistilBert` model, that is adapted for classification -`DistilBertForSequenceClassification`. In our case we should set `num_labels` equal to 3 and `problem_type` equal to multi-label classification

In [23]:
model = DistilBertForSequenceClassification.from_pretrained(
    model_name,
    problem_type="multi_label_classification",
    num_labels = 3, 
    output_attentions = False,
    output_hidden_states = False, 
)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [24]:
model.cuda()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

Let's specify *learning rate* for *AdamW* optimizer and *number of epochs*. We don't really need a lot of epochs, because we fine-tune our model, otherwise it will overfit and validation loss will start to increase noticeably. Different hyperparameters were tried, but epochs between $4-6$ are recommended and learning rate $\in (1e-5;5e-5)$

In [25]:
learning_rate = 2e-5

In [26]:
optimizer = AdamW(model.parameters(),
                  lr = learning_rate)

In [27]:
epochs = 5

Also let's use linear scheduler to decrease the learning rate while training

In [28]:
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

We will interpret output values as a probabilities vector after applying sigmoid function and splitting by `threshold=0.5` for each probability of label

In [29]:
def get_probs(logits, threshold=0.5):
    sigm = 1 / (1 + np.exp(-logits))
    return sigm >= 0.5

In [30]:
def flat_accuracy(preds, labels):
    res = np.zeros(labels.shape[0])
    for i in range(labels.shape[0]):
        res[i] = np.all(preds[i] == labels[i]) 
    return np.sum(res) / labels.shape[0]

In [31]:
def compute_f1_macro(out, pred):
    return metrics.f1_score(pred, out, average='macro')

In [32]:
def compute_f1_micro(out, pred):
    return metrics.f1_score(pred, out, average='micro')

## Training

Now we are ready to train our model, also we will evaluate our model on validation data after each epoch of training. We will compute the loss function (Binary Cross Entropy Loss) and metrics like accuracy and f1-score

In [33]:
print('Training started...')

np.random.seed(42)
random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

training_stats = []
    
for epoch_i in range(epochs):
    print()
    print('#-----------------------#')
    print(f'     Epoch : {epoch_i + 1} / {epochs}')
    print('#-----------------------#')

    model.train()
    total_train_loss = 0
        
    for step, batch in enumerate(train_dataloader):
        batch_input_ids = batch[0].to(device)
        batch_input_mask = batch[1].to(device)
        batch_labels = batch[2].float().to(device)

        model.zero_grad()        
            
        result = model(batch_input_ids, 
                        attention_mask=batch_input_mask, 
                        labels=batch_labels,
                        return_dict=True)
        loss = result.loss
        logits = result.logits

        total_train_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

    avg_train_loss = total_train_loss / len(train_dataloader)            

    print(f'Average train loss : {avg_train_loss:.3f}')
    print()
    print('Validation started...')
    print()

    model.eval()

    total_eval_accuracy = 0
    total_eval_loss = 0
    total_eval_f1_micro = 0
    total_eval_f1_macro = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:
        batch_input_ids = batch[0].to(device)
        batch_input_mask = batch[1].to(device)
        batch_labels = batch[2].float().to(device)

        with torch.no_grad():        
            result = model(batch_input_ids, 
                            attention_mask=batch_input_mask,
                            labels=batch_labels,
                            return_dict=True)

        loss = result.loss
        logits = result.logits

        total_eval_loss += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = batch_labels.to('cpu').numpy()

        total_eval_f1_micro += compute_f1_micro(get_probs(logits), label_ids)
        total_eval_f1_macro += compute_f1_macro(get_probs(logits), label_ids)
        total_eval_accuracy += flat_accuracy(get_probs(logits), label_ids)

    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    avg_val_f1_micro = total_eval_f1_micro / len(validation_dataloader)
    avg_val_f1_macro = total_eval_f1_macro / len(validation_dataloader)
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    print(f'Average validation loss : {avg_val_loss:.3f}')
    print('Average validation metrics:')
    print('----------------')
    print(f'Accuracy : {avg_val_accuracy:.3f}')
    print(f'f1-score micro : {avg_val_f1_micro:.3f}')
    print(f'f1-score macro : {avg_val_f1_macro:.3f}')

    training_stats.append(
            {
                'epoch': epoch_i + 1,
                'train_loss': avg_train_loss,
                'valid_loss': avg_val_loss,
                'val_accuracy': avg_val_accuracy,
                'val_f1_micro' : avg_val_f1_micro,
                'val_f1_macro' : avg_val_f1_macro
})
    
print()
print('Training finished...')

Training started...

#-----------------------#
     Epoch : 1 / 5
#-----------------------#
Average train loss : 0.561

Validation started...

Average validation loss : 0.363
Average validation metrics:
----------------
Accuracy : 0.708
f1-score micro : 0.835
f1-score macro : 0.821

#-----------------------#
     Epoch : 2 / 5
#-----------------------#
Average train loss : 0.332

Validation started...

Average validation loss : 0.281
Average validation metrics:
----------------
Accuracy : 0.748
f1-score micro : 0.860
f1-score macro : 0.829

#-----------------------#
     Epoch : 3 / 5
#-----------------------#
Average train loss : 0.268

Validation started...

Average validation loss : 0.263
Average validation metrics:
----------------
Accuracy : 0.741
f1-score micro : 0.866
f1-score macro : 0.852

#-----------------------#
     Epoch : 4 / 5
#-----------------------#
Average train loss : 0.225

Validation started...

Average validation loss : 0.256
Average validation metrics:
--------

Visualizing stats after training

In [34]:
train_loss = [i['train_loss'] for i in training_stats]
val_loss = [i['valid_loss'] for i in training_stats]
epochs = [i['epoch'] for i in training_stats]
val_acc = [i['val_accuracy'] for i in training_stats]
val_f1_micro = [i['val_f1_micro'] for i in training_stats]
val_f1_macro = [i['val_f1_macro'] for i in training_stats]
loss_stats = pd.DataFrame({
    'epoch' : epochs,
    'train_loss' : train_loss,
    'val_loss' : val_loss,
    'val_accuracy' : val_acc,
    'val_f1_micro' : val_f1_micro,
    'val_f1_macro' : val_f1_macro
})

In [35]:
bunch = GGBunch()
plot = ggplot(loss_stats) + geom_path(aes('epoch', 'train_loss'), size=1.3, color='blue') + ggsize(500, 400) + ggtitle('Average Train Loss')
bunch.add_plot(plot, 100, 0)
plot = ggplot(loss_stats) + geom_path(aes('epoch', 'val_loss'), size=1.3, color='red') + ggsize(500, 400) + ggtitle('Average Validation Loss')
bunch.add_plot(plot, 700, 0)
bunch.show()

Analyzing computed information, training loss is decreasing. Validation loss is also not that high, so we can make an assumption about our model generalization ability

In [36]:
ggplot(loss_stats) + \
    geom_line(aes(x='epoch', y='val_accuracy'), size=1.3, color='green') + \
    ggtitle("Average Validation Accuracy") + ggsize(500, 400)

We use article authors' approach to evaluate our model with F1-score. Below is f1-micro (dashed) and f1-macro visualization, these scores are quite sufficient

In [37]:
ggplot(loss_stats) + \
    geom_line(aes(x='epoch', y='val_f1_micro'), size=1.3, color='orange', linetype = "dashed") + \
    geom_line(aes(x='epoch', y='val_f1_macro'), size=1.3, color='pink') + \
    ggtitle("Average Validation F1-Score") + ggsize(500, 400)

## Testing

Now let's use our hold-out test data from original dataset and evaluate our model

In [38]:
model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [39]:
print('Testing started...')
print()

total_test_accuracy = 0
total_test_f1_micro = 0
total_test_f1_macro = 0

for batch in test_dataloader:
    batch_input_ids = batch[0].to(device)
    batch_input_mask = batch[1].to(device)
    batch_labels = batch[2].float().to(device)

    with torch.no_grad():        
        outputs = model(batch_input_ids, 
                        attention_mask=batch_input_mask,
                        labels=batch_labels)

    logits = outputs.logits
    logits = logits.detach().cpu().numpy()
    label_ids = batch_labels.to('cpu').numpy()
    
    total_test_f1_micro += compute_f1_micro(get_probs(logits), label_ids)
    total_test_f1_macro += compute_f1_macro(get_probs(logits), label_ids)
    total_test_accuracy += flat_accuracy(get_probs(logits), label_ids)
    
avg_test_accuracy = total_test_accuracy / len(test_dataloader)
avg_test_f1_micro = total_test_f1_micro / len(test_dataloader)
avg_test_f1_macro = total_test_f1_macro / len(test_dataloader)

print('Test metrics:')
print('----------------------')
print(f'Accuracy : {avg_test_accuracy:.4f}')
print(f'f1-score micro : {avg_test_f1_micro:.4f}')
print(f'f1-score macro : {avg_test_f1_macro:.4f}')

print()
print("Testing finished...")

Testing started...

Test metrics:
----------------------
Accuracy : 0.7564
f1-score micro : 0.8473
f1-score macro : 0.8301

Testing finished...


## Saving

If necessary, we can save model and tokenizer

In [40]:
is_saving = False

In [41]:
def save_model(path_to_save, is_saving, model):
    if not is_saving:
        return
    if not os.path.exists(path_to_save):
        os.makedirs(path_to_save)
    model_to_save = model.module if hasattr(model, 'module') else model
    model_to_save.save_pretrained(path_to_save)
    tokenizer.save_pretrained(path_to_save)
    print(f'Saved model and tokenizer to {path_to_save}')

In [42]:
save_model('./result/', is_saving, model)

## Summary

Additionally to check a generalization ability of our model let's use some samples from [NNGen](https://github.com/Tbabm/nngen/tree/master/data) 

In [43]:
nn_gen_path = '../input/nngen-test/cleaned.test.msg'

In [44]:
nn_gen_data = []
with open(nn_gen_path) as nn_gen_file:
    for line in nn_gen_file:
        line = line.strip()
        nn_gen_data.append(line)

In [45]:
np.random.seed(27)
n_samples = 30
random_nn_gen_data = np.random.choice(nn_gen_data, n_samples, replace=False)

In [46]:
def reverse_to_label(labels):
    res = ''
    if labels[0]:
        res += ' Corrective'
    if labels[1]:
        res += ' Adaptive'
    if labels[2]:
        res += ' Perfective'
    res = res.strip()
    return res

In [47]:
model.eval()
for i in range(n_samples):
    print(f'Commit message: \n{random_nn_gen_data[i]}')
    inputs = tokenizer(random_nn_gen_data[i], return_tensors="pt")
    model.to('cpu')
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = get_probs(logits).int().numpy()[0]
        print('Prediction:')
        print(reverse_to_label(probs))
        print('--------------------')

Commit message: 
Remove automatic left factoring declaration from Java . g4 since the processing is currently very slow
Prediction:
Perfective
--------------------
Commit message: 
Added tools . jar to classpath for buildSrc
Prediction:
Adaptive
--------------------
Commit message: 
remove logall since it doesn ' t trigger anymore
Prediction:
Perfective
--------------------
Commit message: 
move test file into temp folder
Prediction:
Perfective
--------------------
Commit message: 
Updates SRTP transform engine when setting new SDES keys .
Prediction:
Adaptive
--------------------
Commit message: 
bump up druid version to 0 . 4 . 0
Prediction:
Perfective
--------------------
Commit message: 
fixed previous check - in
Prediction:
Corrective
--------------------
Commit message: 
DomModelTreeView disposing fixed
Prediction:
Corrective
--------------------
Commit message: 
Reverted change to logging file ( was not the right logging file ) .
Prediction:
Corrective Perfective
---------------

As we can see by analyzing predictions on random samples from another yet unseen dataset, our model predictions are sufficient and close to the meaning of commit messages from NNGen. In reality our model predictions fit well the categorization based on Swanson's maintenance activities (from the article) and agree with random messages from NNGen. For example, message `Upgraded parent version .` is considered adaptive, because it's modification to previous version. Or `fixed previous check - in` is labeled corrective, that is close to it's intention. Also `Removed junit from classpath ( all tests moved to zaproxy - test )` is predicted to be perfective, because it's refactoring. Even `ToString usage example had a typo in it ( used ' excludes ' instead of the correct ' exclude ' ) .`  prediction seems to be fine, since it's a fix and changing the names of variable, so it can be considered as corrective and predictive. However, we can see few samples which predictions are hard to explain.

In conclusion, we got decent results, however, there are different approaches on how to improve our model: we can try changing hyperparameters (epochs, lr, batch_size), also we can try to use more data as train by neglecting validation when we already found out the parameters, besides we can add weight decay to our optimizer, or use different metrics and other approache.