<a href="https://colab.research.google.com/github/justdenz/mco2-technical-report/blob/main/Commit_Classifier_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Libraries

**The original script to train a multi-label toxic classifier is a work of Ronak Patel (2020).**

This notebook uses a GitHub message dataset to show the feasibility of creating a message classifier model using a BERT Model.

Adapted from:
https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1

In [None]:
!pip install transformers
import pandas as pd
import numpy as np
import tensorflow as tf
import torch
from torch.nn import BCEWithLogitsLoss, BCELoss
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score
import pickle
from transformers import BertForSequenceClassification, BertTokenizer, AdamW
from tqdm import tqdm, trange
from ast import literal_eval

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 7.8MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 52.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 49.5MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72

We initially tried training a model using **1M** of the original data, however the RAM-limitations and time usage of Colab was preventing us from reaching even 1 epoch. We decided to train a model with **100K** entries instead split into 90-10 testing/validation sets.

In [None]:
# !gdown https://drive.google.com/uc?id=1-d5DgXjR_xhU-2RClDcNZ_pBi3AN0F7J # train dataset 1m
# !gdown https://drive.google.com/uc?id=1-gT-vVaOWegECpXNpzs8LzOEzDwXBgM3 #demo dataset 20k
!gdown https://drive.google.com/uc?id=1c4xKGqFbRFnrSqXsm1FQdAwbvqgiS2ZE #train 90k
!gdown https://drive.google.com/uc?id=1F8dK1frQbWJWx1d07uJEa1X_xXBtwTPP #demo 10k

Downloading...
From: https://drive.google.com/uc?id=1c4xKGqFbRFnrSqXsm1FQdAwbvqgiS2ZE
To: /content/train_90k.csv
48.2MB [00:00, 85.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1F8dK1frQbWJWx1d07uJEa1X_xXBtwTPP
To: /content/demo_10k.csv
5.37MB [00:00, 47.0MB/s]


In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla T4'

## Load and Preprocess Training Data

Dataset will be tokenized then split into training and validation sets. The validation set will be used to monitor training. For testing a separate test set will be loaded for analysis.

In [None]:
df = pd.read_csv('train_90k.csv') #jigsaw-toxic-comment-classification-challenge
df.head()

Unnamed: 0,id,message,other,bug,fix,refactor,security
0,7f45e0f01478de7e4ed3d135c7841e0dab7f6c21,Reset timeout when we are back from interrupt.,1,0,0,0,0
1,bc0d1102225f709770d3027309e15e35bdc8f405,Add an example to the chflags(1) man page.\n ...,1,0,0,0,0
2,cf16bf7779a68bc572c57ab0cb40054e1754223a,arc: vdk: Add support of UIO\n \n ARC VD...,1,0,0,0,0
3,e788759f44b29e5b1bc27a265dece7dcfa4234af,netfilter: ebtables: split do_replace into two...,0,0,0,1,0
4,fcf48de7bd3ac95f3f6a5e5e6352100e2e76525c,read-rtl.c (struct macro_traverse_data): Add u...,1,0,0,0,0


In [None]:
print('Unique comments: ', df.message.nunique() == df.shape[0])
print('Null values: ', df.isnull().values.any())
# df[df.isna().any(axis=1)]

Unique comments:  False
Null values:  False


In [None]:
print('average sentence length: ', df.message.str.split().str.len().mean())
print('stdev sentence length: ', df.message.str.split().str.len().std())

average sentence length:  44.49653518123667
stdev sentence length:  78.43776185185993


In [None]:
cols = df.columns
label_cols = list(cols[2:])
num_labels = len(label_cols)
print('Label columns: ', label_cols)

Label columns:  ['other', 'bug', 'fix', 'refactor', 'security']


In [None]:
print('Count of 1 per label: \n', df[label_cols].sum(), '\n') 
print('Count of 0 per label: \n', df[label_cols].eq(0).sum())

Count of 1 per label: 
 other       38659
bug          8836
fix         36109
refactor    11544
security     2404
dtype: int64 

Count of 0 per label: 
 other       58893
bug         88716
fix         61443
refactor    86008
security    95148
dtype: int64


As you can see the class label counts are clearly imbalanced. This is important to note in model evaluation.

In [None]:
df = df.sample(frac=1).reset_index(drop=True) #shuffle rows

In [None]:
df['one_hot_labels'] = list(df[label_cols].values)
df.head()

Unnamed: 0,id,message,other,bug,fix,refactor,security,one_hot_labels
0,62cd73d989167ef7e812f367961fa9f2bfc6b333,[GlobalISel][AArch64] Select ADDXri.\n \n ...,1,0,0,0,0,"[1, 0, 0, 0, 0]"
1,53c1facc3b1caceada2e967afcdec9f92e582087,[Grappler] Avoid copying tensors in arithmetic...,0,0,1,0,0,"[0, 0, 1, 0, 0]"
2,e92ad9275d5598d6139f69019348423e3ac66449,renaming the class and its file,0,0,0,1,0,"[0, 0, 0, 1, 0]"
3,fc39a9ca0ef4f7b07c485e0d3c61ec0776f7a38c,[CodeGen] Matching promoted type for 16-bit in...,0,0,1,0,0,"[0, 0, 1, 0, 0]"
4,972826a6fa638ee91871e36309aa29fb6c8eaffc,Re-enable disabled size asserts in ppapi_tests...,0,1,0,0,0,"[0, 1, 0, 0, 0]"


In [None]:
df = df.dropna()

In [None]:
labels = list(df.one_hot_labels.values)
comments = list(df.message.values)

Load the pretrained tokenizer that corresponds to your choice in model. e.g.,

```
BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) 

RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)
```


In order to avoid memory issues with Google Colab, I enforce a max_length of 100 tokens. Note that some sentences may not adequately represent each label because of this.

In [None]:
max_length = 100
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) # tokenizer
encodings = tokenizer.batch_encode_plus(comments,max_length=max_length,pad_to_max_length=True) # tokenizer's encoding method
print('tokenizer outputs: ', encodings.keys())

In [None]:
input_ids = encodings['input_ids'] # tokenized and encoded sentences
token_type_ids = encodings['token_type_ids'] # token type ids
attention_masks = encodings['attention_mask'] # attention masks

In [None]:
# Identifying indices of 'one_hot_labels' entries that only occur once - this will allow us to stratify split our training data later
label_counts = df.one_hot_labels.astype(str).value_counts()
one_freq = label_counts[label_counts==1].keys()
one_freq_idxs = sorted(list(df[df.one_hot_labels.astype(str).isin(one_freq)].index), reverse=True)
print('df label indices with only one instance: ', one_freq_idxs)

df label indices with only one instance:  []


Since our scope is to maintain one label for one message or a multiclass classifier. The one-hot labels used in stratification should still be executed but it is ignored when training. This can only become useful when training for multi-label classification.

In [None]:
# Gathering single instance inputs to force into the training set after stratified split
one_freq_input_ids = [input_ids.pop(i) for i in one_freq_idxs]
one_freq_token_types = [token_type_ids.pop(i) for i in one_freq_idxs]
one_freq_attention_masks = [attention_masks.pop(i) for i in one_freq_idxs]
one_freq_labels = [labels.pop(i) for i in one_freq_idxs]

In [None]:
# Use train_test_split to split our data into train and validation sets

train_inputs, validation_inputs, train_labels, validation_labels, train_token_types, validation_token_types, train_masks, validation_masks = train_test_split(input_ids, labels, token_type_ids,attention_masks,
                                                            random_state=2020, test_size=0.10, stratify = labels)

# Add one frequency data to train data
train_inputs.extend(one_freq_input_ids)
train_labels.extend(one_freq_labels)
train_masks.extend(one_freq_attention_masks)
train_token_types.extend(one_freq_token_types)

# Convert all of our data into torch tensors, the required datatype for our model
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
train_token_types = torch.tensor(train_token_types)

validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)
validation_token_types = torch.tensor(validation_token_types)

In [None]:
# Select a batch size for training. For fine-tuning with XLNet, the authors recommend a batch size of 32, 48, or 128. We will use 32 here to avoid memory issues.
batch_size = 32

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels, train_token_types)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels, validation_token_types)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

In [None]:
torch.save(validation_dataloader,'validation_data_loader')
torch.save(train_dataloader,'train_data_loader')

## Load Model & Set Params

Load the appropriate model below, each model already contains a single dense layer for classification on top.



```
BERT:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

XLNet:
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=num_labels)

RoBERTa:
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels)
```



*bert-based-uncased* is our selected model of choice as this is a model that we've worked with before.

In [None]:
# Load model, the pretrained model will include a single linear classification layer on top for classification. 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)
model.cuda()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Setting custom optimization parameters for the AdamW optimizer https://huggingface.co/transformers/main_classes/optimizer_schedules.html

In [None]:
# setting custom optimization parameters. You may implement a scheduler here as well.
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [None]:
optimizer = AdamW(optimizer_grouped_parameters,lr=2e-5,correct_bias=True)
# optimizer = AdamW(model.parameters(),lr=2e-5)  # Default optimization

## Train Model

We left this part of the script untampered and is purely a work of the original author.

In [None]:
# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 3

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):

  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()

  # Tracking variables
  tr_loss = 0 #running loss
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()

    # Forward pass for multilabel classification
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    logits = outputs[0]
    loss_func = BCEWithLogitsLoss() 
    loss = loss_func(logits.view(-1,num_labels),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation
    # loss_func = BCELoss() 
    # loss = loss_func(torch.sigmoid(logits.view(-1,num_labels)),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation
    train_loss_set.append(loss.item())    

    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    # scheduler.step()
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))

###############################################################################

  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Variables to gather full output
  logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

  # Predict
  for i, batch in enumerate(validation_dataloader):
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    with torch.no_grad():
      # Forward pass
      outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      b_logit_pred = outs[0]
      pred_label = torch.sigmoid(b_logit_pred)

      b_logit_pred = b_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()
      b_labels = b_labels.to('cpu').numpy()

    tokenized_texts.append(b_input_ids)
    logit_preds.append(b_logit_pred)
    true_labels.append(b_labels)
    pred_labels.append(pred_label)

  # Flatten outputs
  pred_labels = [item for sublist in pred_labels for item in sublist]
  true_labels = [item for sublist in true_labels for item in sublist]

  # Calculate Accuracy
  threshold = 0.50
  pred_bools = [pl>threshold for pl in pred_labels]
  true_bools = [tl==1 for tl in true_labels]
  val_f1_accuracy = f1_score(true_bools,pred_bools,average='micro')*100
  val_flat_accuracy = accuracy_score(true_bools, pred_bools)*100

  print('F1 Validation Accuracy: ', val_f1_accuracy)
  print('Flat Validation Accuracy: ', val_flat_accuracy)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Train loss: 0.16703288635556687


Epoch:  33%|███▎      | 1/3 [26:28<52:57, 1588.85s/it]

F1 Validation Accuracy:  89.21064922933209
Flat Validation Accuracy:  87.53587535875359
Train loss: 0.09566634473100348


Epoch:  67%|██████▋   | 2/3 [52:57<26:28, 1588.91s/it]

F1 Validation Accuracy:  90.83735659035457
Flat Validation Accuracy:  89.50389503895039
Train loss: 0.07359925955744412


Epoch: 100%|██████████| 3/3 [1:19:28<00:00, 1589.42s/it]

F1 Validation Accuracy:  91.49432436635048
Flat Validation Accuracy:  90.25215252152522





5/28/2021 Training has successfully accomplished after 3 epochs. The model ended epoch 3 with a loss of 0.0735 and an F1 validation score of 91.49%.


In [None]:
torch.save(model.state_dict(), 'commit_classifier_model.zip')

We now save our model for future use.

## Load and Preprocess Test Data

In [None]:
test_df = pd.read_csv('demo_10k.csv')
test_label_cols = list(test_df.columns[2:])
print('Null values: ', test_df.isnull().values.any()) #should not be any null sentences or labels
print('Same columns between train and test: ', label_cols == test_label_cols) #columns should be the same
test_df.head()

Null values:  False
Same columns between train and test:  True


Unnamed: 0,id,message,other,bug,fix,refactor,security
0,1b2bf66ea834f76521822543082b0406a34d0bf3,Implement activateMain() in app shell,1,0,0,0,0
1,7bb1fafc2f163ad03a2007295bb2f57cfdbfb630,"IB/mlx5, ib_post_send(), IB_WR_REG_SIG_MR: Do ...",1,0,0,0,0
2,99c4719c3ca35c8d40c83e988817c47cffa6ed32,Do not compile scrollback support if option SM...,1,0,0,0,0
3,ab2f68d5adf83dfb2484ad2c0b7aff7f2badc23d,"[PowerPC] Regenerate reciprocal tests, as disc...",1,0,0,0,0
4,008a97ef4e9c2bc8a9b105e6e5bd580109373818,AST: Convert IsAsyncHandlerRequest to use sepa...,1,0,0,0,0


In [None]:
test_df = test_df[~test_df[test_label_cols].eq(-1).any(axis=1)] #remove irrelevant rows/comments with -1 values
test_df['one_hot_labels'] = list(test_df[test_label_cols].values)
test_df.head()

Unnamed: 0,id,message,other,bug,fix,refactor,security,one_hot_labels
0,1b2bf66ea834f76521822543082b0406a34d0bf3,Implement activateMain() in app shell,1,0,0,0,0,"[1, 0, 0, 0, 0]"
1,7bb1fafc2f163ad03a2007295bb2f57cfdbfb630,"IB/mlx5, ib_post_send(), IB_WR_REG_SIG_MR: Do ...",1,0,0,0,0,"[1, 0, 0, 0, 0]"
2,99c4719c3ca35c8d40c83e988817c47cffa6ed32,Do not compile scrollback support if option SM...,1,0,0,0,0,"[1, 0, 0, 0, 0]"
3,ab2f68d5adf83dfb2484ad2c0b7aff7f2badc23d,"[PowerPC] Regenerate reciprocal tests, as disc...",1,0,0,0,0,"[1, 0, 0, 0, 0]"
4,008a97ef4e9c2bc8a9b105e6e5bd580109373818,AST: Convert IsAsyncHandlerRequest to use sepa...,1,0,0,0,0,"[1, 0, 0, 0, 0]"


In [None]:
# Gathering input data
test_labels = list(test_df.one_hot_labels.values)
test_comments = list(test_df.message.values)

In [None]:
# Encoding input data
test_encodings = tokenizer.batch_encode_plus(test_comments,max_length=max_length,pad_to_max_length=True)
test_input_ids = test_encodings['input_ids']
test_token_type_ids = test_encodings['token_type_ids']
test_attention_masks = test_encodings['attention_mask']

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
batch_size = 32
# Make tensors out of data
test_inputs = torch.tensor(test_input_ids)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_attention_masks)
test_token_types = torch.tensor(test_token_type_ids)
# Create test dataloader
test_data = TensorDataset(test_inputs, test_masks, test_labels, test_token_types)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
# Save test dataloader
torch.save(test_dataloader,'test_data_loader')

  after removing the cwd from sys.path.


## Prediction and Metics

In [None]:
# Test

# Put model in evaluation mode to evaluate loss on the validation set
model.eval()

#track variables
logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

# Predict
for i, batch in enumerate(test_dataloader):
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels, b_token_types = batch
  with torch.no_grad():
    # Forward pass
    outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    b_logit_pred = outs[0]
    pred_label = torch.sigmoid(b_logit_pred)

    b_logit_pred = b_logit_pred.detach().cpu().numpy()
    pred_label = pred_label.to('cpu').numpy()
    b_labels = b_labels.to('cpu').numpy()

  tokenized_texts.append(b_input_ids)
  logit_preds.append(b_logit_pred)
  true_labels.append(b_labels)
  pred_labels.append(pred_label)

# Flatten outputs
tokenized_texts = [item for sublist in tokenized_texts for item in sublist]
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
# Converting flattened binary values to boolean values
true_bools = [tl==1 for tl in true_labels]

We need to threshold our sigmoid function outputs which range from [0, 1]. Below I use 0.50 as a threshold. Predictions of over 0.5 are accepted while values that are less are not. This is more applicable for multi-label classification where multiple labels can be accepted. For this model, we only trained it to produce 1 label.

In [None]:
pred_bools = [pl>0.50 for pl in pred_labels] #boolean output after thresholding

# Print and save classification report
print('Test F1 Accuracy: ', f1_score(true_bools, pred_bools,average='micro'))
print('Test Flat Accuracy: ', accuracy_score(true_bools, pred_bools),'\n')
clf_report = classification_report(true_bools,pred_bools,target_names=test_label_cols)
pickle.dump(clf_report, open('classification_report.txt','wb')) #save report
print(clf_report)

Test F1 Accuracy:  0.9146272926991265
Test Flat Accuracy:  0.9028929799716772 

              precision    recall  f1-score   support

       other       0.96      0.93      0.94      3916
         bug       0.89      0.84      0.86       883
         fix       0.95      0.92      0.93      3680
    refactor       0.80      0.87      0.83      1161
    security       0.85      0.74      0.79       246

   micro avg       0.92      0.91      0.91      9886
   macro avg       0.89      0.86      0.87      9886
weighted avg       0.93      0.91      0.91      9886
 samples avg       0.90      0.91      0.90      9886



  _warn_prf(average, modifier, msg_start, len(result))


## Output Dataframe

This part of the script is purely for visualizing the results.

In [None]:
idx2label = dict(zip(range(6),label_cols))
print(idx2label)

{0: 'other', 1: 'bug', 2: 'fix', 3: 'refactor', 4: 'security'}


In [None]:
# Getting indices of where boolean one hot vector true_bools is True so we can use idx2label to gather label names
true_label_idxs, pred_label_idxs=[],[]
for vals in true_bools:
  true_label_idxs.append(np.where(vals)[0].flatten().tolist())
for vals in pred_bools:
  pred_label_idxs.append(np.where(vals)[0].flatten().tolist())

In [None]:
# Gathering vectors of label names using idx2label
true_label_texts, pred_label_texts = [], []
for vals in true_label_idxs:
  if vals:
    true_label_texts.append([idx2label[val] for val in vals])
  else:
    true_label_texts.append(vals)

for vals in pred_label_idxs:
  if vals:
    pred_label_texts.append([idx2label[val] for val in vals])
  else:
    pred_label_texts.append(vals)

In [None]:
# Decoding input ids to comment text
comment_texts = [tokenizer.decode(text,skip_special_tokens=True,clean_up_tokenization_spaces=False) for text in tokenized_texts]

In [None]:
# Converting lists to df
comparisons_df = pd.DataFrame({'comment_text': comment_texts, 'true_labels': true_label_texts, 'pred_labels':pred_label_texts})
comparisons_df.to_csv('comparisons.csv')
comparisons_df.head()

Unnamed: 0,comment_text,true_labels,pred_labels
0,add salt files to make scheduler run,[other],[other]
1,document that rtfree ( 9 ) accepts null . from...,[other],[other]
2,merge pull request # 1512 from mdboom / coding...,[refactor],[other]
3,net : emac : fix reset timeout with ar8035 phy...,[fix],[fix]
4,"rename lchunkbase to lchunk , lchunk to lplatf...",[fix],[fix]


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import shutil
# shutil.copy("/content/commit_classifier_model.zip", "/content/drive/MyDrive/") 

'/content/drive/MyDrive/commit_classifier_model.zip'