#**Issue Report Classification using Fine-tuned Sentence Transformer**

- This notebook contains the code for using a fine-tuned Sentence Transformer for GitHub Issue Report Classification. The objective of this project is to train a multi-class classification model for issue label classification.

## Implementation details

* **Sentence transformer used:** BAAI/bge-base-en-v1.5
    - Link to model: https://huggingface.co/BAAI/bge-base-en-v1.5

* **Fine-tuning method used:** Few-shot Fine-tuning using SetFit
    - Link to SetFit repo: https://github.com/huggingface/setfit

### Training details:
* **Loss Class:** Cosine Similarity Loss
    - The loss function to use for contrastive learning with the Sentence Transformer body
* **Number of epochs:** 1
* **Number of iterations:** 20
    - The number of text pairs to generate for contrastive learning
* **Batch size:** 4
* **Metric:** Accuracy


### Installing the required libraries

In [1]:
%pip install pandas
%pip install sentence-transformers
%pip install setfit
%pip install scikit-learn
%pip install datasets



### Importing the necessary libraries

In [2]:
import pandas as pd
import json
import os
from setfit import SetFitModel, SetFitTrainer
from sentence_transformers.losses import CosineSimilarityLoss
from datasets import Dataset
from sklearn.metrics import classification_report
from collections import defaultdict

- Using the bge-base-en-v1.5 model which is one of the top 20 sentence transformer models according to Huggingface sentence transformer rankings.
- Setting a seed value of 42.

#### Declaring constant values to use throughout the notebook

In [3]:
BASE_MODEL = "BAAI/bge-base-en-v1.5"
RANDOM_SEED = 42
OUTPUT_PATH = 'output'

  and should_run_async(code)


### Loading the datasets

In [4]:
train_file_path = r"https://github.com/lhamu/issue-report-classification/raw/main/preprocessed_data/preprocessed_issues_train.csv"
test_file_path = r"https://github.com/lhamu/issue-report-classification/raw/main/preprocessed_data/preprocessed_issues_test.csv"

In [5]:
train_set = pd.read_csv(train_file_path)
test_set = pd.read_csv(test_file_path)

SetFit expects the input to have "text" and "label" columns so I rename the "issue_text" column to "text" here.

In [6]:
train_set = train_set.rename(columns={"issue_text": "text"})
test_set = test_set.rename(columns={"issue_text": "text"})
train_set.columns


  and should_run_async(code)


Index(['repo', 'text', 'label'], dtype='object')

### Preparing the data to use for fine-tuning

In [7]:
repos = list(set(train_set["repo"].unique()))
print(repos)

['opencv/opencv', 'bitcoin/bitcoin', 'microsoft/vscode', 'facebook/react', 'tensorflow/tensorflow']


  and should_run_async(code)


#### Grouping the data by repo

In [8]:
train_set.groupby(["repo", "label"]).size().unstack(fill_value=0)

label,0,1,2
repo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bitcoin/bitcoin,100,100,100
facebook/react,100,100,100
microsoft/vscode,100,100,100
opencv/opencv,100,100,100
tensorflow/tensorflow,100,100,100


In [9]:
group_by_repo = lambda dataset: {
    repo: Dataset.from_pandas(dataset[dataset["repo"] == repo]).class_encode_column("label")
    for repo in dataset["repo"].unique()
}

train_sets = group_by_repo(train_set)
test_sets = group_by_repo(test_set)

  and should_run_async(code)


Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

In [10]:
datasets = {
    repo: {'train': train_sets[repo], 'test': test_sets[repo]} for repo in train_sets.keys()
}

### Fine-tuning the model

Training models for each repo using the training dataset for that repo and getting the predictions on the test dataset for that repo.<br/>

The pretrained Sentence Transformer (BAAI/bge-base-en-v1.5) is used and a logistic classification head is added to create the SetFit model. This model is trained on training data samples.

In [11]:
results = defaultdict(dict) # dictionary to store the complete results in

# for each repo, the model will be trained on the training data specific to the repo
# and the fine-tuned model will be tested by performing predictions on the test dataset
# and comparing the values with the original labels
for repo in datasets.keys():
    train_set, test_set = datasets[repo]['train'], datasets[repo]['test']
    model = SetFitModel.from_pretrained(BASE_MODEL)

    trainer = SetFitTrainer(
        model=model,
        train_dataset=train_set,
        loss_class=CosineSimilarityLoss,
        metric="accuracy",
        batch_size=4,
        num_epochs=1,
        num_iterations=20,
    )
    trainer.train()
    y_pred = trainer.model.predict(test_set['text'])
    results[repo]['metrics'] = classification_report(test_set['label'], y_pred, digits=4, output_dict=True)
    results[repo]['predictions'] = y_pred.tolist()
    results['label_mapping'] = {train_set.features["label"].int2str(x): x for x in range(train_set.features["label"].num_classes)}
    repo_name = repo.split("/")[1]
    save_directory = f"./models/{repo_name}"
    if not os.path.isdir(save_directory):
        os.mkdir(save_directory)
    trainer.model._save_pretrained(save_directory=save_directory)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 12000
  Batch size = 4
  Num epochs = 1
  Total optimization steps = 3000


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 12000
  Batch size = 4
  Num epochs = 1
  Total optimization steps = 3000


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 12000
  Batch size = 4
  Num epochs = 1
  Total optimization steps = 3000


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 12000
  Batch size = 4
  Num epochs = 1
  Total optimization steps = 3000


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 12000
  Batch size = 4
  Num epochs = 1
  Total optimization steps = 3000


Step,Training Loss


#### Displaying the results

In [12]:
for repo in repos:
    print(repo)
    print(json.dumps(results[repo]['metrics'], indent=4))

opencv/opencv
{
    "0": {
        "precision": 0.71,
        "recall": 0.71,
        "f1-score": 0.7100000000000001,
        "support": 100
    },
    "1": {
        "precision": 0.81,
        "recall": 0.81,
        "f1-score": 0.81,
        "support": 100
    },
    "2": {
        "precision": 0.76,
        "recall": 0.76,
        "f1-score": 0.76,
        "support": 100
    },
    "accuracy": 0.76,
    "macro avg": {
        "precision": 0.7600000000000001,
        "recall": 0.7600000000000001,
        "f1-score": 0.7600000000000001,
        "support": 300
    },
    "weighted avg": {
        "precision": 0.76,
        "recall": 0.76,
        "f1-score": 0.76,
        "support": 300
    }
}
bitcoin/bitcoin
{
    "0": {
        "precision": 0.7472527472527473,
        "recall": 0.68,
        "f1-score": 0.7120418848167539,
        "support": 100
    },
    "1": {
        "precision": 0.8415841584158416,
        "recall": 0.85,
        "f1-score": 0.845771144278607,
        "support"

  and should_run_async(code)


### Adding average and overall metrics

In [13]:
class_metrics_sum = defaultdict(defaultdict)
labels = [key for key in results[repos[0]]['metrics'].keys() if key.isnumeric()]

for repo in repos:
    for label in labels:
        for metric in results[repo]['metrics'][label]:
            class_metrics_sum[label][metric] = class_metrics_sum[label].get(metric, 0) + results[repo]['metrics'][label][metric]

class_metrics_avg = {
    label: {
        metric: class_metrics_sum[label][metric] / len(repos)
        for metric in class_metrics_sum[label]
    }
    for label in labels
}

# add the average of the metric over all classes
class_metrics_avg['average'] = {
    metric: sum(class_metrics_avg[label][metric] for label in labels)
    / len(labels)
    for metric in class_metrics_avg[labels[0]]
}

# add to the results
results['overall'] = {
    'metrics': class_metrics_avg
}


In [17]:
results

  and should_run_async(code)


defaultdict(dict,
            {'facebook/react': {'metrics': {'0': {'precision': 0.8878504672897196,
                'recall': 0.95,
                'f1-score': 0.9178743961352657,
                'support': 100},
               '1': {'precision': 0.7857142857142857,
                'recall': 0.88,
                'f1-score': 0.830188679245283,
                'support': 100},
               '2': {'precision': 0.8518518518518519,
                'recall': 0.69,
                'f1-score': 0.7624309392265193,
                'support': 100},
               'accuracy': 0.84,
               'macro avg': {'precision': 0.8418055349519523,
                'recall': 0.84,
                'f1-score': 0.836831338202356,
                'support': 300},
               'weighted avg': {'precision': 0.8418055349519523,
                'recall': 0.84,
                'f1-score': 0.8368313382023559,
                'support': 300}},
              'predictions': [0,
               0,
               0

### Saving the results

In [15]:
output_file_name = 'sentence_transformer_results.json'
with open(os.path.join(OUTPUT_PATH, output_file_name), 'w') as fp:
    json.dump(results, fp)

In [16]:
import urllib.request

your_url = 'https://github.com/nlbse2024/issue-report-classification/raw/main/output/results.json'
with urllib.request.urlopen(your_url) as url:
    sota_data = json.loads(url.read().decode())


In [22]:
comparison_data = []

comparison_data.append(results["overall"]["metrics"]["average"])
comparison_data[-1]["process"] = "SentenceTransformer with bge-base-en-v1.5"
comparison_data.append(sota_data["overall"]["metrics"]["average"])
comparison_data[-1]["process"] = "SOTA"

comparison_df = pd.DataFrame(comparison_data)

In [24]:
comparison_df

  and should_run_async(code)


Unnamed: 0,precision,recall,f1-score,support,process
0,0.799517,0.797333,0.797,100.0,SentenceTransformer with bge-base-en-v1.5
1,0.830455,0.826667,0.827046,100.0,SOTA
