#**Issue Report Classification**

- Using Fine-tuned Sentence Transformer for GitHub Issue Report Classification
- Multi-class classification

##Project Description
- Dataset used: Issue Report Classification competition 2024 dataset
<br/>
Link to dataset: https://github.com/nlbse2024/issue-report-classification
- Classes in dataset:
    - bug
    - feature
    - question

- The dataset was collected from:
    - bitcoin/bitcoin
    - facebook/react
    - microsoft/vscode
    - opencv/opencv
    - tensorflow/tensorflow


Checking if the current notebook runtime is a high-RAM runtime.

In [1]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 54.8 gigabytes of available RAM

You are using a high-RAM runtime!


Installing the required libraries

In [2]:
%pip install pandas
%pip install sentence-transformers
%pip install setfit
%pip install scikit-learn
%pip install datasets

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting setfit
  Downloading setfit-1.0.1-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.2/74.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.

Importing the necessary libraries

In [3]:
import pandas as pd
import json
import os
from setfit import SetFitModel, SetFitTrainer
from sentence_transformers.losses import CosineSimilarityLoss
from datasets import Dataset
from sklearn.metrics import classification_report
from collections import defaultdict

- Using the bge-base-en-v1.5 model which is one of the top 20 sentence transformer models according to Huggingface sentence transformer rankings.
- Setting a seed value of 42.

In [4]:
# BASE_MODEL = "BAAI/bge-small-en-v1.5"
BASE_MODEL = "BAAI/bge-base-en-v1.5"
RANDOM_SEED = 42
OUTPUT_PATH = 'output'

  and should_run_async(code)


Reading the preprocessed training and test data sets

In [5]:
train_set = pd.read_csv("preprocessed_issues_train.csv")
test_set = pd.read_csv("preprocessed_issues_test.csv")

In [6]:
train_set = train_set.rename(columns={"issue_text": "text"})
test_set = test_set.rename(columns={"issue_text": "text"})
train_set.columns


  and should_run_async(code)


Index(['repo', 'text', 'label'], dtype='object')

In [7]:
repos = list(set(train_set["repo"].unique()))
print(repos)

['microsoft/vscode', 'opencv/opencv', 'bitcoin/bitcoin', 'facebook/react', 'tensorflow/tensorflow']


  and should_run_async(code)


Grouping the train set values by repo and label

In [8]:
train_set.groupby(["repo", "label"]).size().unstack(fill_value=0)

label,0,1,2
repo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bitcoin/bitcoin,100,100,100
facebook/react,100,100,100
microsoft/vscode,100,100,100
opencv/opencv,100,100,100
tensorflow/tensorflow,100,100,100


Grouping the train and test set data items by repo

In [9]:
group_by_repo = lambda dataset: {
    repo: Dataset.from_pandas(dataset[dataset["repo"] == repo]).class_encode_column("label")
    for repo in dataset["repo"].unique()
}

train_sets = group_by_repo(train_set)
test_sets = group_by_repo(test_set)

  and should_run_async(code)


Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/300 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/300 [00:00<?, ? examples/s]

In [10]:
datasets = {
    repo: {'train': train_sets[repo], 'test': test_sets[repo]} for repo in train_sets.keys()
}

Training models for each repo using the training dataset for that repo and getting the predictions on the test dataset for that repo.

In [11]:
results = defaultdict(dict)
for repo in datasets.keys():
    train_set, test_set = datasets[repo]['train'], datasets[repo]['test']
    model = SetFitModel.from_pretrained(BASE_MODEL)

    trainer = SetFitTrainer(
        model=model,
        train_dataset=train_set,
        loss_class=CosineSimilarityLoss,
        metric="accuracy",
        batch_size=4,
        num_epochs=1,
        num_iterations=20,
    )
    trainer.train()
    y_pred = trainer.model.predict(test_set['text'])
    results[repo]['metrics'] = classification_report(test_set['label'], y_pred, digits=4, output_dict=True)
    results[repo]['predictions'] = y_pred.tolist()
    results['label_mapping'] = {train_set.features["label"].int2str(x): x for x in range(train_set.features["label"].num_classes)}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num examples = 3000
  Num epochs = 1
  Total optimization steps = 3000
  Total train batch size = 4


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num examples = 3000
  Num epochs = 1
  Total optimization steps = 3000
  Total train batch size = 4


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num examples = 3000
  Num epochs = 1
  Total optimization steps = 3000
  Total train batch size = 4


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num examples = 3000
  Num epochs = 1
  Total optimization steps = 3000
  Total train batch size = 4


Step,Training Loss


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

***** Running training *****
  Num examples = 3000
  Num epochs = 1
  Total optimization steps = 3000
  Total train batch size = 4


Step,Training Loss


Printing the results

In [12]:
print(results['label_mapping'])
for repo in repos:
    print(repo)
    print(json.dumps(results[repo]['metrics'], indent=4))

{'0': 0, '1': 1, '2': 2}
microsoft/vscode
{
    "0": {
        "precision": 0.8297872340425532,
        "recall": 0.78,
        "f1-score": 0.8041237113402062,
        "support": 100
    },
    "1": {
        "precision": 0.7610619469026548,
        "recall": 0.86,
        "f1-score": 0.8075117370892019,
        "support": 100
    },
    "2": {
        "precision": 0.8494623655913979,
        "recall": 0.79,
        "f1-score": 0.8186528497409327,
        "support": 100
    },
    "accuracy": 0.81,
    "macro avg": {
        "precision": 0.8134371821788687,
        "recall": 0.81,
        "f1-score": 0.8100960993901136,
        "support": 300
    },
    "weighted avg": {
        "precision": 0.8134371821788685,
        "recall": 0.81,
        "f1-score": 0.8100960993901136,
        "support": 300
    }
}
opencv/opencv
{
    "0": {
        "precision": 0.71,
        "recall": 0.71,
        "f1-score": 0.7100000000000001,
        "support": 100
    },
    "1": {
        "precision": 0.81

  and should_run_async(code)


In [13]:
class_metrics_sum = defaultdict(defaultdict)
labels = [key for key in results[repos[0]]['metrics'].keys() if key.isnumeric()]

for repo in repos:
    for label in labels:
        for metric in results[repo]['metrics'][label]:
            class_metrics_sum[label][metric] = class_metrics_sum[label].get(metric, 0) + results[repo]['metrics'][label][metric]

class_metrics_avg = {
    label: {
        metric: class_metrics_sum[label][metric] / len(repos)
        for metric in class_metrics_sum[label]
    }
    for label in labels
}

# add the average of the metric over all classes
class_metrics_avg['average'] = {
    metric: sum(class_metrics_avg[label][metric] for label in labels)
    / len(labels)
    for metric in class_metrics_avg[labels[0]]
}

# add to the results
results['overall'] = {
    'metrics': class_metrics_avg
}


In [14]:
results

defaultdict(dict,
            {'facebook/react': {'metrics': {'0': {'precision': 0.8878504672897196,
                'recall': 0.95,
                'f1-score': 0.9178743961352657,
                'support': 100},
               '1': {'precision': 0.7857142857142857,
                'recall': 0.88,
                'f1-score': 0.830188679245283,
                'support': 100},
               '2': {'precision': 0.8518518518518519,
                'recall': 0.69,
                'f1-score': 0.7624309392265193,
                'support': 100},
               'accuracy': 0.84,
               'macro avg': {'precision': 0.8418055349519523,
                'recall': 0.84,
                'f1-score': 0.836831338202356,
                'support': 300},
               'weighted avg': {'precision': 0.8418055349519523,
                'recall': 0.84,
                'f1-score': 0.8368313382023559,
                'support': 300}},
              'predictions': [0,
               0,
               0

Saving the results in a JSON file.

In [15]:
output_file_name = 'results.json'
with open(os.path.join(OUTPUT_PATH, output_file_name), 'w') as fp:
    json.dump(results, fp)