### Commit Classification and Prediction — Problem Description

Software repositories such as GitHub, Bitbucket, Gitlab etc. use commits to record changes in the source code, allowing teams of developers to collaborate effectively. Each commit includes a commit message, a brief summary describing the change. These messages generally shows whether the change involves fixing a bug, adding a feature, updating documentation, refactoring code, or making other non-functional improvements.

Automatically analyzing these commit messages is highly valuable because software projects frequently suffer from information overload, where developers spend significant time understanding the nature of past changes, which helps in identifying bug-prone commits, and prioritizing maintenance tasks. Prior research shows that commit categorization can accelerate issue resolution, enhance code review efficiency, and support project management workflows.

This project focuses on commit message classification, where the objective is to predict the macro-type of a software commit using only its natural-language commit message. We treat this as a supervised machine learning problem, leveraging a labeled dataset compiled from well-established and trusted sources. The dataset categorizes commits into five broad classes:

- Corrective — commits related to bug fixes or fault correction

- Feature — commits introducing new functionality or enhancements

- Non-Functional — documentation updates, formatting, or other structural changes that do not alter behavior

- Perfective — improvements to code quality, refactoring, or maintainability enhancements

- Unknown — commits that are auto-generated or whose purpose cannot be clearly determined

These categories provide a meaningful high-level taxonomy that supports automated analysis of software evolution and enables more efficient maintainability workflows.

To address this prediction task, we evaluate both traditional machine learning techniques (TF-IDF combined with Logistic Regression, Naive Bayes, and Random Forest) and a modern deep learning model (BERT). By applying natural language processing methods, our aim is to identify the most effective model for commit classification and to highlight how automated commit analysis can significantly aid software maintenance, quality assurance, and project analytics.

## Install the packages

In [None]:
!pip install -q transformers accelerate datasets sentencepiece

## Import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import torch
import os


from datasets import load_dataset, Dataset
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    f1_score,
    ConfusionMatrixDisplay
)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)

In [None]:
# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

## Download English Stopwords
We download English stopwords because they are common words like the, is, and, of that usually do not carry meaningful information for text classification.

By removing stopwords during preprocessing, we:
- Reduce noise in the text,
- Focus the model on meaningful words, and
- Improve model performance and efficiency.

In [None]:
# Download the NLTK English stopwords list (common words to remove during text preprocessing)
nltk.download("stopwords")

# Load the English stopwords into a set for faster lookup during text cleaning
stop_words = set(stopwords.words("english"))

In [None]:
# Check device GPU or CPU
print("Torch device:", "cuda" if torch.cuda.is_available() else "cpu")

## Load the dataset

In [None]:
dataset = load_dataset("0x404/ccs_dataset")

In [None]:
# Convert each split to Pandas DataFrame
train_df = dataset["train"].to_pandas()
train_df

In [None]:
train_df.columns

In [None]:
val_df   = dataset["eval"].to_pandas()
val_df

In [None]:
test_df  = dataset["test"].to_pandas()
test_df

In [None]:
print("\nDATASET SIZES:")
for split in dataset.keys():
    print(f"{split}: {len(dataset[split])} samples")

In [None]:
# Convert to Pandas DataFrames and rename columns
train_df_raw = dataset["train"].to_pandas()[["masked_commit_message", "annotated_type"]]
val_df_raw   = dataset["eval"].to_pandas()[["masked_commit_message", "annotated_type"]]
test_df_raw  = dataset["test"].to_pandas()[["masked_commit_message", "annotated_type"]]

train_df_raw = train_df_raw.rename(columns={"masked_commit_message": "Message", "annotated_type": "Ground truth"})
val_df_raw   = val_df_raw.rename(columns={"masked_commit_message": "Message", "annotated_type": "Ground truth"})
test_df_raw  = test_df_raw.rename(columns={"masked_commit_message": "Message", "annotated_type": "Ground truth"})

In [None]:
print("\n Raw training data (First 5 rows)")
train_df_raw.head()

## Clean an dPreprocess Commit Messages


This function cleans commit messages by converting to lowercase, removing URLs,
keeping only alphabetic characters, removing extra spaces, and filtering out stopwords.


In [None]:
def preprocess(df):
    """
    Clean commit messages:
    - Convert to lowercase
    - Remove URLs
    - Keep only alphabetic characters
    - Remove stopwords
    """
    df = df.copy()
    df["Message"] = df["Message"].astype(str)

    def clean(t):
        t = t.lower()
        t = re.sub(r"http\S+", "", t)          # Remove URLs
        t = re.sub(r"[^a-z\s]", " ", t)       # Remove non-alphabetic characters
        t = re.sub(r"\s+", " ", t).strip()    # Remove extra spaces
        return " ".join([w for w in t.split() if w not in stop_words])

    df["clean_message"] = df["Message"].apply(clean)
    return df

In [None]:
train_df = preprocess(train_df_raw)
val_df   = preprocess(val_df_raw)
test_df  = preprocess(test_df_raw)

In [None]:
print("\n Cleaned Training Data (First 5 rows)")

train_df.head()

## Explorartory Data Analysis

## Label Distribution Plot
Shows how frequently each ground-truth label appears in the training set, helping identify class imbalance.

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y=train_df["Ground truth"], order=train_df["Ground truth"].value_counts().index)
plt.title("Label Distribution")
plt.show()

In [None]:
train_df["msg_len"] = train_df["clean_message"].apply(lambda x: len(x.split()))
train_df

### Message Length Distribution
Displays how long commit messages typically are, revealing patterns such as very short or very long messages

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(train_df["msg_len"], bins=40)
plt.title("Message Length Distribution")
plt.show()

## 5. TF-IDF Feature Extraction
Converts text into numerical vectors by measuring how important each word is within a message relative to the entire dataset. This helps machine-learning models understand and compare commit messages based on their content.



In [None]:
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df["clean_message"])
X_val   = vectorizer.transform(val_df["clean_message"])
X_test  = vectorizer.transform(test_df["clean_message"])

y_train = train_df["Ground truth"]
y_val   = val_df["Ground truth"]
y_test  = test_df["Ground truth"]

## Train Classical Machine Learning Models

## Logistic Regression Model
A linear classification model that learns weighted features to predict labels; class_weight="balanced" helps handle class imbalance.

In [None]:
# Logistic Regression
log = LogisticRegression(max_iter=2000, class_weight="balanced")
log.fit(X_train, y_train)

## Naive Bayes Model
A probabilistic classifier based on word-frequency statistics, assuming features are conditionally independent—fast and effective for text.

In [None]:
# Naive bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)

## Random Forest
An ensemble of many decision trees that vote on the final prediction, improving accuracy and reducing overfitting.

In [None]:
# Random Forest
rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train, y_train)

In [None]:
# Get feature importances from Random Forest
importances = rf.feature_importances_

# Get feature names from TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get indices of top 15 important features
indices = np.argsort(importances)[-10:][::-1]

# Plot
plt.figure(figsize=(8,6))
sns.barplot(x=importances[indices], y=np.array(feature_names)[indices], palette="magma")
plt.title("Top 10 Random Forest TF-IDF Features")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

## Evaluation Of TF-IDF(Term Frequency–Inverse Document Frequency) Models
Each trained model is evaluated using accuracy and macro-F1 scores on the training, validation, and test sets. This helps compare how well the models learn from the data, generalize to unseen data, and handle class imbalance. The resulting table summarizes the performance of Logistic Regression, Naive Bayes, and Random Forest side-by-side.

In [None]:
def evaluate_model(name, model, X_train, X_val, X_test, y_train, y_val, y_test):
    return {
        "Model": name,
        "Train Acc": accuracy_score(y_train, model.predict(X_train)),
        "Val Acc": accuracy_score(y_val, model.predict(X_val)),
        "Test Acc": accuracy_score(y_test, model.predict(X_test)),
        "Train F1": f1_score(y_train, model.predict(X_train), average="macro"),
        "Val F1": f1_score(y_val, model.predict(X_val), average="macro"),
        "Test F1": f1_score(y_test, model.predict(X_test), average="macro"),
    }

In [None]:
results_detailed = [
    evaluate_model("Logistic Regression", log, X_train, X_val, X_test, y_train, y_val, y_test),
    evaluate_model("Naive Bayes", nb, X_train, X_val, X_test, y_train, y_val, y_test),
    evaluate_model("Random Forest", rf, X_train, X_val, X_test, y_train, y_val, y_test),
]

df_compare = pd.DataFrame(results_detailed)
print("\n Training / Validation / Test data performance results (TF-IDF Models) ")
df_compare

In [None]:
labels = sorted(train_df["Ground truth"].unique())

### Confusion Matrix
A confusion matrix is used to visualize how well a classification model performs by showing the counts of correct and incorrect predictions for each class.
It helps identify which classes the model predicts accurately and where it makes mistakes.

In [None]:
def plot_confusion_matrix(y_true, y_pred, title, cmap):
    plt.figure(figsize=(10,7))
    sns.heatmap(
        confusion_matrix(y_true, y_pred, labels=labels),
        annot=True,
        cmap=cmap,
        fmt="d",
        xticklabels=labels,
        yticklabels=labels
    )
    plt.title(title)
    plt.show()

In [None]:
plot_confusion_matrix(y_test, log.predict(X_test), "Logistic Regression — Confusion Matrix", "Blues")

In [None]:
plot_confusion_matrix(y_test, nb.predict(X_test),  "Naive Bayes — Confusion Matrix", "Oranges")

The Naive Bayes model is performing well overall, with strong correct predictions for classes like build, ci, docs, perf, style, and test, but it notably confuses semantically similar categories such as feat, fix, refactor

In [None]:
plot_confusion_matrix(y_test, rf.predict(X_test),  "Random Forest — Confusion Matrix", "Greens")

## Prepare data for BERT(Bidirectional Encoder Representations from Transformers)
It is a powerful pre-trained language model developed by Google that understands text by looking at words in both directions (left and right), making it highly effective for NLP tasks like classification, sentiment analysis, and question answering. BERT can deeply understand the meaning, context, and intent behind commit messages. This helps classify commits more accurately (e.g., bug fix, feature, refactor) because BERT captures nuances such as technical terms, action verbs, and contextual relationships that simpler models (TF-IDF + ML) may miss.

Implementation Explanation: This code prepares commit messages for BERT-based classification. It converts textual labels to numeric IDs, formats the data as HuggingFace Dataset objects, tokenizes the messages into BERT-compatible input IDs and attention masks, removes the original text, and sets the datasets to PyTorch format for model training. This ensures the data is ready for fine-tuning a BERT model.

In [None]:
label_list = labels
label_list

In [None]:
# Create label-to-ID and ID-to-label mappings

label2id = {l:i for i,l in enumerate(label_list)}
id2label = {i:l for l,i in label2id.items()}
id2label

In [None]:
# Convert original dataframe into a format suitable for BERT

def make_bert_df(df):
    x = df[["clean_message", "Ground truth"]].copy()
    x = x.rename(columns={"clean_message": "text", "Ground truth": "label"})
    x["label"] = x["label"].map(label2id)
    return x

In [None]:
# Convert DataFrames to HuggingFace Dataset objects

train_bert = Dataset.from_pandas(make_bert_df(train_df))
val_bert   = Dataset.from_pandas(make_bert_df(val_df))
test_bert  = Dataset.from_pandas(make_bert_df(test_df))

In [None]:
# Load BERT tokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
# Tokenization function to convert text → BERT input IDs & attention masks

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,            # Cut text longer than max_length
        padding="max_length",       # Pad all sequences to a fixed length
        max_length=128              # Maximum token length
        )

In [None]:
# Tokenize datasets and remove raw text column (BERT uses tokenized inputs instead)

train_tok = train_bert.map(tokenize, batched=True).remove_columns(["text"])
val_tok   = val_bert.map(tokenize, batched=True).remove_columns(["text"])
test_tok  = test_bert.map(tokenize, batched=True).remove_columns(["text"])

In [None]:
# Convert tokenized datasets into PyTorch format for training

train_tok.set_format("torch")
val_tok.set_format("torch")
test_tok.set_format("torch")

## BERT Model Training

In [None]:
# Load pre-trained BERT model for sequence classification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

In [None]:
# Define training hyperparameters and settings
args = TrainingArguments(
    output_dir="bert_ccs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to="none"
)

In [None]:
# Define evaluation metrics function

def metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1_macro": f1_score(p.label_ids, preds, average="macro")
    }

In [None]:
# Initialize the Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    tokenizer=tokenizer,
    compute_metrics=metrics,
)

In [None]:
# Train the model
trainer.train()

The BERT model achieved a Test Accuracy of 57.5% and F1-score of 56.5%, demonstrating its capability to capture semantic patterns in commit messages. Its main difficulty was distinguishing commit types with overlapping meanings, such as 'feature addition' vs. 'code refactoring

In [None]:
metrics_history = trainer.state.log_history
metrics_history

## Training / Validation Metrics Over Epochs

In [None]:
# Only keep entries with evaluation metrics
eval_metrics = [m for m in metrics_history if "eval_accuracy" in m]

epochs = [m["epoch"] for m in eval_metrics]
val_acc = [m["eval_accuracy"] for m in eval_metrics]
val_f1 = [m["eval_f1_macro"] for m in eval_metrics]

# Plot
plt.figure(figsize=(8,5))
plt.plot(epochs, val_acc, label="Validation Accuracy", marker='o')
plt.plot(epochs, val_f1, label="Validation F1 Macro", marker='x')
plt.title("Validation Metrics Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Score")
plt.legend()
plt.show()

## Label Distribution Visualization

In [None]:
sns.countplot(y=train_df["Ground truth"], order=train_df["Ground truth"].value_counts().index)
plt.title("Label Distribution in Training Set")
plt.show()

## Prediction Distribution

In [None]:
bert_preds = trainer.predict(test_tok)
bert_pred = np.argmax(bert_preds.predictions, axis=1)
bert_true = bert_preds.label_ids

print("\nBERT Classification Report:\n",
      classification_report(bert_true, bert_pred, target_names=label_list))

In [None]:
sns.countplot(y=pd.Series(bert_pred).map(lambda x: id2label[x]))
plt.title("Prediction Distribution on Test Set")
plt.show()

## BERT Testing Evaluation

## Sample Misclassifications
As To understand errors, you can show a few commit messages that were misclassified:

In [None]:
misclassified = test_df.copy()
misclassified['pred'] = bert_pred
misclassified = misclassified[misclassified['pred'] != misclassified['Ground truth'].map(label2id)]
misclassified[['Message', 'Ground truth', 'pred']].head(10)

In [None]:
# Confusion matrix
cm_bert = confusion_matrix(bert_true, bert_pred)
plt.figure(figsize=(10,7))
sns.heatmap(cm_bert, annot=True, cmap="Purples",
            xticklabels=label_list, yticklabels=label_list)
plt.title("BERT Confusion Matrix")
plt.show()

In [None]:
# Normalized Confusion Matrix
def plot_normalized_confusion_matrix(y_true, y_pred, title):
    labels = np.arange(len(label_list))  # numeric labels 0 to 9
    cm = confusion_matrix(y_true, y_pred, labels=labels, normalize='true')

    plt.figure(figsize=(10,7))
    sns.heatmap(cm, annot=True, fmt=".2f", cmap="Purples",
                xticklabels=label_list, yticklabels=label_list)
    plt.title(title)
    plt.ylabel("True Label")
    plt.xlabel("Predicted Label")
    plt.show()

In [None]:
plot_normalized_confusion_matrix(bert_true, bert_pred, "BERT Normalized Confusion Matrix")

## BERT Training / Validation / Test performance


In [None]:
train_out = trainer.predict(train_tok)
bert_train_pred = np.argmax(train_out.predictions, axis=1)
bert_train_true = train_out.label_ids

val_out = trainer.predict(val_tok)
bert_val_pred = np.argmax(val_out.predictions, axis=1)
bert_val_true = val_out.label_ids

bert_results = {
    "Model": "BERT",
    "Train Acc": accuracy_score(bert_train_true, bert_train_pred),
    "Val Acc": accuracy_score(bert_val_true, bert_val_pred),
    "Test Acc": accuracy_score(bert_true, bert_pred),
    "Train F1": f1_score(bert_train_true, bert_train_pred, average="macro"),
    "Val F1": f1_score(bert_val_true, bert_val_pred, average="macro"),
    "Test F1": f1_score(bert_true, bert_pred, average="macro"),
}

print("\n=== BERT TRAIN / VAL / TEST PERFORMANCE ===")
print(pd.DataFrame([bert_results]))

## Observations
Overall, TF‑IDF + Logistic Regression gives the best balance of performance, simplicity, and interpretability on your commit-message task, and the notebook would benefit most from making label mappings explicit, emphasizing macro‑F1 and per-class metrics, and adding a concise, side‑by‑side comparison and brief error analysis (especially versus BERT) to clearly show where each model helps or struggles