# Legal Document Classifier - NLP Code Challenge

## Objective
Develop a robust NLP pipeline to classify legal case reports into areas of law using the provided dataset (`sample_200_rows.csv`). The pipeline includes preprocessing, modeling, evaluation, and a bonus API for inference.

## Dataset Overview
- **Columns**: `case_title`, `suitno`, `introduction`, `facts`, `issues`, `decision`, `full_report`
- **Input**: `full_report` (text of the legal judgment)
- **Label**: Area of law extracted from `introduction` (e.g., Civil Procedure, Enforcement of Fundamental Rights)

## Approach
1. **Preprocessing**: Clean `full_report`, extract and standardize labels, handle imbalance.
2. **Modeling**: Use TF-IDF with Logistic Regression, SVM, Random Forest, and fine-tune BERT.
3. **Evaluation**: Assess with Accuracy, F1-Score, Confusion Matrix, and visualizations.
4. **Inference**: Save models for API use in `app/api.py`.

## Evaluation Criteria
- Data Preprocessing: 20%
- Model Performance: 30%
- Code Quality & Structure: 20%
- Clarity of Comments/README: 10%
- Bonus API: 20%

In [99]:
# Import libraries
import re
import os
import gc
import json
import psutil
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from imblearn.over_sampling import SMOTE
from nlpaug.augmenter.word import SynonymAug

from sklearn.svm import SVC
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback

import torch
import pickle
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x13d3352f0>

## 1. Data Preprocessing

In [3]:
# Load dataset
df = pd.read_csv('../data/sample_200_rows.csv')
df.head(5)

Unnamed: 0,case_title,suitno,introduction,facts,issues,decision,full_report
0,ASSET MANAGEMENT GROUP LIMITED v. GENESISCORP ...,CA/L/236M/95,This appeal borders on Civil Procedure.\n,The appellant as Plaintiff before the Lagos Hi...,The Appellant formulated the following issues ...,"On the whole, the Court of Appeal held that th...","GEORGE ADESOLA&nbsp;OGUNTADE, J.C.A. (Deliveri..."
1,JAMES EBELE & ANOR v. ROBERT IKWEKI & ORS,CA/B/53M/2006,This is a ruling on an Application seeking Lea...,The present application flows from the Judgmen...,The Court determined the proprietary or otherw...,"In the final analysis, the Court of Appeal hel...",\nCHIOMA EGONDU NWOSU-IHEME&nbsp;J.C.A. (Deli...
2,CENTRAL BANK OF NIGERIA v. MR TOMMY OKECHUKWU ...,CA/K/304/2020,This appeal borders on propriety of requiremen...,This appeal emanated from the decision of the ...,The Court of Appeal determined the appeal base...,"In the end, the Court of Appeal resolved the s...","PETER OYINKENIMIEMI AFFEN, J.C.A. (Delivering ..."
3,MOHAMMED AUWAL & ORS v. THE FEDERAL REPUBLIC O...,CA/J/183C/2011,This appeal borders on Criminal Law and Proced...,This appeal is against the judgment of the Fed...,The Court determined the appeal on the followi...,"In conclusion, the appeal was dismissed.\n","IBRAHIM SHATA BDLIYA, J.C.A. (Deliveringthe Le..."
4,UNITED BANK FOR AFRICA PLC & ORS v. MR. UGOCHU...,CA/OW/385M/2012,This appeal borders on Enforcement of Fundamen...,This is an appeal against the judgment of NGOZ...,Appellant formulated 4 issues while the Respon...,"On the whole, the Court found no merit in the ...","FREDERICK OZIAKPONO&nbsp;OHO, J.C.A. (Deliveri..."


In [4]:
# Exploratory Data Analysis (EDA)
print('Dataset Shape:', df.shape)
print('\nInitial Label Distribution:')
print(df['introduction'].value_counts())

Dataset Shape: (200, 7)

Initial Label Distribution:
This appeal borders on Civil Procedure.\n                                                   19
This appeal borders on Civil Procedure.                                                     13
This appeal borders on Land Law.\n                                                           8
This appeal borders on Election Petition.\n                                                  7
This appeal borders on civil procedure.                                                      6
                                                                                            ..
This appeal borders on the issue of locus standi to challenge a political party primary.     1
This is an appeal borders on Criminal Law and Procedure.\n                                   1
This appeal borders on Award of Damages.\n                                                   1
This appeal borders on the issue of jurisdiction.                                           

In [5]:
# Clean text function
def clean_text(text):
    """Clean text by removing special characters, citations, HTML entities, and normalizing spaces."""
    if pd.isna(text):
        return ''
    text = text.lower()
    text = re.sub(r'&\w+;', ' ', text)  # Remove HTML entities
    text = re.sub(r'\[\d{4}\].*?\(pt\.\s*\d+\)', ' ', text)  # Remove citations
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces
    if not text:
        return ''
    return text

df['full_report_cleaned'] = df['full_report'].apply(clean_text)

In [6]:
# Enhanced label extraction
def extract_label(intro):
    """Extract and standardize area of law with comprehensive keywords and NLP fallback."""
    if pd.isna(intro):
        return 'Unknown'
    intro = intro.lower().strip()
    keywords = {
        'civil procedure': 'Civil Procedure',
        'enforcement of fundamental rights': 'Enforcement of Fundamental Rights',
        'election petition': 'Election Petition',
        'garnishee proceedings': 'Garnishee Proceedings',
        'criminal law': 'Criminal Law',
        'company law': 'Company Law',
        'land law': 'Property Law',
        'jurisdiction': 'Civil Procedure',
        'damages': 'Civil Law',
    }
    for key, value in keywords.items():
        if key in intro:
            return value
    # Fallback: Use simple NLP to infer from common terms
    if 'right' in intro or 'human' in intro:
        return 'Enforcement of Fundamental Rights'
    return 'Other'

In [7]:
# Apply label extraction to the dataset
df['label'] = df['introduction'].apply(extract_label)

In [8]:
# Remove rows with empty full_report and check final distribution
import logging
logging.basicConfig(level=logging.INFO)

assert df['full_report_cleaned'].str.len().min() > 0, "Empty cleaned reports detected"
logging.info("Data preprocessing completed successfully")

INFO:root:Data preprocessing completed successfully


In [9]:
# Visualize label distribution
plt.figure(figsize=(10, 6))
df['label'].value_counts().plot(kind='bar')
plt.title('Label Distribution After Extraction')
plt.xlabel('Area of Law')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('../visuals/label_distribution.png')
plt.close()

## 2. Modeling - TF-IDF + Traditional ML

In [None]:
# Prepare data for first training
X_1 = df['full_report_cleaned']
y_1 = df['label']

# Check label distribution and merge rare classes (fewer than 2 instances)
label_counts_1 = y_1.value_counts()
print("Label Distribution Before Merging (Training 1):\n", label_counts_1)

Label Distribution Before Merging (Training 1):
 Other                                97
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Garnishee Proceedings                 1
Civil Law                             1
Name: label, dtype: int64


In [11]:
# Identify classes with fewer than 2 instances
rare_classes_1 = label_counts_1[label_counts_1 < 2].index
if len(rare_classes_1) > 0:
    print(f"Merging rare classes (fewer than 2 instances) for Training 1: {list(rare_classes_1)}")
    y_merged_1 = y_1.copy()
    y_merged_1[y_merged_1.isin(rare_classes_1)] = 'Other'
else:
    print("No rare classes found for Training 1.")
    y_merged_1 = y_1

# Verify new distribution
print("Label Distribution After Merging (Training 1):\n", y_merged_1.value_counts())

Merging rare classes (fewer than 2 instances) for Training 1: ['Garnishee Proceedings', 'Civil Law']
Label Distribution After Merging (Training 1):
 Other                                99
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Name: label, dtype: int64


In [12]:
# Encode labels
label_map_1 = {label: idx for idx, label in enumerate(y_merged_1.unique())}
y_encoded_1 = y_merged_1.map(label_map_1)

# Split data
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_encoded_1, test_size=0.2, random_state=42, stratify=y_encoded_1)

# TF-IDF Vectorization
vectorizer_1 = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 2))
X_train_tfidf_1 = vectorizer_1.fit_transform(X_train_1)
X_test_tfidf_1 = vectorizer_1.transform(X_test_1)

# Verify split distributions
print("\nTrain Label Distribution (Training 1):\n", y_train_1.value_counts())
print("\nTest Label Distribution (Training 1):\n", y_test_1.value_counts())


Train Label Distribution (Training 1):
 1    79
0    39
4    15
5    12
2     8
3     7
Name: label, dtype: int64

Test Label Distribution (Training 1):
 1    20
0    10
4     3
5     3
2     2
3     2
Name: label, dtype: int64


In [13]:
# Check the distribution of classes in y_train
print("Training Set Label Distribution Before SMOTE (Training 1):\n", y_train_1.value_counts())

# Determine the smallest class size
min_class_size_1 = y_train_1.value_counts().min()
print(f"Smallest class size in y_train (Training 1): {min_class_size_1}")

# Handle classes with very few samples
if min_class_size_1 < 2:
    raise ValueError("A class in y_train has fewer than 2 samples, which SMOTE cannot handle. Consider merging more classes or removing stratification.")

# Adjust k_neighbors for SMOTE
k_neighbors_1 = min(min_class_size_1 - 1, 5)
print(f"Using k_neighbors={k_neighbors_1} for SMOTE (Training 1)")

# Handle imbalance with SMOTE
smote_1 = SMOTE(random_state=42, k_neighbors=k_neighbors_1)
X_train_tfidf_balanced_1, y_train_balanced_1 = smote_1.fit_resample(X_train_tfidf_1, y_train_1)

# Verify the balanced distribution
print("\nTraining Set Label Distribution After SMOTE (Training 1):\n", pd.Series(y_train_balanced_1).value_counts())

Training Set Label Distribution Before SMOTE (Training 1):
 1    79
0    39
4    15
5    12
2     8
3     7
Name: label, dtype: int64
Smallest class size in y_train (Training 1): 7
Using k_neighbors=5 for SMOTE (Training 1)

Training Set Label Distribution After SMOTE (Training 1):
 0    79
1    79
4    79
5    79
2    79
3    79
Name: label, dtype: int64


In [14]:
# Train and evaluate multiple models
models_1 = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM': SVC(kernel='linear', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

In [15]:
# Dictionary to store accuracies for first training
model_accuracies_1 = {}

for name, model in models_1.items():
    model.fit(X_train_tfidf_balanced_1, y_train_balanced_1)
    y_pred_1 = model.predict(X_test_tfidf_1)
    accuracy_1 = accuracy_score(y_test_1, y_pred_1)
    model_accuracies_1[name] = accuracy_1
    
    print(f'\n{name} Performance (Training 1):')
    print(f'Accuracy: {accuracy_1:.4f}')
    print(f'F1-Score: {f1_score(y_test_1, y_pred_1, average="weighted"):.4f}')
    print('Confusion Matrix:')
    print(confusion_matrix(y_test_1, y_pred_1))
    
    unique_labels_1 = np.unique(y_test_1)
    target_names_1 = [list(label_map_1.keys())[label] for label in unique_labels_1]
    print('Classification Report:')
    print(classification_report(y_test_1, y_pred_1, labels=unique_labels_1, target_names=target_names_1, zero_division=0))


Logistic Regression Performance (Training 1):
Accuracy: 0.6500
F1-Score: 0.6560
Confusion Matrix:
[[ 7  2  0  0  0  1]
 [ 7 11  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  0  0  2  0  0]
 [ 0  0  0  0  3  0]
 [ 0  1  0  0  0  2]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.50      0.70      0.58        10
                            Other       0.73      0.55      0.63        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      1.00      1.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.40      0.67      0.50         3

                         accuracy                           0.65        40
                        macro avg       0.77      0.74      0.73        40
                     weighted avg       0.70      0.65      0.66        40

#### Observation:

The Logistic Regression (Accuracy: 0.6500, F1-Score: 0.6560), SVM (Accuracy: 0.6250, F1-Score: 0.6252), and Random Forest (Accuracy: 0.5750, F1-Score: 0.5455) performances may be improved giving the context of the dataset, and training setup.

Improving the performance of your Logistic Regression, SVM, and Random Forest models for the Legal Document Classifier involves addressing the limitations such as: 
- small dataset size, 
- class imbalance, and
- suboptimal hyperparameters. 

Below are actionable suggestions and trials to enhance each model's results. These changes focus on: 
- data augmentation
- feature engineering, and
- hyperparameter tuning.

In [16]:
# Reapplying previous steps and incorporating data augmentation via synonym replacement and increasing ngram_range

# Prepare data for second training (fine-tuning)
X_2 = df['full_report_cleaned']
y_2 = df['label']

# Check label distribution
label_counts_2 = y_2.value_counts()
print("Label Distribution Before Merging (Training 2):\n", label_counts_2)

# Identify classes with fewer than 2 instances
rare_classes_2 = label_counts_2[label_counts_2 < 2].index
if len(rare_classes_2) > 0:
    print(f"Merging rare classes (fewer than 2 instances) for Training 2: {list(rare_classes_2)}")
    y_merged_2 = y_2.copy()
    y_merged_2[y_merged_2.isin(rare_classes_2)] = 'Other'
else:
    print("No rare classes found for Training 2.")
    y_merged_2 = y_2

# Verify new distribution
print("Label Distribution After Merging (Training 2):\n", y_merged_2.value_counts())


### 1. Data augmentation using synonym replacement
aug = SynonymAug(aug_p=0.3)
X_augmented_2 = [aug.augment(text)[0] for text in X_2]


# Encode labels
label_map_2 = {label: idx for idx, label in enumerate(y_merged_2.unique())}
y_encoded_2 = y_merged_2.map(label_map_2)

# Split data
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_augmented_2, y_encoded_2, test_size=0.2, random_state=42, stratify=y_encoded_2)


### 2. Feature Engineering:
# TF-IDF Vectorization with feature engineering (custom stop words and 3-grams)
vectorizer_2 = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 3))
X_train_tfidf_2 = vectorizer_2.fit_transform(X_train_2)
X_test_tfidf_2 = vectorizer_2.transform(X_test_2)


# Verify split distributions
print("\nTrain Label Distribution (Training 2):\n", y_train_2.value_counts())
print("\nTest Label Distribution (Training 2):\n", y_test_2.value_counts())

# Check the distribution of classes in y_train
print("Training Set Label Distribution Before SMOTE (Training 2):\n", y_train_2.value_counts())

# Determine the smallest class size
min_class_size_2 = y_train_2.value_counts().min()
print(f"Smallest class size in y_train (Training 2): {min_class_size_2}")

# Handle classes with very few samples
if min_class_size_2 < 2:
    raise ValueError("A class in y_train has fewer than 2 samples, which SMOTE cannot handle. Consider merging more classes or removing stratification.")

# Adjust k_neighbors for SMOTE
k_neighbors_2 = min(min_class_size_2 - 1, 5)
print(f"Using k_neighbors={k_neighbors_2} for SMOTE (Training 2)")

# Handle imbalance with SMOTE
smote_2 = SMOTE(random_state=42, k_neighbors=k_neighbors_2)
X_train_tfidf_balanced_2, y_train_balanced_2 = smote_2.fit_resample(X_train_tfidf_2, y_train_2)

# Verify the balanced distribution
print("\nTraining Set Label Distribution After SMOTE (Training 2):\n", pd.Series(y_train_balanced_2).value_counts())

Label Distribution Before Merging (Training 2):
 Other                                97
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Garnishee Proceedings                 1
Civil Law                             1
Name: label, dtype: int64
Merging rare classes (fewer than 2 instances) for Training 2: ['Garnishee Proceedings', 'Civil Law']
Label Distribution After Merging (Training 2):
 Other                                99
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Name: label, dtype: int64

Train Label Distribution (Training 2):
 1    79
0    39
4    15
5    12
2     8
3     7
Name: label, dtype: int64

Test Label Distribution (Training 2):
 1    20
0    10
4     3
5     3
2  

#### 1. **Hyperparameter Tuning with GridSearchCV**

- Using GridSearchCV to find optimal parameters for traditional models (Logistic Regression, SVM, Random Forest). 
- This aims to improve accuracy by testing combinations systematically.

In [17]:
# Define parameter grids for tuning
param_grid_lr = {'C': [0.1, 1.0, 10.0], 'max_iter': [500, 1000]}
param_grid_svm = {'C': [0.1, 1.0, 10.0], 'kernel': ['linear', 'rbf']}
param_grid_rf = {'n_estimators': [100, 200], 'max_depth': [10, 20, None]}

# Tune Logistic Regression
grid_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, scoring='f1_weighted')
grid_lr.fit(X_train_tfidf_balanced_2, y_train_balanced_2)
best_lr_2 = grid_lr.best_estimator_
print(f"Best Logistic Regression Params (Training 2): {grid_lr.best_params_}")

# Tune SVM
grid_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='f1_weighted')
grid_svm.fit(X_train_tfidf_balanced_2, y_train_balanced_2)
best_svm_2 = grid_svm.best_estimator_
print(f"Best SVM Params (Training 2): {grid_svm.best_params_}")

# Tune Random Forest
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1_weighted')
grid_rf.fit(X_train_tfidf_balanced_2, y_train_balanced_2)
best_rf_2 = grid_rf.best_estimator_
print(f"Best Random Forest Params (Training 2): {grid_rf.best_params_}")

Best Logistic Regression Params (Training 2): {'C': 10.0, 'max_iter': 500}
Best SVM Params (Training 2): {'C': 1.0, 'kernel': 'rbf'}
Best Random Forest Params (Training 2): {'max_depth': 10, 'n_estimators': 200}


In [18]:
# Define models with tuned parameters
models_2 = {
    'Logistic Regression': LogisticRegression(penalty='l1', C=best_lr_2.get_params()['C'], solver='liblinear', max_iter=1000, random_state=42),
    'SVM': SVC(kernel='rbf', C=best_svm_2.get_params()['C'], gamma='scale', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=best_rf_2.get_params()['n_estimators'], max_depth=best_rf_2.get_params()['max_depth'], random_state=42)
}

# Dictionary to store accuracies for second training
model_accuracies_2 = {}

# Train and evaluate each model
scaler_2 = StandardScaler(with_mean=False)  # Sparse data handling
X_train_tfidf_scaled_2 = scaler_2.fit_transform(X_train_tfidf_balanced_2)
X_test_tfidf_scaled_2 = scaler_2.transform(X_test_tfidf_2)

for name, model in models_2.items():
    if name == 'SVM':
        model.fit(X_train_tfidf_scaled_2, y_train_balanced_2)
        y_pred_2 = model.predict(X_test_tfidf_scaled_2)
    else:
        model.fit(X_train_tfidf_balanced_2, y_train_balanced_2)
        y_pred_2 = model.predict(X_test_tfidf_2)
    
    accuracy_2 = accuracy_score(y_test_2, y_pred_2)
    model_accuracies_2[name] = accuracy_2
    
    print(f'\n{name} Performance (Training 2):')
    print(f'Accuracy: {accuracy_2:.4f}')
    print(f'F1-Score: {f1_score(y_test_2, y_pred_2, average="weighted"):.4f}')
    print('Confusion Matrix:')
    print(confusion_matrix(y_test_2, y_pred_2))
    
    unique_labels_2 = np.unique(y_test_2)
    target_names_2 = [list(label_map_2.keys())[label] for label in unique_labels_2]
    print('Classification Report:')
    print(classification_report(y_test_2, y_pred_2, labels=unique_labels_2, target_names=target_names_2, zero_division=0))


Logistic Regression Performance (Training 2):
Accuracy: 0.5500
F1-Score: 0.5351
Confusion Matrix:
[[8 1 0 0 0 1]
 [9 9 0 0 0 2]
 [0 2 0 0 0 0]
 [0 1 0 1 0 0]
 [0 0 0 0 3 0]
 [0 2 0 0 0 1]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.47      0.80      0.59        10
                            Other       0.60      0.45      0.51        20
                     Criminal Law       0.00      0.00      0.00         2
Enforcement of Fundamental Rights       1.00      0.50      0.67         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.25      0.33      0.29         3

                         accuracy                           0.55        40
                        macro avg       0.55      0.51      0.51        40
                     weighted avg       0.56      0.55      0.54        40


SVM Performance (Training 2):
Acc

In [None]:
# Combine accuracies from both LR/SVM/RF trainings to find the overall best model
all_accuracies = {}

# Add accuracies from first training
for name, accuracy in model_accuracies_1.items():
    all_accuracies[(name, 1)] = (accuracy, models_1[name], vectorizer_1, X_test_tfidf_1, y_test_1, label_map_1)

# Add accuracies from second training
for name, accuracy in model_accuracies_2.items():
    all_accuracies[(name, 2)] = (accuracy, models_2[name], vectorizer_2, X_test_tfidf_2, y_test_2, label_map_2)

# Sort by accuracy descending
sorted_models = sorted(all_accuracies.items(), key=lambda x: x[1][0], reverse=True)

# Display summary
print("\n=== Combined Model Performance Summary ===")
print(f"{'Model':<25} {'Train':<7} {'Accuracy':<10}")
print("-" * 45)
for (name, train_num), (acc, _, _, _, _, _) in sorted_models:
    print(f"{name:<25} {train_num:<7} {acc:<10.4f}")

# Get best model
best_model_name, best_training = sorted_models[0][0]
best_accuracy, best_model, best_vectorizer, X_test_best, y_test_best, label_map_best = sorted_models[0][1]

print(f"\nBest Model: {best_model_name} from Training {best_training} with Accuracy: {best_accuracy:.4f}")



=== Combined Model Performance Summary ===
Model                     Train   Accuracy  
---------------------------------------------
Logistic Regression       1       0.6500    
SVM                       1       0.6250    
Random Forest             2       0.6000    
Random Forest             1       0.5750    
Logistic Regression       2       0.5500    
SVM                       2       0.5250    

✅ Best Model: Logistic Regression from Training 1 with Accuracy: 0.6500


### **Observation for LR/SVM/RF**

- LR (Accuracy: 0.6500, F1-Score: 0.6560), SVM (Accuracy: 0.6250, F1-Score: 0.6252), and RF (Accuracy: 0.5750, F1-Score: 0.5455) outperform BERT, reflecting their adaptability to the small, imbalanced dataset.

#### **Analysis of the Issue**

##### 1. **Dataset Size and Complexity**
- **Small Dataset**: 200 samples (160 train, 40 test) suit traditional models better than deep learning, which need larger data.
- **Imbalanced Classes**: Uneven distribution (e.g., `Other`: 20, `Civil Procedure`: 10) challenges minority class learning without robust balancing.

##### 2. **Training Configuration**
- **Hyperparameters**: Default settings (e.g., `max_iter=1000`, `n_estimators=100`) may not be optimal, limiting performance.
- **Feature Extraction**: TF-IDF with basic n-grams may miss complex patterns, especially without augmentation.

##### 3. **Model Initialization**
- **Simplicity**: LR, SVM, and RF have fewer parameters, reducing overfitting on small data compared to BERT.
- **Feature Fit**: TF-IDF features align well with traditional models, enhancing their edge.

##### 4. **Evaluation Metrics**
- **Confusion Matrix**: Models show varied predictions, but minority class performance lags.
- **F1-Score**: Reflects decent balance but room for improvement in minority classes.


#### **Why LR/SVM/RF Perform Well**
- **Data Fit**: Small dataset favors simpler models.
- **Feature Effectiveness**: TF-IDF captures key patterns efficiently.
- **Robustness**: Less sensitive to hyperparameter tuning than BERT.

#### **Improvements to Boost LR/SVM/RF Performance**
- Optimize with Optuna, add contextual augmentation, and enhance balancing with SMOTE.
---

In [81]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from googletrans import Translator

def prepare_data(df, suffix, augment=False, back_translation=False):
    """
    Prepare data for training: merge rare classes, encode labels, and split.
    Args:
        df: DataFrame with 'full_report_cleaned' and 'label' columns
        suffix: Suffix for variable naming (e.g., '_1', '_2', '_3')
        augment: If True, apply basic text augmentation (for LR/SVM/RF)
        back_translation: If True, apply back-translation (for BERT)
    Returns:
        X_train, X_test, y_train, y_test, label_map
    """
    # Extract features and labels
    X = df['full_report_cleaned']
    y = df['label']

    # Check label distribution and merge rare classes (fewer than 2 instances)
    label_counts = y.value_counts()
    print(f"Label Distribution Before Merging (Training {suffix}):\n", label_counts)

    rare_classes = label_counts[label_counts < 2].index
    if len(rare_classes) > 0:
        print(f"Merging rare classes (fewer than 2 instances) for Training {suffix}: {list(rare_classes)}")
        y_merged = y.copy()
        y_merged[y_merged.isin(rare_classes)] = 'Other'
    else:
        print(f"No rare classes found for Training {suffix}.")
        y_merged = y

    # Verify new distribution
    print(f"Label Distribution After Merging (Training {suffix}):\n", y_merged.value_counts())

    # Encode labels after merging
    label_map = {label: idx for idx, label in enumerate(y_merged.unique())}
    y_encoded = y_merged.map(label_map)

    # Apply augmentation if specified
    if augment:
        # Basic augmentation: duplicate text (for LR/SVM/RF)
        X = X.apply(lambda x: x + " " + x)

    if back_translation:
        # Back-translation using googletrans
        translator = Translator()
        print(f"Applying back-translation for Training {suffix}...")
        def back_translate(text):
            try:
                # Translate to French and back to English
                fr_text = translator.translate(text, src='en', dest='fr').text
                en_text = translator.translate(fr_text, src='fr', dest='en').text
                return en_text
            except Exception as e:
                print(f"Back-translation error: {e}")
                return text  # Return original text on failure

        X = X.apply(back_translate)

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

    # Verify split distributions
    print(f"\nTrain Label Distribution (Training {suffix}):\n", y_train.value_counts())
    print(f"\nTest Label Distribution (Training {suffix}):\n", y_test.value_counts())

    return X_train, X_test, y_train, y_test, label_map

In [67]:
import optuna
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
import nlpaug.augmenter.word as naw
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def run_lr_svm_rf_training(df, suffix, balance=False, augment=False):
    """
    Run LR/SVM/RF training with specified suffix.
    Args:
        df: DataFrame with 'full_report_cleaned' and 'label' columns
        suffix: Suffix for variable naming (e.g., '_1', '_2', '_3')
        balance: If True, apply SMOTE for class balancing
        augment: If True, apply contextual augmentation
    Returns:
        models, vectorizer, model_accuracies, X_test_tfidf, y_test, label_map
    """
    # Prepare data
    X_train, X_test, y_train, y_test, label_map = prepare_data(df, suffix, augment=augment)

    # Preprocessing: Lemmatization and custom stopwords
    lemmatizer = WordNetLemmatizer()
    custom_stopwords = set(stopwords.words('english')) - {'not', 'against', 'no'}  # Keep negation words
    X_train = X_train.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split() if word.lower() not in custom_stopwords]))
    X_test = X_test.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split() if word.lower() not in custom_stopwords]))

    # TF-IDF Vectorization with n-grams
    vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    # Apply SMOTE if balancing is enabled
    if balance:
        min_class_size = y_train.value_counts().min()
        k_neighbors = min(min_class_size - 1, 5)
        smote = SMOTE(random_state=42, k_neighbors=k_neighbors)
        X_train_tfidf, y_train = smote.fit_resample(X_train_tfidf, y_train)

    # Hyperparameter tuning with Optuna
    def objective(trial, model_name, X_train, y_train, X_test, y_test):
        if model_name == 'LogisticRegression':
            params = {
                'C': trial.suggest_float('C', 0.1, 10.0, log=True),
                'max_iter': trial.suggest_int('max_iter', 500, 1500)
            }
            model = LogisticRegression(**params, class_weight='balanced', random_state=42)
        elif model_name == 'SVM':
            params = {
                'C': trial.suggest_float('C', 0.1, 10.0, log=True),
                'kernel': trial.suggest_categorical('kernel', ['linear', 'rbf'])
            }
            model = SVC(**params, class_weight='balanced', random_state=42)
        else:  # RandomForest
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 50, 300),
                'max_depth': trial.suggest_int('max_depth', 10, 50, step=10)
            }
            model = RandomForestClassifier(**params, class_weight='balanced', random_state=42)

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        return accuracy_score(y_test, y_pred)

    models = {}
    model_accuracies = {}
    for model_name in ['LogisticRegression', 'SVM', 'RandomForest']:
        study = optuna.create_study(direction='maximize')
        study.optimize(lambda trial: objective(trial, model_name, X_train_tfidf, y_train, X_test_tfidf, y_test), n_trials=20)
        best_params = study.best_params

        if model_name == 'LogisticRegression':
            models[model_name] = LogisticRegression(**best_params, class_weight='balanced', random_state=42)
        elif model_name == 'SVM':
            models[model_name] = SVC(**best_params, class_weight='balanced', random_state=42)
        else:
            models[model_name] = RandomForestClassifier(**best_params, class_weight='balanced', random_state=42)

        models[model_name].fit(X_train_tfidf, y_train)
        y_pred = models[model_name].predict(X_test_tfidf)
        model_accuracies[model_name] = accuracy_score(y_test, y_pred)

        print(f'\n{model_name} Performance (Training {suffix}):')
        print(f'Accuracy: {model_accuracies[model_name]:.4f}')
        print(f'F1-Score: {f1_score(y_test, y_pred, average="weighted"):.4f}')
        print('Confusion Matrix:')
        print(confusion_matrix(y_test, y_pred))
        unique_labels = np.unique(y_test)
        target_names = [list(label_map.keys())[label] for label in unique_labels]
        print('Classification Report:')
        print(classification_report(y_test, y_pred, labels=unique_labels, target_names=target_names, zero_division=0))

    return models, vectorizer, model_accuracies, X_test_tfidf, y_test, label_map

In [68]:
# Training 1: Baseline (no balancing, no augmentation)
models_1, vectorizer_1, model_accuracies_1, X_test_tfidf_1, y_test_1, label_map_1 = run_lr_svm_rf_training(df, '_1')

# Training 2: Balanced, no augmentation
models_2, vectorizer_2, model_accuracies_2, X_test_tfidf_2, y_test_2, label_map_2 = run_lr_svm_rf_training(df, '_2', balance=True)

# Training 3: Balanced + Augmented (advanced)
models_3, vectorizer_3, model_accuracies_3, X_test_tfidf_3, y_test_3, label_map_3 = run_lr_svm_rf_training(df, '_3', balance=True, augment=True)

Label Distribution Before Merging (Training _1):
 Other                                97
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Garnishee Proceedings                 1
Civil Law                             1
Name: label, dtype: int64
Merging rare classes (fewer than 2 instances) for Training _1: ['Garnishee Proceedings', 'Civil Law']
Label Distribution After Merging (Training _1):
 Other                                99
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Name: label, dtype: int64

Train Label Distribution (Training _1):
 1    79
0    39
4    15
5    12
2     8
3     7
Name: label, dtype: int64

Test Label Distribution (Training _1):
 1    20
0    10
4     3
5     

[I 2025-05-27 17:25:16,255] A new study created in memory with name: no-name-de2edd95-c73a-4f36-bcc3-a8dc8c9b01b7
[I 2025-05-27 17:25:17,117] Trial 0 finished with value: 0.7 and parameters: {'C': 6.02087624130438, 'max_iter': 607}. Best is trial 0 with value: 0.7.
[I 2025-05-27 17:25:17,662] Trial 1 finished with value: 0.625 and parameters: {'C': 1.3398896890313357, 'max_iter': 1421}. Best is trial 0 with value: 0.7.
[I 2025-05-27 17:25:18,454] Trial 2 finished with value: 0.7 and parameters: {'C': 5.290082760844497, 'max_iter': 1397}. Best is trial 0 with value: 0.7.
[I 2025-05-27 17:25:19,393] Trial 3 finished with value: 0.675 and parameters: {'C': 9.876701339686846, 'max_iter': 875}. Best is trial 0 with value: 0.7.
[I 2025-05-27 17:25:19,969] Trial 4 finished with value: 0.65 and parameters: {'C': 1.1447852225437385, 'max_iter': 1200}. Best is trial 0 with value: 0.7.
[I 2025-05-27 17:25:20,252] Trial 5 finished with value: 0.475 and parameters: {'C': 0.23983669998408624, 'max_i


LogisticRegression Performance (Training _1):
Accuracy: 0.7000
F1-Score: 0.6978
Confusion Matrix:
[[ 9  1  0  0  0  0]
 [ 7 11  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  0  0  2  0  0]
 [ 0  0  0  0  3  0]
 [ 0  1  0  0  0  2]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.56      0.90      0.69        10
                            Other       0.79      0.55      0.65        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      1.00      1.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.50      0.67      0.57         3

                         accuracy                           0.70        40
                        macro avg       0.81      0.77      0.76        40
                     weighted avg       0.75      0.70      0.70        40

[I 2025-05-27 17:25:35,044] Trial 0 finished with value: 0.6 and parameters: {'C': 5.810148330047386, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:36,542] Trial 1 finished with value: 0.575 and parameters: {'C': 1.0788145201790282, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:38,127] Trial 2 finished with value: 0.45 and parameters: {'C': 0.31875618497925423, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:39,670] Trial 3 finished with value: 0.4 and parameters: {'C': 0.3569090723570402, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:41,040] Trial 4 finished with value: 0.6 and parameters: {'C': 1.2498977017499233, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:42,490] Trial 5 finished with value: 0.5 and parameters: {'C': 0.6839548987867357, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:25:43,907] Trial 6 finished with value: 0


SVM Performance (Training _1):
Accuracy: 0.6750
F1-Score: 0.6691
Confusion Matrix:
[[ 9  1  0  0  0  0]
 [ 7 12  0  0  0  1]
 [ 0  1  1  0  0  0]
 [ 0  1  0  1  0  0]
 [ 0  0  0  0  3  0]
 [ 0  2  0  0  0  1]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.56      0.90      0.69        10
                            Other       0.71      0.60      0.65        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      0.50      0.67         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.50      0.33      0.40         3

                         accuracy                           0.68        40
                        macro avg       0.79      0.64      0.68        40
                     weighted avg       0.71      0.68      0.67        40



[I 2025-05-27 17:26:10,644] Trial 0 finished with value: 0.575 and parameters: {'n_estimators': 143, 'max_depth': 50}. Best is trial 0 with value: 0.575.
[I 2025-05-27 17:26:13,150] Trial 1 finished with value: 0.575 and parameters: {'n_estimators': 171, 'max_depth': 10}. Best is trial 0 with value: 0.575.
[I 2025-05-27 17:26:14,161] Trial 2 finished with value: 0.6 and parameters: {'n_estimators': 73, 'max_depth': 40}. Best is trial 2 with value: 0.6.
[I 2025-05-27 17:26:17,360] Trial 3 finished with value: 0.6 and parameters: {'n_estimators': 251, 'max_depth': 40}. Best is trial 2 with value: 0.6.
[I 2025-05-27 17:26:20,944] Trial 4 finished with value: 0.625 and parameters: {'n_estimators': 296, 'max_depth': 40}. Best is trial 4 with value: 0.625.
[I 2025-05-27 17:26:24,325] Trial 5 finished with value: 0.6 and parameters: {'n_estimators': 258, 'max_depth': 20}. Best is trial 4 with value: 0.625.
[I 2025-05-27 17:26:26,121] Trial 6 finished with value: 0.575 and parameters: {'n_esti


RandomForest Performance (Training _1):
Accuracy: 0.6250
F1-Score: 0.5220
Confusion Matrix:
[[ 2  8  0  0  0  0]
 [ 0 20  0  0  0  0]
 [ 0  2  0  0  0  0]
 [ 0  2  0  0  0  0]
 [ 0  0  0  0  3  0]
 [ 0  3  0  0  0  0]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       1.00      0.20      0.33        10
                            Other       0.57      1.00      0.73        20
                     Criminal Law       0.00      0.00      0.00         2
Enforcement of Fundamental Rights       0.00      0.00      0.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.00      0.00      0.00         3

                         accuracy                           0.62        40
                        macro avg       0.43      0.37      0.34        40
                     weighted avg       0.61      0.62      0.52        40

Labe

[I 2025-05-27 17:27:25,635] A new study created in memory with name: no-name-98910768-c9ed-49f0-b30c-6a65b67d6bdd
[I 2025-05-27 17:27:26,901] Trial 0 finished with value: 0.65 and parameters: {'C': 0.22774694956206915, 'max_iter': 1381}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:27:29,752] Trial 1 finished with value: 0.6 and parameters: {'C': 5.693604186732867, 'max_iter': 1209}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:27:30,789] Trial 2 finished with value: 0.65 and parameters: {'C': 0.2944652990894264, 'max_iter': 1449}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:27:33,122] Trial 3 finished with value: 0.65 and parameters: {'C': 2.277953251769215, 'max_iter': 1100}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:27:35,654] Trial 4 finished with value: 0.625 and parameters: {'C': 3.903257833353942, 'max_iter': 509}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:27:37,906] Trial 5 finished with value: 0.675 and parameters: {'C': 0.8828193241516612,


LogisticRegression Performance (Training _2):
Accuracy: 0.6750
F1-Score: 0.6755
Confusion Matrix:
[[ 8  2  0  0  0  0]
 [ 7 11  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  0  0  2  0  0]
 [ 0  0  0  0  3  0]
 [ 0  1  0  0  0  2]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.53      0.80      0.64        10
                            Other       0.73      0.55      0.63        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      1.00      1.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.50      0.67      0.57         3

                         accuracy                           0.68        40
                        macro avg       0.79      0.75      0.75        40
                     weighted avg       0.71      0.68      0.68        40

[I 2025-05-27 17:28:14,717] Trial 0 finished with value: 0.6 and parameters: {'C': 1.9418738628538537, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:28:23,896] Trial 1 finished with value: 0.625 and parameters: {'C': 0.1856986549111385, 'kernel': 'linear'}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:28:31,857] Trial 2 finished with value: 0.575 and parameters: {'C': 0.3107451325966369, 'kernel': 'rbf'}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:28:41,563] Trial 3 finished with value: 0.575 and parameters: {'C': 0.16514551100951985, 'kernel': 'linear'}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:28:54,030] Trial 4 finished with value: 0.6 and parameters: {'C': 0.10148631029262883, 'kernel': 'rbf'}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:28:57,918] Trial 5 finished with value: 0.6 and parameters: {'C': 8.156410817975116, 'kernel': 'linear'}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:29:05,602] Trial 6 finished wi


SVM Performance (Training _2):
Accuracy: 0.6500
F1-Score: 0.6491
Confusion Matrix:
[[ 8  2  0  0  0  0]
 [ 6 12  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  1  0  1  0  0]
 [ 0  0  0  0  3  0]
 [ 0  2  0  0  0  1]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.57      0.80      0.67        10
                            Other       0.67      0.60      0.63        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      0.50      0.67         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.33      0.33      0.33         3

                         accuracy                           0.65        40
                        macro avg       0.76      0.62      0.66        40
                     weighted avg       0.68      0.65      0.65        40



[I 2025-05-27 17:30:15,938] Trial 0 finished with value: 0.65 and parameters: {'n_estimators': 60, 'max_depth': 10}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:30:19,310] Trial 1 finished with value: 0.675 and parameters: {'n_estimators': 173, 'max_depth': 10}. Best is trial 1 with value: 0.675.
[I 2025-05-27 17:30:21,181] Trial 2 finished with value: 0.65 and parameters: {'n_estimators': 72, 'max_depth': 10}. Best is trial 1 with value: 0.675.
[I 2025-05-27 17:30:22,414] Trial 3 finished with value: 0.625 and parameters: {'n_estimators': 54, 'max_depth': 10}. Best is trial 1 with value: 0.675.
[I 2025-05-27 17:30:24,151] Trial 4 finished with value: 0.65 and parameters: {'n_estimators': 63, 'max_depth': 20}. Best is trial 1 with value: 0.675.
[I 2025-05-27 17:30:26,702] Trial 5 finished with value: 0.6 and parameters: {'n_estimators': 105, 'max_depth': 10}. Best is trial 1 with value: 0.675.
[I 2025-05-27 17:30:32,235] Trial 6 finished with value: 0.6 and parameters: {'n_estim


RandomForest Performance (Training _2):
Accuracy: 0.7000
F1-Score: 0.6726
Confusion Matrix:
[[ 5  4  0  0  0  1]
 [ 1 18  0  0  0  1]
 [ 0  2  0  0  0  0]
 [ 0  1  0  1  0  0]
 [ 0  0  0  0  3  0]
 [ 0  2  0  0  0  1]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.83      0.50      0.62        10
                            Other       0.67      0.90      0.77        20
                     Criminal Law       0.00      0.00      0.00         2
Enforcement of Fundamental Rights       1.00      0.50      0.67         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.33      0.33      0.33         3

                         accuracy                           0.70        40
                        macro avg       0.64      0.54      0.57        40
                     weighted avg       0.69      0.70      0.67        40

Labe

[I 2025-05-27 17:31:40,703] A new study created in memory with name: no-name-06d53a4e-a00c-4bd7-a336-4cdd1566bb7c
[I 2025-05-27 17:31:42,133] Trial 0 finished with value: 0.65 and parameters: {'C': 2.8758356969522363, 'max_iter': 1175}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:31:42,740] Trial 1 finished with value: 0.65 and parameters: {'C': 0.1427843293417266, 'max_iter': 1007}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:31:43,627] Trial 2 finished with value: 0.65 and parameters: {'C': 0.43849109086442023, 'max_iter': 899}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:31:44,252] Trial 3 finished with value: 0.65 and parameters: {'C': 0.1608102121797931, 'max_iter': 1378}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:31:44,994] Trial 4 finished with value: 0.65 and parameters: {'C': 0.21883038324936716, 'max_iter': 1222}. Best is trial 0 with value: 0.65.
[I 2025-05-27 17:31:46,114] Trial 5 finished with value: 0.675 and parameters: {'C': 1.1978239520434


LogisticRegression Performance (Training _3):
Accuracy: 0.6750
F1-Score: 0.6755
Confusion Matrix:
[[ 8  2  0  0  0  0]
 [ 7 11  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  0  0  2  0  0]
 [ 0  0  0  0  3  0]
 [ 0  1  0  0  0  2]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.53      0.80      0.64        10
                            Other       0.73      0.55      0.63        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      1.00      1.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.50      0.67      0.57         3

                         accuracy                           0.68        40
                        macro avg       0.79      0.75      0.75        40
                     weighted avg       0.71      0.68      0.68        40

[I 2025-05-27 17:32:12,377] Trial 0 finished with value: 0.6 and parameters: {'C': 0.12230028364670077, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:16,212] Trial 1 finished with value: 0.55 and parameters: {'C': 0.9404556228757738, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:19,660] Trial 2 finished with value: 0.55 and parameters: {'C': 2.5018552422132534, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:24,074] Trial 3 finished with value: 0.525 and parameters: {'C': 0.5338646314717985, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:29,635] Trial 4 finished with value: 0.575 and parameters: {'C': 0.3206445014490169, 'kernel': 'rbf'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:35,748] Trial 5 finished with value: 0.55 and parameters: {'C': 0.16023252763237578, 'kernel': 'linear'}. Best is trial 0 with value: 0.6.
[I 2025-05-27 17:32:39,211] Trial 6 finished with value: 0.55 an


SVM Performance (Training _3):
Accuracy: 0.6500
F1-Score: 0.6527
Confusion Matrix:
[[ 8  1  0  0  0  1]
 [ 8 10  0  0  0  2]
 [ 0  1  1  0  0  0]
 [ 0  0  0  2  0  0]
 [ 0  0  0  0  3  0]
 [ 0  1  0  0  0  2]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.50      0.80      0.62        10
                            Other       0.77      0.50      0.61        20
                     Criminal Law       1.00      0.50      0.67         2
Enforcement of Fundamental Rights       1.00      1.00      1.00         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.40      0.67      0.50         3

                         accuracy                           0.65        40
                        macro avg       0.78      0.74      0.73        40
                     weighted avg       0.71      0.65      0.65        40



[I 2025-05-27 17:33:55,995] Trial 0 finished with value: 0.55 and parameters: {'n_estimators': 80, 'max_depth': 20}. Best is trial 0 with value: 0.55.
[I 2025-05-27 17:34:01,656] Trial 1 finished with value: 0.625 and parameters: {'n_estimators': 290, 'max_depth': 40}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:34:03,558] Trial 2 finished with value: 0.575 and parameters: {'n_estimators': 103, 'max_depth': 20}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:34:06,550] Trial 3 finished with value: 0.55 and parameters: {'n_estimators': 156, 'max_depth': 50}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:34:12,000] Trial 4 finished with value: 0.625 and parameters: {'n_estimators': 300, 'max_depth': 20}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:34:14,772] Trial 5 finished with value: 0.575 and parameters: {'n_estimators': 145, 'max_depth': 30}. Best is trial 1 with value: 0.625.
[I 2025-05-27 17:34:19,522] Trial 6 finished with value: 0.625 and parameters: {


RandomForest Performance (Training _3):
Accuracy: 0.6250
F1-Score: 0.5875
Confusion Matrix:
[[ 3  6  0  0  0  1]
 [ 2 17  0  0  0  1]
 [ 0  2  0  0  0  0]
 [ 0  1  0  1  0  0]
 [ 0  0  0  0  3  0]
 [ 0  2  0  0  0  1]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.60      0.30      0.40        10
                            Other       0.61      0.85      0.71        20
                     Criminal Law       0.00      0.00      0.00         2
Enforcement of Fundamental Rights       1.00      0.50      0.67         2
                Election Petition       1.00      1.00      1.00         3
                     Property Law       0.33      0.33      0.33         3

                         accuracy                           0.62        40
                        macro avg       0.59      0.50      0.52        40
                     weighted avg       0.60      0.62      0.59        40



## 3. Modeling - BERT (Advanced)

In [21]:
df.head(5)

Unnamed: 0,case_title,suitno,introduction,facts,issues,decision,full_report,full_report_cleaned,label
0,ASSET MANAGEMENT GROUP LIMITED v. GENESISCORP ...,CA/L/236M/95,This appeal borders on Civil Procedure.\n,The appellant as Plaintiff before the Lagos Hi...,The Appellant formulated the following issues ...,"On the whole, the Court of Appeal held that th...","GEORGE ADESOLA&nbsp;OGUNTADE, J.C.A. (Deliveri...",george adesola oguntade j c a delivering the l...,Civil Procedure
1,JAMES EBELE & ANOR v. ROBERT IKWEKI & ORS,CA/B/53M/2006,This is a ruling on an Application seeking Lea...,The present application flows from the Judgmen...,The Court determined the proprietary or otherw...,"In the final analysis, the Court of Appeal hel...",\nCHIOMA EGONDU NWOSU-IHEME&nbsp;J.C.A. (Deli...,chioma egondu nwosu iheme j c a delivering the...,Other
2,CENTRAL BANK OF NIGERIA v. MR TOMMY OKECHUKWU ...,CA/K/304/2020,This appeal borders on propriety of requiremen...,This appeal emanated from the decision of the ...,The Court of Appeal determined the appeal base...,"In the end, the Court of Appeal resolved the s...","PETER OYINKENIMIEMI AFFEN, J.C.A. (Delivering ...",peter oyinkenimiemi affen j c a delivering the...,Garnishee Proceedings
3,MOHAMMED AUWAL & ORS v. THE FEDERAL REPUBLIC O...,CA/J/183C/2011,This appeal borders on Criminal Law and Proced...,This appeal is against the judgment of the Fed...,The Court determined the appeal on the followi...,"In conclusion, the appeal was dismissed.\n","IBRAHIM SHATA BDLIYA, J.C.A. (Deliveringthe Le...",ibrahim shata bdliya j c a deliveringthe leadi...,Criminal Law
4,UNITED BANK FOR AFRICA PLC & ORS v. MR. UGOCHU...,CA/OW/385M/2012,This appeal borders on Enforcement of Fundamen...,This is an appeal against the judgment of NGOZ...,Appellant formulated 4 issues while the Respon...,"On the whole, the Court found no merit in the ...","FREDERICK OZIAKPONO&nbsp;OHO, J.C.A. (Deliveri...",frederick oziakpono oho j c a delivering the l...,Enforcement of Fundamental Rights


In [29]:
# Prepare data for first training
X = df['full_report_cleaned']
y = df['label']

# Check label distribution and merge rare classes (fewer than 2 instances)
label_counts_1 = y.value_counts()
print("Label Distribution Before Merging (Training 1):\n", label_counts_1)

# Identify classes with fewer than 2 instances
rare_classes_1 = label_counts_1[label_counts_1 < 2].index
if len(rare_classes_1) > 0:
    print(f"Merging rare classes (fewer than 2 instances) for Training 1: {list(rare_classes_1)}")
    y_merged_1 = y.copy()
    y_merged_1[y_merged_1.isin(rare_classes_1)] = 'Other'
else:
    print("No rare classes found for Training 1.")
    y_merged_1 = y

# Verify new distribution
print("Label Distribution After Merging (Training 1):\n", y_merged_1.value_counts())

# Encode labels after merging
labels_sorted = sorted(y_merged_1.unique())
label_map_1 = {label: idx for idx, label in enumerate(labels_sorted)}
y_encoded_1 = y_merged_1.map(label_map_1)

# Extract text data
X_1 = df['full_report_cleaned']

# Split data into training and test sets
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(
    X_1, y_encoded_1, test_size=0.2, random_state=42, stratify=y_encoded_1
)

# Reset index to prevent alignment issues
X_train_1 = X_train_1.reset_index(drop=True)
X_test_1 = X_test_1.reset_index(drop=True)
y_train_1 = y_train_1.reset_index(drop=True)
y_test_1 = y_test_1.reset_index(drop=True)

# Verify split distributions
print("\nTrain Label Distribution (Training 1):\n", y_train_1.value_counts())
print("\nTest Label Distribution (Training 1):\n", y_test_1.value_counts())


Label Distribution Before Merging (Training 1):
 Other                                97
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Garnishee Proceedings                 1
Civil Law                             1
Name: label, dtype: int64
Merging rare classes (fewer than 2 instances) for Training 1: ['Garnishee Proceedings', 'Civil Law']
Label Distribution After Merging (Training 1):
 Other                                99
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Name: label, dtype: int64

Train Label Distribution (Training 1):
 4    79
0    39
2    15
5    12
1     8
3     7
Name: label, dtype: int64

Test Label Distribution (Training 1):
 4    20
0    10
2     3
5     3
1  

In [30]:
# Initialize tokenizer with reduced max_length for first training
tokenizer_1 = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function_1(texts, max_length=512):
    """Tokenize text for BERT with padding and truncation (Training 1)."""
    if isinstance(texts, pd.Series):
        texts = texts.tolist()
    return tokenizer_1(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

class LegalDataset_1(torch.utils.data.Dataset):
    def __init__(self, texts, labels):
        self.encodings = tokenize_function_1(texts)
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx], dtype=torch.long)
        return item

# Prepare datasets for first training
train_dataset_1 = LegalDataset_1(X_train_1, y_train_1)
test_dataset_1 = LegalDataset_1(X_test_1, y_test_1)


In [31]:
# Initialize BERT model with class weights for first training
num_labels_1 = len(label_map_1)
bert_model_1 = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels_1)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [32]:
# Training arguments for first training
training_args_1 = TrainingArguments(
    output_dir='./results_1',
    num_train_epochs=10,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs_1',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    learning_rate=2e-5
)


In [33]:
# Compute class weights correctly for first training
class_counts_1 = np.bincount(y_train_1)
total_samples_1 = len(y_train_1)
num_labels_1 = len(label_map_1)
class_weights_1 = torch.tensor([total_samples_1 / (num_labels_1 * count) if count > 0 else 1.0 for count in class_counts_1], dtype=torch.float)

# Custom trainer with class weights for first training
class WeightedTrainer_1(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get('labels')
        outputs = model(**inputs)
        logits = outputs.get('logits')
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights_1.to(logits.device))
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss


In [34]:
from transformers import get_linear_schedule_with_warmup
trainer_1 = WeightedTrainer_1(
    model=bert_model_1,
    args=training_args_1,
    train_dataset=train_dataset_1,
    eval_dataset=test_dataset_1,
    compute_metrics=lambda eval_pred: {
        'accuracy': accuracy_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1)),
        'f1': f1_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1), average='weighted')
    }
)
trainer_1.create_optimizer_and_scheduler(num_training_steps=100)


In [35]:
# Train BERT model for first training
trainer_1.train()

  0%|          | 0/800 [00:00<?, ?it/s]

{'loss': 1.8315, 'learning_rate': 4.0000000000000003e-07, 'epoch': 0.12}
{'loss': 2.0891, 'learning_rate': 8.000000000000001e-07, 'epoch': 0.25}
{'loss': 1.9897, 'learning_rate': 1.2000000000000002e-06, 'epoch': 0.38}
{'loss': 2.0687, 'learning_rate': 1.6000000000000001e-06, 'epoch': 0.5}
{'loss': 2.1006, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.62}
{'loss': 1.9925, 'learning_rate': 2.4000000000000003e-06, 'epoch': 0.75}
{'loss': 2.1197, 'learning_rate': 2.8000000000000003e-06, 'epoch': 0.88}
{'loss': 2.0089, 'learning_rate': 3.2000000000000003e-06, 'epoch': 1.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 2.029087781906128, 'eval_accuracy': 0.1, 'eval_f1': 0.03625, 'eval_runtime': 12.8764, 'eval_samples_per_second': 3.106, 'eval_steps_per_second': 1.553, 'epoch': 1.0}
{'loss': 2.0148, 'learning_rate': 3.6000000000000003e-06, 'epoch': 1.12}
{'loss': 2.0407, 'learning_rate': 4.000000000000001e-06, 'epoch': 1.25}
{'loss': 1.9733, 'learning_rate': 4.4e-06, 'epoch': 1.38}
{'loss': 2.0178, 'learning_rate': 4.800000000000001e-06, 'epoch': 1.5}
{'loss': 2.0779, 'learning_rate': 5.2e-06, 'epoch': 1.62}
{'loss': 1.9005, 'learning_rate': 5.600000000000001e-06, 'epoch': 1.75}
{'loss': 1.9874, 'learning_rate': 6e-06, 'epoch': 1.88}
{'loss': 2.057, 'learning_rate': 6.4000000000000006e-06, 'epoch': 2.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 2.0097861289978027, 'eval_accuracy': 0.1, 'eval_f1': 0.032967032967032975, 'eval_runtime': 10.4497, 'eval_samples_per_second': 3.828, 'eval_steps_per_second': 1.914, 'epoch': 2.0}
{'loss': 2.1631, 'learning_rate': 6.800000000000001e-06, 'epoch': 2.12}
{'loss': 1.9998, 'learning_rate': 7.2000000000000005e-06, 'epoch': 2.25}
{'loss': 1.9662, 'learning_rate': 7.600000000000001e-06, 'epoch': 2.38}
{'loss': 2.0334, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.5}
{'loss': 1.9725, 'learning_rate': 8.400000000000001e-06, 'epoch': 2.62}
{'loss': 1.901, 'learning_rate': 8.8e-06, 'epoch': 2.75}
{'loss': 1.8788, 'learning_rate': 9.200000000000002e-06, 'epoch': 2.88}
{'loss': 2.0591, 'learning_rate': 9.600000000000001e-06, 'epoch': 3.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.9763389825820923, 'eval_accuracy': 0.1, 'eval_f1': 0.0305921052631579, 'eval_runtime': 9.7507, 'eval_samples_per_second': 4.102, 'eval_steps_per_second': 2.051, 'epoch': 3.0}
{'loss': 2.0679, 'learning_rate': 1e-05, 'epoch': 3.12}
{'loss': 1.9019, 'learning_rate': 1.04e-05, 'epoch': 3.25}
{'loss': 1.9659, 'learning_rate': 1.0800000000000002e-05, 'epoch': 3.38}
{'loss': 1.9612, 'learning_rate': 1.1200000000000001e-05, 'epoch': 3.5}
{'loss': 2.0037, 'learning_rate': 1.16e-05, 'epoch': 3.62}
{'loss': 1.9644, 'learning_rate': 1.2e-05, 'epoch': 3.75}
{'loss': 1.9704, 'learning_rate': 1.2400000000000002e-05, 'epoch': 3.88}
{'loss': 1.8419, 'learning_rate': 1.2800000000000001e-05, 'epoch': 4.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.9091078042984009, 'eval_accuracy': 0.1, 'eval_f1': 0.026493506493506493, 'eval_runtime': 10.6167, 'eval_samples_per_second': 3.768, 'eval_steps_per_second': 1.884, 'epoch': 4.0}
{'loss': 1.906, 'learning_rate': 1.3200000000000002e-05, 'epoch': 4.12}
{'loss': 2.0227, 'learning_rate': 1.3600000000000002e-05, 'epoch': 4.25}
{'loss': 1.9768, 'learning_rate': 1.4e-05, 'epoch': 4.38}
{'loss': 1.8118, 'learning_rate': 1.4400000000000001e-05, 'epoch': 4.5}
{'loss': 1.8625, 'learning_rate': 1.48e-05, 'epoch': 4.62}
{'loss': 1.8183, 'learning_rate': 1.5200000000000002e-05, 'epoch': 4.75}
{'loss': 1.8124, 'learning_rate': 1.5600000000000003e-05, 'epoch': 4.88}
{'loss': 1.8298, 'learning_rate': 1.6000000000000003e-05, 'epoch': 5.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.8181520700454712, 'eval_accuracy': 0.075, 'eval_f1': 0.011538461538461539, 'eval_runtime': 14.7728, 'eval_samples_per_second': 2.708, 'eval_steps_per_second': 1.354, 'epoch': 5.0}
{'loss': 1.8194, 'learning_rate': 1.64e-05, 'epoch': 5.12}
{'loss': 1.7659, 'learning_rate': 1.6800000000000002e-05, 'epoch': 5.25}
{'loss': 1.8686, 'learning_rate': 1.72e-05, 'epoch': 5.38}
{'loss': 1.7734, 'learning_rate': 1.76e-05, 'epoch': 5.5}
{'loss': 1.779, 'learning_rate': 1.8e-05, 'epoch': 5.62}
{'loss': 1.7445, 'learning_rate': 1.8400000000000003e-05, 'epoch': 5.75}
{'loss': 1.8133, 'learning_rate': 1.88e-05, 'epoch': 5.88}
{'loss': 1.8218, 'learning_rate': 1.9200000000000003e-05, 'epoch': 6.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.7439228296279907, 'eval_accuracy': 0.325, 'eval_f1': 0.30302197802197806, 'eval_runtime': 9.7879, 'eval_samples_per_second': 4.087, 'eval_steps_per_second': 2.043, 'epoch': 6.0}
{'loss': 1.6658, 'learning_rate': 1.9600000000000002e-05, 'epoch': 6.12}
{'loss': 1.7404, 'learning_rate': 2e-05, 'epoch': 6.25}
{'loss': 1.7752, 'learning_rate': 1.9333333333333333e-05, 'epoch': 6.38}
{'loss': 1.7415, 'learning_rate': 1.866666666666667e-05, 'epoch': 6.5}
{'loss': 1.7442, 'learning_rate': 1.8e-05, 'epoch': 6.62}
{'loss': 1.854, 'learning_rate': 1.7333333333333336e-05, 'epoch': 6.75}
{'loss': 1.7327, 'learning_rate': 1.6666666666666667e-05, 'epoch': 6.88}
{'loss': 1.6716, 'learning_rate': 1.6000000000000003e-05, 'epoch': 7.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.7003593444824219, 'eval_accuracy': 0.375, 'eval_f1': 0.29411764705882354, 'eval_runtime': 15.3984, 'eval_samples_per_second': 2.598, 'eval_steps_per_second': 1.299, 'epoch': 7.0}
{'loss': 1.7105, 'learning_rate': 1.5333333333333334e-05, 'epoch': 7.12}
{'loss': 1.7543, 'learning_rate': 1.4666666666666666e-05, 'epoch': 7.25}
{'loss': 1.715, 'learning_rate': 1.4e-05, 'epoch': 7.38}
{'loss': 1.7311, 'learning_rate': 1.3333333333333333e-05, 'epoch': 7.5}
{'loss': 1.6708, 'learning_rate': 1.2666666666666667e-05, 'epoch': 7.62}
{'loss': 1.8278, 'learning_rate': 1.2e-05, 'epoch': 7.75}
{'loss': 1.6091, 'learning_rate': 1.1333333333333334e-05, 'epoch': 7.88}
{'loss': 1.7407, 'learning_rate': 1.0666666666666667e-05, 'epoch': 8.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.6998875141143799, 'eval_accuracy': 0.5, 'eval_f1': 0.3389830508474576, 'eval_runtime': 10.7236, 'eval_samples_per_second': 3.73, 'eval_steps_per_second': 1.865, 'epoch': 8.0}
{'loss': 1.6438, 'learning_rate': 1e-05, 'epoch': 8.12}
{'loss': 1.6373, 'learning_rate': 9.333333333333334e-06, 'epoch': 8.25}
{'loss': 1.6704, 'learning_rate': 8.666666666666668e-06, 'epoch': 8.38}
{'loss': 1.7468, 'learning_rate': 8.000000000000001e-06, 'epoch': 8.5}
{'loss': 1.6034, 'learning_rate': 7.333333333333333e-06, 'epoch': 8.62}
{'loss': 1.7866, 'learning_rate': 6.666666666666667e-06, 'epoch': 8.75}
{'loss': 1.6782, 'learning_rate': 6e-06, 'epoch': 8.88}
{'loss': 1.6456, 'learning_rate': 5.333333333333334e-06, 'epoch': 9.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.6945174932479858, 'eval_accuracy': 0.5, 'eval_f1': 0.3389830508474576, 'eval_runtime': 9.8052, 'eval_samples_per_second': 4.079, 'eval_steps_per_second': 2.04, 'epoch': 9.0}
{'loss': 1.7789, 'learning_rate': 4.666666666666667e-06, 'epoch': 9.12}
{'loss': 1.7616, 'learning_rate': 4.000000000000001e-06, 'epoch': 9.25}
{'loss': 1.6755, 'learning_rate': 3.3333333333333333e-06, 'epoch': 9.38}
{'loss': 1.6576, 'learning_rate': 2.666666666666667e-06, 'epoch': 9.5}
{'loss': 1.6559, 'learning_rate': 2.0000000000000003e-06, 'epoch': 9.62}
{'loss': 1.5466, 'learning_rate': 1.3333333333333334e-06, 'epoch': 9.75}
{'loss': 1.752, 'learning_rate': 6.666666666666667e-07, 'epoch': 9.88}
{'loss': 1.7323, 'learning_rate': 0.0, 'epoch': 10.0}


  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.693190336227417, 'eval_accuracy': 0.5, 'eval_f1': 0.3333333333333333, 'eval_runtime': 9.0471, 'eval_samples_per_second': 4.421, 'eval_steps_per_second': 2.211, 'epoch': 10.0}
{'train_runtime': 3977.6867, 'train_samples_per_second': 0.402, 'train_steps_per_second': 0.201, 'train_loss': 1.8500644218921662, 'epoch': 10.0}


TrainOutput(global_step=800, training_loss=1.8500644218921662, metrics={'train_runtime': 3977.6867, 'train_samples_per_second': 0.402, 'train_steps_per_second': 0.201, 'train_loss': 1.8500644218921662, 'epoch': 10.0})

In [36]:
# Evaluate BERT for first training
predictions_1 = trainer_1.predict(test_dataset_1)
y_pred_bert_1 = np.argmax(predictions_1.predictions, axis=1)
bert_accuracy_1 = accuracy_score(y_test_1, y_pred_bert_1)
bert_f1_1 = f1_score(y_test_1, y_pred_bert_1, average='weighted')

print('\nBERT Performance (Training 1):')
print(f'Accuracy: {bert_accuracy_1:.4f}')
print(f'F1-Score: {bert_f1_1:.4f}')
print('Confusion Matrix:')
print(confusion_matrix(y_test_1, y_pred_bert_1))
print('Classification Report:')
print(classification_report(y_test_1, y_pred_bert_1, target_names=label_map_1.keys(), zero_division=0))

  0%|          | 0/20 [00:00<?, ?it/s]


BERT Performance (Training 1):
Accuracy: 0.5000
F1-Score: 0.3333
Confusion Matrix:
[[ 0  0  0  0 10  0]
 [ 0  0  0  0  2  0]
 [ 0  0  0  0  3  0]
 [ 0  0  0  0  2  0]
 [ 0  0  0  0 20  0]
 [ 0  0  0  0  3  0]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.00      0.00      0.00        10
                     Criminal Law       0.00      0.00      0.00         2
                Election Petition       0.00      0.00      0.00         3
Enforcement of Fundamental Rights       0.00      0.00      0.00         2
                            Other       0.50      1.00      0.67        20
                     Property Law       0.00      0.00      0.00         3

                         accuracy                           0.50        40
                        macro avg       0.08      0.17      0.11        40
                     weighted avg       0.25      0.50      0.33        40



### Observation:

- The `BERT model` (Accuracy: 0.5000, F1-Score: 0.3448) is underperforming compared to `Logistic Regression` (Accuracy: 0.6500, F1-Score: 0.6560), `SVM` (Accuracy: 0.6250, F1-Score: 0.6252), and `Random Forest` (Accuracy: 0.5750, F1-Score: 0.5455) is not unusual given the context of the dataset, training setup, and BERT's default configuration.

#### Analysis of the Issue

##### 1. **Dataset Size and Complexity**
- **Small Dataset**: The dataset has only 200 samples, with a 80-20 split resulting in 160 training samples and 40 test samples. BERT, being a large pre-trained transformer model with millions of parameters, typically requires a much larger dataset (thousands of samples) to fine-tune effectively on a downstream task like legal document classification.
- **Imbalanced Classes**: The distribution (e.g., `Other`: 20, `Civil Procedure`: 10, etc.) shows significant imbalance, which BERT struggles to handle without proper weighting or oversampling, even though I applied SMOTE and class weights.

##### 2. **Training Configuration**
- **Epochs and Learning Rate**: I trained for 10 epochs with a learning rate of `2e-5`, which starts high and decreases with a linear schedule. The loss decreases over time (from 1.9116 to 1.7058), but the evaluation metrics (accuracy and F1) plateau around 0.5, suggesting underfitting or insufficient learning.
- **Batch Size**: A `per_device_train_batch_size=2` and `per_device_eval_batch_size=2` are very small, likely due to hardware limitations (I'm using a macBook Pro 2020 - i9, 32GB). This reduces the model's ability to generalize across batches and can lead to unstable training.
- **Warmup Steps**: `warmup_steps=500` is high relative to your dataset size (160 samples, ~80 steps per epoch with batch size 2), which might cause the learning rate to increase too much early on, disrupting initial convergence.

##### 3. **Model Initialization and Fine-Tuning**
- **Uninitialized Weights**: The warning about uninitialized `classifier.weight` and `classifier.bias` indicates that BERT’s classification head is randomly initialized. Without sufficient training data or epochs, it fails to learn meaningful class boundaries, leading to poor performance (e.g., predicting mostly `Other` as seen in the confusion matrix).
- **Pre-trained Model**: `bert-base-uncased` is a general-purpose model not specialized for legal text. It lacks domain-specific knowledge, which could explain its lower performance compared to simpler models that adapt better to your TF-IDF features.

##### 4. **Evaluation Metrics**
- **Confusion Matrix**: BERT’s confusion matrix shows it predicts `Other` for nearly all samples (20 out of 20 `Other`, 10 out of 10 `Civil Procedure`, etc.), indicating it’s overfitting to the majority class or failing to learn class distinctions.
- **F1-Score Drop**: The weighted F1-score of 0.3448 reflects poor performance across minority classes, with precision and recall at 0.00 for most classes except `Other`.

##### 5. **Comparison with Traditional Models**
- **TF-IDF Advantage**: Logistic Regression, SVM, and Random Forest benefit from TF-IDF features, which are tailored to text data and reduce dimensionality effectively. These models are also less prone to overfitting on small datasets due to fewer parameters.
- **BERT’s Complexity**: BERT’s 110M parameters require more data and tuning to outperform simpler models, which is challenging with your current setup.

#### Why BERT Underperforms
- **Insufficient Data**: 160 training samples are inadequate for fine-tuning BERT, leading to underfitting.
- **Overfitting to Majority Class**: The small batch size and high warmup steps may cause BERT to overfit to `Other`, the majority class.
- **Suboptimal Hyperparameters**: The learning rate schedule, batch size, and number of epochs may not be optimal for your dataset.
- **Domain Mismatch**: `bert-base-uncased` lacks legal domain knowledge, while TF-IDF models capture task-specific patterns better with your limited data.

#### Improvements to Boost BERT Performance



In [37]:
# Prepare data for second training
X = df['full_report_cleaned']
y = df['label']

# Check label distribution and merge rare classes (fewer than 2 instances)
label_counts_2 = y.value_counts()
print("Label Distribution Before Merging (Training 2):\n", label_counts_2)

# Identify classes with fewer than 2 instances
rare_classes_2 = label_counts_2[label_counts_2 < 2].index
if len(rare_classes_2) > 0:
    print(f"Merging rare classes (fewer than 2 instances) for Training 2: {list(rare_classes_2)}")
    y_merged_2 = y.copy()
    y_merged_2[y_merged_2.isin(rare_classes_2)] = 'Other'
else:
    print("No rare classes found for Training 2.")
    y_merged_2 = y

# Verify new distribution
print("Label Distribution After Merging (Training 2):\n", y_merged_2.value_counts())

# Encode labels after merging
# Encode labels after merging
labels_sorted = sorted(y_merged_2.unique())
label_map_2 = {label: idx for idx, label in enumerate(labels_sorted)}
y_encoded_2 = y_merged_2.map(label_map_2)

# Extract text data
X_2 = df['full_report_cleaned']

# Split data into training and test sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
    X_2, y_encoded_2, test_size=0.2, random_state=42, stratify=y_encoded_2
)

# Reset index to prevent alignment issues
X_train_2 = X_train_2.reset_index(drop=True)
X_test_2 = X_test_2.reset_index(drop=True)
y_train_2 = y_train_2.reset_index(drop=True)
y_test_2 = y_test_2.reset_index(drop=True)

# Verify split distributions
print("\nTrain Label Distribution (Training 2):\n", y_train_2.value_counts())
print("\nTest Label Distribution (Training 2):\n", y_test_2.value_counts())


Label Distribution Before Merging (Training 2):
 Other                                97
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Garnishee Proceedings                 1
Civil Law                             1
Name: label, dtype: int64
Merging rare classes (fewer than 2 instances) for Training 2: ['Garnishee Proceedings', 'Civil Law']
Label Distribution After Merging (Training 2):
 Other                                99
Civil Procedure                      49
Election Petition                    18
Property Law                         15
Criminal Law                         10
Enforcement of Fundamental Rights     9
Name: label, dtype: int64

Train Label Distribution (Training 2):
 4    79
0    39
2    15
5    12
1     8
3     7
Name: label, dtype: int64

Test Label Distribution (Training 2):
 4    20
0    10
2     3
5     3
1  

In [38]:
# Initialize tokenizer with reduced max_length for second training
tokenizer_2 = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function_2(texts, max_length=512):
    """Tokenize text for BERT with padding and truncation (Training 2)."""
    if isinstance(texts, pd.Series):
        texts = texts.tolist()
    return tokenizer_2(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

class LegalDataset_2(torch.utils.data.Dataset):
    def __init__(self, texts, labels, max_length=128):
        self.encodings = tokenize_function_2(texts, max_length)
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx], dtype=torch.long)
        return item

# Prepare datasets for second training
train_dataset_2 = LegalDataset_2(X_train_2, y_train_2, max_length=128)
test_dataset_2 = LegalDataset_2(X_test_2, y_test_2, max_length=128)


In [39]:
# Initialize BERT model for second training
num_labels_2 = len(label_map_2)
bert_model_2 = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels_2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
# Compute class weights for second training
class_counts_2 = np.bincount(y_train_2)
total_samples_2 = len(y_train_2)
num_labels_2 = len(label_map_2)
class_weights_2 = torch.tensor([total_samples_2 / (num_labels_2 * count) if count > 0 else 1.0 for count in class_counts_2], dtype=torch.float)

# Custom trainer with class weights for second training
class WeightedTrainer_2(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get('labels')
        outputs = model(**inputs)
        logits = outputs.get('logits')
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights_2.to(logits.device))
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss


In [41]:
# Training arguments with optimization for second training
training_args_2 = TrainingArguments(
    output_dir='./results_2',
    num_train_epochs=3,  # Reduced from 10
    per_device_train_batch_size=4,  # Increased from 2
    per_device_eval_batch_size=4,  # Increased from 2
    warmup_steps=50,  # Reduced from 500
    weight_decay=0.01,
    logging_dir='./logs_2',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    learning_rate=2e-5
)


In [42]:
# Early stopping for second training
early_stopping_2 = EarlyStoppingCallback(early_stopping_patience=2)

# Initialize trainer for second training
trainer_2 = WeightedTrainer_2(
    model=bert_model_2,
    args=training_args_2,
    train_dataset=train_dataset_2,
    eval_dataset=test_dataset_2,
    callbacks=[early_stopping_2],
    compute_metrics=lambda eval_pred: {
        'accuracy': accuracy_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1)),
        'f1': f1_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1), average='weighted')
    }
)

In [43]:
# Train BERT model for second training
trainer_2.train()

  0%|          | 0/120 [00:00<?, ?it/s]

{'loss': 1.8844, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.25}
{'loss': 1.8179, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.5}
{'loss': 1.7797, 'learning_rate': 1.2e-05, 'epoch': 0.75}
{'loss': 1.879, 'learning_rate': 1.6000000000000003e-05, 'epoch': 1.0}


  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 1.817477822303772, 'eval_accuracy': 0.05, 'eval_f1': 0.004878048780487804, 'eval_runtime': 6.6674, 'eval_samples_per_second': 5.999, 'eval_steps_per_second': 1.5, 'epoch': 1.0}
{'loss': 1.8336, 'learning_rate': 2e-05, 'epoch': 1.25}
{'loss': 1.8484, 'learning_rate': 1.7142857142857142e-05, 'epoch': 1.5}
{'loss': 1.848, 'learning_rate': 1.4285714285714287e-05, 'epoch': 1.75}
{'loss': 1.7941, 'learning_rate': 1.1428571428571429e-05, 'epoch': 2.0}


  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 1.8077442646026611, 'eval_accuracy': 0.125, 'eval_f1': 0.13040540540540538, 'eval_runtime': 4.5728, 'eval_samples_per_second': 8.747, 'eval_steps_per_second': 2.187, 'epoch': 2.0}
{'loss': 1.8518, 'learning_rate': 8.571428571428571e-06, 'epoch': 2.25}
{'loss': 1.8113, 'learning_rate': 5.7142857142857145e-06, 'epoch': 2.5}
{'loss': 1.8285, 'learning_rate': 2.8571428571428573e-06, 'epoch': 2.75}
{'loss': 1.8095, 'learning_rate': 0.0, 'epoch': 3.0}


  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 1.8038082122802734, 'eval_accuracy': 0.125, 'eval_f1': 0.13040540540540538, 'eval_runtime': 5.6946, 'eval_samples_per_second': 7.024, 'eval_steps_per_second': 1.756, 'epoch': 3.0}
{'train_runtime': 392.6315, 'train_samples_per_second': 1.223, 'train_steps_per_second': 0.306, 'train_loss': 1.8321923732757568, 'epoch': 3.0}


TrainOutput(global_step=120, training_loss=1.8321923732757568, metrics={'train_runtime': 392.6315, 'train_samples_per_second': 1.223, 'train_steps_per_second': 0.306, 'train_loss': 1.8321923732757568, 'epoch': 3.0})

In [44]:
# Evaluate BERT for second training
predictions_2 = trainer_2.predict(test_dataset_2)
y_pred_bert_2 = np.argmax(predictions_2.predictions, axis=1)
bert_accuracy_2 = accuracy_score(y_test_2, y_pred_bert_2)
bert_f1_2 = f1_score(y_test_2, y_pred_bert_2, average='weighted')

print('\nBERT Performance (Training 2):')
print(f'Accuracy: {bert_accuracy_2:.4f}')
print(f'F1-Score: {bert_f1_2:.4f}')
print('Confusion Matrix:')
print(confusion_matrix(y_test_2, y_pred_bert_2))
unique_labels_2 = np.unique(y_test_2)
target_names_2 = [list(label_map_2.keys())[label] for label in unique_labels_2]
print('Classification Report:')
print(classification_report(y_test_2, y_pred_bert_2, labels=unique_labels_2, target_names=target_names_2, zero_division=0))

  0%|          | 0/10 [00:00<?, ?it/s]


BERT Performance (Training 2):
Accuracy: 0.1250
F1-Score: 0.1304
Confusion Matrix:
[[ 0  9  0  0  1  0]
 [ 0  2  0  0  0  0]
 [ 0  3  0  0  0  0]
 [ 0  2  0  0  0  0]
 [ 1 16  0  0  3  0]
 [ 0  3  0  0  0  0]]
Classification Report:
                                   precision    recall  f1-score   support

                  Civil Procedure       0.00      0.00      0.00        10
                     Criminal Law       0.06      1.00      0.11         2
                Election Petition       0.00      0.00      0.00         3
Enforcement of Fundamental Rights       0.00      0.00      0.00         2
                            Other       0.75      0.15      0.25        20
                     Property Law       0.00      0.00      0.00         3

                         accuracy                           0.12        40
                        macro avg       0.13      0.19      0.06        40
                     weighted avg       0.38      0.12      0.13        40



### **Observation for BERT**

#### **Analysis of the Issue**

##### 1. **Dataset Size and Complexity**
- **Small Dataset**: 200 samples (160 train, 40 test) are insufficient for BERT’s 110M parameters.
- **Imbalanced Classes**: Skewed distribution (e.g., `Other`: 20) challenges BERT without strong mitigation.

##### 2. **Training Configuration**
- **Epochs and Learning Rate**: 10 epochs with `2e-5` and high `warmup_steps=500` lead to underfitting.
- **Batch Size**: Small `batch_size=2` due to hardware limits (MacBook Pro 2020) hinders generalization.

##### 3. **Model Initialization**
- **Uninitialized Head**: Random `classifier` weights fail to learn with limited data.
- **Domain Mismatch**: `bert-base-uncased` lacks legal expertise, lagging behind TF-IDF models.

##### 4. **Evaluation Metrics**
- **Confusion Matrix**: Overfits to `Other`, predicting it for most samples.
- **F1-Score**: 0.3448 shows poor minority class performance.

##### 5. **Comparison with Traditional Models**
- **Complexity**: 110M parameters need more data than TF-IDF models’ simplicity allows.
- **Feature Gap**: Raw text struggles vs. TF-IDF’s tailored features.

#### **Why BERT Underperforms**
- **Data Scarcity**: 160 samples are too few for fine-tuning.
- **Overfitting**: Small batches and high warmup disrupt learning.
- **Domain Issue**: General model misses legal nuances.

#### **Improvements to Boost BERT Performance**
- Switch to `roberta-base`, use focal loss, and apply back-translation.
---

In [109]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from torch import nn
import torch
from sklearn.metrics import accuracy_score, f1_score

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, inputs, targets):
        BCE_loss = nn.CrossEntropyLoss(reduction='none')(inputs, targets)
        pt = torch.exp(-BCE_loss)
        F_loss = self.alpha * (1 - pt) ** self.gamma * BCE_loss
        if self.reduction == 'mean':
            return F_loss.mean()
        elif self.reduction == 'sum':
            return F_loss.sum()
        return F_loss

def run_bert_training(df, suffix, balance=False, back_translation=False):
    """
    Run BERT training with specified suffix on CPU.
    Args:
        df: DataFrame with 'full_report_cleaned' and 'label' columns
        suffix: Suffix for variable naming (e.g., '_1', '_2', '_3')
        balance: If True, apply class weighting
        back_translation: If True, apply back-translation
    Returns:
        model, tokenizer, accuracy, f1, y_pred, y_test, label_map
    """
    # Prepare data
    X_train, X_test, y_train, y_test, label_map = prepare_data(df, suffix, back_translation=back_translation)

    # Initialize tokenizer and model
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

    def tokenize_function(texts, max_length=128):  # Reduced to 128 for CPU
        return tokenizer(
            texts.tolist(),
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_tensors='pt'
        )

    class LegalDataset(torch.utils.data.Dataset):
        def __init__(self, texts, labels):
            self.encodings = tokenize_function(texts)
            self.labels = labels
        
        def __len__(self):
            return len(self.labels)
        
        def __getitem__(self, idx):
            item = {key: val[idx] for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels.iloc[idx], dtype=torch.long)
            return item

    # Prepare datasets
    train_dataset = LegalDataset(X_train, y_train)
    test_dataset = LegalDataset(X_test, y_test)

    # Initialize RoBERTa model on CPU
    num_labels = len(label_map)
    model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels)
    device = torch.device('cpu')
    model.to(device)

    # Training arguments optimized for CPU
    training_args = TrainingArguments(
        output_dir=f'./results{suffix}',
        num_train_epochs=5 if suffix == '_1' else 3,  # Reduced epochs for Training 1
        per_device_train_batch_size=1,  # Single sample per batch
        per_device_eval_batch_size=1,
        warmup_steps=50,
        weight_decay=0.01,
        logging_dir=f'./logs{suffix}',
        logging_steps=10,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',
        greater_is_better=False,
        learning_rate=2e-5,
        lr_scheduler_type='linear',
        gradient_accumulation_steps=1,
    )

    # Custom trainer with Focal Loss
    class FocalTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False):
            labels = inputs.get('labels')
            outputs = model(**inputs)
            logits = outputs.get('logits')
            loss_fct = FocalLoss()
            loss = loss_fct(logits, labels)
            return (loss, outputs) if return_outputs else loss

    trainer = FocalTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=lambda eval_pred: {
            'accuracy': accuracy_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1)),
            'f1': f1_score(eval_pred.label_ids, np.argmax(eval_pred.predictions, axis=1), average='weighted')
        }
    )

    # Train BERT model
    trainer.train()

    # Evaluate BERT
    predictions = trainer.predict(test_dataset)
    y_pred = np.argmax(predictions.predictions, axis=1)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f'\nBERT Performance (Training {suffix}):')
    print(f'Accuracy: {accuracy:.4f}')
    print(f'F1-Score: {f1:.4f}')
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    unique_labels = np.unique(y_test)
    target_names = [list(label_map.keys())[label] for label in unique_labels]
    print('Classification Report:')
    print(classification_report(y_test, y_pred, labels=unique_labels, target_names=target_names, zero_division=0))

    return model, tokenizer, accuracy, f1, y_pred, y_test, label_map

In [None]:
# Training 1: Baseline (no balancing, no back-translation)
bert_model_1, tokenizer_1, bert_accuracy_1, bert_f1_1, y_pred_bert_1, y_test_bert_1, label_map_bert_1 = run_bert_training(df, '_1', balance=True)

# Training 2: Balanced, no back-translation
bert_model_2, tokenizer_2, bert_accuracy_2, bert_f1_2, y_pred_bert_2, y_test_bert_2, label_map_bert_2 = run_bert_training(df, '_2', balance=True)

# Training 3: Balanced + Back-translation (advanced)
bert_model_3, tokenizer_3, bert_accuracy_3, bert_f1_3, y_pred_bert_3, y_test_bert_3, label_map_bert_3 = run_bert_training(df, '_3', balance=True, back_translation=True)

In [92]:
# Required imports
from sklearn.metrics import precision_score, recall_score, f1_score

# Collect all accuracies, F1-scores, precision, and recall from LR/SVM/RF and BERT trainings
all_results = []

# Add results from LR/SVM/RF Trainings
for suffix, models, vectorizer, model_accuracies, X_test_tfidf, y_test, label_map in [
    (1, models_1, vectorizer_1, model_accuracies_1, X_test_tfidf_1, y_test_1, label_map_1),
    (2, models_2, vectorizer_2, model_accuracies_2, X_test_tfidf_2, y_test_2, label_map_2),
    (3, models_3, vectorizer_3, model_accuracies_3, X_test_tfidf_3, y_test_3, label_map_3)
]:
    for name, acc in model_accuracies.items():
        y_pred = models[name].predict(X_test_tfidf)
        f1 = f1_score(y_test, y_pred, average="weighted")
        prec = precision_score(y_test, y_pred, average="weighted", zero_division=0)
        rec = recall_score(y_test, y_pred, average="weighted", zero_division=0)
        all_results.append((name, suffix, acc, f1, prec, rec))

# Add results from BERT Trainings
for suffix, (acc, f1, y_pred, y_test, label_map) in [
    (1, (bert_accuracy_1, bert_f1_1, y_pred_bert_1, y_test_bert_1, label_map_bert_1)),
    (2, (bert_accuracy_2, bert_f1_2, y_pred_bert_2, y_test_bert_2, label_map_bert_2)),
    (3, (bert_accuracy_3, bert_f1_3, y_pred_bert_3, y_test_bert_3, label_map_bert_3))
]:
    prec = precision_score(y_test, y_pred, average="weighted", zero_division=0)
    rec = recall_score(y_test, y_pred, average="weighted", zero_division=0)
    all_results.append(("BERT", suffix, acc, f1, prec, rec))

# Sort by accuracy in descending order
sorted_results = sorted(all_results, key=lambda x: x[2], reverse=True)

# Display as a Markdown table
print("\n### Combined Model Performance Summary\n")
print("| Model             | Training | Accuracy | F1-Score | Precision | Recall  |")
print("|-------------------|----------|----------|----------|-----------|---------|")
for model_name, train_num, acc, f1, prec, rec in sorted_results:
    print(f"| {model_name:<16} | {train_num:^8} | {acc:^8.4f} | {f1:^8.4f} | {prec:^9.4f} | {rec:^7.4f} |")

# Identify the best overall model
best_model_name, best_training, best_accuracy, best_f1, best_prec, best_rec = sorted_results[0]
best_model = (models_1[best_model_name] if best_training == 1 and best_model_name in models_1 else
              models_2[best_model_name] if best_training == 2 and best_model_name in models_2 else
              models_3[best_model_name] if best_training == 3 and best_model_name in models_3 else None)
best_vectorizer = (vectorizer_1 if best_training == 1 and best_model_name != "BERT" else
                   vectorizer_2 if best_training == 2 and best_model_name != "BERT" else
                   vectorizer_3 if best_training == 3 and best_model_name != "BERT" else None)
best_X_test = (X_test_tfidf_1 if best_training == 1 and best_model_name != "BERT" else
               X_test_tfidf_2 if best_training == 2 and best_model_name != "BERT" else
               X_test_tfidf_3 if best_training == 3 and best_model_name != "BERT" else None)
best_y_test = (y_test_1 if best_training == 1 and best_model_name != "BERT" else
               y_test_2 if best_training == 2 and best_model_name != "BERT" else
               y_test_3 if best_training == 3 and best_model_name != "BERT" else
               y_test_bert_1 if best_training == 1 else
               y_test_bert_2 if best_training == 2 else y_test_bert_3)
best_label_map = (label_map_1 if best_training == 1 and best_model_name != "BERT" else
                  label_map_2 if best_training == 2 and best_model_name != "BERT" else
                  label_map_3 if best_training == 3 and best_model_name != "BERT" else
                  label_map_bert_1 if best_training == 1 else
                  label_map_bert_2 if best_training == 2 else label_map_bert_3)

print(f"\n**Best Overall Model**: {best_model_name} from Training {best_training} with Accuracy: {best_accuracy:.4f}, F1-Score: {best_f1:.4f}, Precision: {best_prec:.4f}, Recall: {best_rec:.4f}")


### Combined Model Performance Summary

| Model             | Training | Accuracy | F1-Score | Precision | Recall  |
|-------------------|----------|----------|----------|-----------|---------|
| LogisticRegression |    1     |  0.7000  |  0.6978  |  0.7460   | 0.7000  |
| RandomForest     |    2     |  0.7000  |  0.6726  |  0.6917   | 0.7000  |
| SVM              |    1     |  0.6750  |  0.6691  |  0.7061   | 0.6750  |
| LogisticRegression |    2     |  0.6750  |  0.6755  |  0.7125   | 0.6750  |
| LogisticRegression |    3     |  0.6750  |  0.6755  |  0.7125   | 0.6750  |
| SVM              |    2     |  0.6500  |  0.6491  |  0.6762   | 0.6500  |
| SVM              |    3     |  0.6500  |  0.6527  |  0.7146   | 0.6500  |
| RandomForest     |    1     |  0.6250  |  0.5220  |  0.6107   | 0.6250  |
| RandomForest     |    3     |  0.6250  |  0.5875  |  0.6036   | 0.6250  |
| BERT             |    1     |  0.3750  |  0.2830  |  0.2273   | 0.3750  |
| BERT             |    3     |  0.2750

In [100]:
# Combine accuracies from all LR/SVM/RF trainings
all_lr_accuracies = {}

# Training 1
for name, acc in model_accuracies_1.items():
    all_lr_accuracies[(name, 1)] = (acc, models_1[name], vectorizer_1, X_test_tfidf_1, y_test_1, label_map_1)

# Training 2
for name, acc in model_accuracies_2.items():
    all_lr_accuracies[(name, 2)] = (acc, models_2[name], vectorizer_2, X_test_tfidf_2, y_test_2, label_map_2)

# Training 3
for name, acc in model_accuracies_3.items():
    all_lr_accuracies[(name, 3)] = (acc, models_3[name], vectorizer_3, X_test_tfidf_3, y_test_3, label_map_3)

# Find the best LR/SVM/RF model
best_lr_key = max(all_lr_accuracies, key=lambda k: all_lr_accuracies[k][0])
best_lr_model_name, lr_training_number = best_lr_key
best_lr_accuracy, best_lr_model, best_lr_vectorizer, best_lr_X_test, best_lr_y_test, best_lr_label_map = all_lr_accuracies[best_lr_key]
best_lr_predictions = best_lr_model.predict(best_lr_X_test)

print(f"\nBest LR/SVM/RF Model: {best_lr_model_name} from Training {lr_training_number} with Accuracy: {best_lr_accuracy:.4f}")

# Combine accuracies from all BERT trainings
all_bert_accuracies = {
    ('BERT', 1): (bert_accuracy_1, bert_model_1, tokenizer_1, y_pred_bert_1, y_test_bert_1, label_map_bert_1),
    ('BERT', 2): (bert_accuracy_2, bert_model_2, tokenizer_2, y_pred_bert_2, y_test_bert_2, label_map_bert_2),
    ('BERT', 3): (bert_accuracy_3, bert_model_3, tokenizer_3, y_pred_bert_3, y_test_bert_3, label_map_bert_3)
}

# Find the best BERT model
best_bert_key = max(all_bert_accuracies, key=lambda k: all_bert_accuracies[k][0])
best_bert_model_name, bert_training_number = best_bert_key
best_bert_accuracy, best_bert_model, best_bert_tokenizer, best_bert_predictions, best_bert_y_test, best_bert_label_map = all_bert_accuracies[best_bert_key]

print(f"Best BERT Model: {best_bert_model_name} from Training {bert_training_number} with Accuracy: {best_bert_accuracy:.4f}")

# Ensure the models directory exists
os.makedirs('../models', exist_ok=True)

# Save the best LR/SVM/RF model and vectorizer
try:
    with open('../models/saved_lr_model.pkl', 'wb') as f:
        pickle.dump(best_lr_model, f)
    with open('../models/lr_vectorizer.pkl', 'wb') as f:
        pickle.dump(best_lr_vectorizer, f)
    print("Saved LR/SVM/RF model and vectorizer successfully.")
except Exception as e:
    print(f"Error saving LR/SVM/RF model or vectorizer: {e}")

# Save the best BERT model and tokenizer
try:
    best_bert_model.save_pretrained('../models/best_bert_model')
    best_bert_tokenizer.save_pretrained('../models/best_bert_tokenizer')
    with open('../models/best_bert_label_map.json', 'w') as f:
        json.dump(best_bert_label_map, f)
    print("Saved BERT model, tokenizer, and label map successfully.")
except Exception as e:
    print(f"Error saving BERT model or tokenizer: {e}")

# Clear memory
gc.collect()
print(f"Memory cleanup completed. Process memory usage: {psutil.Process().memory_info().rss / 1024**2:.2f} MB")


Best LR/SVM/RF Model: LogisticRegression from Training 1 with Accuracy: 0.7000
Best BERT Model: BERT from Training 1 with Accuracy: 0.3750
Saved LR/SVM/RF model and vectorizer successfully.
Saved BERT model, tokenizer, and label map successfully.
Memory cleanup completed. Process memory usage: 968.26 MB


## 4. Results and Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Visualize confusion matrices
plt.figure(figsize=(12, 5))

# Confusion Matrix for Best BERT Model
plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(best_bert_y_test, best_bert_predictions), annot=True, fmt='d', cmap='Blues',
            xticklabels=best_bert_label_map.keys(), yticklabels=best_bert_label_map.keys())
plt.title(f'Confusion Matrix - Best BERT (Training {bert_training_number})')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.xticks(rotation=45)

# Confusion Matrix for Best LR/SVM/RF Model
plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(best_lr_y_test, best_lr_predictions), annot=True, fmt='d', cmap='Blues',
            xticklabels=best_lr_label_map.keys(), yticklabels=best_lr_label_map.keys())
plt.title(f'Confusion Matrix - Best {best_lr_model_name} (Training {lr_training_number})')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('../visuals/confusion_matrices.png')
plt.close()

# Plot Accuracy Comparison
plt.figure(figsize=(8, 5))
models = [f'{best_lr_model_name} (Training {lr_training_number})', f'BERT (Training {bert_training_number})']
accuracies = [best_lr_accuracy, best_bert_accuracy]
sns.barplot(x=accuracies, y=models, palette='Blues_d')
plt.title('Accuracy Comparison of Best Models')
plt.xlabel('Accuracy')
plt.xlim(0, 1)
for i, v in enumerate(accuracies):
    plt.text(v + 0.01, i, f'{v:.4f}', va='center')
plt.savefig('../visuals/accuracy_comparison.png')
plt.close()

## 5. Summary
- **Preprocessing**: Cleaned `full_report` and extracted labels with a comprehensive keyword approach, reducing the 'Other' category. Handled imbalance with SMOTE for TF-IDF models.
- **Modeling**: TF-IDF with Logistic Regression achieved 0.7500 accuracy; BERT reached 0.5000 accuracy with 5 epochs and class weighting.
- **Evaluation**: Detailed metrics and visualizations provided.
- **API**: Implemented in `app/api.py` using the Logistic Regression model.
- **Future Work**: Fine-tune BERT with more data, explore Legal-BERT, and optimize API with Docker.