<span style="font-weight: bold; font-size: 18px;">**Multi-Label Posture Classification: Model Development Strategy**<br><br>

We propose a comparative evaluation of two complementary modeling approaches to address the multi-label posture prediction task, each offering distinct advantages for legal document classification.

**Baseline Approach: Bag-of-Words Models**<br>

Our initial baseline leverages traditional bag-of-words representations (TF-IDF, BM25) combined with multi-label classifiers, justified by several key factors:

<div style="margin-left: 20px;"><b>• Computational Efficiency:</b> Lightweight architecture enables rapid prototyping and establishes performance baselines without GPU requirements</div>
<div style="margin-left: 20px;"><b>• Statistical Robustness:</b> Word-frequency features provide interpretable, domain-agnostic representations suitable for legal terminology analysis</div>
<div style="margin-left: 20px;"><b>• Multi-Label Compatibility:</b> Well-established integration with multi-label algorithms (One-vs-Rest, Binary Relevance, Label Powerset)</div>
<div style="margin-left: 20px;"><b>• Baseline Establishment:</b> Provides interpretable performance benchmarks for evaluating more complex architectures</div>

**Advanced Approach: Transformer-Based Models (ModernBERT)**<br>

Our primary model leverages ModernBERT encoder architecture, specifically designed to address the limitations of traditional BERT for our use case:

<div style="margin-left: 20px;"><b>• Extended Context Coverage:</b> ModernBERT's 8,192-token context window accommodates ~90% of our corpus without truncation, preserving critical legal context that may span entire documents</div>

<div style="margin-left: 20px;"><b>• Contextual Understanding:</b> Unlike bag-of-words approaches, transformer architectures capture:
  <div style="margin-left: 40px;">- Long-range dependencies between legal arguments</div>
  <div style="margin-left: 40px;">- Positional relationships between procedural elements</div>
  <div style="margin-left: 40px;">- Semantic nuances distinguishing similar posture categories</div>
</div>

<div style="margin-left: 20px;"><b>• Multi-Label Architecture:</b> The encoder's [CLS] token representation can be effectively coupled with multi-label classification heads, enabling simultaneous prediction of multiple postures</div>

<div style="margin-left: 20px;"><b>• Legal Domain Adaptation:</b> Pre-trained language understanding provides superior handling of complex legal terminology and document structure</div>

**Comparative Justification:**<br>

This dual-approach strategy enables comprehensive evaluation of feature representation impact on multi-label performance, ranging from traditional statistical methods to state-of-the-art contextual understanding, ultimately identifying the optimal balance between computational efficiency and classification accuracy for legal posture prediction.

</span>

## Data Preparation for ML

In [31]:
import os
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

In [32]:
# Prepare the labels - convert postures to a list format
def prepare_labels(postures_str):
    """Convert posture string to list of postures"""
    if pd.isna(postures_str) or postures_str == '':
        return []
    return [p.strip() for p in postures_str.split(',') if p.strip()]

# Apply to dataframe
_dir=os.path.join(os.getcwd(),"processed_data")
df=pd.read_pickle(os.path.join(_dir, "data.pkl"))
df['posture_list'] = df['postures'].apply(prepare_labels)

# Remove documents with no postures
df_ml = df[df['posture_list'].apply(len) > 0].copy()
print(f"Documents with postures: {len(df_ml)}")

Documents with postures: 17077


In [33]:
# Analyze posture distribution
all_postures_ml = []
for postures in df_ml['posture_list']:
    all_postures_ml.extend(postures)

posture_counts = pd.Series(all_postures_ml).value_counts()
print(f"\nTotal unique postures: {len(posture_counts)}")
print()
print(f"Most common postures:")
print(posture_counts.head(15))


Total unique postures: 230

Most common postures:
On Appeal                                                         9197
Appellate Review                                                  4652
Review of Administrative Decision                                 2773
Motion to Dismiss                                                 1679
Sentencing or Penalty Phase Motion or Objection                   1342
Trial or Guilt Phase Motion or Objection                          1097
Motion for Attorney's Fees                                         612
Post-Trial Hearing Motion                                          512
Motion for Preliminary Injunction                                  364
Motion to Dismiss for Lack of Subject Matter Jurisdiction          343
Motion to Compel Arbitration                                       255
Motion for New Trial                                               226
Petition to Terminate Parental Rights                              219
Motion for Judgment as a M

In [34]:
# Filter to most common postures (those appearing in at least 100 documents)
min_frequency = 100
common_postures = posture_counts[posture_counts >= min_frequency].index.tolist()
print(f"\nPostures with >= {min_frequency} occurrences: {len(common_postures)}")
print(common_postures)


Postures with >= 100 occurrences: 27
['On Appeal', 'Appellate Review', 'Review of Administrative Decision', 'Motion to Dismiss', 'Sentencing or Penalty Phase Motion or Objection', 'Trial or Guilt Phase Motion or Objection', "Motion for Attorney's Fees", 'Post-Trial Hearing Motion', 'Motion for Preliminary Injunction', 'Motion to Dismiss for Lack of Subject Matter Jurisdiction', 'Motion to Compel Arbitration', 'Motion for New Trial', 'Petition to Terminate Parental Rights', 'Motion for Judgment as a Matter of Law (JMOL)/Directed Verdict', 'Motion for Reconsideration', 'Motion to Dismiss for Lack of Personal Jurisdiction', 'Motion for Costs', 'Juvenile Delinquency Proceeding', 'Motion for Default Judgment/Order of Default', 'Motion to Dismiss for Lack of Standing', 'Motion to Dismiss for Lack of Jurisdiction', 'Motion to Transfer or Change Venue', 'Petition for Divorce or Dissolution', 'Motion for Contempt', 'Motion for Protective Order', 'Motion for Permanent Injunction', 'Motion to Se

In [35]:
# Filter documents to only include those with common postures
def filter_common_postures(posture_list, common_postures):
    """Keep only postures that are in the common_postures list"""
    return [p for p in posture_list if p in common_postures]

df_ml['filtered_postures'] = df_ml['posture_list'].apply(
    lambda x: filter_common_postures(x, common_postures)
)

# Remove documents that have no common postures after filtering
df_ml = df_ml[df_ml['filtered_postures'].apply(len) > 0].copy()
print(f"Documents after filtering to common postures: {len(df_ml)}")

Documents after filtering to common postures: 16568


In [36]:
## Multi-label Classification Setup

# Create binary label matrix using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_multilabel = mlb.fit_transform(df_ml['filtered_postures'])

print(f"Label matrix shape: {y_multilabel.shape}")
print(f"Labels: {mlb.classes_}")

Label matrix shape: (16568, 27)
Labels: ['Appellate Review' 'Juvenile Delinquency Proceeding'
 "Motion for Attorney's Fees" 'Motion for Contempt' 'Motion for Costs'
 'Motion for Default Judgment/Order of Default'
 'Motion for Judgment as a Matter of Law (JMOL)/Directed Verdict'
 'Motion for New Trial' 'Motion for Permanent Injunction'
 'Motion for Preliminary Injunction' 'Motion for Protective Order'
 'Motion for Reconsideration' 'Motion to Compel Arbitration'
 'Motion to Dismiss' 'Motion to Dismiss for Lack of Jurisdiction'
 'Motion to Dismiss for Lack of Personal Jurisdiction'
 'Motion to Dismiss for Lack of Standing'
 'Motion to Dismiss for Lack of Subject Matter Jurisdiction'
 'Motion to Set Aside or Vacate' 'Motion to Transfer or Change Venue'
 'On Appeal' 'Petition for Divorce or Dissolution'
 'Petition to Terminate Parental Rights' 'Post-Trial Hearing Motion'
 'Review of Administrative Decision'
 'Sentencing or Penalty Phase Motion or Objection'
 'Trial or Guilt Phase Motion or 

In [37]:
_counts = df_ml['num_postures'].value_counts(dropna=False)
_pct = df_ml['num_postures'].value_counts(dropna=False,normalize=True) 

pd.DataFrame({
    'count': _counts,
    'percentage': _pct
}).sort_index().style.format({'count':'{:,}','percentage':'{:.2%}'}).set_caption("Distribution of num_postures")\
    .set_table_styles([{'selector': 'caption','props': [('color', 'red'),('font-size', '15px')]}])

Unnamed: 0_level_0,count,percentage
num_postures,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7649,46.17%
2,7567,45.67%
3,1127,6.80%
4,189,1.14%
5,32,0.19%
6,2,0.01%
7,2,0.01%


In [None]:
# Prepare text data
X_text = df_ml['full_text'].values

# Split the data
X_train, X_temp, y_train, y_temp = train_test_split(
    X_text, y_multilabel, 
    test_size=0.3, # 30% for temp (which will be split into val and test)
    random_state=42, 
    stratify=None
)

 # Split temp into validation and test (50-50 split of the 30%)
# # This gives us 15% each
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42, 
    stratify=None
)

print(f"Total samples: {len(df_ml)}")
print(f"Training set: {len(X_train)} ({len(X_train)/len(df_ml):.2%})")
print(f"Validation set: {len(X_val)} ({len(X_val)/len(df_ml):.2%})")
print(f"Test set: {len(X_test)} ({len(X_test)/len(df_ml):.2%})")

Total samples: 16568
Training set: 11597 (70.00%)
Validation set: 2485 (15.00%)
Test set: 2486 (15.00%)


In [39]:
# Check label distribution
train_label_sums = y_train.sum(axis=0)
val_label_sums = y_val.sum(axis=0)
test_label_sums = y_test.sum(axis=0)

print("\nLabel distribution in training set:")
for i, label in enumerate(mlb.classes_):
    print(f"{label}: {train_label_sums[i]} ({train_label_sums[i]/len(y_train)*100:.1f}%)")


Label distribution in training set:
Appellate Review: 3310 (28.5%)
Juvenile Delinquency Proceeding: 103 (0.9%)
Motion for Attorney's Fees: 412 (3.6%)
Motion for Contempt: 88 (0.8%)
Motion for Costs: 121 (1.0%)
Motion for Default Judgment/Order of Default: 101 (0.9%)
Motion for Judgment as a Matter of Law (JMOL)/Directed Verdict: 147 (1.3%)
Motion for New Trial: 156 (1.3%)
Motion for Permanent Injunction: 73 (0.6%)
Motion for Preliminary Injunction: 254 (2.2%)
Motion for Protective Order: 73 (0.6%)
Motion for Reconsideration: 145 (1.3%)
Motion to Compel Arbitration: 179 (1.5%)
Motion to Dismiss: 1155 (10.0%)
Motion to Dismiss for Lack of Jurisdiction: 82 (0.7%)
Motion to Dismiss for Lack of Personal Jurisdiction: 138 (1.2%)
Motion to Dismiss for Lack of Standing: 87 (0.8%)
Motion to Dismiss for Lack of Subject Matter Jurisdiction: 231 (2.0%)
Motion to Set Aside or Vacate: 73 (0.6%)
Motion to Transfer or Change Venue: 88 (0.8%)
On Appeal: 6404 (55.2%)
Petition for Divorce or Dissolution

In [43]:
## save preprocess data
saved_data=os.path.join(os.getcwd(), 'processed_data')
os.makedirs(saved_data, exist_ok=True)
# Save using pickle
with open(os.path.join(saved_data,'train_arrays.pkl'), 'wb') as f:
    pickle.dump({'X_train': X_train, 'y_train': y_train, 'label_train': label_train}, f)

with open(os.path.join(saved_data,'val_arrays.pkl'), 'wb') as f:
    pickle.dump({'X_val': X_val, 'y_val': y_val, 'label_val': label_val}, f)

with open(os.path.join(saved_data,'test_arrays.pkl'), 'wb') as f:
    pickle.dump({'X_test': X_test, 'y_test': y_test, 'label_test': label_test}, f)

with open(os.path.join(saved_data,'class_name.pkl'), 'wb') as f:
    pickle.dump({'class_name': mlb.classes_}, f)

print("All arrays saved with pickle!")

# To load later:
# with open(os.path.join(saved_data,'train_arrays.pkl'), 'rb') as f:
#     train_data = pickle.load(f)
#     X_train = train_data['X_train']
#     y_train = train_data['y_train']
#     label_train = train_data['label_train']

All arrays saved with pickle!


## Bag-of-word (TFIDF): Benchmark

In [15]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import xgboost as xgb
import lightgbm as lgb
from lightgbm import early_stopping, log_evaluation
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, hamming_loss
from sklearn.metrics import (
    precision_score, recall_score, f1_score, 
    roc_auc_score, average_precision_score,
    hamming_loss, jaccard_score
)
from sklearn.preprocessing import MultiLabelBinarizer

import warnings
warnings.filterwarnings('ignore')

In [21]:
# Create TF-IDF vectorizer
# Using parameters optimized for legal text
tfidf = TfidfVectorizer(
    max_features=10000,  # Limit features for computational efficiency
    stop_words='english',
    ngram_range=(1, 2),  # Include unigrams and bigrams
    min_df=5,           # Ignore terms that appear in fewer than 5 documents
    max_df=0.95,        # Ignore terms that appear in more than 95% of documents
    sublinear_tf=True   # Apply sublinear scaling
)

print("Fitting TF-IDF vectorizer...")
X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF matrix shape (train): {X_train_tfidf.shape}")
print(f"TF-IDF matrix shape (val): {X_val_tfidf.shape}")
print(f"TF-IDF matrix shape (test): {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf.vocabulary_)}")

# Show some sample features
feature_names = tfidf.get_feature_names_out()
print(f"\nSample features: {feature_names[:20]}")
print(f"Last features: {feature_names[-20:]}")

def comprehensive_evaluation(y_true, y_pred_binary, y_pred_proba, threshold=0.5):
    """
    Comprehensive evaluation function for multi-label classification.
    
    Args:
        y_true: Ground truth binary labels (n_samples, n_labels)
        y_pred_binary: Predicted binary labels (n_samples, n_labels) 
        y_pred_proba: Predicted probabilities (n_samples, n_labels)
        threshold: Threshold for converting probabilities to binary (default: 0.5)
    
    Returns:
        dict: Comprehensive metrics including all averaging methods
    """
    import numpy as np
    from sklearn.metrics import (
        precision_score, recall_score, f1_score, accuracy_score,
        hamming_loss, jaccard_score, roc_auc_score, average_precision_score
    )
    
    # Ensure inputs are numpy arrays
    y_true = np.array(y_true, dtype=int)
    y_pred_binary = np.array(y_pred_binary, dtype=int)
    y_pred_proba = np.array(y_pred_proba, dtype=float)
    
    metrics = {}
    
    try:
        # SAMPLES AVERAGE (per-sample then average across samples)
        metrics['precision_samples'] = precision_score(y_true, y_pred_binary, average='samples', zero_division=0)
        metrics['recall_samples'] = recall_score(y_true, y_pred_binary, average='samples', zero_division=0)
        metrics['f1_samples'] = f1_score(y_true, y_pred_binary, average='samples', zero_division=0)
        
        # MICRO AVERAGE (global average)
        metrics['precision_micro'] = precision_score(y_true, y_pred_binary, average='micro', zero_division=0)
        metrics['recall_micro'] = recall_score(y_true, y_pred_binary, average='micro', zero_division=0)
        metrics['f1_micro'] = f1_score(y_true, y_pred_binary, average='micro', zero_division=0)
        
        # MACRO AVERAGE (unweighted average across labels)
        metrics['precision_macro'] = precision_score(y_true, y_pred_binary, average='macro', zero_division=0)
        metrics['recall_macro'] = recall_score(y_true, y_pred_binary, average='macro', zero_division=0)
        metrics['f1_macro'] = f1_score(y_true, y_pred_binary, average='macro', zero_division=0)
        
        # WEIGHTED AVERAGE (weighted by support)
        metrics['precision_weighted'] = precision_score(y_true, y_pred_binary, average='weighted', zero_division=0)
        metrics['recall_weighted'] = recall_score(y_true, y_pred_binary, average='weighted', zero_division=0)
        metrics['f1_weighted'] = f1_score(y_true, y_pred_binary, average='weighted', zero_division=0)
        
        # ACCURACY METRICS
        metrics['accuracy'] = accuracy_score(y_true, y_pred_binary)
        metrics['hamming_loss'] = hamming_loss(y_true, y_pred_binary)
        
        # JACCARD (IoU) METRICS 
        metrics['jaccard_samples'] = jaccard_score(y_true, y_pred_binary, average='samples', zero_division=0)
        metrics['jaccard_macro'] = jaccard_score(y_true, y_pred_binary, average='macro', zero_division=0)
        metrics['jaccard_weighted'] = jaccard_score(y_true, y_pred_binary, average='weighted', zero_division=0)
        
        # ROC-AUC METRICS (using probabilities)
        try:
            metrics['roc_auc_micro'] = roc_auc_score(y_true, y_pred_proba, average='micro')
            metrics['roc_auc_macro'] = roc_auc_score(y_true, y_pred_proba, average='macro')
            metrics['roc_auc_weighted'] = roc_auc_score(y_true, y_pred_proba, average='weighted')
            metrics['roc_auc_samples'] = roc_auc_score(y_true, y_pred_proba, average='samples')
        except ValueError as e:
            print(f"Warning: ROC-AUC calculation failed: {e}")
            metrics['roc_auc_micro'] = 0.0
            metrics['roc_auc_macro'] = 0.0
            metrics['roc_auc_weighted'] = 0.0
            metrics['roc_auc_samples'] = 0.0
        
        # PR-AUC METRICS (using probabilities)
        try:
            metrics['pr_auc_micro'] = average_precision_score(y_true, y_pred_proba, average='micro')
            metrics['pr_auc_macro'] = average_precision_score(y_true, y_pred_proba, average='macro')
            metrics['pr_auc_weighted'] = average_precision_score(y_true, y_pred_proba, average='weighted')
            metrics['pr_auc_samples'] = average_precision_score(y_true, y_pred_proba, average='samples')
        except ValueError as e:
            print(f"Warning: PR-AUC calculation failed: {e}")
            metrics['pr_auc_micro'] = 0.0
            metrics['pr_auc_macro'] = 0.0
            metrics['pr_auc_weighted'] = 0.0
            metrics['pr_auc_samples'] = 0.0
        
    except Exception as e:
        print(f"Error in comprehensive_evaluation: {e}")
        # Return minimal metrics if calculation fails
        metrics = {
            'precision_micro': 0.0, 'recall_micro': 0.0, 'f1_micro': 0.0,
            'precision_macro': 0.0, 'recall_macro': 0.0, 'f1_macro': 0.0,
            'accuracy': 0.0, 'hamming_loss': 1.0
        }
    
    return metrics

print("✅ Comprehensive evaluation function updated and ready to use!")

NameError: name 'TfidfVectorizer' is not defined

In [11]:
def comprehensive_evaluation(y_true, y_pred_proba, y_pred_binary=None, threshold=0.5):
    """
    Comprehensive evaluation for multi-label classification with all averaging methods
    """
    if y_pred_binary is None:
        y_pred_binary = (y_pred_proba >= threshold).astype(int)
    
    metrics = {}
    
    # SAMPLES AVERAGE (per-sample then average across samples)
    metrics['precision_samples'] = precision_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['recall_samples'] = recall_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['f1_samples'] = f1_score(y_true, y_pred_binary, average='samples', zero_division=0)
    
    # MICRO AVERAGE (global aggregation)
    metrics['precision_micro'] = precision_score(y_true, y_pred_binary, average='micro', zero_division=0)
    metrics['recall_micro'] = recall_score(y_true, y_pred_binary, average='micro', zero_division=0)
    metrics['f1_micro'] = f1_score(y_true, y_pred_binary, average='micro', zero_division=0)
    
    # MACRO AVERAGE (unweighted average across labels)
    metrics['precision_macro'] = precision_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['recall_macro'] = recall_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['f1_macro'] = f1_score(y_true, y_pred_binary, average='macro', zero_division=0)
    
    # WEIGHTED AVERAGE (weighted by support/frequency)
    metrics['precision_weighted'] = precision_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    metrics['recall_weighted'] = recall_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    metrics['f1_weighted'] = f1_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    
    # ROC-AUC (multiple averaging methods)
    try:
        metrics['roc_auc_macro'] = roc_auc_score(y_true, y_pred_proba, average='macro')
        metrics['roc_auc_weighted'] = roc_auc_score(y_true, y_pred_proba, average='weighted')
        metrics['roc_auc_samples'] = roc_auc_score(y_true, y_pred_proba, average='samples')
    except ValueError as e:
        print(f"ROC-AUC calculation failed: {e}")
        metrics['roc_auc_macro'] = 0.0
        metrics['roc_auc_weighted'] = 0.0
        metrics['roc_auc_samples'] = 0.0
    
    # Precision-Recall AUC (multiple averaging methods)
    try:
        metrics['pr_auc_macro'] = average_precision_score(y_true, y_pred_proba, average='macro')
        metrics['pr_auc_weighted'] = average_precision_score(y_true, y_pred_proba, average='weighted')
        metrics['pr_auc_samples'] = average_precision_score(y_true, y_pred_proba, average='samples')
    except ValueError as e:
        print(f"PR-AUC calculation failed: {e}")
        metrics['pr_auc_macro'] = 0.0
        metrics['pr_auc_weighted'] = 0.0
        metrics['pr_auc_samples'] = 0.0
    
    # Hamming Loss (inherently micro-averaged)
    metrics['hamming_loss'] = hamming_loss(y_true, y_pred_binary)
    
    # Jaccard Score (multiple averaging methods)
    metrics['jaccard_samples'] = jaccard_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['jaccard_macro'] = jaccard_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['jaccard_weighted'] = jaccard_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    
    # Note: micro average for Jaccard in multi-label is not directly supported in sklearn
    # but can be calculated manually if needed
    
    return metrics

In [None]:
# # Define models to test with optimized hyperparameters and validation-aware training
# models = {
#     'Logistic Regression': OneVsRestClassifier(
#         LogisticRegression(
#             random_state=42, 
#             max_iter=1000,
#             C=1.0,
#             solver='liblinear'
#         )
#     ),
#     'Random Forest': OneVsRestClassifier(
#         RandomForestClassifier(
#             n_estimators=100, 
#             random_state=42, 
#             n_jobs=-1,
#             max_depth=10,
#             min_samples_split=5,
#             min_samples_leaf=2,
#             # Additional overfitting control
#             min_impurity_decrease=0.0001,
#             max_features='sqrt'
#         )
#     ),
#     'XGBoost': OneVsRestClassifier(
#         xgb.XGBClassifier(
#             random_state=42,
#             n_estimators=100,
#             max_depth=6,
#             learning_rate=0.1,
#             subsample=0.8,
#             colsample_bytree=0.8,
#             eval_metric='logloss',
#             verbosity=0,
#             # Early stopping will be handled in training loop
#             early_stopping_rounds=10
#         )
#     ),
#     'LightGBM': OneVsRestClassifier(
#         lgb.LGBMClassifier(
#             random_state=42,
#             n_estimators=100,
#             max_depth=6,
#             learning_rate=0.1,
#             subsample=0.8,
#             colsample_bytree=0.8,
#             verbosity=-1,
#             # Early stopping will be handled in training loop
#             early_stopping_rounds=10
#         )
#     )
# }

# # Enhanced training function with validation monitoring
# def train_with_validation_control(model, X_train, y_train, X_val, y_val, model_name):
#     """
#     Train model with validation monitoring to control overfitting
#     """
#     print(f"\nTraining {model_name} with validation control...")
    
#     if model_name in ['XGBoost', 'LightGBM']:
#         # For tree-based models, we can use early stopping
#         if model_name == 'XGBoost':
#             # XGBoost with early stopping
#             for i, estimator in enumerate(model.estimators_):
#                 print(f"  Training label {i+1}/{len(model.estimators_)}")
                
#                 # Get single label
#                 y_train_single = y_train[:, i]
#                 y_val_single = y_val[:, i]
                
#                 # Only train if there are positive samples
#                 if y_train_single.sum() > 0:
#                     estimator.fit(
#                         X_train, y_train_single,
#                         eval_set=[(X_val, y_val_single)],
#                         verbose=False
#                     )
#                 else:
#                     # For labels with no positive samples, create a dummy classifier
#                     estimator.fit(X_train[:10], y_train_single[:10])
        
#         elif model_name == 'LightGBM':
#             # LightGBM with early stopping
#             for i, estimator in enumerate(model.estimators_):
#                 print(f"  Training label {i+1}/{len(model.estimators_)}")
                
#                 # Get single label
#                 y_train_single = y_train[:, i]
#                 y_val_single = y_val[:, i]
                
#                 # Only train if there are positive samples
#                 if y_train_single.sum() > 0:
#                     estimator.fit(
#                         X_train, y_train_single,
#                         eval_set=[(X_val, y_val_single)],
#                         callbacks=[
#                             early_stopping(10, verbose=False),
#                             log_evaluation(0)  # No logging
#                         ]
#                     )
#                 else:
#                     # For labels with no positive samples, create a dummy classifier
#                     estimator.fit(X_train[:10], y_train_single[:10])
#     else:
#         # For other models, use regular training
#         model.fit(X_train, y_train)
    
#     return model

# # Store results with validation tracking
# results = {}
# validation_scores = {}

# print("Training and evaluating models with validation control...")
# print("="*60)
# print("Models to evaluate:")
# for name in models.keys():
#     print(f"  • {name}")
# print()

# for name, model in models.items():
#     # Train with validation control
#     if name in ['XGBoost', 'LightGBM']:
#         # For tree-based models, we need to handle OneVsRestClassifier manually
#         # to implement early stopping properly
#         trained_model = OneVsRestClassifier(
#             model.estimator,
#             n_jobs=1  # Sequential to handle early stopping
#         )
#         trained_model.fit(X_train_tfidf, y_train)
#     else:
#         trained_model = model
#         trained_model.fit(X_train_tfidf, y_train)
    
#     # Make predictions on all sets
#     y_pred_train = trained_model.predict(X_train_tfidf)
#     y_pred_val = trained_model.predict(X_val_tfidf)
#     y_pred_test = trained_model.predict(X_test_tfidf)
    
#     # Calculate metrics for all sets
#     train_accuracy = accuracy_score(y_train, y_pred_train)
#     val_accuracy = accuracy_score(y_val, y_pred_val)
#     test_accuracy = accuracy_score(y_test, y_pred_test)
    
#     train_hamming = hamming_loss(y_train, y_pred_train)
#     val_hamming = hamming_loss(y_val, y_pred_val)
#     test_hamming = hamming_loss(y_test, y_pred_test)
    
#     # Calculate F1 scores
#     train_f1_micro = f1_score(y_train, y_pred_train, average='micro')
#     val_f1_micro = f1_score(y_val, y_pred_val, average='micro')
#     test_f1_micro = f1_score(y_test, y_pred_test, average='micro')
    
#     # Store results
#     results[name] = {
#         'model': trained_model,
#         'train_accuracy': train_accuracy,
#         'val_accuracy': val_accuracy,
#         'test_accuracy': test_accuracy,
#         'train_hamming_loss': train_hamming,
#         'val_hamming_loss': val_hamming,
#         'test_hamming_loss': test_hamming,
#         'train_f1_micro': train_f1_micro,
#         'val_f1_micro': val_f1_micro,
#         'test_f1_micro': test_f1_micro,
#         'y_pred_test': y_pred_test,
#         'y_pred_val': y_pred_val
#     }
    
#     # Check for overfitting
#     accuracy_gap = train_accuracy - val_accuracy
#     f1_gap = train_f1_micro - val_f1_micro
    
#     overfitting_status = "✅ Good" if accuracy_gap < 0.05 else "⚠️ Moderate" if accuracy_gap < 0.1 else "🚨 High"
    
#     print(f"\n{name} Results:")
#     print(f"  Train Accuracy: {train_accuracy:.4f}")
#     print(f"  Val Accuracy:   {val_accuracy:.4f}")
#     print(f"  Test Accuracy:  {test_accuracy:.4f}")
#     print(f"  Train-Val Gap:  {accuracy_gap:.4f} ({overfitting_status})")
#     print(f"  Train F1:       {train_f1_micro:.4f}")
#     print(f"  Val F1:         {val_f1_micro:.4f}")
#     print(f"  Test F1:        {test_f1_micro:.4f}")
#     print(f"  F1 Gap:         {f1_gap:.4f}")

# print("\n" + "="*80)
# print("Model Comparison with Overfitting Analysis:")
# print(f"{'Model':<15} | {'Test Acc':<8} | {'Val Acc':<8} | {'Gap':<6} | {'Status':<12} | {'Performance':<12}")
# print("-" * 85)

# # Sort results by validation accuracy (better indicator than test accuracy)
# sorted_results = sorted(results.items(), key=lambda x: x[1]['val_accuracy'], reverse=True)

# for name, result in sorted_results:
#     gap = result['train_accuracy'] - result['val_accuracy']
#     status = "Good" if gap < 0.05 else "Moderate" if gap < 0.1 else "High"
#     performance = "🥇 Best" if name == sorted_results[0][0] else "🥈 Good" if result['val_accuracy'] > 0.55 else "⚠️ Poor"
#     print(f"{name:<15} | {result['test_accuracy']:<8.4f} | {result['val_accuracy']:<8.4f} | {gap:<6.4f} | {status:<12} | {performance}")

# # Identify best model based on validation performance
# best_model_name = sorted_results[0][0]
# best_model = sorted_results[0][1]['model']
# print(f"\n🏆 Best performing model (based on validation): {best_model_name}")
# print(f"   Validation Accuracy: {sorted_results[0][1]['val_accuracy']:.4f}")
# print(f"   Test Accuracy: {sorted_results[0][1]['test_accuracy']:.4f}")
# print(f"   Overfitting Gap: {sorted_results[0][1]['train_accuracy'] - sorted_results[0][1]['val_accuracy']:.4f}")

In [16]:
import lightgbm as lgb
import xgboost as xgb
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score, 
    roc_auc_score, average_precision_score,
    hamming_loss, jaccard_score, accuracy_score
)
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm

class Train_XGBoost(BaseEstimator, ClassifierMixin):
    """XGBoost classifier with validation-based early stopping for multi-label"""
    
    def __init__(self, **xgb_params):
        self.xgb_params = xgb_params
        self.models_ = []
        self.n_classes_ = None
        
    def fit(self, X, y, X_val=None, y_val=None):
        if len(y.shape) == 1:
            y = y.reshape(-1, 1)
        if X_val is not None and len(y_val.shape) == 1:
            y_val = y_val.reshape(-1, 1)
            
        self.n_classes_ = y.shape[1]
        self.models_ = []
        
        for i in tqdm(range(self.n_classes_), total=self.n_classes_, leave=True, position=0):
            
            y_single = y[:, i]
            
            # Skip if no positive samples
            if y_single.sum() == 0:
                self.models_.append(None)
                continue
            
            model = xgb.XGBClassifier(**self.xgb_params)
            
            if X_val is not None and y_val is not None:
                y_val_single = y_val[:, i]
                model.fit(
                    X, y_single,
                    eval_set=[(X_val, y_val_single)],
                    verbose=False
                )
            else:
                model.fit(X, y_single)
            
            self.models_.append(model)
        
        return self
    
    def predict(self, X):
        predictions = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                predictions[:, i] = model.predict(X)
        
        return predictions
    
    def predict_proba(self, X):
        probabilities = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                proba = model.predict_proba(X)
                # Handle case where only one class is present
                if proba.shape[1] == 1:
                    probabilities[:, i] = 0  # All negative class
                else:
                    probabilities[:, i] = proba[:, 1]  # Positive class probability
        
        return probabilities

class Train_LGBM(BaseEstimator, ClassifierMixin):
    """LightGBM classifier with validation-based early stopping for multi-label"""
    
    def __init__(self, **lgb_params):
        self.lgb_params = lgb_params
        self.models_ = []
        self.n_classes_ = None
        
    def fit(self, X, y, X_val=None, y_val=None):
        if len(y.shape) == 1:
            y = y.reshape(-1, 1)
        if X_val is not None and len(y_val.shape) == 1:
            y_val = y_val.reshape(-1, 1)
            
        self.n_classes_ = y.shape[1]
        self.models_ = []
        
        for i in tqdm(range(self.n_classes_), total=self.n_classes_, leave=True, position=0):
            
            y_single = y[:, i]
            
            # Skip if no positive samples
            if y_single.sum() == 0:
                self.models_.append(None)
                continue
            
            model = lgb.LGBMClassifier(**self.lgb_params)
            
            if X_val is not None and y_val is not None:
                y_val_single = y_val[:, i]
                model.fit(
                    X, y_single,
                    eval_set=[(X_val, y_val_single)],
                    callbacks=[
                        lgb.early_stopping(10, verbose=False),
                        lgb.log_evaluation(0)
                    ]
                )
            else:
                model.fit(X, y_single)
            
            self.models_.append(model)
        
        return self
    
    def predict(self, X):
        predictions = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                predictions[:, i] = model.predict(X)
        
        return predictions
    
    def predict_proba(self, X):
        probabilities = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                proba = model.predict_proba(X)
                # Handle case where only one class is present
                if proba.shape[1] == 1:
                    probabilities[:, i] = 0  # All negative class
                else:
                    probabilities[:, i] = proba[:, 1]  # Positive class probability
        
        return probabilities

class Train_logistic(BaseEstimator, ClassifierMixin):
    """Logistic Regression classifier with validation monitoring for multi-label"""
    
    def __init__(self, **lr_params):
        self.lr_params = lr_params
        self.models_ = []
        self.n_classes_ = None
        self.validation_scores_ = []
        
    def fit(self, X, y, X_val=None, y_val=None):
        if len(y.shape) == 1:
            y = y.reshape(-1, 1)
        if X_val is not None and len(y_val.shape) == 1:
            y_val = y_val.reshape(-1, 1)
            
        self.n_classes_ = y.shape[1]
        self.models_ = []
        self.validation_scores_ = []
        
        for i in tqdm(range(self.n_classes_), total=self.n_classes_, leave=True, position=0):
            
            y_single = y[:, i]
            
            # Skip if no positive samples
            if y_single.sum() == 0:
                self.models_.append(None)
                self.validation_scores_.append(0.0)
                continue
            
            model = LogisticRegression(**self.lr_params)
            model.fit(X, y_single)
            
            # Calculate validation score if validation data provided
            if X_val is not None and y_val is not None:
                y_val_single = y_val[:, i]
                val_score = model.score(X_val, y_val_single)
                self.validation_scores_.append(val_score)
            else:
                self.validation_scores_.append(None)
            
            self.models_.append(model)
        
        return self
    
    def predict(self, X):
        predictions = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                predictions[:, i] = model.predict(X)
        
        return predictions
    
    def predict_proba(self, X):
        probabilities = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                proba = model.predict_proba(X)
                # Handle case where only one class is present
                if proba.shape[1] == 1:
                    probabilities[:, i] = 0  # All negative class
                else:
                    probabilities[:, i] = proba[:, 1]  # Positive class probability
        
        return probabilities
    
    def get_validation_scores(self):
        """Return validation scores for each label"""
        return self.validation_scores_

class Train_RandomForest(BaseEstimator, ClassifierMixin):
    """Random Forest classifier with validation monitoring for multi-label"""
    
    def __init__(self, **rf_params):
        self.rf_params = rf_params
        self.models_ = []
        self.n_classes_ = None
        self.validation_scores_ = []
        self.feature_importances_ = []
        
    def fit(self, X, y, X_val=None, y_val=None):
        if len(y.shape) == 1:
            y = y.reshape(-1, 1)
        if X_val is not None and len(y_val.shape) == 1:
            y_val = y_val.reshape(-1, 1)
            
        self.n_classes_ = y.shape[1]
        self.models_ = []
        self.validation_scores_ = []
        self.feature_importances_ = []
        
        for i in tqdm(range(self.n_classes_), total=self.n_classes_, leave=True, position=0):
            
            y_single = y[:, i]
            
            # Skip if no positive samples
            if y_single.sum() == 0:
                self.models_.append(None)
                self.validation_scores_.append(0.0)
                self.feature_importances_.append(None)
                continue
            
            model = RandomForestClassifier(**self.rf_params)
            model.fit(X, y_single)
            
            # Store feature importances
            self.feature_importances_.append(model.feature_importances_)
            
            # Calculate validation score if validation data provided
            if X_val is not None and y_val is not None:
                y_val_single = y_val[:, i]
                val_score = model.score(X_val, y_val_single)
                self.validation_scores_.append(val_score)
            else:
                self.validation_scores_.append(None)
            
            self.models_.append(model)
        
        return self
    
    def predict(self, X):
        predictions = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                predictions[:, i] = model.predict(X)
        
        return predictions
    
    def predict_proba(self, X):
        probabilities = np.zeros((X.shape[0], self.n_classes_))
        
        for i, model in enumerate(self.models_):
            if model is not None:
                proba = model.predict_proba(X)
                # Handle case where only one class is present
                if proba.shape[1] == 1:
                    probabilities[:, i] = 0  # All negative class
                else:
                    probabilities[:, i] = proba[:, 1]  # Positive class probability
        
        return probabilities
    
    def get_validation_scores(self):
        """Return validation scores for each label"""
        return self.validation_scores_
    
    def get_feature_importances(self):
        """Return feature importances for each label"""
        return self.feature_importances_

def training_function_with_validation(X_train, y_train, X_val, y_val, model_type='lightgbm'):
    """
    Enhanced training function with proper validation control for multi-label classification
    """
    
    print(f"Training {model_type} with validation control...")
    print(f"X_train shape: {X_train.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"X_val shape: {X_val.shape}")
    print(f"y_val shape: {y_val.shape}")
    
    if model_type == 'lightgbm':
        model = Train_LGBM(
            random_state=42,
            n_estimators=200,  # More estimators for early stopping
            max_depth=6,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            verbosity=-1,
            early_stopping_rounds=10
        )
    elif model_type == 'xgboost':
        model = Train_XGBoost(
            random_state=42,
            n_estimators=200,  # More estimators for early stopping
            max_depth=6,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            eval_metric='logloss',
            verbosity=0,
            early_stopping_rounds=10
        )
    elif model_type == 'logistic':
        model = Train_logistic(
            random_state=42,
            max_iter=1000,
            C=1.0,
            solver='liblinear',
            class_weight='balanced'  # Handle class imbalance
        )
    elif model_type == 'randomforest':
        model = Train_RandomForest(
            random_state=42,
            n_estimators=100,
            max_depth=10,
            min_samples_split=5,
            min_samples_leaf=2,
            max_features='sqrt',
            class_weight='balanced',  # Handle class imbalance
            n_jobs=-1
        )
    else:
        raise ValueError("Supported model types: 'lightgbm', 'xgboost', 'logistic', 'randomforest'")
    
    # Fit with validation data
    model.fit(X_train, y_train, X_val, y_val)
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_val = model.predict(X_val)
    
    # Calculate metrics
    train_acc = accuracy_score(y_train, y_pred_train)
    val_acc = accuracy_score(y_val, y_pred_val)
    train_f1 = f1_score(y_train, y_pred_train, average='micro')
    val_f1 = f1_score(y_val, y_pred_val, average='micro')
    
    # Calculate hamming loss (lower is better)
    train_hamming = hamming_loss(y_train, y_pred_train)
    val_hamming = hamming_loss(y_val, y_pred_val)
    
    # Calculate overfitting gaps for different metrics
    accuracy_gap = train_acc - val_acc
    f1_gap = train_f1 - val_f1
    hamming_gap = val_hamming - train_hamming  # Note: val - train because lower hamming is better
    
    print(f"Training completed!")
    print(f"Train Accuracy: {train_acc:.4f}")
    print(f"Val Accuracy: {val_acc:.4f}")
    print(f"Train F1: {train_f1:.4f}")
    print(f"Val F1: {val_f1:.4f}")
    print(f"Train Hamming Loss: {train_hamming:.4f}")
    print(f"Val Hamming Loss: {val_hamming:.4f}")
    print(f"Overfitting Gap (Accuracy): {accuracy_gap:.4f}")
    print(f"Overfitting Gap (F1): {f1_gap:.4f}")
    print(f"Overfitting Gap (Hamming): {hamming_gap:.4f}")
    
    return model, {
        'train_accuracy': train_acc,
        'val_accuracy': val_acc,
        'train_f1': train_f1,
        'val_f1': val_f1,
        'train_hamming_loss': train_hamming,
        'val_hamming_loss': val_hamming,
        'accuracy_gap': accuracy_gap,
        'f1_gap': f1_gap,
        'hamming_gap': hamming_gap,
        'overfitting_gap': hamming_gap  # Use hamming gap as primary overfitting indicator
    }



In [17]:
# Comprehensive Model Comparison with Validation Control

def compare_all_models(X_train, y_train, X_val, y_val, X_test, y_test):
    """
    Train and compare all models with validation control
    """
    
    print("🚀 COMPREHENSIVE MODEL COMPARISON WITH VALIDATION CONTROL")
    print("="*80)
    
    models_to_test = ['logistic', 'randomforest', 'lightgbm', 'xgboost']
    results = {}
    
    for model_type in models_to_test:
        print(f"\n{'='*60}")
        print(f"🔧 Training {model_type.upper()} Model")
        print(f"{'='*60}")
        
        try:
            # Train model with validation
            model, metrics = training_function_with_validation(
                X_train, y_train, X_val, y_val, model_type=model_type
            )
            
            # Test on unseen data
            y_pred_test = model.predict(X_test)
            test_acc = accuracy_score(y_test, y_pred_test)
            test_f1 = f1_score(y_test, y_pred_test, average='micro')
            test_hamming = hamming_loss(y_test, y_pred_test)
            
            # Store all results
            results[model_type] = {
                'model': model,
                'train_accuracy': metrics['train_accuracy'],
                'val_accuracy': metrics['val_accuracy'],
                'test_accuracy': test_acc,
                'train_f1': metrics['train_f1'],
                'val_f1': metrics['val_f1'],
                'test_f1': test_f1,
                'train_hamming_loss': metrics['train_hamming_loss'],
                'val_hamming_loss': metrics['val_hamming_loss'],
                'test_hamming_loss': test_hamming,
                'accuracy_gap': metrics['accuracy_gap'],
                'f1_gap': metrics['f1_gap'],
                'hamming_gap': metrics['hamming_gap'],
                'overfitting_gap': metrics['overfitting_gap']  # Based on hamming loss
            }
            
            print(f"✅ {model_type.upper()} completed successfully!")
            print(f"   Test Accuracy: {test_acc:.4f}")
            print(f"   Test F1: {test_f1:.4f}")
            print(f"   Test Hamming Loss: {test_hamming:.4f}")
            print(f"   Overfitting Gap (Hamming): {metrics['overfitting_gap']:.4f}")
            
        except Exception as e:
            print(f"❌ Error training {model_type}: {str(e)}")
            results[model_type] = None
    
    return results

def analyze_model_results(results):
    """
    Analyze and display comprehensive results
    """
    
    print(f"\n{'='*100}")
    print("📊 COMPREHENSIVE MODEL ANALYSIS")
    print(f"{'='*100}")
    
    # Filter successful results
    successful_results = {k: v for k, v in results.items() if v is not None}
    
    if not successful_results:
        print("❌ No models trained successfully!")
        return
    
    # Display detailed comparison table
    print(f"\n{'Model':<15} | {'Train Acc':<9} | {'Val Acc':<9} | {'Test Acc':<9} | {'Train Ham':<9} | {'Val Ham':<8} | {'Test Ham':<8} | {'Ham Gap':<8} | {'Status'}")
    print("-" * 105)
    
    # Sort by validation accuracy (best practice)
    sorted_results = sorted(successful_results.items(), 
                          key=lambda x: x[1]['val_accuracy'], reverse=True)
    
    for rank, (model_name, result) in enumerate(sorted_results, 1):
        hamming_gap = result['hamming_gap']
        
        # Determine overfitting status based on hamming gap
        # For hamming loss, positive gap means validation is worse (overfitting)
        if hamming_gap < 0.01:
            status = "✅ Excellent"
        elif hamming_gap < 0.02:
            status = "🟢 Good"
        elif hamming_gap < 0.04:
            status = "🟡 Moderate"
        else:
            status = "🔴 High"
        
        rank_emoji = "🥇" if rank == 1 else "🥈" if rank == 2 else "🥉" if rank == 3 else "4️⃣"
        
        print(f"{model_name.upper():<15} | {result['train_accuracy']:<9.4f} | {result['val_accuracy']:<9.4f} | "
              f"{result['test_accuracy']:<9.4f} | {result['train_hamming_loss']:<9.4f} | {result['val_hamming_loss']:<8.4f} | "
              f"{result['test_hamming_loss']:<8.4f} | {hamming_gap:<8.4f} | {status}")
    
    # Identify best models
    best_model = sorted_results[0]
    print(f"\n🏆 BEST MODEL (Based on Validation Performance): {best_model[0].upper()}")
    print(f"   📈 Validation Accuracy: {best_model[1]['val_accuracy']:.4f}")
    print(f"   🎯 Test Accuracy: {best_model[1]['test_accuracy']:.4f}")
    print(f"   📊 Test F1 Score: {best_model[1]['test_f1']:.4f}")
    print(f"   🔻 Test Hamming Loss: {best_model[1]['test_hamming_loss']:.4f}")
    print(f"   ⚖️ Overfitting Gap (Hamming): {best_model[1]['overfitting_gap']:.4f}")
    print(f"   📏 Accuracy Gap: {best_model[1]['accuracy_gap']:.4f}")
    print(f"   📈 F1 Gap: {best_model[1]['f1_gap']:.4f}")
    
    # Best test performance (might be different from best validation)
    best_test = max(successful_results.items(), key=lambda x: x[1]['test_accuracy'])
    if best_test[0] != best_model[0]:
        print(f"\n🎯 BEST TEST PERFORMANCE: {best_test[0].upper()}")
        print(f"   Test Accuracy: {best_test[1]['test_accuracy']:.4f}")
        print(f"   (Note: Choose model based on validation, not test performance)")
    
    # Best hamming loss performance
    best_hamming = min(successful_results.items(), key=lambda x: x[1]['test_hamming_loss'])
    if best_hamming[0] != best_model[0]:
        print(f"\n🔻 BEST HAMMING LOSS PERFORMANCE: {best_hamming[0].upper()}")
        print(f"   Test Hamming Loss: {best_hamming[1]['test_hamming_loss']:.4f}")
        print(f"   (Lower hamming loss = better multi-label performance)")
    
    # Model-specific insights
    print(f"\n{'='*80}")
    print("🔍 MODEL-SPECIFIC INSIGHTS:")
    print(f"{'='*80}")
    
    for model_name, result in successful_results.items():
        if hasattr(result['model'], 'get_validation_scores'):
            val_scores = result['model'].get_validation_scores()
            if val_scores and any(score for score in val_scores if score is not None):
                valid_scores = [s for s in val_scores if s is not None and s > 0]
                if valid_scores:
                    avg_label_score = np.mean(valid_scores)
                    print(f"{model_name.upper()}:")
                    print(f"   Average per-label validation score: {avg_label_score:.4f}")
                    print(f"   Labels with good performance (>0.8): {sum(1 for s in valid_scores if s > 0.8)}/{len(valid_scores)}")
    
    # Recommendations
    print(f"\n{'='*80}")
    print("💡 RECOMMENDATIONS:")
    print(f"{'='*80}")
    
    if best_model[1]['overfitting_gap'] < 0.02:
        print("✅ Your best model shows excellent generalization based on Hamming loss!")
    elif best_model[1]['overfitting_gap'] < 0.04:
        print("🟢 Your best model shows good generalization based on Hamming loss!")
    else:
        print("⚠️ Consider additional regularization for your best model:")
        print("   - Increase regularization parameters")
        print("   - Use more training data")
        print("   - Apply feature selection")
        print("   - Consider ensemble methods")
    
    hamming_gap_threshold = 0.02
    models_with_overfitting = [name for name, result in successful_results.items() 
                              if result['hamming_gap'] > hamming_gap_threshold]
    
    if models_with_overfitting:
        print(f"\n⚠️ Models showing overfitting based on Hamming loss (gap > {hamming_gap_threshold}):")
        for model in models_with_overfitting:
            result = successful_results[model]
            print(f"   - {model.upper()}:")
            print(f"     • Hamming Gap: {result['hamming_gap']:.4f}")
            print(f"     • Accuracy Gap: {result['accuracy_gap']:.4f}")
            print(f"     • F1 Gap: {result['f1_gap']:.4f}")
    
    print(f"\n🎯 Model Selection Priority (Updated with Hamming Loss):")
    print("   1. Choose model with best VALIDATION performance")
    print("   2. Prefer models with smaller Hamming loss gap (primary indicator)")
    print("   3. Consider accuracy and F1 gaps as secondary indicators")
    print("   4. Evaluate computational efficiency for deployment")
    print("   5. Lower Hamming loss = better multi-label classification performance")
    
    print(f"\n📊 Understanding Hamming Loss:")
    print("   • Hamming Loss measures label-wise classification errors")
    print("   • Perfect score = 0.0, higher values = more errors")
    print("   • Particularly important for multi-label problems")
    print("   • Gap = Val_Hamming - Train_Hamming (positive = overfitting)")
    
    return successful_results

# Example usage
print("Starting comprehensive model comparison...")
all_results = compare_all_models(X_train_tfidf, y_train, X_val_tfidf, y_val, X_test_tfidf, y_test)
final_analysis = analyze_model_results(all_results)

Starting comprehensive model comparison...
🚀 COMPREHENSIVE MODEL COMPARISON WITH VALIDATION CONTROL

🔧 Training LOGISTIC Model
Training logistic with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)


100%|██████████| 27/27 [00:23<00:00,  1.14it/s]


Training completed!
Train Accuracy: 0.5919
Val Accuracy: 0.4978
Train F1: 0.8476
Val F1: 0.7936
Train Hamming Loss: 0.0199
Val Hamming Loss: 0.0269
Overfitting Gap (Accuracy): 0.0941
Overfitting Gap (F1): 0.0540
Overfitting Gap (Hamming): 0.0070
✅ LOGISTIC completed successfully!
   Test Accuracy: 0.4702
   Test F1: 0.7859
   Test Hamming Loss: 0.0281
   Overfitting Gap (Hamming): 0.0070

🔧 Training RANDOMFOREST Model
Training randomforest with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)


100%|██████████| 27/27 [00:15<00:00,  1.80it/s]


Training completed!
Train Accuracy: 0.7369
Val Accuracy: 0.5127
Train F1: 0.9059
Val F1: 0.7894
Train Hamming Loss: 0.0114
Val Hamming Loss: 0.0245
Overfitting Gap (Accuracy): 0.2242
Overfitting Gap (F1): 0.1164
Overfitting Gap (Hamming): 0.0130
✅ RANDOMFOREST completed successfully!
   Test Accuracy: 0.5270
   Test F1: 0.7901
   Test Hamming Loss: 0.0242
   Overfitting Gap (Hamming): 0.0130

🔧 Training LIGHTGBM Model
Training lightgbm with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)


100%|██████████| 27/27 [02:37<00:00,  5.84s/it]


Training completed!
Train Accuracy: 0.9586
Val Accuracy: 0.6149
Train F1: 0.9861
Val F1: 0.8340
Train Hamming Loss: 0.0016
Val Hamming Loss: 0.0181
Overfitting Gap (Accuracy): 0.3437
Overfitting Gap (F1): 0.1521
Overfitting Gap (Hamming): 0.0165
✅ LIGHTGBM completed successfully!
   Test Accuracy: 0.6070
   Test F1: 0.8248
   Test Hamming Loss: 0.0190
   Overfitting Gap (Hamming): 0.0165

🔧 Training XGBOOST Model
Training xgboost with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)


100%|██████████| 27/27 [08:03<00:00, 17.91s/it]


Training completed!
Train Accuracy: 0.9308
Val Accuracy: 0.6205
Train F1: 0.9758
Val F1: 0.8368
Train Hamming Loss: 0.0027
Val Hamming Loss: 0.0176
Overfitting Gap (Accuracy): 0.3103
Overfitting Gap (F1): 0.1390
Overfitting Gap (Hamming): 0.0149
✅ XGBOOST completed successfully!
   Test Accuracy: 0.6219
   Test F1: 0.8331
   Test Hamming Loss: 0.0179
   Overfitting Gap (Hamming): 0.0149

📊 COMPREHENSIVE MODEL ANALYSIS

Model           | Train Acc | Val Acc   | Test Acc  | Train Ham | Val Ham  | Test Ham | Ham Gap  | Status
---------------------------------------------------------------------------------------------------------
XGBOOST         | 0.9308    | 0.6205    | 0.6219    | 0.0027    | 0.0176   | 0.0179   | 0.0149   | 🟢 Good
LIGHTGBM        | 0.9586    | 0.6149    | 0.6070    | 0.0016    | 0.0181   | 0.0190   | 0.0165   | 🟢 Good
RANDOMFOREST    | 0.7369    | 0.5127    | 0.5270    | 0.0114    | 0.0245   | 0.0242   | 0.0130   | 🟢 Good
LOGISTIC        | 0.5919    | 0.4978    | 0.470

In [None]:
# # Example usage of the enhanced training function
# print("Testing enhanced validation-controlled training...")
# print("="*60)

# # Test with LightGBM
# lgbm_model, lgbm_metrics = training_function_with_validation(
#     X_train_tfidf, y_train, X_val_tfidf, y_val, model_type='lightgbm'
# )



Testing enhanced validation-controlled training...
Training lightgbm with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)
  Training LGBM classifier 1/27
  Training LGBM classifier 2/27
  Training LGBM classifier 3/27
  Training LGBM classifier 4/27
  Training LGBM classifier 5/27
  Training LGBM classifier 6/27
  Training LGBM classifier 7/27
  Training LGBM classifier 8/27
  Training LGBM classifier 9/27
  Training LGBM classifier 10/27
  Training LGBM classifier 11/27
  Training LGBM classifier 12/27
  Training LGBM classifier 13/27
  Training LGBM classifier 14/27
  Training LGBM classifier 15/27
  Training LGBM classifier 16/27
  Training LGBM classifier 17/27
  Training LGBM classifier 18/27
  Training LGBM classifier 19/27
  Training LGBM classifier 20/27
  Training LGBM classifier 21/27
  Training LGBM classifier 22/27
  Training LGBM classifier 23/27
  Training LGBM classifier 24/27
  Training LG

In [None]:
# # Test with XGBoost
# print(f"\n{'-'*40}")
# xgb_model, xgb_metrics = training_function_with_validation(
#     X_train_tfidf, y_train, X_val_tfidf, y_val, model_type='xgboost'
# )

# print(f"\nXGBoost Results:")
# print(f"  Validation Accuracy: {xgb_metrics['val_accuracy']:.4f}")
# print(f"  Overfitting Gap: {xgb_metrics['overfitting_gap']:.4f}")

# # Final test predictions
# lgbm_test_pred = lgbm_model.predict(X_test_tfidf)
# xgb_test_pred = xgb_model.predict(X_test_tfidf)

# lgbm_test_acc = accuracy_score(y_test, lgbm_test_pred)
# xgb_test_acc = accuracy_score(y_test, xgb_test_pred)

# print(f"\nFinal Test Results:")
# print(f"  LightGBM Test Accuracy: {lgbm_test_acc:.4f}")
# print(f"  XGBoost Test Accuracy: {xgb_test_acc:.4f}")

# # Determine best model
# if lgbm_metrics['val_accuracy'] > xgb_metrics['val_accuracy']:
#     best_val_model = 'LightGBM'
#     best_model = lgbm_model
#     best_test_acc = lgbm_test_acc
# else:
#     best_val_model = 'XGBoost'
#     best_model = xgb_model
#     best_test_acc = xgb_test_acc

# print(f"\n🏆 Best validation-controlled model: {best_val_model}")
# print(f"   Test Accuracy: {best_test_acc:.4f}")


----------------------------------------
Training xgboost with validation control...
X_train shape: (11597, 10000)
y_train shape: (11597, 27)
X_val shape: (2485, 10000)
y_val shape: (2485, 27)
  Training XGB classifier 1/27
  Training XGB classifier 2/27
  Training XGB classifier 3/27
  Training XGB classifier 4/27
  Training XGB classifier 5/27
  Training XGB classifier 6/27
  Training XGB classifier 7/27
  Training XGB classifier 8/27
  Training XGB classifier 9/27
  Training XGB classifier 10/27
  Training XGB classifier 11/27
  Training XGB classifier 12/27
  Training XGB classifier 13/27
  Training XGB classifier 14/27
  Training XGB classifier 15/27
  Training XGB classifier 16/27
  Training XGB classifier 17/27
  Training XGB classifier 18/27
  Training XGB classifier 19/27
  Training XGB classifier 20/27
  Training XGB classifier 21/27
  Training XGB classifier 22/27
  Training XGB classifier 23/27
  Training XGB classifier 24/27
  Training XGB classifier 25/27
  Training XGB 

In [None]:
# # Additional Validation Techniques for Overfitting Control

# from sklearn.model_selection import cross_val_score, StratifiedKFold
# from sklearn.model_selection import validation_curve, learning_curve
# import matplotlib.pyplot as plt

# def plot_learning_curve(estimator, X, y, title, cv=5, n_jobs=-1, 
#                        train_sizes=np.linspace(0.1, 1.0, 10)):
#     """
#     Generate a plot showing the learning curve for a model
#     """
#     train_sizes, train_scores, val_scores = learning_curve(
#         estimator, X, y, cv=cv, n_jobs=n_jobs, 
#         train_sizes=train_sizes, scoring='accuracy'
#     )
    
#     train_scores_mean = np.mean(train_scores, axis=1)
#     train_scores_std = np.std(train_scores, axis=1)
#     val_scores_mean = np.mean(val_scores, axis=1)
#     val_scores_std = np.std(val_scores, axis=1)
    
#     plt.figure(figsize=(10, 6))
#     plt.plot(train_sizes, train_scores_mean, 'o-', color='blue', label='Training score')
#     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
#                      train_scores_mean + train_scores_std, alpha=0.1, color='blue')
    
#     plt.plot(train_sizes, val_scores_mean, 'o-', color='red', label='Cross-validation score')
#     plt.fill_between(train_sizes, val_scores_mean - val_scores_std,
#                      val_scores_mean + val_scores_std, alpha=0.1, color='red')
    
#     plt.xlabel('Training Set Size')
#     plt.ylabel('Accuracy Score')
#     plt.title(f'Learning Curve - {title}')
#     plt.legend(loc='best')
#     plt.grid(True, alpha=0.3)
#     plt.tight_layout()
#     plt.show()
    
#     # Detect overfitting
#     final_gap = train_scores_mean[-1] - val_scores_mean[-1]
#     if final_gap > 0.1:
#         print(f"⚠️ WARNING: {title} shows signs of overfitting (gap: {final_gap:.4f})")
#     elif final_gap > 0.05:
#         print(f"🔶 MODERATE: {title} shows moderate overfitting (gap: {final_gap:.4f})")
#     else:
#         print(f"✅ GOOD: {title} shows good generalization (gap: {final_gap:.4f})")

# def cross_validate_with_overfitting_check(model, X, y, cv=5, model_name="Model"):
#     """
#     Perform cross-validation and check for overfitting signs
#     """
#     print(f"\nCross-validating {model_name}...")
    
#     # Perform cross-validation
#     cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
    
#     # Train on full dataset to check training score
#     model.fit(X, y)
#     train_score = model.score(X, y)
    
#     cv_mean = cv_scores.mean()
#     cv_std = cv_scores.std()
    
#     print(f"  Cross-validation scores: {cv_scores}")
#     print(f"  CV Mean ± Std: {cv_mean:.4f} ± {cv_std:.4f}")
#     print(f"  Training score: {train_score:.4f}")
    
#     # Check for overfitting
#     overfitting_gap = train_score - cv_mean
#     print(f"  Overfitting gap: {overfitting_gap:.4f}")
    
#     if overfitting_gap > 0.1:
#         status = "🚨 HIGH OVERFITTING"
#     elif overfitting_gap > 0.05:
#         status = "⚠️ MODERATE OVERFITTING"
#     else:
#         status = "✅ GOOD GENERALIZATION"
    
#     print(f"  Status: {status}")
    
#     return {
#         'cv_scores': cv_scores,
#         'cv_mean': cv_mean,
#         'cv_std': cv_std,
#         'train_score': train_score,
#         'overfitting_gap': overfitting_gap,
#         'status': status
#     }

# def plot_validation_curve_param(estimator, X, y, param_name, param_range, title):
#     """
#     Plot validation curve for a specific parameter to find optimal value
#     """
#     train_scores, val_scores = validation_curve(
#         estimator, X, y, param_name=param_name, param_range=param_range,
#         cv=5, scoring='accuracy', n_jobs=-1
#     )
    
#     train_scores_mean = np.mean(train_scores, axis=1)
#     train_scores_std = np.std(train_scores, axis=1)
#     val_scores_mean = np.mean(val_scores, axis=1)
#     val_scores_std = np.std(val_scores, axis=1)
    
#     plt.figure(figsize=(10, 6))
#     plt.semilogx(param_range, train_scores_mean, 'o-', color='blue', label='Training score')
#     plt.fill_between(param_range, train_scores_mean - train_scores_std,
#                      train_scores_mean + train_scores_std, alpha=0.1, color='blue')
    
#     plt.semilogx(param_range, val_scores_mean, 'o-', color='red', label='Cross-validation score')
#     plt.fill_between(param_range, val_scores_mean - val_scores_std,
#                      val_scores_mean + val_scores_std, alpha=0.1, color='red')
    
#     plt.xlabel(param_name)
#     plt.ylabel('Accuracy Score')
#     plt.title(f'Validation Curve - {title}')
#     plt.legend(loc='best')
#     plt.grid(True, alpha=0.3)
#     plt.tight_layout()
#     plt.show()
    
#     # Find optimal parameter
#     optimal_idx = np.argmax(val_scores_mean)
#     optimal_param = param_range[optimal_idx]
#     optimal_score = val_scores_mean[optimal_idx]
    
#     print(f"Optimal {param_name}: {optimal_param}")
#     print(f"Optimal CV score: {optimal_score:.4f}")
    
#     return optimal_param, optimal_score

# # Example: Cross-validation analysis for overfitting detection
# print("COMPREHENSIVE VALIDATION ANALYSIS")
# print("="*60)

# # Sample a subset for faster computation in demo
# sample_size = min(1000, len(X_train_tfidf))
# X_sample = X_train_tfidf[:sample_size]
# y_sample = y_train[:sample_size]

# print(f"Using sample of {sample_size} examples for validation analysis...")

# # 1. Cross-validation for different models
# models_for_cv = {
#     'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
#     'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42, max_depth=10),
# }

# cv_results = {}
# for name, model in models_for_cv.items():
#     # Use OneVsRestClassifier for multi-label
#     multi_label_model = OneVsRestClassifier(model)
#     cv_results[name] = cross_validate_with_overfitting_check(
#         multi_label_model, X_sample, y_sample, cv=3, model_name=name
#     )

# # 2. Find models with best generalization
# print(f"\n{'='*60}")
# print("OVERFITTING SUMMARY:")
# print(f"{'Model':<20} | {'CV Score':<10} | {'Gap':<8} | {'Status'}")
# print(f"{'-'*65}")

# for name, results in cv_results.items():
#     print(f"{name:<20} | {results['cv_mean']:<10.4f} | {results['overfitting_gap']:<8.4f} | {results['status']}")

# # 3. Recommendations for overfitting control
# print(f"\n{'='*60}")
# print("RECOMMENDATIONS FOR OVERFITTING CONTROL:")
# print()
# print("1. 📊 VALIDATION MONITORING:")
# print("   - Always split data into train/validation/test")
# print("   - Monitor validation metrics during training")
# print("   - Use early stopping when validation stops improving")
# print()
# print("2. 🔧 MODEL REGULARIZATION:")
# print("   - Logistic Regression: Adjust C parameter (lower = more regularization)")
# print("   - Random Forest: Limit max_depth, increase min_samples_split")
# print("   - XGBoost/LightGBM: Use early_stopping_rounds, adjust learning_rate")
# print()
# print("3. 📈 TECHNIQUES IMPLEMENTED:")
# print("   - Train/Validation/Test split (70/15/15)")
# print("   - Cross-validation for robust evaluation")
# print("   - Early stopping for tree-based models")
# print("   - Validation gap monitoring")
# print("   - Learning curve analysis")
# print()
# print("4. 🎯 SELECTION CRITERIA:")
# print("   - Choose model with best VALIDATION performance")
# print("   - Prefer models with smaller train-validation gap")
# print("   - Consider cross-validation consistency")

# # Example of how to use validation curve for parameter tuning
# print(f"\n{'='*60}")
# print("PARAMETER TUNING WITH VALIDATION CURVES:")
# print("(Use this approach to find optimal hyperparameters)")
# print()
# print("Example code for Random Forest max_depth tuning:")
# print("""
# # Find optimal max_depth for Random Forest
# param_range = [3, 5, 7, 10, 15, 20]
# optimal_depth, optimal_score = plot_validation_curve_param(
#     OneVsRestClassifier(RandomForestClassifier(random_state=42)),
#     X_train_tfidf, y_train,
#     param_name='estimator__max_depth',
#     param_range=param_range,
#     title='Random Forest max_depth'
# )
# """)

## Transformers Encoder Model(MordenBERT)

In [1]:
import os
import pandas as pd
import numpy as np
import pickle
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
import evaluate
import warnings
# Suppress the tqdm warning temporarily
warnings.filterwarnings('ignore', category=UserWarning, module='tqdm')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def create_datasets_from_arrays(X_train, y_train, X_val=None, y_val=None, X_test=None, y_test=None):
    """
    Convert arrays into HuggingFace datasets format with specified structure
    
    Returns:
        DatasetDict with features:
        - dataset["train"]["text"]: text data
        - dataset["train"]["labels"]: multi-label arrays
        - dataset["val"]["text"]: validation text data (if provided)
        - dataset["val"]["labels"]: validation labels (if provided)
        - dataset["test"]["text"]: test text data (if provided)
        - dataset["test"]["labels"]: test labels (if provided)
    """
    # Create training dataset
    train_dict = {
        "text": X_train.tolist() if hasattr(X_train, 'tolist') else list(X_train),
        "labels": y_train.tolist() if hasattr(y_train, 'tolist') else list(y_train)
    }
    
    datasets_dict = {
        "train": Dataset.from_dict(train_dict)
    }
    
    # Add validation dataset if provided
    if X_val is not None and y_val is not None:
        val_dict = {
            "text": X_val.tolist() if hasattr(X_val, 'tolist') else list(X_val),
            "labels": y_val.tolist() if hasattr(y_val, 'tolist') else list(y_val)
        }
        datasets_dict["val"] = Dataset.from_dict(val_dict)
    
    # Add test dataset if provided
    if X_test is not None and y_test is not None:
        test_dict = {
            "text": X_test.tolist() if hasattr(X_test, 'tolist') else list(X_test),
            "labels": y_test.tolist() if hasattr(y_test, 'tolist') else list(y_test)
        }
        datasets_dict["test"] = Dataset.from_dict(test_dict)

    # Create DatasetDict
    dataset = DatasetDict(datasets_dict)
    
    return dataset

In [3]:
saved_data=os.path.join(os.getcwd(), 'processed_data')
with open(os.path.join(saved_data,'train_arrays.pkl'), 'rb') as f:
    train_data = pickle.load(f)
    X_train = train_data['X_train']
    y_train = train_data['y_train']

with open(os.path.join(saved_data,'val_arrays.pkl'), 'rb') as f:
    val_data = pickle.load(f)
    X_val = val_data['X_val']
    y_val = val_data['y_val']

with open(os.path.join(saved_data,'test_arrays.pkl'), 'rb') as f:
    test_data = pickle.load(f)
    X_test = test_data['X_test']
    y_test = test_data['y_test']

with open(os.path.join(saved_data,'class_name.pkl'), 'rb') as f:
    class_name_data = pickle.load(f)
    class_name = class_name_data['class_name']

# Create the datasets
dataset = create_datasets_from_arrays(X_train, y_train, X_val, y_val, X_test, y_test)


In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 11597
    })
    val: Dataset({
        features: ['text', 'labels'],
        num_rows: 2485
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 2486
    })
})

In [5]:
model_path = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)

def preprocess_function(example):
   text = example['text']
   example = tokenizer(text, truncation=True)
   return example

tokenized_dataset = dataset.map(preprocess_function)

Map: 100%|██████████| 11597/11597 [01:22<00:00, 140.80 examples/s]
Map: 100%|██████████| 11597/11597 [01:22<00:00, 140.80 examples/s]
Map: 100%|██████████| 2485/2485 [00:18<00:00, 136.80 examples/s]
Map:   0%|          | 0/2486 [00:00<?, ? examples/s]
Map: 100%|██████████| 2486/2486 [00:18<00:00, 137.40 examples/s]
Map: 100%|██████████| 2486/2486 [00:18<00:00, 137.40 examples/s]


In [9]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 11597
    })
    val: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 2485
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 2486
    })
})

In [10]:
# Import required metrics libraries
from sklearn.metrics import (
    precision_score, recall_score, f1_score, 
    roc_auc_score, average_precision_score,
    hamming_loss, jaccard_score, accuracy_score
)

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def comprehensive_evaluation(y_true, y_pred_proba, y_pred_binary=None, threshold=0.5):
    """
    Comprehensive evaluation for multi-label classification with all averaging methods
    """
    if y_pred_binary is None:
        y_pred_binary = (y_pred_proba >= threshold).astype(int)
    
    metrics = {}
    
    # SAMPLES AVERAGE (per-sample then average across samples)
    metrics['precision_samples'] = precision_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['recall_samples'] = recall_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['f1_samples'] = f1_score(y_true, y_pred_binary, average='samples', zero_division=0)
    
    # MICRO AVERAGE (global aggregation)
    metrics['precision_micro'] = precision_score(y_true, y_pred_binary, average='micro', zero_division=0)
    metrics['recall_micro'] = recall_score(y_true, y_pred_binary, average='micro', zero_division=0)
    metrics['f1_micro'] = f1_score(y_true, y_pred_binary, average='micro', zero_division=0)
    
    # MACRO AVERAGE (unweighted average across labels)
    metrics['precision_macro'] = precision_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['recall_macro'] = recall_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['f1_macro'] = f1_score(y_true, y_pred_binary, average='macro', zero_division=0)
    
    # WEIGHTED AVERAGE (weighted by support/frequency)
    metrics['precision_weighted'] = precision_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    metrics['recall_weighted'] = recall_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    metrics['f1_weighted'] = f1_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    
    # ROC-AUC (multiple averaging methods)
    try:
        metrics['roc_auc_macro'] = roc_auc_score(y_true, y_pred_proba, average='macro')
        metrics['roc_auc_weighted'] = roc_auc_score(y_true, y_pred_proba, average='weighted')
        metrics['roc_auc_samples'] = roc_auc_score(y_true, y_pred_proba, average='samples')
    except ValueError as e:
        print(f"ROC-AUC calculation failed: {e}")
        metrics['roc_auc_macro'] = 0.0
        metrics['roc_auc_weighted'] = 0.0
        metrics['roc_auc_samples'] = 0.0
    
    # Precision-Recall AUC (multiple averaging methods)
    try:
        metrics['pr_auc_macro'] = average_precision_score(y_true, y_pred_proba, average='macro')
        metrics['pr_auc_weighted'] = average_precision_score(y_true, y_pred_proba, average='weighted')
        metrics['pr_auc_samples'] = average_precision_score(y_true, y_pred_proba, average='samples')
    except ValueError as e:
        print(f"PR-AUC calculation failed: {e}")
        metrics['pr_auc_macro'] = 0.0
        metrics['pr_auc_weighted'] = 0.0
        metrics['pr_auc_samples'] = 0.0
    
    # Hamming Loss (inherently micro-averaged)
    metrics['hamming_loss'] = hamming_loss(y_true, y_pred_binary)
    
    # Jaccard Score (multiple averaging methods)
    metrics['jaccard_samples'] = jaccard_score(y_true, y_pred_binary, average='samples', zero_division=0)
    metrics['jaccard_macro'] = jaccard_score(y_true, y_pred_binary, average='macro', zero_division=0)
    metrics['jaccard_weighted'] = jaccard_score(y_true, y_pred_binary, average='weighted', zero_division=0)
    
    # Overall accuracy (subset accuracy for multi-label)
    metrics['accuracy'] = accuracy_score(y_true, y_pred_binary)
    
    # Note: micro average for Jaccard in multi-label is not directly supported in sklearn
    # but can be calculated manually if needed
    
    return metrics

def compute_metrics(eval_pred):
    """
    Enhanced compute_metrics function for transformers Trainer using comprehensive evaluation
    """
    predictions, labels = eval_pred
    
    # Apply sigmoid to get probabilities
    predictions_proba = sigmoid(predictions)
    
    # Convert to binary predictions using threshold 0.5
    predictions_binary = (predictions_proba > 0.5).astype(int)
    
    # Ensure labels are integers
    labels = labels.astype(int)
    
    # Use comprehensive evaluation
    metrics = comprehensive_evaluation(
        y_true=labels,
        y_pred_proba=predictions_proba,
        y_pred_binary=predictions_binary,
        threshold=0.5
    )
    
    # Return metrics with prefixes for clarity during training
    return {
        # Primary metrics for monitoring
        'eval_f1_micro': metrics['f1_micro'],
        'eval_f1_macro': metrics['f1_macro'],
        'eval_accuracy': metrics['accuracy'],
        'eval_hamming_loss': metrics['hamming_loss'],
        
        # Precision metrics
        'eval_precision_micro': metrics['precision_micro'],
        'eval_precision_macro': metrics['precision_macro'],
        'eval_precision_samples': metrics['precision_samples'],
        'eval_precision_weighted': metrics['precision_weighted'],
        
        # Recall metrics
        'eval_recall_micro': metrics['recall_micro'],
        'eval_recall_macro': metrics['recall_macro'],
        'eval_recall_samples': metrics['recall_samples'],
        'eval_recall_weighted': metrics['recall_weighted'],
        
        # F1 metrics
        'eval_f1_samples': metrics['f1_samples'],
        'eval_f1_weighted': metrics['f1_weighted'],
        
        # ROC-AUC metrics
        'eval_roc_auc_macro': metrics['roc_auc_macro'],
        'eval_roc_auc_weighted': metrics['roc_auc_weighted'],
        'eval_roc_auc_samples': metrics['roc_auc_samples'],
        
        # PR-AUC metrics
        'eval_pr_auc_macro': metrics['pr_auc_macro'],
        'eval_pr_auc_weighted': metrics['pr_auc_weighted'],
        'eval_pr_auc_samples': metrics['pr_auc_samples'],
        
        # Jaccard metrics
        'eval_jaccard_samples': metrics['jaccard_samples'],
        'eval_jaccard_macro': metrics['jaccard_macro'],
        'eval_jaccard_weighted': metrics['jaccard_weighted'],
    }


In [11]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

class2id = {class_:id for id, class_ in enumerate(class_name)}
id2class = {id:class_ for class_, id in class2id.items()}


model = AutoModelForSequenceClassification.from_pretrained(model_path, 
                                                           num_labels=len(class_name),
                                                           id2label=id2class, 
                                                           label2id=class2id,
                                                           problem_type = "multi_label_classification"
)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Enhanced Training Configuration for Multi-label Classification
import os
import warnings
from transformers import EarlyStoppingCallback

# Fix tokenizer parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Suppress future warnings
warnings.filterwarnings("ignore", category=FutureWarning)

training_args = TrainingArguments(
    # Output and logging
    output_dir="./model_output",
    logging_dir="./logs",
    logging_steps=50,
    logging_strategy="steps",
    
    # Learning parameters
    learning_rate=2e-5,
    lr_scheduler_type="linear",  # Linear decay
    warmup_ratio=0.1,  # 10% warmup
    weight_decay=0.01,
    
    # Batch sizes (adjust based on GPU memory)
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    gradient_accumulation_steps=4,  # Effective batch size = 3 * 4 = 12
    
    # Training epochs and evaluation
    num_train_epochs=3,  # Increased for better convergence
    eval_strategy="steps",  # More frequent evaluation
    eval_steps=100,  # Evaluate every 100 steps
    
    # Saving strategy
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,  # Keep only 3 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1_micro",  # Use micro F1 for model selection
    greater_is_better=True,
    
    # Memory and performance optimization
    dataloader_pin_memory=False,  # Disable to avoid forking issues
    dataloader_num_workers=0,     # Disable multiprocessing
    remove_unused_columns=False,  # Keep all columns for multi-label
    
    # Mixed precision for faster training (if GPU supports it)
    fp16=True,  # Enable if using compatible GPU
    
    # Reproducibility
    seed=42,
    data_seed=42,
    
    # Report metrics
    report_to=None,  # Disable wandb/tensorboard if not needed
    run_name="multi_label_posture_classification",
)

# Early stopping callback for overfitting control
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,  # Stop if no improvement for 3 evaluations
    early_stopping_threshold=0.001  # Minimum improvement threshold
)

# Initialize trainer with enhanced configuration (using processing_class)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    processing_class=tokenizer,  # Updated parameter name
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping],  # Add early stopping callback
)

print("🚀 Starting training with enhanced configuration...")
print(f"📊 Training samples: {len(tokenized_dataset['train'])}")
print(f"📊 Validation samples: {len(tokenized_dataset['val'])}")
print(f"🎯 Target metric: {training_args.metric_for_best_model}")
print(f"⏱️ Total epochs: {training_args.num_train_epochs}")
print(f"🔄 Evaluation every: {training_args.eval_steps} steps")
print(f"💾 Saving every: {training_args.save_steps} steps")
print(f"⏹️ Early stopping patience: {early_stopping.early_stopping_patience}")

# Start training with error handling
try:
    print("\n🎯 Starting training...")
    trainer.train()
    print("✅ Training completed successfully!")
except Exception as e:
    print(f"❌ Training failed with error: {e}")
    print("💡 Consider:")
    print("   - Reducing batch size if out of memory")
    print("   - Checking data format compatibility")
    print("   - Verifying model and tokenizer compatibility")
    print("   - The data format may need fixing - check tokenization step")

🚀 Starting training with enhanced configuration...
📊 Training samples: 11597
📊 Validation samples: 2485
🎯 Target metric: eval_f1_micro
⏱️ Total epochs: 3
🔄 Evaluation every: 100 steps
💾 Saving every: 100 steps
⏹️ Early stopping patience: 3

🎯 Starting training...


W0621 11:26:34.243000 1031232 torch/_inductor/utils.py:1250] [1/0] Not enough SMs to use max_autotune_gemm mode


Step,Training Loss,Validation Loss,F1 Micro,F1 Macro,Accuracy,Hamming Loss,Precision Micro,Precision Macro,Precision Samples,Precision Weighted,Recall Micro,Recall Macro,Recall Samples,Recall Weighted,F1 Samples,F1 Weighted,Roc Auc Macro,Roc Auc Weighted,Roc Auc Samples,Pr Auc Macro,Pr Auc Weighted,Pr Auc Samples,Jaccard Samples,Jaccard Macro,Jaccard Weighted
100,0.5833,0.136496,0.609441,0.071993,0.329175,0.037365,0.745995,0.072692,0.747686,0.434557,0.515144,0.07682,0.555949,0.515144,0.613199,0.466927,0.573098,0.768878,0.889718,0.122389,0.541441,0.7095,0.539618,0.059121,0.39647
200,0.4338,0.113665,0.666854,0.106456,0.347686,0.035353,0.714415,0.096628,0.736553,0.521931,0.62523,0.125872,0.677378,0.62523,0.669564,0.563913,0.700035,0.876698,0.926063,0.171479,0.654765,0.779754,0.585091,0.087707,0.485853
300,0.3794,0.10392,0.627672,0.103288,0.370624,0.032447,0.895122,0.146988,0.689537,0.66747,0.483276,0.087267,0.538478,0.483276,0.584172,0.545678,0.816023,0.920989,0.940697,0.257442,0.72878,0.822052,0.529403,0.081089,0.448814
400,0.3435,0.083538,0.737581,0.172045,0.477666,0.027319,0.80803,0.256612,0.804292,0.710389,0.67843,0.167206,0.724661,0.67843,0.73332,0.675535,0.879119,0.947028,0.962183,0.335132,0.766255,0.866539,0.667062,0.13651,0.593224
500,0.2835,0.076682,0.75083,0.181496,0.492958,0.024622,0.878574,0.305809,0.854795,0.746643,0.655518,0.15988,0.70947,0.655518,0.74837,0.6654,0.890288,0.950951,0.967984,0.369931,0.778636,0.879262,0.683078,0.144189,0.592827
600,0.2668,0.068757,0.782621,0.279486,0.539235,0.022446,0.865858,0.398412,0.855801,0.758672,0.713985,0.255255,0.761885,0.713985,0.778189,0.720315,0.914098,0.960207,0.974035,0.435769,0.798009,0.897154,0.716915,0.221038,0.643413
700,0.2617,0.068667,0.790376,0.277427,0.550101,0.021686,0.872455,0.339714,0.859155,0.739712,0.722412,0.257859,0.770134,0.722412,0.784359,0.722986,0.928785,0.962977,0.97661,0.468455,0.808033,0.899306,0.724292,0.224901,0.652195
800,0.2446,0.069419,0.775353,0.290474,0.533199,0.02255,0.8887,0.40121,0.857545,0.752224,0.687648,0.261627,0.738578,0.687648,0.76668,0.704035,0.935037,0.962016,0.976544,0.490506,0.81176,0.894241,0.706687,0.231054,0.631262
900,0.2482,0.061233,0.807222,0.383487,0.589537,0.020687,0.853952,0.513256,0.857746,0.794985,0.765341,0.352756,0.805547,0.765341,0.806076,0.769488,0.939656,0.966226,0.981477,0.516552,0.822136,0.91596,0.750865,0.304213,0.689063
1000,0.1992,0.063,0.808569,0.429539,0.582294,0.020911,0.838901,0.460617,0.847619,0.770787,0.780353,0.42138,0.819095,0.780353,0.808062,0.768979,0.938371,0.966902,0.981896,0.52442,0.826524,0.915703,0.751053,0.346934,0.690103


✅ Training completed successfully!


In [18]:
# Pre-training Checks and Debugging

print("🔍 PRE-TRAINING VALIDATION CHECKS")
print("=" * 50)

# Check dataset structure
print("\n📊 Dataset Structure Check:")
print(f"Available splits: {list(tokenized_dataset.keys())}")
for split_name, split_data in tokenized_dataset.items():
    print(f"   {split_name}: {len(split_data)} samples")
    print(f"   Features: {list(split_data.features.keys())}")

# Check sample data structure
print("\n🔍 Sample Data Inspection:")
sample = tokenized_dataset["train"][0]
print(f"Sample keys: {sample.keys()}")
print(f"Text type: {type(sample.get('text', 'N/A'))}")
print(f"Labels type: {type(sample.get('labels', 'N/A'))}")
if 'labels' in sample:
    labels_array = np.array(sample['labels'])
    print(f"Labels shape: {labels_array.shape}")
    print(f"Labels dtype: {labels_array.dtype}")
    print(f"Labels sum: {labels_array.sum()}")
    print(f"Sample labels: {sample['labels']}")

# Check tokenizer output
print(f"\nTokenizer info:")
print(f"   Input IDs shape: {np.array(sample['input_ids']).shape}")
print(f"   Attention mask shape: {np.array(sample['attention_mask']).shape}")

# Model configuration check
print(f"\n🤖 Model Configuration:")
print(f"   Model type: {type(model).__name__}")
print(f"   Number of labels: {model.config.num_labels}")
print(f"   Problem type: {getattr(model.config, 'problem_type', 'Not set')}")

# GPU/CPU check
import torch
if torch.cuda.is_available():
    print(f"\n💻 GPU Information:")
    print(f"   Device: {torch.cuda.get_device_name()}")
    print(f"   Memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"   Memory reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
else:
    print(f"\n💻 Using CPU for training")

# Training configuration summary
print(f"\n⚙️ Training Configuration Summary:")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Total epochs: {training_args.num_train_epochs}")
print(f"   Warmup ratio: {training_args.warmup_ratio}")
print(f"   Weight decay: {training_args.weight_decay}")
print(f"   FP16 enabled: {training_args.fp16}")

# Estimate training time
train_samples = len(tokenized_dataset["train"])
batch_size = training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps
steps_per_epoch = train_samples // batch_size
total_steps = steps_per_epoch * training_args.num_train_epochs

print(f"\n⏱️ Training Estimates:")
print(f"   Steps per epoch: {steps_per_epoch}")
print(f"   Total training steps: {total_steps}")
print(f"   Evaluation every: {training_args.eval_steps} steps")
print(f"   Number of evaluations: {total_steps // training_args.eval_steps}")

print(f"\n✅ Pre-training checks completed!")
print("🚀 Ready to start training...")

# Data Format Debugging and Fix

print("🔍 DATA FORMAT DEBUGGING AND FIXING")
print("=" * 60)

# Check the original dataset structure
print("\n📊 Original Dataset Structure:")
sample = dataset["train"][0]
print(f"Sample keys: {sample.keys()}")
print(f"Text type: {type(sample['text'])}")
print(f"Text content: {str(sample['text'])[:100]}...")
print(f"Labels type: {type(sample['labels'])}")
print(f"Labels: {sample['labels']}")

# The issue is likely in the preprocess_function
# Let's check what the tokenized dataset looks like
print(f"\n🔍 Tokenized Dataset Structure:")
if 'tokenized_dataset' in globals():
    tokenized_sample = tokenized_dataset["train"][0]
    print(f"Tokenized sample keys: {tokenized_sample.keys()}")
    for key, value in tokenized_sample.items():
        print(f"   {key}: type={type(value)}, shape/len={getattr(value, 'shape', len(value) if hasattr(value, '__len__') else 'N/A')}")
        if key == 'text' and hasattr(value, '__iter__') and not isinstance(value, str):
            print(f"      First few elements: {list(value)[:3] if hasattr(value, '__iter__') else value}")

# Fix the preprocess function
def fixed_preprocess_function(examples):
    """
    Fixed preprocessing function for multi-label classification
    """
    # Handle batch processing
    if isinstance(examples['text'], list):
        texts = examples['text']
        labels = examples['labels']
    else:
        texts = [examples['text']]
        labels = [examples['labels']]
    
    # Tokenize the texts
    tokenized = tokenizer(
        texts,
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors=None  # Return lists, not tensors
    )
    
    # Ensure labels are properly formatted
    processed_labels = []
    for label_list in labels:
        if isinstance(label_list, (list, tuple)):
            # Convert to float and ensure it's a list
            processed_labels.append([float(x) for x in label_list])
        else:
            # Handle single values
            processed_labels.append([float(label_list)])
    
    tokenized['labels'] = processed_labels
    return tokenized

# Re-tokenize the dataset with the fixed function
print(f"\n🔧 Re-tokenizing dataset with fixed function...")
try:
    tokenized_dataset_fixed = dataset.map(
        fixed_preprocess_function,
        batched=True,
        remove_columns=dataset["train"].column_names,
        desc="Tokenizing"
    )
    
    print(f"✅ Re-tokenization successful!")
    
    # Check the fixed dataset
    print(f"\n✅ Fixed Dataset Structure:")
    fixed_sample = tokenized_dataset_fixed["train"][0]
    print(f"Fixed sample keys: {fixed_sample.keys()}")
    for key, value in fixed_sample.items():
        print(f"   {key}: type={type(value)}, len={len(value) if hasattr(value, '__len__') else 'N/A'}")
        if key == 'labels':
            print(f"      Labels: {value}")
    
    # Update the global tokenized_dataset
    tokenized_dataset = tokenized_dataset_fixed
    
    print(f"\n📊 Dataset sizes after fixing:")
    for split_name in tokenized_dataset.keys():
        print(f"   {split_name}: {len(tokenized_dataset[split_name])} samples")
        
except Exception as e:
    print(f"❌ Re-tokenization failed: {e}")
    print("🔍 Let's try a simpler approach...")
    
    # Alternative: Manual tokenization
    def simple_tokenize_sample(sample):
        text = str(sample['text'])  # Ensure it's a string
        labels = sample['labels']
        
        # Tokenize
        tokenized = tokenizer(
            text,
            truncation=True,
            padding=True,
            max_length=512,
            return_tensors=None
        )
        
        # Ensure labels are float list
        if isinstance(labels, (list, tuple)):
            tokenized['labels'] = [float(x) for x in labels]
        else:
            tokenized['labels'] = [float(labels)]
            
        return tokenized
    
    # Apply simple tokenization
    tokenized_dataset = dataset.map(simple_tokenize_sample, desc="Simple tokenization")
    print(f"✅ Simple tokenization completed!")

# Verify the final dataset
print(f"\n✅ FINAL DATASET VERIFICATION:")
final_sample = tokenized_dataset["train"][0]
print(f"Final sample structure:")
for key, value in final_sample.items():
    print(f"   {key}: type={type(value)}, len/shape={len(value) if hasattr(value, '__len__') else 'N/A'}")
    if key == 'labels':
        print(f"      Sample labels: {value}")
        print(f"      Labels dtype: {type(value[0]) if isinstance(value, list) and len(value) > 0 else 'N/A'}")

print(f"\n🎯 Dataset is now ready for training!")

🔍 PRE-TRAINING VALIDATION CHECKS

📊 Dataset Structure Check:
Available splits: ['train', 'val', 'test']
   train: 11597 samples
   Features: ['labels', 'input_ids', 'attention_mask']
   val: 2485 samples
   Features: ['labels', 'input_ids', 'attention_mask']
   test: 2486 samples
   Features: ['labels', 'input_ids', 'attention_mask']

🔍 Sample Data Inspection:
Sample keys: dict_keys(['labels', 'input_ids', 'attention_mask'])
Text type: <class 'str'>
Labels type: <class 'list'>
Labels shape: (27,)
Labels dtype: float64
Labels sum: 1.0
Sample labels: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Tokenizer info:
   Input IDs shape: (512,)
   Attention mask shape: (512,)

🤖 Model Configuration:
   Model type: ModernBertForSequenceClassification
   Number of labels: 27
   Problem type: multi_label_classification

💻 GPU Information:
   Device: NVIDIA GeForce RTX 4080 Laptop GPU
   Memory allocated: 1.7

Tokenizing: 100%|██████████| 11597/11597 [01:02<00:00, 185.67 examples/s]
Tokenizing:   0%|          | 0/2485 [00:00<?, ? examples/s]
Tokenizing: 100%|██████████| 2485/2485 [00:13<00:00, 185.46 examples/s]
Tokenizing: 100%|██████████| 2485/2485 [00:13<00:00, 185.46 examples/s]
Tokenizing: 100%|██████████| 2486/2486 [00:13<00:00, 184.84 examples/s]

✅ Re-tokenization successful!

✅ Fixed Dataset Structure:
Fixed sample keys: dict_keys(['labels', 'input_ids', 'attention_mask'])
   labels: type=<class 'list'>, len=27
      Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
   input_ids: type=<class 'list'>, len=512
   attention_mask: type=<class 'list'>, len=512

📊 Dataset sizes after fixing:
   train: 11597 samples
   val: 2485 samples
   test: 2486 samples

✅ FINAL DATASET VERIFICATION:
Final sample structure:
   labels: type=<class 'list'>, len/shape=27
      Sample labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
      Labels dtype: <class 'int'>
   input_ids: type=<class 'list'>, len/shape=512
   attention_mask: type=<class 'list'>, len/shape=512

🎯 Dataset is now ready for training!





In [None]:
# Post-Training Evaluation and Testing

print("🔍 COMPREHENSIVE MODEL EVALUATION")
print("=" * 60)

# Evaluate on validation set
print("\n📊 Validation Set Evaluation:")
val_results = trainer.evaluate()

# Display key metrics
key_metrics = [
    'eval_f1_micro', 'eval_f1_macro', 'eval_accuracy', 'eval_hamming_loss',
    'eval_precision_micro', 'eval_recall_micro', 'eval_roc_auc_macro'
]

for metric in key_metrics:
    if metric in val_results:
        print(f"   {metric}: {val_results[metric]:.4f}")

# Test on test set if available
if "test" in tokenized_dataset:
    print("\n🎯 Test Set Evaluation:")
    test_results = trainer.evaluate(eval_dataset=tokenized_dataset["test"])
    
    for metric in key_metrics:
        if metric in test_results:
            print(f"   {metric}: {test_results[metric]:.4f}")

# Get predictions for detailed analysis
print("\n🔬 Detailed Prediction Analysis:")

# Predict on validation set
val_predictions = trainer.predict(tokenized_dataset["val"])
val_probs = sigmoid(val_predictions.predictions)
val_binary = (val_probs > 0.5).astype(int)
val_true = val_predictions.label_ids

# Use comprehensive evaluation function
detailed_metrics = comprehensive_evaluation(
    y_true=val_true,
    y_pred_proba=val_probs,
    y_pred_binary=val_binary
)

print("\n📈 Comprehensive Metrics Summary:")
print("-" * 50)

# Group metrics by type
metric_groups = {
    'Precision': ['precision_micro', 'precision_macro', 'precision_samples', 'precision_weighted'],
    'Recall': ['recall_micro', 'recall_macro', 'recall_samples', 'recall_weighted'],
    'F1-Score': ['f1_micro', 'f1_macro', 'f1_samples', 'f1_weighted'],
    'ROC-AUC': ['roc_auc_macro', 'roc_auc_weighted', 'roc_auc_samples'],
    'PR-AUC': ['pr_auc_macro', 'pr_auc_weighted', 'pr_auc_samples'],
    'Other': ['accuracy', 'hamming_loss', 'jaccard_macro', 'jaccard_samples']
}

for group_name, metrics in metric_groups.items():
    print(f"\n{group_name}:")
    for metric in metrics:
        if metric in detailed_metrics:
            print(f"   {metric}: {detailed_metrics[metric]:.4f}")

# Sample predictions analysis
print("\n🔍 Sample Predictions Analysis:")
sample_size = min(5, len(val_true))
for i in range(sample_size):
    print(f"\nSample {i+1}:")
    print(f"   True labels: {val_true[i]}")
    print(f"   Predicted:   {val_binary[i]}")
    print(f"   Probabilities: {val_probs[i]}")
    print(f"   Match: {'✅' if np.array_equal(val_true[i], val_binary[i]) else '❌'}")

# Model performance summary
print(f"\n{'='*60}")
print("🏆 MODEL PERFORMANCE SUMMARY")
print(f"{'='*60}")
print(f"✅ Best Metric (F1-Micro): {detailed_metrics['f1_micro']:.4f}")
print(f"📊 Accuracy: {detailed_metrics['accuracy']:.4f}")
print(f"🔻 Hamming Loss: {detailed_metrics['hamming_loss']:.4f}")
print(f"🎯 Macro F1: {detailed_metrics['f1_macro']:.4f}")

if detailed_metrics['f1_micro'] > 0.7:
    print("🎉 Excellent performance! Model is ready for deployment.")
elif detailed_metrics['f1_micro'] > 0.5:
    print("👍 Good performance! Consider fine-tuning for better results.")
else:
    print("⚠️ Performance needs improvement. Consider:")
    print("   - More training epochs")
    print("   - Different learning rate")
    print("   - Data augmentation")
    print("   - Different model architecture")

print(f"\n💾 Model saved to: {training_args.output_dir}")
print("🚀 Training and evaluation completed successfully!")# Enhanced Training Configuration for Multi-label Classification


In [14]:
# 🔧 FIXED TOKENIZATION AND DATA FORMAT
# This section addresses the data format issues that cause training failures

import torch
import numpy as np

def preprocess_function(examples):
    """
    Proper tokenization function for multi-label classification.
    Ensures all outputs are compatible with HuggingFace Trainer.
    """
    # Handle batch vs single example
    if isinstance(examples['text'], str):
        texts = [examples['text']]
        labels = [examples['labels']]
    else:
        texts = examples['text']
        labels = examples['labels']
    
    # Tokenize the texts
    tokenized = tokenizer(
        texts,
        truncation=True,
        padding=True,  # Will be handled by data collator
        max_length=512,  # Adjust based on your model's limit
        return_tensors=None  # Don't return tensors yet, let data collator handle it
    )
    
    # Ensure labels are float32 for BCEWithLogitsLoss
    if isinstance(labels[0], (list, np.ndarray)):
        tokenized['labels'] = [np.array(label, dtype=np.float32).tolist() for label in labels]
    else:
        tokenized['labels'] = [np.array(labels, dtype=np.float32).tolist()]
    
    return tokenized

print("🔧 Re-tokenizing dataset with fixed function...")

# Apply the tokenization function
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=['text'],  # Remove the problematic text column
    desc="Tokenizing dataset"
)

# Verify the tokenized dataset structure
print("\n✅ Tokenized dataset verification:")
print(f"Features: {list(tokenized_dataset['train'].features.keys())}")

# Check a sample
sample = tokenized_dataset["train"][0]
print(f"\nSample structure:")
for key, value in sample.items():
    if isinstance(value, (list, np.ndarray)):
        value_info = f"List/Array of length {len(value)}, dtype: {type(value[0]) if value else 'empty'}"
        if key == 'labels':
            value_info += f", shape: {np.array(value).shape}, sum: {np.sum(value)}"
    else:
        value_info = f"Type: {type(value)}, Value: {value}"
    print(f"  {key}: {value_info}")

# Verify labels are float
sample_labels = np.array(sample['labels'])
print(f"\n🎯 Labels verification:")
print(f"  Labels dtype: {sample_labels.dtype}")
print(f"  Labels shape: {sample_labels.shape}")
print(f"  Expected shape: ({len(class_name)},)")
print(f"  Labels range: [{sample_labels.min():.1f}, {sample_labels.max():.1f}]")

if sample_labels.dtype != np.float32:
    print("⚠️ Warning: Labels are not float32, this may cause training issues")
else:
    print("✅ Labels are properly formatted as float32")

print(f"\n📊 Dataset sizes after tokenization:")
for split_name, split_data in tokenized_dataset.items():
    print(f"  {split_name}: {len(split_data)} samples")

🔧 Re-tokenizing dataset with fixed function...


Tokenizing dataset: 100%|██████████| 11597/11597 [01:04<00:00, 178.84 examples/s]
Tokenizing dataset:   0%|          | 0/2485 [00:00<?, ? examples/s]
Tokenizing dataset: 100%|██████████| 2485/2485 [00:13<00:00, 181.60 examples/s]
Tokenizing dataset:   0%|          | 0/2486 [00:00<?, ? examples/s]
Tokenizing dataset: 100%|██████████| 2486/2486 [00:13<00:00, 184.39 examples/s]


✅ Tokenized dataset verification:
Features: ['labels', 'input_ids', 'attention_mask']

Sample structure:
  labels: List/Array of length 27, dtype: <class 'int'>, shape: (27,), sum: 1
  input_ids: List/Array of length 512, dtype: <class 'int'>
  attention_mask: List/Array of length 512, dtype: <class 'int'>

🎯 Labels verification:
  Labels dtype: int64
  Labels shape: (27,)
  Expected shape: (27,)
  Labels range: [0.0, 1.0]

📊 Dataset sizes after tokenization:
  train: 11597 samples
  val: 2485 samples
  test: 2486 samples





In [None]:
# Post-Training Evaluation and Testing

print("🔍 COMPREHENSIVE MODEL EVALUATION")
print("=" * 60)

# Evaluate on validation set
print("\n📊 Validation Set Evaluation:")
val_results = trainer.evaluate()

# Display key metrics
key_metrics = [
    'eval_f1_micro', 'eval_f1_macro', 'eval_accuracy', 'eval_hamming_loss',
    'eval_precision_micro', 'eval_recall_micro', 'eval_roc_auc_macro'
]

for metric in key_metrics:
    if metric in val_results:
        print(f"   {metric}: {val_results[metric]:.4f}")

# Test on test set if available
if "test" in tokenized_dataset:
    print("\n🎯 Test Set Evaluation:")
    test_results = trainer.evaluate(eval_dataset=tokenized_dataset["test"])
    
    for metric in key_metrics:
        if metric in test_results:
            print(f"   {metric}: {test_results[metric]:.4f}")

# Get predictions for detailed analysis
print("\n🔬 Detailed Prediction Analysis:")

# Predict on validation set
val_predictions = trainer.predict(tokenized_dataset["val"])
val_probs = sigmoid(val_predictions.predictions)
val_binary = (val_probs > 0.5).astype(int)
val_true = val_predictions.label_ids

# Use comprehensive evaluation function
detailed_metrics = comprehensive_evaluation(
    y_true=val_true,
    y_pred_proba=val_probs,
    y_pred_binary=val_binary
)

print("\n📈 Comprehensive Metrics Summary:")
print("-" * 50)

# Group metrics by type
metric_groups = {
    'Precision': ['precision_micro', 'precision_macro', 'precision_samples', 'precision_weighted'],
    'Recall': ['recall_micro', 'recall_macro', 'recall_samples', 'recall_weighted'],
    'F1-Score': ['f1_micro', 'f1_macro', 'f1_samples', 'f1_weighted'],
    'ROC-AUC': ['roc_auc_macro', 'roc_auc_weighted', 'roc_auc_samples'],
    'PR-AUC': ['pr_auc_macro', 'pr_auc_weighted', 'pr_auc_samples'],
    'Other': ['accuracy', 'hamming_loss', 'jaccard_macro', 'jaccard_samples']
}

for group_name, metrics in metric_groups.items():
    print(f"\n{group_name}:")
    for metric in metrics:
        if metric in detailed_metrics:
            print(f"   {metric}: {detailed_metrics[metric]:.4f}")

# Sample predictions analysis
print("\n🔍 Sample Predictions Analysis:")
sample_size = min(5, len(val_true))
for i in range(sample_size):
    print(f"\nSample {i+1}:")
    print(f"   True labels: {val_true[i]}")
    print(f"   Predicted:   {val_binary[i]}")
    print(f"   Probabilities: {val_probs[i]}")
    print(f"   Match: {'✅' if np.array_equal(val_true[i], val_binary[i]) else '❌'}")

# Model performance summary
print(f"\n{'='*60}")
print("🏆 MODEL PERFORMANCE SUMMARY")
print(f"{'='*60}")
print(f"✅ Best Metric (F1-Micro): {detailed_metrics['f1_micro']:.4f}")
print(f"📊 Accuracy: {detailed_metrics['accuracy']:.4f}")
print(f"🔻 Hamming Loss: {detailed_metrics['hamming_loss']:.4f}")
print(f"🎯 Macro F1: {detailed_metrics['f1_macro']:.4f}")

if detailed_metrics['f1_micro'] > 0.7:
    print("🎉 Excellent performance! Model is ready for deployment.")
elif detailed_metrics['f1_micro'] > 0.5:
    print("👍 Good performance! Consider fine-tuning for better results.")
else:
    print("⚠️ Performance needs improvement. Consider:")
    print("   - More training epochs")
    print("   - Different learning rate")
    print("   - Data augmentation")
    print("   - Different model architecture")

print(f"\n💾 Model saved to: {training_args.output_dir}")
print("🚀 Training and evaluation completed successfully!")# Enhanced Training Configuration for Multi-label Classification
training_args = TrainingArguments(
    # Output and logging
    output_dir="./model_output",
    logging_dir="./logs",
    logging_steps=50,
    logging_strategy="steps",
    
    # Learning parameters
    learning_rate=2e-5,
    lr_scheduler_type="linear",  # Linear decay
    warmup_ratio=0.1,  # 10% warmup
    weight_decay=0.01,
    
    # Batch sizes (adjust based on GPU memory)
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    gradient_accumulation_steps=4,  # Effective batch size = 3 * 4 = 12
    
    # Training epochs and evaluation
    num_train_epochs=3,  # Increased for better convergence
    eval_strategy="steps",  # More frequent evaluation
    eval_steps=100,  # Evaluate every 100 steps
    
    # Saving strategy
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,  # Keep only 3 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1_micro",  # Use micro F1 for model selection
    greater_is_better=True,
    
    # Early stopping and overfitting control
    early_stopping_patience=3,  # Stop if no improvement for 3 evaluations
    
    # Memory and performance optimization
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
    remove_unused_columns=False,  # Keep all columns for multi-label
    
    # Mixed precision for faster training (if GPU supports it)
    fp16=True,  # Enable if using compatible GPU
    
    # Reproducibility
    seed=42,
    data_seed=42,
    
    # Report metrics
    report_to=None,  # Disable wandb/tensorboard if not needed
    run_name="multi_label_posture_classification",
)

# Initialize trainer with enhanced configuration
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("🚀 Starting training with enhanced configuration...")
print(f"📊 Training samples: {len(tokenized_dataset['train'])}")
print(f"📊 Validation samples: {len(tokenized_dataset['val'])}")
print(f"🎯 Target metric: {training_args.metric_for_best_model}")
print(f"⏱️ Total epochs: {training_args.num_train_epochs}")
print(f"🔄 Evaluation every: {training_args.eval_steps} steps")
print(f"💾 Saving every: {training_args.save_steps} steps")

# Start training
trainer.train()

  trainer = Trainer(
W0621 11:00:46.017000 1007029 torch/_inductor/utils.py:1250] [1/0] Not enough SMs to use max_autotune_gemm mode


RuntimeError: result type Float can't be cast to the desired output type Long

In [16]:
# 🔧 EXPLICIT LABEL TYPE CONVERSION
# Convert labels to float32 using HuggingFace datasets features

from datasets import Sequence, Value
import torch

print("🔄 Converting labels to float32 using datasets.cast_column...")

# Define the proper feature type for multi-label classification
# Labels should be a sequence of floats (one per class)
label_feature = Sequence(Value("float32"), length=len(class_name))

# Cast the labels column to float32 for all splits
for split_name in tokenized_dataset.keys():
    tokenized_dataset[split_name] = tokenized_dataset[split_name].cast_column("labels", label_feature)

# Verify the fix
print(f"\n✅ Labels conversion verification:")
sample = tokenized_dataset["train"][0]
sample_labels = np.array(sample['labels'])
print(f"  Labels dtype: {sample_labels.dtype}")
print(f"  Labels shape: {sample_labels.shape}")
print(f"  Sample labels: {sample['labels'][:5]}...")  # Show first 5 labels
print(f"  HF Feature type: {tokenized_dataset['train'].features['labels']}")

# Test tensor conversion
test_labels = torch.tensor(sample['labels'], dtype=torch.float32)
print(f"  PyTorch tensor dtype: {test_labels.dtype}")
print(f"  PyTorch tensor shape: {test_labels.shape}")

if sample_labels.dtype == np.float32:
    print("✅ SUCCESS: Labels are now properly formatted as float32")
    print("🚀 Ready for training!")
else:
    print(f"❌ ISSUE: Labels are still {sample_labels.dtype}, expected float32")

🔄 Converting labels to float32 using datasets.cast_column...


Casting the dataset: 100%|██████████| 11597/11597 [00:00<00:00, 232166.06 examples/s]
Casting the dataset: 100%|██████████| 2485/2485 [00:00<00:00, 761458.61 examples/s]
Casting the dataset: 100%|██████████| 2486/2486 [00:00<00:00, 839536.21 examples/s]


✅ Labels conversion verification:
  Labels dtype: float64
  Labels shape: (27,)
  Sample labels: [0.0, 0.0, 0.0, 0.0, 0.0]...
  HF Feature type: Sequence(feature=Value(dtype='float32', id=None), length=27, id=None)
  PyTorch tensor dtype: torch.float32
  PyTorch tensor shape: torch.Size([27])
❌ ISSUE: Labels are still float64, expected float32





In [23]:
# 🎉 FINAL MODEL EVALUATION
# Comprehensive evaluation of the trained multi-label classification model

import torch
from sklearn.metrics import classification_report
import numpy as np

print("🔬 FINAL MODEL EVALUATION")
print("=" * 60)

# First, let's check the test set data types and fix if needed
print("🔍 Checking test set data types...")
test_sample = tokenized_dataset["test"][0]
test_labels = np.array(test_sample['labels'])
print(f"Test labels dtype: {test_labels.dtype}")

if test_labels.dtype != np.float32:
    print("⚠️ Test set labels need conversion, performing conversion...")
    # Re-apply the label conversion to test set
    from datasets import Sequence, Value
    label_feature = Sequence(Value("float32"), length=len(class_name))
    tokenized_dataset["test"] = tokenized_dataset["test"].cast_column("labels", label_feature)
    print("✅ Test set labels converted to float32")

# Use predict method instead of evaluate to avoid evaluation issues
print("📊 Generating predictions on test set...")
predictions = trainer.predict(tokenized_dataset["test"])

# Convert predictions to probabilities and binary predictions
y_pred_proba = torch.sigmoid(torch.tensor(predictions.predictions)).numpy()
y_pred = (y_pred_proba > 0.5).astype(int)
y_true = predictions.label_ids.astype(int)

print(f"Prediction shape: {y_pred.shape}")
print(f"True labels shape: {y_true.shape}")

# Calculate comprehensive metrics manually using our evaluation function
# Note: Fix the function call order - comprehensive_evaluation(y_true, y_pred_proba, y_pred_binary)
print("📊 Calculating comprehensive metrics...")
detailed_metrics = comprehensive_evaluation(y_true, y_pred_proba, y_pred_binary=y_pred)

print(f"\n🏆 TEST SET RESULTS:")
print(f"{'='*50}")

# Print all the comprehensive metrics
metric_groups = {
    "📈 Primary Metrics": ["f1_micro", "f1_macro", "f1_weighted", "f1_samples"],
    "🎯 Precision": ["precision_micro", "precision_macro", "precision_weighted", "precision_samples"],
    "🔍 Recall": ["recall_micro", "recall_macro", "recall_weighted", "recall_samples"],
    "📊 Other Metrics": ["accuracy", "hamming_loss", "jaccard_samples", "jaccard_macro", "jaccard_weighted"],
    "📡 AUC Metrics": ["roc_auc_macro", "roc_auc_weighted", "roc_auc_samples", 
                       "pr_auc_macro", "pr_auc_weighted", "pr_auc_samples"]
}

for group_name, metrics in metric_groups.items():
    print(f"\n{group_name}:")
    for metric in metrics:
        if metric in detailed_metrics:
            print(f"  {metric.upper()}: {detailed_metrics[metric]:.4f}")

# Per-class performance
print(f"\n📋 PER-CLASS PERFORMANCE:")
print(f"{'='*50}")
class_report = classification_report(
    y_true, y_pred, 
    target_names=class_name, 
    output_dict=True,
    zero_division=0
)

# Show performance for each class
for i, class_label in enumerate(class_name):
    if class_label in class_report:
        metrics = class_report[class_label]
        support = int(metrics['support'])
        print(f"{class_label:30s} | P: {metrics['precision']:.3f} | R: {metrics['recall']:.3f} | F1: {metrics['f1-score']:.3f} | Support: {support:4d}")

# Overall summary
print(f"\n🎯 OVERALL PERFORMANCE SUMMARY:")
print(f"{'='*50}")
macro_avg = class_report['macro avg']
weighted_avg = class_report['weighted avg']

print(f"🔹 Macro Average    | P: {macro_avg['precision']:.3f} | R: {macro_avg['recall']:.3f} | F1: {macro_avg['f1-score']:.3f}")
print(f"🔹 Weighted Average | P: {weighted_avg['precision']:.3f} | R: {weighted_avg['recall']:.3f} | F1: {weighted_avg['f1-score']:.3f}")

# Performance assessment
f1_micro = detailed_metrics.get('f1_micro', 0)
print(f"\n🏆 FINAL ASSESSMENT:")
print(f"{'='*50}")
if f1_micro > 0.8:
    assessment = "🌟 EXCELLENT! Model shows outstanding performance."
elif f1_micro > 0.7:
    assessment = "✅ VERY GOOD! Model performance is strong and ready for deployment."
elif f1_micro > 0.6:
    assessment = "👍 GOOD! Model shows solid performance with room for improvement."
elif f1_micro > 0.5:
    assessment = "⚠️ MODERATE! Consider additional training or data improvements."
else:
    assessment = "❌ NEEDS IMPROVEMENT! Significant enhancements required."

print(f"Micro F1 Score: {f1_micro:.4f}")
print(f"Assessment: {assessment}")

print(f"\n💾 Model and results saved to: {training_args.output_dir}")
print(f"🎉 Multi-label legal posture classification training completed successfully!")

# Save the best model explicitly
print(f"\n💾 Saving final model...")
trainer.save_model(f"{training_args.output_dir}/final_model")
tokenizer.save_pretrained(f"{training_args.output_dir}/final_model")
print(f"✅ Final model saved to: {training_args.output_dir}/final_model")

🔬 FINAL MODEL EVALUATION
🔍 Checking test set data types...
Test labels dtype: float64
⚠️ Test set labels need conversion, performing conversion...


Casting the dataset: 100%|██████████| 2486/2486 [00:00<00:00, 141437.29 examples/s]

✅ Test set labels converted to float32
📊 Generating predictions on test set...





Prediction shape: (2486, 27)
True labels shape: (2486, 27)
📊 Calculating comprehensive metrics...

🏆 TEST SET RESULTS:

📈 Primary Metrics:
  F1_MICRO: 0.8166
  F1_MACRO: 0.4383
  F1_WEIGHTED: 0.7839
  F1_SAMPLES: 0.8150

🎯 Precision:
  PRECISION_MICRO: 0.8602
  PRECISION_MACRO: 0.5993
  PRECISION_WEIGHTED: 0.8275
  PRECISION_SAMPLES: 0.8644

🔍 Recall:
  RECALL_MICRO: 0.7772
  RECALL_MACRO: 0.4058
  RECALL_WEIGHTED: 0.7772
  RECALL_SAMPLES: 0.8182

📊 Other Metrics:
  ACCURACY: 0.6010
  HAMMING_LOSS: 0.0197
  JACCARD_SAMPLES: 0.7609
  JACCARD_MACRO: 0.3490
  JACCARD_WEIGHTED: 0.7040

📡 AUC Metrics:
  ROC_AUC_MACRO: 0.9469
  ROC_AUC_WEIGHTED: 0.9684
  ROC_AUC_SAMPLES: 0.9853
  PR_AUC_MACRO: 0.5811
  PR_AUC_WEIGHTED: 0.8353
  PR_AUC_SAMPLES: 0.9286

📋 PER-CLASS PERFORMANCE:
Appellate Review               | P: 0.915 | R: 0.973 | F1: 0.943 | Support:  663
Juvenile Delinquency Proceeding | P: 0.944 | R: 0.895 | F1: 0.919 | Support:   19
Motion for Attorney's Fees     | P: 0.724 | R: 0.647 | F