# Project Overview and Current Focus

## What We've Accomplished So Far

- **Data Preparation:** Loaded CFPB consumer complaint data, performed cleaning, preprocessing, and feature engineering of complaint narratives.
- **Sampling:** Created a 20k sample demo dataset, initially with original class imbalance, later re-sampled to ensure balanced class representation for robust model evaluation.
- **Text Representation:** Tokenized complaint narratives and generated padded sequences (max length = 200) using Keras Tokenizer.
- **Embeddings:** Downloaded and aligned pre-trained FastText word vectors with the vocabulary for these complaints, resulting in a complete embedding matrix.
- **Classic and Deep Learning Models:** Built and compared a variety of models for product classification:
    - **Feedforward Neural Network:** Performed poorly/underfit, confirming the need for sequential/contextual architectures.
    - **BiLSTM/CNN (with FastText):** Achieved moderate-to-good accuracy on majority classes, but struggled with rare classes due to imbalance.
    - **Class-Weighted Models:** Tried class weights to address imbalance—improved recall for minority classes, but reduced overall accuracy and precision.
- **Evaluation & Interpretation:** Performed detailed error analysis, learning curve interpretation, and confusion matrix breakdown to understand each model's strengths and limitations.

## What We Are Doing Next

- **Goal:** Advance to state-of-the-art deep learning by implementing two modern NLP approaches:
    1. **BiLSTM + Attention:** Add a custom Attention layer on top of BiLSTM using balanced data and FastText embeddings to improve focus on relevant tokens and interpretability.
    2. **Transformer Fine-Tuning:** Fine-tune a pre-trained language model (RoBERTa/DistilBERT) on our product classification task for SOTA performance.

- **Objectives:**
    - Compare attention-enhanced BiLSTM vs vanilla BiLSTM performance on balanced data.
    - Demonstrate modern transformer fine-tuning skills and achieve best possible classification metrics.
    - Evaluate interpretability through attention weights and model comparison across all architectures.
    - Create comprehensive model comparison showcasing progression from classic ML to modern transformers.

- **Context:** This completes our NLP pipeline progression from traditional embeddings through deep learning to transformer models, demonstrating full-stack data science and modern NLP expertise for portfolio/interview purposes.


In [None]:
%pip install gensim



In [None]:

from google.colab import drive
import os
import pandas as pd
drive.mount('/content/drive')
load_path = '/content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major'

cleaned_data = pd.read_parquet(os.path.join(load_path, 'cleaned_data.parquet'))

# SAMPLE ONLY 20k
demo_data = cleaned_data.sample(20000, random_state=42).reset_index(drop=True)

demo_data.to_csv(os.path.join(load_path, 'demo_data_20k.csv'), index=False)

import zipfile, gensim
ft_zip = os.path.join(load_path, 'embeddings/fasttext-wiki-news-subwords-300.kv.zip')
ft_extracted_path = os.path.join(load_path, 'embeddings')
ft_file = os.path.join(ft_extracted_path, 'fasttext-wiki-news-subwords-300.kv')

if os.path.exists(ft_zip):
    if not os.path.exists(ft_file):
        print(f"Extracting {os.path.basename(ft_zip)} to {ft_extracted_path}...")
        with zipfile.ZipFile(ft_zip, 'r') as zipf:
            zipf.extractall(ft_extracted_path)
        print("Extraction complete.")
    else:
        print(f"FastText model already extracted to {ft_extracted_path}. Skipping extraction.")

    try:
        ft_model = gensim.models.KeyedVectors.load(ft_file, mmap='r') # Load from the correct path
        print(f"FastText model loaded from {ft_file}.")
    except Exception as e:
        print(f"Error loading FastText model from {ft_file}: {e}")
        # Handle the error or exit if the model is essential
else:
    print(f"Error: Zip file not found at {ft_zip}")
    # Handle the error or exit if the zip file is essential


# Tokenizer and sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(demo_data['cleaned_narrative'])
sequences = tokenizer.texts_to_sequences(demo_data['cleaned_narrative'])
max_len = 200
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

# Embedding matrix
import numpy as np
import joblib # Import joblib for saving the tokenizer

if 'ft_model' in locals(): # Proceed only if the model was loaded successfully
    embedding_dim = ft_model.vector_size  # 300
    vocab_size = len(tokenizer.word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim), dtype='float32')
    oov_count = 0
    for word, i in tokenizer.word_index.items():
        if i >= vocab_size:
            continue
        try:
            embedding_matrix[i] = ft_model[word]
        except KeyError:
            oov_count += 1

    print(f"Embedding matrix created: shape={embedding_matrix.shape}, OOV words={oov_count}/{vocab_size-1} ({oov_count/(vocab_size-1)*100:.2f}%)")

    # SAVE ARTIFACTS
    np.save(os.path.join(load_path, 'fasttext_embedding_matrix_20k.npy'), embedding_matrix)
    joblib.dump(tokenizer, os.path.join(load_path, 'tokenizer_fasttext_20k.joblib'))
    print("20k embedding matrix and tokenizer saved.")
else:
    print("FastText model not loaded. Skipping embedding matrix creation and saving.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
FastText model already extracted to /content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major/embeddings. Skipping extraction.
FastText model loaded from /content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major/embeddings/fasttext-wiki-news-subwords-300.kv.
Embedding matrix created: shape=(23324, 300), OOV words=5716/23323 (24.51%)
20k embedding matrix and tokenizer saved.


In [None]:
# Target Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(demo_data['Product'])

# Convert to one-hot encoding
y_categorical = to_categorical(y_encoded)

# Info
print(f"Number of product classes: {len(label_encoder.classes_)}")
print(f"Sample encoded labels: {y_encoded[:10]}")
print(f"One-hot shape: {y_categorical.shape}")

# class distribution
class_dist = pd.Series(y_encoded).value_counts().sort_index()
for idx, count in class_dist.items():
    print(f"{label_encoder.classes_[idx]}: {count}")

Number of product classes: 18
Sample encoded labels: [15  5  5  3  6  6  3  7  6  6]
One-hot shape: 
(20000, 18)
Bank account or service: 748
Checking or savings account: 680
Consumer Loan: 498
Credit card: 942
Credit card or prepaid card: 1174
Credit reporting: 1634
Credit reporting, credit repair services, or other personal consumer reports: 4845
Debt collection: 4442
Money transfer, virtual currency, or money service: 293
Money transfers: 86
Mortgage: 2804
Other financial service: 13
Payday loan: 101
Payday loan, title loan, or personal loan: 225
Prepaid card: 72
Student loan: 1137
Vehicle loan or lease: 305
Virtual currency: 1


In [None]:
# Splitting the data
from sklearn.model_selection import train_test_split
import numpy as np # Import numpy

# Find the index of the "Virtual currency" class
virtual_currency_index = label_encoder.transform(['Virtual currency'])[0]

# Filter out samples belonging to the "Virtual currency" class
filtered_indices = np.where(y_encoded != virtual_currency_index)[0]
X_filtered = padded_sequences[filtered_indices]
y_filtered = y_categorical[filtered_indices]
y_encoded_filtered = y_encoded[filtered_indices]


# First split: separate test set (20%) from filtered data
X_temp, X_test, y_temp, y_test, y_temp_encoded, y_test_encoded = train_test_split(
    X_filtered, y_filtered, y_encoded_filtered,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded_filtered # Stratify based on filtered encoded labels
)

X_train, X_val, y_train, y_val, y_train_encoded, y_val_encoded = train_test_split(
    X_temp, y_temp, y_temp_encoded,
    test_size=0.1875, # (0.15 / 0.80) of original data
    random_state=42,
    stratify=y_temp_encoded # Stratify based on encoded labels of the temporary set
)


# Display split info
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Total: {X_train.shape[0] + X_val.shape[0] + X_test.shape[0]}")

# Verify shapes
print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"Number of classes: {y_train.shape[1]}")

Training set: 12999 samples
Validation set: 3000 samples
Test set: 4000 samples
Total: 19999

X_train shape: (12999, 200)
y_train shape: (12999, 18)
Number of classes: 18


In [None]:
# Custom Attention layer
from tensorflow.keras.layers import Layer, Input
import tensorflow as tf
from sklearn.metrics import accuracy_score

class Attention(Layer):
  def __init__(self, **kwargs):
    super(Attention, self).__init__(**kwargs)
  def build(self, input_shape):
    self.W = self.add_weight(name='att_weight',
                             shape=(input_shape[-1], 1),
                             initializer='normal',
                             trainable=True)
    super().build(input_shape)
  def call(self, x):
    e = tf.keras.backend.tanh(tf.keras.backend.dot(x, self.W))
    a = tf.keras.backend.softmax(e, axis=1)
    output = x * a
    return tf.keras.backend.sum(output, axis=1)

# Calculate Class Weights
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

y_train_int = np.argmax(y_train, axis=1)
class_weights_arr = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_int),
    y=y_train_int
)
class_weight_dict = dict(enumerate(class_weights_arr))

print("Class weights dictionary:")
for k, v in class_weight_dict.items():
    print(f"Class {k} ({label_encoder.classes_[k]}): {v:.3f}")

# Model Construction
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

input_seq = Input(shape = (X_train.shape[1],))
embedding_layer = Embedding(
    input_dim = embedding_matrix.shape[0],
    output_dim = embedding_matrix.shape[1],
    weights = [embedding_matrix],
    trainable = False
)(input_seq)

bilstm_out = Bidirectional(LSTM(64, return_sequences=True))(embedding_layer)
attn_out = Attention()(bilstm_out)
drop1 = Dropout(0.4)(attn_out)
dense = Dense(64, activation = 'relu')(drop1)
drop2 = Dropout(0.3)(dense)
output = Dense(y_train.shape[1], activation = 'softmax')(drop2)

model = Model(inputs = input_seq, outputs = output)
model.compile(
    loss = 'categorical_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy']
)

print("\nModel Architecture:")
model.summary()

# Model Training
print('\nStarting training with class weights...')
history = model.fit(
    X_train, y_train,
    epochs = 8,
    batch_size = 128,
    validation_data = (X_val, y_val),
    class_weight = class_weight_dict,
    verbose = 1
)

# Model Evaluation
from sklearn.metrics import classification_report, confusion_matrix

print("\nEvaluating model on test set...")
y_test_pred_prob = model.predict(X_test)
y_test_pred = np.argmax(y_test_pred_prob, axis=1)
y_test_true = np.argmax(y_test, axis=1)

# Create a list of target names excluding 'Virtual currency'
target_names_filtered = [name for name in label_encoder.classes_ if name != 'Virtual currency']

print(f"\nTest Accuracy: {accuracy_score(y_test_true, y_test_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_true, y_test_pred, target_names=target_names_filtered))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test_true, y_test_pred)
print(cm)

# --- Save Model ---
model.save('bilstm_attention_model.h5')
print("\nModel saved as 'bilstm_attention_model.h5'")

Class weights dictionary:
Class 0 (Bank account or service): 1.573
Class 1 (Checking or savings account): 1.730
Class 2 (Consumer Loan): 2.367
Class 3 (Credit card): 1.247
Class 4 (Credit card or prepaid card): 1.002
Class 5 (Credit reporting): 0.720
Class 6 (Credit reporting, credit repair services, or other personal consumer reports): 0.243
Class 7 (Debt collection): 0.265
Class 8 (Money transfer, virtual currency, or money service): 4.024
Class 9 (Money transfers): 13.654
Class 10 (Mortgage): 0.419
Class 11 (Other financial service): 95.581
Class 12 (Payday loan): 11.586
Class 13 (Payday loan, title loan, or personal loan): 5.237
Class 14 (Prepaid card): 16.269
Class 15 (Student loan): 1.035
Class 16 (Vehicle loan or lease): 3.862

Model Architecture:



Starting training with class weights...
Epoch 1/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 39ms/step - accuracy: 0.0760 - loss: 2.8154 - val_accuracy: 0.2553 - val_loss: 2.7433
Epoch 2/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.1312 - loss: 2.8263 - val_accuracy: 0.1800 - val_loss: 2.5903
Epoch 3/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - accuracy: 0.2229 - loss: 2.6509 - val_accuracy: 0.1973 - val_loss: 2.5138
Epoch 4/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - accuracy: 0.2383 - loss: 2.6525 - val_accuracy: 0.2057 - val_loss: 2.1387
Epoch 5/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 34ms/step - accuracy: 0.2555 - loss: 2.3972 - val_accuracy: 0.3230 - val_loss: 2.1474
Epoch 6/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.2672 - loss: 2.2660 - val_accuracy: 0.2693 - val_




Test Accuracy: 0.3255

Classification Report:
                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.50      0.01      0.03       150
                                                 Checking or savings account       0.30      0.52      0.38       136
                                                               Consumer Loan       0.08      0.12      0.09       100
                                                                 Credit card       0.31      0.09      0.13       188
                                                 Credit card or prepaid card       0.30      0.23      0.26       235
                                                            Credit reporting       0.19      0.74      0.30       327
Credit reporting, credit repair services, or other personal consumer reports       0.34      0.04      0.07       969
        

In [None]:
# Model Construction (without class weights)
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, Input
import tensorflow as tf
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Re-define the Attention layer if it's not globally available (or ensure it is)
class Attention(Layer):
  def __init__(self, **kwargs):
    super(Attention, self).__init__(**kwargs)
  def build(self, input_shape):
    self.W = self.add_weight(name='att_weight',
                             shape=(input_shape[-1], 1),
                             initializer='normal',
                             trainable=True)
    super().build(input_shape)
  def call(self, x):
    e = tf.keras.backend.tanh(tf.keras.backend.dot(x, self.W))
    a = tf.keras.backend.softmax(e, axis=1)
    output = x * a
    return tf.keras.backend.sum(output, axis=1)


input_seq = Input(shape = (X_train.shape[1],))
embedding_layer = Embedding(
    input_dim = embedding_matrix.shape[0],
    output_dim = embedding_matrix.shape[1],
    weights = [embedding_matrix],
    trainable = False
)(input_seq)

bilstm_out = Bidirectional(LSTM(64, return_sequences=True))(embedding_layer)
attn_out = Attention()(bilstm_out)
drop1 = Dropout(0.4)(attn_out)
dense = Dense(64, activation = 'relu')(drop1)
drop2 = Dropout(0.3)(dense)
output = Dense(y_train.shape[1], activation = 'softmax')(drop2)

model_no_weights = Model(inputs = input_seq, outputs = output)
model_no_weights.compile(
    loss = 'categorical_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy']
)

print("\nModel Architecture (without class weights):")
model_no_weights.summary()

# Model Training (without class weights)
print('\nStarting training without class weights...')
history_no_weights = model_no_weights.fit(
    X_train, y_train,
    epochs = 8,
    batch_size = 128,
    validation_data = (X_val, y_val),
    verbose = 1 # Removed class_weight
)

# Model Evaluation (without class weights)
print("\nEvaluating model without class weights on test set...")
y_test_pred_prob_no_weights = model_no_weights.predict(X_test)
y_test_pred_no_weights = np.argmax(y_test_pred_prob_no_weights, axis=1)
y_test_true = np.argmax(y_test, axis=1)

# Create a list of target names excluding 'Virtual currency'
target_names_filtered = [name for name in label_encoder.classes_ if name != 'Virtual currency']

print(f"\nTest Accuracy (without class weights): {accuracy_score(y_test_true, y_test_pred_no_weights):.4f}")
print("\nClassification Report (without class weights):")
print(classification_report(y_test_true, y_test_pred_no_weights, target_names=target_names_filtered))

print("\nConfusion Matrix (without class weights):")
cm_no_weights = confusion_matrix(y_test_true, y_test_pred_no_weights)
print(cm_no_weights)

# --- Save Model ---
model_no_weights.save('bilstm_attention_model_no_weights.h5')
print("\nModel saved as 'bilstm_attention_model_no_weights.h5'")


Model Architecture (without class weights):



Starting training without class weights...
Epoch 1/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 45ms/step - accuracy: 0.2173 - loss: 2.5307 - val_accuracy: 0.3607 - val_loss: 2.0238
Epoch 2/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 37ms/step - accuracy: 0.3362 - loss: 2.0861 - val_accuracy: 0.3417 - val_loss: 1.9541
Epoch 3/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 31ms/step - accuracy: 0.3345 - loss: 1.9266 - val_accuracy: 0.4260 - val_loss: 1.7097
Epoch 4/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.4402 - loss: 1.6962 - val_accuracy: 0.5033 - val_loss: 1.5365
Epoch 5/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 36ms/step - accuracy: 0.4948 - loss: 1.5377 - val_accuracy: 0.5340 - val_loss: 1.4425
Epoch 6/8
[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.5162 - loss: 1.4870 - val_accuracy: 0.5563 - v

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Test Accuracy (without class weights): 0.5308

Classification Report (without class weights):
                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.28      0.38      0.32       150
                                                 Checking or savings account       0.36      0.17      0.23       136
                                                               Consumer Loan       0.00      0.00      0.00       100
                                                                 Credit card       0.62      0.03      0.05       188
                                                 Credit card or prepaid card       0.39      0.33      0.36       235
                                                            Credit reporting       0.00      0.00      0.00       327
Credit reporting, credit repair services, or other personal consumer reports  

## Project Progress Summary

This notebook chronicles our journey in building and evaluating models for classifying consumer complaints. Following our initial data preparation and exploration, we focused on building and comparing different modeling approaches:

*   We started with **classic and deep learning models**, including a Feedforward Neural Network and a BiLSTM/CNN, using FastText embeddings. These initial models provided a baseline and highlighted the challenges of classifying imbalanced text data.
*   We then advanced to a **BiLSTM with a custom Attention layer**, experimenting with and without class weights to understand their impact on model performance, particularly for less frequent classes. These experiments offered valuable insights into improving model focus and handling data imbalance in deep learning architectures.

Having explored these approaches and evaluated their performance, we are now ready to advance to state-of-the-art techniques. The next phase of this project will involve **fine-tuning a pre-trained transformer model, specifically DistilBERT**, to leverage its advanced language understanding capabilities for potentially achieving the best possible classification metrics on our dataset.

In [None]:
from google.colab import drive
import os
import pandas as pd

drive.mount('/content/drive')
load_path = '/content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major'

if os.path.exists(os.path.join(load_path, 'demo_data_20k.csv')):
    demo_data = pd.read_csv(os.path.join(load_path, 'demo_data_20k.csv'))
    print(f"Loaded existing demo_data_20k.csv: {demo_data.shape}")
else:
    cleaned_data = pd.read_parquet(os.path.join(load_path, 'cleaned_data.parquet'))
    demo_data = cleaned_data.sample(20000, random_state=42).reset_index(drop=True)
    demo_data.to_csv(os.path.join(load_path, 'demo_data_20k.csv'), index=False)
    print(f"Created demo_data_20k.csv: {demo_data.shape}")

print(f"Columns: {list(demo_data.columns)}")
print(f"Product classes: {demo_data['Product'].nunique()}")

Mounted at /content/drive
Loaded existing demo_data_20k.csv: (20000, 20)
Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID', 'narrative_length', 'cleaned_narrative']
Product classes: 18


In [None]:
!pip install -q transformers datasets accelerate torch

import os
import numpy as np
import pandas as pd
import torch
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict

In [None]:
# 1. Load data
print("1. Loading data ....")
demo_data = pd.read_csv('/content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major/demo_data_20k.csv')
print(f"Data shape: {demo_data.shape}")


1. Loading data ....
Data shape: (20000, 20)


In [None]:
# # 2. Subsample and Filter
# print("2. Subsampling and Filtering data ....")
# grouped = demo_data.groupby('Product', group_keys=False)
# # Subsample, then filter out groups with less than 2 samples for stratification
# subset = grouped.apply(lambda x: x.sample(min(len(x), 200), random_state=42)).groupby('Product').filter(lambda x: len(x) >= 2)
# print(f"Subset shape after filtering: {subset.shape}")
# print(f"Product classes in subset: {subset['Product'].nunique()}")

# Use the full demo_data as subsampling gave very less data
subset = demo_data
print(f"Using full data: {subset.shape}")
print(f"Product classes in data: {subset['Product'].nunique()}")

# Filter out classes with less than 2 samples for stratification
product_counts = subset['Product'].value_counts()
classes_to_keep = product_counts[product_counts >= 2].index
subset = subset[subset['Product'].isin(classes_to_keep)]
print(f"Subset shape after filtering for stratification: {subset.shape}")
print(f"Product classes in subset after filtering for stratification: {subset['Product'].nunique()}")

Using full data: (20000, 20)
Product classes in data: 18
Subset shape after filtering for stratification: (19999, 20)
Product classes in subset after filtering for stratification: 17


In [None]:
# 3. Split data
print("3. Splitting data ....")
# Stratify based on the filtered subset['Product']
train_df, temp_df = train_test_split(subset, test_size=0.2, stratify=subset['Product'], random_state=42)
# Stratify the second split based on the temporary dataframe's product column
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['Product'], random_state=42)
print(f"Train: {train_df.shape} | Val: {val_df.shape} | Test: {test_df.shape}")


3. Splitting data ....
Train: (15999, 20) | Val: (2000, 20) | Test: (2000, 20)


In [None]:
# 4. Label encoding
print("4. Label encoding ....")
label_encoder = LabelEncoder()
# Fit the encoder on the product names present in the filtered subset
label_encoder.fit(subset['Product'])

# Transform the 'Product' column to numerical labels for all dataframes
train_df['label'] = label_encoder.transform(train_df['Product'])
val_df['label'] = label_encoder.transform(val_df['Product'])
test_df['label'] = label_encoder.transform(test_df['Product'])
num_labels = len(label_encoder.classes_)
print(f"Number of classes for training: {num_labels}")
print(f"Label classes for training: {list(label_encoder.classes_)}")

4. Label encoding ....
Number of classes for training: 17
Label classes for training: ['Bank account or service', 'Checking or savings account', 'Consumer Loan', 'Credit card', 'Credit card or prepaid card', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Money transfer, virtual currency, or money service', 'Money transfers', 'Mortgage', 'Other financial service', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Prepaid card', 'Student loan', 'Vehicle loan or lease']


In [None]:
# 5. Tokenization
print("Initializing tokenizer ....")
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize_function(batch):
  if 'cleaned_narrative' not in batch:
      raise ValueError("Batch does not contain 'cleaned_narrative' column.")
  return tokenizer(batch['cleaned_narrative'], truncation=True, padding='max_length', max_length=128)

print("Tokenizing datasets ....")
# Create Dataset objects from pandas DataFrames, including the 'label' column
train_ds = Dataset.from_pandas(train_df[['cleaned_narrative', 'label']])
val_ds = Dataset.from_pandas(val_df[['cleaned_narrative', 'label']])
test_ds = Dataset.from_pandas(test_df[['cleaned_narrative', 'label']])

# Map the tokenization function over the datasets
train_ds = train_ds.map(tokenize_function, batched=True)
val_ds = val_ds.map(tokenize_function, batched=True)
test_ds = test_ds.map(tokenize_function, batched=True)

# Set the format to PyTorch tensors, specifying the columns to keep
train_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
val_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
print("Tokenization done.")


Initializing tokenizer ....
Tokenizing datasets ....


Map:   0%|          | 0/15999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenization done.


In [None]:
# 6. Load model
print("Loading DistilBERT model ....")
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)

Loading DistilBERT model ....


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# 7. Training arguments and trainer setup
output_dir = f"./distilbert-finetuned-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f"Output directory = {output_dir}")
training_args = TrainingArguments(
    output_dir = output_dir,
    eval_strategy = 'epoch',
    save_strategy = 'epoch',
    logging_strategy = 'steps',
    logging_steps = 10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs = 3,
    learning_rate = 2e-5,
    warmup_steps = 100,
    weight_decay = 0.01,
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss',
    save_total_limit=2,
    fp16=True,
    seed=42,
    report_to="none",
    dataloader_num_workers=0,
    greater_is_better=False
)

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  preds = np.argmax(predictions, axis = -1)
  return {
      'accuracy': accuracy_score(labels, preds),
      'macro_f1': f1_score(labels, preds, average = 'macro'),
      'weighted_f1': f1_score(labels, preds, average = 'weighted')
  }

Output directory = ./distilbert-finetuned-20251006-111927


In [None]:
# Class Weight Calculation ---
# Calculate class weights to handle class imbalance
from sklearn.utils import class_weight

print("Calculating class weights for imbalance handling....")
class_labels = np.unique(train_df['label'])
weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=class_labels,
    y=train_df['label'].values
)
# Convert weights to a PyTorch tensor
class_weights_tensor = torch.tensor(weights, dtype=torch.float32).to(model.device)


Calculating class weights for imbalance handling....


In [None]:
# Custom Trainer for Weighted Loss ---
class WeightedLossTrainer(Trainer):
    """Subclassing the Trainer to inject class weights into the CrossEntropyLoss function."""
    def __init__(self, *args, class_weights_tensor=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights_tensor = class_weights_tensor


    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None): # Added num_items_in_batch
        # Retrieve labels and remove them from inputs for the model forward pass
        labels = inputs.pop("labels")
        # Forward pass
        outputs = model(**inputs)
        logits = outputs.get('logits')

        # Compute custom weighted loss
        # The weight parameter in CrossEntropyLoss handles class imbalance by
        # scaling the loss contribution of each class.
        # Use self.class_weights_tensor
        loss_fct = torch.nn.CrossEntropyLoss(weight=self.class_weights_tensor.to(logits.device))


        # Calculate loss (logits: [batch_size, num_labels], labels: [batch_size])
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

In [None]:
# 8. Starting training
print("Initializing WeightedLossTrainer and starting training ....")
trainer = WeightedLossTrainer(
    model = model,
    args = training_args,
    train_dataset = train_ds,
    eval_dataset = val_ds,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=1)],
    class_weights_tensor=class_weights_tensor # Pass the tensor here
)
trainer.train()
print('Training done ................')

Initializing WeightedLossTrainer and starting training ....


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1,Weighted F1
1,1.4413,1.367026,0.544,0.335312,0.526502
2,1.2481,1.268488,0.5635,0.41316,0.555385
3,1.0653,1.234289,0.5865,0.440218,0.583363


Training done ................


In [None]:
# 9. Save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Model and tokenizer saved to ./distilbert-finetuned-20251006-111927


In [None]:
# 10. Evaluate on test set
print("Evaluating on test set...")
results = trainer.evaluate(test_ds)
print(f"Test results: {results}")

preds = trainer.predict(test_ds)
y_pred = np.argmax(preds.predictions, axis=1)
# Get true labels from the test_ds Dataset object
y_true = test_ds['label']

# Create a list of target names based on the classes present in the subset
target_names = label_encoder.classes_[np.unique(y_true)]

print("\nClassification Report:")
# Use the label_encoder.classes_ for target names, but ensure they correspond to the classes in subset
print(classification_report(y_true, y_pred, target_names=target_names))

print("Fine Tuning complaints on DistilBERT completed successfully.")

Evaluating on test set...


Test results: {'eval_loss': 1.2850532531738281, 'eval_accuracy': 0.5895, 'eval_macro_f1': 0.4471136220536465, 'eval_weighted_f1': 0.5911367626628946, 'eval_runtime': 1.9582, 'eval_samples_per_second': 1021.34, 'eval_steps_per_second': 63.834, 'epoch': 3.0}

Classification Report:
                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.41      0.21      0.28        75
                                                 Checking or savings account       0.44      0.75      0.55        68
                                                               Consumer Loan       0.14      0.16      0.15        50
                                                                 Credit card       0.38      0.56      0.45        94
                                                 Credit card or prepaid card       0.46      0.42      0.44       117
          

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Fine-tuning DistilBERT for Product Classification

This section details the process and results of fine-tuning a pre-trained DistilBERT model for the consumer complaint product classification task. This represents our progression to state-of-the-art NLP techniques following experiments with traditional embeddings and attention-enhanced BiLSTM models.

**Fine-tuning Process and Setup:**

1.  **Data Preparation:** We utilized the previously prepared 20k sample dataset. Crucially, we filtered out product classes with fewer than 2 samples to enable stratified splitting, ensuring representative class distribution across training, validation, and test sets. The cleaned complaint narratives were used as input text.
2.  **Label Encoding:** Product categories were encoded into numerical labels using `LabelEncoder`, ensuring compatibility with the model's output layer. The number of unique classes after filtering was 17.
3.  **Tokenization:** The `DistilBertTokenizerFast` for `distilbert-base-uncased` was used to tokenize the complaint narratives. Sequences were truncated and padded to a maximum length of 128 tokens, as required by the DistilBERT model.
4.  **Model Loading:** The `TFDistilBertForSequenceClassification` model pre-trained on `distilbert-base-uncased` was loaded. The output layer was configured to have `num_labels=17`, matching the number of classes in our filtered dataset.
5.  **Training Arguments and Trainer:** We defined `TrainingArguments` to configure the fine-tuning process. Key parameters included:
    *   `output_dir`: Directory for saving checkpoints and logs.
    *   `eval_strategy` and `save_strategy`: Set to `'epoch'` to evaluate and save the model at the end of each training epoch.
    *   `learning_rate`: A small learning rate (2e-5) is used, which is typical for fine-tuning to avoid rapidly overwriting the pre-trained knowledge.
    *   `per_device_train_batch_size` and `per_device_eval_batch_size`: Set to 16.
    *   `num_train_epochs`: Set to 3.
    *   `load_best_model_at_end=True`: To load the model with the best validation performance after training.
    *   `metric_for_best_model='eval_loss'`: Using validation loss to determine the best model.
    *   `fp16=True`: Enabled for faster training on compatible hardware.
    *   `report_to=None`: Disabled logging to external platforms like Weights & Biases.
    *   An `EarlyStoppingCallback` with a patience of 1 was used to stop training if the validation loss did not improve for one epoch, preventing overfitting.
6.  **Evaluation Metrics:** A `compute_metrics` function was defined to calculate Accuracy, Macro F1-score, and Weighted F1-score during evaluation.

**Evaluation and Interpretation of Results:**

The model was evaluated on the held-out test set. The key metrics are:

*   **Test Accuracy: 0.5895** - This represents the overall proportion of correctly classified complaints. An accuracy of nearly 59% is a significant improvement over the previously attempted BiLSTM models (which achieved around 33% with class weights and 53% without), indicating the superior capability of the fine-tuned transformer model.
*   **Macro F1-score: 0.4471** - The Macro F1-score is the unweighted average of the F1-scores for each class. It treats all classes equally, regardless of their size. A Macro F1 of 0.45 suggests that the model's performance varies significantly across different classes, and it likely struggles with the less frequent (minority) classes. If the model performed equally well on all classes, the Macro F1 would be closer to the overall accuracy. The discrepancy indicates that while the model is doing well on average across samples (accuracy), its performance is not balanced across different product categories.
*   **Weighted F1-score: 0.5911** - The Weighted F1-score calculates the F1-score for each class and then averages them, weighted by the number of samples in each class. This metric is heavily influenced by the performance on larger (majority) classes. A Weighted F1 of 0.59, which is close to the overall accuracy, suggests that the model performs much better on the majority classes. The large difference between Macro and Weighted F1 confirms that the model is biased towards predicting the more frequent product categories.

**Classification Report Breakdown:**

The detailed classification report provides per-class metrics (precision, recall, F1-score, support). Observing the report (output in cell `-oQlEG-7_tPg`), we can see this bias:

*   Classes like 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', and 'Mortgage' (which are likely majority classes based on the support counts) generally have higher precision, recall, and F1-scores.
*   Conversely, many minority classes (e.g., 'Consumer Loan', 'Money transfer, virtual currency, or money service', 'Payday loan', 'Prepaid card', 'Vehicle loan or lease') have significantly lower or even zero precision and recall, resulting in very low or zero F1-scores. The model is likely failing to predict any samples for some of these rare classes.

**Issues and Potential Refinements:**

While the overall accuracy and weighted F1-score are encouraging and represent a significant improvement, the low Macro F1-score and the detailed classification report highlight that class imbalance is still a major challenge affecting the model's ability to generalize to less frequent product categories.

Potential refinements to address the class imbalance and improve performance on minority classes include:

*   **Class Weighting in Trainer:** Although we used a custom trainer for BiLSTM with class weights, the standard Hugging Face `Trainer` also supports `class_weight` directly if using a PyTorch model. If using a TensorFlow model, injecting weights into the loss function within a custom training loop or a subclassed `Trainer` (similar to what was attempted for the BiLSTM) would be necessary.
*   **Oversampling Minority Classes:** Techniques like Random Oversampling or SMOTE could be applied to the training data to increase the number of samples in minority classes. Care must be taken to apply this only to the training set to avoid data leakage.
*   **Undersampling Majority Classes:** Reducing the number of samples in majority classes in the training data can also help balance the dataset, though this might lead to losing valuable information.
*   **Combining Oversampling and Undersampling:** Using a hybrid approach can be effective.
*   **Exploring Different Metrics for Best Model:** While `eval_loss` is a common metric for saving the best model, you could experiment with using `'eval_macro_f1'` to explicitly optimize for better performance across all classes, even if it slightly reduces overall accuracy.
*   **More Training Epochs:** While early stopping was used, perhaps slightly more training epochs with a larger patience could allow the model to learn more, provided it doesn't lead to overfitting.
*   **Different Transformer Models:** Experimenting with other pre-trained transformer models (e.g., RoBERTa, ELECTRA) might yield better results.
*   **Larger Dataset:** Fine-tuning on a larger subset of the original data (if computational resources allow) could provide more data for the model to learn from, especially for the less frequent classes.

The fine-tuned DistilBERT model provides a strong foundation, but further efforts are needed to improve its ability to classify minority product categories effectively.

**Now, let's try fine tuning DistilBERT on the extended 350k samples that we have which can be a good headstart by adding more samples for rare classes**

In [None]:
# Load the cleaned_data.parquet
import pandas as pd
import os
from google.colab import drive

if not os.path.exists('/content/drive'):
    drive.mount('/content/drive')

load_path = '/content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major'
cleaned_data = pd.read_parquet(os.path.join(load_path, 'cleaned_data.parquet'))

print("Cleaned data loaded successfully.")
print(f"Shape of cleaned data: {cleaned_data.shape}")
print(f"Columns in cleaned data: {list(cleaned_data.columns)}")

Mounted at /content/drive
Cleaned data loaded successfully.
Shape of cleaned data: (383512, 20)
Columns in cleaned data: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID', 'narrative_length', 'cleaned_narrative']


In [None]:
# data distribution (class counts) for the 'Product' column
print("\nProduct class distribution in cleaned data:")
product_counts_cleaned = cleaned_data['Product'].value_counts()
print(product_counts_cleaned)
print(f"\nNumber of unique product classes: {cleaned_data['Product'].nunique()}")


Product class distribution in cleaned data:
Product
Credit reporting, credit repair services, or other personal consumer reports    92364
Debt collection                                                                 86683
Mortgage                                                                        52984
Credit reporting                                                                31584
Student loan                                                                    21809
Credit card or prepaid card                                                     21379
Credit card                                                                     18836
Bank account or service                                                         14884
Checking or savings account                                                     12881
Consumer Loan                                                                    9474
Vehicle loan or lease                                                            5745
M

In [None]:
# Preprocessing steps for fine-tuning (Label Encoding and Tokenization)

# 1. Filter out classes with less than 2 samples for stratification (if needed for future splits)
# Although we won't split the full dataset in this example, it's good practice
# if you plan to split it for training/validation/testing.
product_counts_cleaned = cleaned_data['Product'].value_counts()
classes_to_keep_cleaned = product_counts_cleaned[product_counts_cleaned >= 2].index
cleaned_data_filtered = cleaned_data[cleaned_data['Product'].isin(classes_to_keep_cleaned)].copy()

print(f"\nShape of cleaned data after filtering for stratification: {cleaned_data_filtered.shape}")
print(f"Product classes in filtered cleaned data: {cleaned_data_filtered['Product'].nunique()}")


# 2. Label Encoding
from sklearn.preprocessing import LabelEncoder
from datasets import ClassLabel, Features, Value # Import ClassLabel, Features, Value

label_encoder_cleaned = LabelEncoder()
# Fit the encoder on the product names present in the filtered cleaned data
cleaned_data_filtered['label'] = label_encoder_cleaned.fit_transform(cleaned_data_filtered['Product'])
num_labels_cleaned = len(label_encoder_cleaned.classes_)

print(f"\nNumber of classes after encoding: {num_labels_cleaned}")
print(f"Encoded label classes: {list(label_encoder_cleaned.classes_)}")


# 3. Tokenization (using the same tokenizer as for fine-tuning)
from transformers import DistilBertTokenizerFast
from datasets import Dataset # Import Dataset

print("\nInitializing tokenizer ....")
# Using the same tokenizer as used for DistilBERT fine-tuning
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize_function(batch):
    if 'cleaned_narrative' not in batch:
        raise ValueError("Batch does not contain 'cleaned_narrative' column.")
    return tokenizer(batch['cleaned_narrative'], truncation=True, padding='max_length', max_length=128)


print("Tokenizing cleaned data ....")
# Create a Dataset object from the filtered cleaned data
# Define features with ClassLabel for the 'label' column
features = Features({
    'cleaned_narrative': Value(dtype='string'),
    'label': ClassLabel(names=list(label_encoder_cleaned.classes_))
})
# Reset the index before creating the Dataset to avoid index column issues
cleaned_data_for_dataset = cleaned_data_filtered[['cleaned_narrative', 'label']].reset_index(drop=True)
cleaned_ds = Dataset.from_pandas(cleaned_data_for_dataset, features=features)


# Map the tokenization function over the dataset
cleaned_ds = cleaned_ds.map(tokenize_function, batched=True)

# Set the format to PyTorch tensors (or TensorFlow if you prefer for TF models)
# We'll set it to PyTorch format as the previous fine-tuning used PyTorch Trainer
cleaned_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print("Tokenization of cleaned data done.")
print(f"Tokenized dataset columns: {cleaned_ds.column_names}")
print(f"Label column feature type: {cleaned_ds.features['label']}")


Shape of cleaned data after filtering for stratification: (383512, 20)
Product classes in filtered cleaned data: 18

Number of classes after encoding: 18
Encoded label classes: ['Bank account or service', 'Checking or savings account', 'Consumer Loan', 'Credit card', 'Credit card or prepaid card', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Money transfer, virtual currency, or money service', 'Money transfers', 'Mortgage', 'Other financial service', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Prepaid card', 'Student loan', 'Vehicle loan or lease', 'Virtual currency']

Initializing tokenizer ....
Tokenizing cleaned data ....


Map:   0%|          | 0/383512 [00:00<?, ? examples/s]

Tokenization of cleaned data done.
Tokenized dataset columns: ['cleaned_narrative', 'label', 'input_ids', 'attention_mask']
Label column feature type: ClassLabel(names=['Bank account or service', 'Checking or savings account', 'Consumer Loan', 'Credit card', 'Credit card or prepaid card', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Money transfer, virtual currency, or money service', 'Money transfers', 'Mortgage', 'Other financial service', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Prepaid card', 'Student loan', 'Vehicle loan or lease', 'Virtual currency'])


In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import torch
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification, # Import the model class
    Trainer, # Import Trainer if used in this cell
    TrainingArguments, # Import TrainingArguments if used in this cell
    EarlyStoppingCallback # Import EarlyStoppingCallback if used in this cell
)
from datasets import Dataset, DatasetDict # Import Dataset and DatasetDict if used in this cell
from sklearn.utils import class_weight # Import class_weight


# 6. Load model
print("Loading DistilBERT model ....")
# Use num_labels_cleaned from the preprocessing of the full dataset
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels_cleaned)

# 7. Training arguments and trainer setup
output_dir = f"./distilbert-finetuned-full-data-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f"Output directory = {output_dir}")
training_args = TrainingArguments(
    output_dir = output_dir,
    eval_strategy = 'epoch',
    save_strategy = 'epoch',
    logging_strategy = 'steps',
    logging_steps = 10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs = 3,
    learning_rate = 2e-5,
    warmup_steps = 100,
    weight_decay = 0.01,
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss',
    save_total_limit=2,
    fp16=True,
    seed=42,
    report_to="none",
    dataloader_num_workers=0,
    greater_is_better=False
)

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  preds = np.argmax(predictions, axis = -1)
  return {
      'accuracy': accuracy_score(labels, preds),
      'macro_f1': f1_score(labels, preds, average = 'macro'),
      'weighted_f1': f1_score(labels, preds, average = 'weighted')
  }

# Class Weight Calculation ---
# Calculate class weights to handle class imbalance

print("Calculating class weights for imbalance handling....")
# Calculate weights based on the labels in the full cleaned data before splitting
# Access the 'label' column as a list or array from the Dataset object
class_labels_full = cleaned_ds['label']
# Convert to numpy array of integers explicitly for compute_class_weight
y_for_weights = np.array(list(class_labels_full), dtype=int)

weights_full = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_for_weights), # Use unique from the numpy array
    y=y_for_weights # Use the numpy array for y
)
# Convert weights to a PyTorch tensor
class_weights_tensor_full = torch.tensor(weights_full, dtype=torch.float32).to(model.device)

# Custom Trainer for Weighted Loss ---
class WeightedLossTrainer(Trainer):
    """Subclassing the Trainer to inject class weights into the CrossEntropyLoss function."""
    def __init__(self, *args, class_weights_tensor=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights_tensor = class_weights_tensor


    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None): # Added num_items_in_batch
        # Retrieve labels and remove them from inputs for the model forward pass
        labels = inputs.pop("labels")
        # Forward pass
        outputs = model(**inputs)
        logits = outputs.get('logits')


        # Compute custom weighted loss
        # The weight parameter in CrossEntropyLoss handles class imbalance by
        # scaling the loss contribution of each class.
        # Use self.class_weights_tensor
        loss_fct = torch.nn.CrossEntropyLoss(weight=self.class_weights_tensor.to(logits.device))


        # Calculate loss (logits: [batch_size, num_labels], labels: [batch_size])
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))


        return (loss, outputs) if return_outputs else loss

Loading DistilBERT model ....


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Output directory = ./distilbert-finetuned-full-data-20251007-075339
Calculating class weights for imbalance handling....


In [None]:
# Split cleaned_ds into train, val, and test sets for fine-tuning
from datasets import DatasetDict

print("Splitting cleaned data for fine-tuning ....")

# Use the tokenized cleaned_ds Dataset for splitting
# Splitting the Dataset using datasets library's train_test_split
# This returns a DatasetDict
train_testvalid_full = cleaned_ds.train_test_split(test_size=0.2, stratify_by_column='label', seed=42)

# Split the test_valid further into validation and test
test_valid_full = train_testvalid_full['test'].train_test_split(test_size=0.5, stratify_by_column='label', seed=42)

train_ds_full = train_testvalid_full['train']
val_ds_full = test_valid_full['train']
test_ds_full = test_valid_full['test']


print(f"Train set (full data): {len(train_ds_full)} samples")
print(f"Validation set (full data): {len(val_ds_full)} samples")
print(f"Test set (full data): {len(test_ds_full)} samples")

# 8. Starting training
print("Initializing WeightedLossTrainer and starting training with full data....")
trainer = WeightedLossTrainer(
    model = model, # Use the model initialized in the previous cell
    args = training_args, # Use training_args from the previous cell
    train_dataset = train_ds_full, # Use full training data
    eval_dataset = val_ds_full, # Use full validation data
    compute_metrics = compute_metrics, # Use compute_metrics from the previous cell
    callbacks = [EarlyStoppingCallback(early_stopping_patience=1)],
    class_weights_tensor=class_weights_tensor_full # Pass the tensor for full data
)
trainer.train()
print('Training done ................')

# 9. Save model
trainer.save_model(output_dir) # Use output_dir from the previous cell
tokenizer.save_pretrained(output_dir) # Use tokenizer from preprocessing
print(f"Model and tokenizer saved to {output_dir}")

# 10. Evaluate on test set
print("Evaluating on test set (full data)...")
results_full = trainer.evaluate(test_ds_full) # Evaluate on full test data
print(f"Test results (full data): {results_full}")


preds_full = trainer.predict(test_ds_full) # Predict on full test data
y_pred_full = np.argmax(preds_full.predictions, axis=1)
# Get true labels from the test_ds_full Dataset object
y_true_full = test_ds_full['label']

# Use label_encoder_cleaned which was fitted on the full filtered data
target_names_full = label_encoder_cleaned.classes_[np.unique(y_true_full)]


print("\nClassification Report (full data):")
print(classification_report(y_true_full, y_pred_full, target_names=target_names_full))

print("Fine Tuning complaints on DistilBERT with full data completed successfully.")

Splitting cleaned data for fine-tuning ....
Train set (full data): 306809 samples
Validation set (full data): 38351 samples
Test set (full data): 38352 samples
Initializing WeightedLossTrainer and starting training with full data....


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1,Weighted F1
1,0.9936,1.021794,0.682173,0.506405,0.68627
2,0.6415,1.016793,0.69508,0.53347,0.698462
3,0.6231,1.028906,0.709577,0.54892,0.714171


Training done ................
Model and tokenizer saved to ./distilbert-finetuned-full-data-20251007-075339
Evaluating on test set (full data)...


Test results (full data): {'eval_loss': 1.0231534242630005, 'eval_accuracy': 0.6932363370880267, 'eval_macro_f1': 0.5288980649847599, 'eval_weighted_f1': 0.6971259381336563, 'eval_runtime': 36.8424, 'eval_samples_per_second': 1040.974, 'eval_steps_per_second': 65.061, 'epoch': 3.0}

Classification Report (full data):
                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.51      0.65      0.57      1489
                                                 Checking or savings account       0.60      0.45      0.51      1288
                                                               Consumer Loan       0.40      0.51      0.45       947
                                                                 Credit card       0.46      0.69      0.55      1884
                                                 Credit card or prepaid card       0.56   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Fine-tuning DistilBERT on the Full Dataset

Having explored initial models and fine-tuned DistilBERT on a smaller sample, we proceeded to fine-tune the DistilBERT model on the entire cleaned dataset to leverage its full potential and improve performance, especially on minority classes.

**Process Highlights:**

*   The full cleaned dataset was loaded, filtered to ensure classes had at least 2 samples for stratification, and then labeled encoded.
*   The dataset was tokenized using the DistilBERT tokenizer with a max length of 128 and prepared as a `datasets.Dataset` with the 'label' column cast to `ClassLabel` for proper handling.
*   The full dataset was split into training, validation, and test sets using stratified splitting to maintain class distribution.
*   A `DistilBertForSequenceClassification` model was loaded with the appropriate number of output labels (18, reflecting the classes in the full filtered dataset).
*   Balanced class weights were calculated for the full training set to address class imbalance during training, which was implemented using a custom `WeightedLossTrainer`.
*   The model was fine-tuned for 3 epochs with early stopping based on validation loss.

**Performance Evaluation (Full Dataset):**

The model's performance was evaluated on the held-out test set, yielding the following key metrics:

*   **Test Accuracy: 0.6932** - The overall proportion of correctly classified samples. This shows a significant improvement in overall accuracy compared to the fine-tuning on the 20k sample (0.5895), indicating that training on a larger dataset has helped the model generalize better.
*   **Macro F1-score: 0.5289** - The unweighted average of F1-scores across all classes. This metric is a good indicator of the model's performance across both majority and minority classes, treating them equally. A Macro F1 of 0.53 is a notable improvement over the 20k sample result (0.4471), suggesting that the model is performing better on the less frequent classes when trained on the full dataset with class weights.
*   **Weighted F1-score: 0.6971** - The average of F1-scores weighted by the number of samples in each class. This metric is heavily influenced by the performance on majority classes. A Weighted F1 of 0.70 is close to the overall accuracy, as expected, and also shows improvement over the 20k sample result (0.5911).

**Interpretation:**

The results from fine-tuning on the full dataset with class weights demonstrate a substantial improvement across all key metrics compared to the previous attempts, including the fine-tuning on the smaller 20k sample.

*   The higher **Test Accuracy** and **Weighted F1-score** indicate that the model is much better at classifying the majority classes when trained on more data.
*   Crucially, the improved **Macro F1-score** suggests that the combined effect of using the full dataset and applying class weights has helped the model learn to classify minority classes more effectively, leading to a more balanced performance across all product categories.

While there is still a gap between the Macro F1 and Weighted F1 (indicating that imbalance still poses a challenge, though less severe than before), the results are promising and demonstrate the power of fine-tuning on a larger, more representative dataset with appropriate techniques to handle imbalance.

We have successfully fine-tuned a DistilBERT model on the full dataset, achieving significantly better performance metrics.

## Merging Rare Product Classes

To further address the class imbalance observed in the product categories, particularly the poor performance on classes with very few samples, we will merge some of the rare product classes into more frequent or related categories. This strategy aims to increase the number of training examples for the merged categories, potentially improving the model's ability to learn and classify these instances more effectively.

Based on the class distribution and domain knowledge, the following merging strategy is applied:

*   'Virtual currency' and 'Money transfers' are merged into 'Money transfer, virtual currency, or money service'.
*   'Other financial service' is merged into 'Bank account or service'.
*   'Prepaid card' is merged into 'Credit card or prepaid card'.
*   'Payday loan' is merged into 'Payday loan, title loan, or personal loan'.
*   'Consumer Loan' is merged into 'Vehicle loan or lease' (This merge is based on the assumption of some overlap or similarity in consumer complaints related to these loan types. This can be adjusted based on further analysis or domain expertise).

The merging is performed by creating a mapping from the rare class names to their target merged class names and then using the `.replace()` method on the 'Product' column to create a new 'Product_merged' column. We will then examine the new class distribution to see the effect of the merging.

In [None]:
!pip install -q transformers datasets accelerate torch

import os
import numpy as np
import pandas as pd
import torch
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.utils import class_weight
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)
from datasets import Dataset, Features, Value, ClassLabel

# 1. Load data
print("1. Loading data...")
drive.mount('/content/drive')
load_path = '/content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major'
cleaned_data = pd.read_parquet(os.path.join(load_path, 'cleaned_data.parquet'))
print(f"Shape of cleaned data: {cleaned_data.shape}")
print(f"Columns: {list(cleaned_data.columns)}")
print("\nProduct class distribution:")
print(cleaned_data['Product'].value_counts())

1. Loading data...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Shape of cleaned data: (383512, 20)
Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID', 'narrative_length', 'cleaned_narrative']

Product class distribution:
Product
Credit reporting, credit repair services, or other personal consumer reports    92364
Debt collection                                                                 86683
Mortgage                                                                        52984
Credit reporting                                                                31584
Student loan                                               

In [None]:
# 2. Merge rare classes
print("2. Merging rare classes...")
merge_map = {
    'Virtual currency': 'Money transfer, virtual currency, or money service',
    'Other financial service': 'Bank account or service',
    'Money transfers': 'Money transfer, virtual currency, or money service',
    'Prepaid card': 'Credit card or prepaid card',
    'Payday loan': 'Payday loan, title loan, or personal loan',
    'Consumer Loan': 'Vehicle loan or lease'
}
cleaned_data['Product_merged'] = cleaned_data['Product'].replace(merge_map)
print("Merged classes:", merge_map)
print("New class distribution:\n", cleaned_data['Product_merged'].value_counts())

2. Merging rare classes...
Merged classes: {'Virtual currency': 'Money transfer, virtual currency, or money service', 'Other financial service': 'Bank account or service', 'Money transfers': 'Money transfer, virtual currency, or money service', 'Prepaid card': 'Credit card or prepaid card', 'Payday loan': 'Payday loan, title loan, or personal loan', 'Consumer Loan': 'Vehicle loan or lease'}
New class distribution:
 Product_merged
Credit reporting, credit repair services, or other personal consumer reports    92364
Debt collection                                                                 86683
Mortgage                                                                        52984
Credit reporting                                                                31584
Credit card or prepaid card                                                     22829
Student loan                                                                    21809
Credit card                                       

In [None]:
# 3. Filter classes with <2 samples
print("3. Filtering classes with <2 samples...")
product_counts = cleaned_data['Product_merged'].value_counts()
classes_to_keep = product_counts[product_counts >= 2].index
cleaned_data_filtered = cleaned_data[cleaned_data['Product_merged'].isin(classes_to_keep)].copy()
print(f"Shape after filtering: {cleaned_data_filtered.shape}")
print(f"Number of classes: {cleaned_data_filtered['Product_merged'].nunique()}")

3. Filtering classes with <2 samples...
Shape after filtering: (383512, 21)
Number of classes: 12


In [None]:
# 4. Label encoding
print("4. Label encoding...")
label_encoder = LabelEncoder()
cleaned_data_filtered['label'] = label_encoder.fit_transform(cleaned_data_filtered['Product_merged'])
num_labels = len(label_encoder.classes_)
print(f"Number of classes: {num_labels}")
print(f"Label classes: {list(label_encoder.classes_)}")

4. Label encoding...
Number of classes: 12
Label classes: ['Bank account or service', 'Checking or savings account', 'Credit card', 'Credit card or prepaid card', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', 'Money transfer, virtual currency, or money service', 'Mortgage', 'Payday loan, title loan, or personal loan', 'Student loan', 'Vehicle loan or lease']


In [None]:
# 5. Tokenization
print("Initializing tokenizer...")
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize_function(batch):
    if 'cleaned_narrative' not in batch:
        raise ValueError("Batch does not contain 'cleaned_narrative' column.")
    return tokenizer(batch['cleaned_narrative'], truncation=True, padding='max_length', max_length=200)

print("Tokenizing cleaned data...")
features = Features({
    'cleaned_narrative': Value(dtype='string'),
    'label': ClassLabel(names=list(label_encoder.classes_))
})
cleaned_data_for_dataset = cleaned_data_filtered[['cleaned_narrative', 'label']].reset_index(drop=True)
cleaned_ds = Dataset.from_pandas(cleaned_data_for_dataset, features=features)
cleaned_ds = cleaned_ds.map(tokenize_function, batched=True)
cleaned_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
print("Tokenization done.")
print(f"Tokenized dataset columns: {cleaned_ds.column_names}")

Initializing tokenizer...
Tokenizing cleaned data...


Map:   0%|          | 0/383512 [00:00<?, ? examples/s]

Tokenization done.
Tokenized dataset columns: ['cleaned_narrative', 'label', 'input_ids', 'attention_mask']


In [None]:
# 6. Split data
print("Splitting cleaned data for fine-tuning...")
train_testvalid = cleaned_ds.train_test_split(test_size=0.2, stratify_by_column='label', seed=42)
test_valid = train_testvalid['test'].train_test_split(test_size=0.5, stratify_by_column='label', seed=42)
train_ds = train_testvalid['train']
val_ds = test_valid['train']
test_ds = test_valid['test']
print(f"Train set: {len(train_ds)} samples")
print(f"Validation set: {len(val_ds)} samples")
print(f"Test set: {len(test_ds)} samples")


Splitting cleaned data for fine-tuning...
Train set: 306809 samples
Validation set: 38351 samples
Test set: 38352 samples


In [None]:
# 7. Load model
print("Loading DistilBERT model...")
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)
model.to('cuda')

Loading DistilBERT model...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
# 8. Compute class weights
print("Calculating class weights...")
y_for_weights = np.array(cleaned_ds['label'], dtype=int)
class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_for_weights), y=y_for_weights)
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to('cuda')

Calculating class weights...


In [None]:
# 9. Custom Trainer
class WeightedLossTrainer(Trainer):
    def __init__(self, *args, class_weights_tensor=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights_tensor = class_weights_tensor

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.get('logits')
        loss_fct = torch.nn.CrossEntropyLoss(weight=self.class_weights_tensor.to(logits.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

In [None]:
# 10. Training arguments
output_dir = os.path.join(load_path, f"distilbert-finetuned-merged-{datetime.now().strftime('%Y%m%d-%H%M%S')}")
print(f"Output directory: {output_dir}")
training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_strategy='steps',
    logging_steps=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    learning_rate=2e-5,
    warmup_steps=100,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='eval_macro_f1',
    greater_is_better=True,
    save_total_limit=2,
    fp16=True,
    seed=42,
    report_to="none",
    dataloader_num_workers=0
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=-1)
    return {
        'accuracy': accuracy_score(labels, preds),
        'macro_f1': f1_score(labels, preds, average='macro'),
        'weighted_f1': f1_score(labels, preds, average='weighted')
    }

Output directory: /content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major/distilbert-finetuned-merged-20251008-060259


In [None]:
# 11. Start training
print("Initializing WeightedLossTrainer and starting training...")
trainer = WeightedLossTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    class_weights_tensor=class_weights_tensor
)
trainer.train()
print("Training done.")

Initializing WeightedLossTrainer and starting training...


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1,Weighted F1
1,0.8515,0.817943,0.690595,0.647663,0.69046
2,0.951,0.777135,0.719043,0.674091,0.720365
3,0.462,0.801859,0.74032,0.688505,0.742916


Training done.


In [None]:
# 12. Save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Model and tokenizer saved to /content/drive/MyDrive/Data Science course/Major Projects/Projects/Smart Support NLP - Major/distilbert-finetuned-merged-20251008-060259


In [None]:
# 13. Evaluate on test set
print("Evaluating on test set...")
results = trainer.evaluate(test_ds)
print(f"Test results: {results}")

preds = trainer.predict(test_ds)
y_pred = np.argmax(preds.predictions, axis=1)
y_true = preds.label_ids
target_names = label_encoder.classes_[np.unique(y_true)]

print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=target_names))

print("Fine-tuning DistilBERT with merged classes completed successfully.")

Evaluating on test set...


Test results: {'eval_loss': 0.8072065711021423, 'eval_accuracy': 0.7381883604505632, 'eval_macro_f1': 0.6866992855344299, 'eval_weighted_f1': 0.7406354100531907, 'eval_runtime': 52.9127, 'eval_samples_per_second': 724.816, 'eval_steps_per_second': 45.301, 'epoch': 3.0}

Classification Report:
                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.55      0.54      0.54      1517
                                                 Checking or savings account       0.57      0.58      0.57      1288
                                                                 Credit card       0.51      0.64      0.57      1883
                                                 Credit card or prepaid card       0.59      0.59      0.59      2283
                                                            Credit reporting       0.51      0.71      0.59      31

## Fine-tuning DistilBERT with Merged Product Classes

This section presents the methodology and outcomes of fine-tuning a pre-trained DistilBERT model on the comprehensive dataset, incorporating a strategy of merging rare product classes to mitigate class imbalance and enhance classification performance. This represents a key advancement in our modeling approach, building upon previous experiments with traditional embeddings and initial transformer fine-tuning on a smaller subset.

**Modeling Approach and Configuration:**

The core of this phase involved leveraging the transfer learning capabilities of a pre-trained transformer model.

*   **Base Model:** DistilBERT (specifically `distilbert-base-uncased`) was selected as the base architecture. DistilBERT is a smaller, faster, and lighter version of BERT, making it suitable for environments with computational constraints while retaining a significant portion of BERT's language understanding capabilities.
*   **Task Adaptation:** The pre-trained DistilBERT model was adapted for sequence classification by adding a classification head (a linear layer) on top, configured to output probabilities for the target product classes.
*   **Data Strategy:** The full cleaned dataset was utilized. Prior to model training, a critical data preprocessing step involved **merging rare product classes** into more frequent or semantically related categories. This reduced the number of distinct classes from 18 to 12, effectively increasing the sample size for the merged categories and creating a more favorable class distribution for training. The data was then split into stratified training, validation, and test sets to ensure representative class distribution across splits.
*   **Tokenization:** The standard `DistilBertTokenizerFast` was used to tokenize the complaint narratives, applying truncation and padding to a fixed maximum sequence length of 200 tokens.
*   **Class Imbalance Handling:** To further address the remaining class imbalance in the merged dataset, **balanced class weights** were calculated based on the distribution of samples in the training set. These weights were incorporated into the loss function during training using a custom `Trainer` implementation. This assigns higher penalties for misclassifications of minority class samples.
*   **Training Configuration:** The model was fine-tuned using the Hugging Face `Trainer` with the following key `TrainingArguments`:
    *   **Optimizer:** Adam with a small learning rate (2e-5), standard for fine-tuning to avoid disrupting pre-trained weights.
    *   **Batch Size:** 16 per device for both training and evaluation.
    *   **Epochs:** Trained for 3 epochs, with an `EarlyStoppingCallback` monitoring validation performance (specifically `eval_macro_f1`) with a patience of 2 epochs to prevent overfitting and select the best model.
    *   **Evaluation and Saving Strategy:** Set to evaluate and save the model checkpoint at the end of each epoch (`eval_strategy='epoch'`, `save_strategy='epoch'`).
    *   **Metric for Best Model:** `eval_macro_f1` was chosen as the metric to determine the best model to load at the end of training, prioritizing performance balance across all classes over overall accuracy.
    *   **Other:** Warmup steps (50), weight decay (0.01), FP16 mixed precision for faster training, and a fixed random seed for reproducibility were also configured.

**Evaluation and Performance Analysis:**

The fine-tuned model was evaluated on the held-out test set. The key performance indicators provide insights into the model's effectiveness in classifying consumer complaints across the merged product categories.

*   **Test Accuracy: 0.7382** - The overall accuracy indicates that approximately 73.8% of the test complaints were correctly classified into their respective merged product categories. This represents a substantial improvement compared to the fine-tuning on the unmerged 20k sample (0.5895) and the full unmerged data (0.6932).
*   **Macro F1-score: 0.6867** - The Macro F1-score, which is the unweighted average of the F1-scores for each individual merged class, is a critical metric for evaluating performance on imbalanced datasets. A Macro F1 of 0.687 signifies a significant improvement over the previous fine-tuning attempts (0.4471 on 20k sample, 0.5289 on full unmerged data). This indicates that the merging strategy, combined with class weighting, has been effective in improving the model's ability to classify minority classes more accurately, leading to a more balanced performance across all categories.
*   **Weighted F1-score: 0.7406** - The Weighted F1-score, which averages the F1-scores weighted by the number of samples in each class, is closer to the overall accuracy (0.7382). This metric is more influenced by the performance on the larger classes. A Weighted F1 of 0.741 suggests strong performance on the majority merged categories.

**Classification Report Deep Dive:**

The detailed classification report (output in cell `ctBrRvuLYwQb`) provides a per-class breakdown of precision, recall, and F1-score for the 12 merged classes. Analyzing this report reveals:

*   **Improved Minority Class Performance:** Compared to the classification reports from fine-tuning on the unmerged data, the F1-scores for classes that were previously very rare have significantly improved. For example, categories like 'Money transfer, virtual currency, or money service', 'Payday loan, title loan, or personal loan', and 'Vehicle loan or lease' show much more respectable precision, recall, and F1-scores. This directly reflects the positive impact of merging and class weighting on the model's ability to handle less frequent cases.
*   **Strong Performance on Majority Classes:** Classes like 'Credit reporting, credit repair services, or other personal consumer reports', 'Debt collection', and 'Mortgage' continue to exhibit high precision, recall, and F1-scores, benefiting from both a larger number of samples and the powerful features learned by the transformer model.
*   **Balanced Performance:** The reduced gap between the Macro F1-score (0.687) and the Weighted F1-score (0.741) compared to previous models (e.g., 0.45 vs 0.59 on 20k sample) is a strong indicator that the model's performance is now much more balanced across all merged classes, rather than being heavily skewed towards the largest categories.

**Conclusion of this Fine-tuning Phase:**

The fine-tuning of DistilBERT on the full dataset with the implemented class merging strategy has yielded significantly improved and more balanced classification performance. The notable increase in Macro F1-score demonstrates the effectiveness of addressing class imbalance through data manipulation (merging) and weighted loss. This model represents the best performance achieved so far in this project.



## Project Conclusion: End-to-End Consumer Complaint Classification

This project successfully developed and evaluated several natural language processing models for classifying CFPB consumer complaints into product categories. The journey progressed from foundational techniques to state-of-the-art deep learning, showcasing an end-to-end approach to tackling a real-world text classification problem with imbalanced data.

**Project Progression and Methodologies:**

1.  **Data Preparation and Exploration:** The project began with loading, cleaning, and preprocessing consumer complaint narratives. Initial data exploration revealed the inherent class imbalance in the product categories, a key challenge addressed throughout the modeling phases.
2.  **Traditional and Deep Learning Baselines:** Early modeling efforts established baselines using traditional embeddings (FastText) with a Feedforward Neural Network and a BiLSTM/CNN architecture. These models provided initial insights into the data complexity and the limitations of simpler approaches on this dataset, particularly concerning minority classes.
3.  **Exploring Attention Mechanisms:** An attention layer was integrated with the BiLSTM model to investigate its impact on model focus and performance. Experiments with and without class weighting at this stage highlighted the complexities of handling imbalance with deep learning and the nuanced effects of weighting on overall vs. per-class performance.
4.  **Advancing to Transformer Fine-tuning:** Recognizing the superior capabilities of modern large language models, the project transitioned to fine-tuning a pre-trained DistilBERT model. This marked a significant step, leveraging transfer learning to benefit from the extensive linguistic knowledge acquired during DistilBERT's pre-training.
5.  **Addressing Imbalance with Full Data and Merging:** The fine-tuning was first performed on a smaller data sample and then scaled up to the full cleaned dataset. To further mitigate class imbalance, a data-centric strategy of merging rare product classes into more frequent, related categories was successfully implemented. Balanced class weights were incorporated during training on the merged dataset to ensure the model did not become overly biased towards majority classes.

**Key Outcomes and Performance:**

The fine-tuning of DistilBERT on the full dataset with merged classes and class weighting yielded the best performance metrics observed throughout the project. The improved **Macro F1-score** (0.687) compared to earlier models demonstrated a significant step towards more balanced performance across all product categories, including those that were previously rare. The high **Weighted F1-score** (0.741) and **Test Accuracy** (0.738) indicated strong overall classification capability, particularly for the more prevalent merged classes.

**Demonstrated Skills:**

This project effectively showcases a range of essential data science and NLP skills:

*   **NLP Fundamentals:** Data cleaning, preprocessing, tokenization, and using word embeddings (FastText).
*   **Deep Learning:** Building and experimenting with sequential models (BiLSTM), implementing custom layers (Attention), and understanding the effects of techniques like class weighting.
*   **Transfer Learning & Fine-tuning:** Applying pre-trained transformer models (DistilBERT) to a specific downstream task.
*   **Transformer Architectures:** Understanding the benefits and application of transformer-based models for complex NLP tasks.
*   **Data Handling & Imbalance:** Strategies for handling large datasets, sampling, splitting data, and explicitly addressing class imbalance through data manipulation (merging) and algorithmic techniques (class weighting).
*   **Model Evaluation and Interpretation:** Utilizing comprehensive metrics (Accuracy, Precision, Recall, F1-score, Macro/Weighted F1) and interpreting classification reports and confusion matrices to understand model strengths and weaknesses.
*   **Iterative Development:** Progressing through different modeling approaches, evaluating results, and refining the strategy based on observations.

**Final Thoughts:**

The fine-tuned DistilBERT model with merged classes provides a robust solution for classifying consumer complaints. While opportunities for further refinement might exist (e.g., exploring other transformer models, different merging strategies), the current model represents a significant achievement in building an effective and more balanced classifier for this imbalanced text dataset. This project serves as a solid foundation for potential deployment or further exploration into aspects like model interpretability or real-time inference.

This marks the conclusion of the development and evaluation phase of this consumer complaint classification project.