# Problem Statement: **Customer Support Ticket Classification**

Your mission is to develop an automated system that classifies the customer complaints into one of the possible classes:

	1) Billing and Payments
	2) Customer Service
	3) General Inquiry
	4) Human Resources
	5) IT Support
	6) Product Support
	7) Returns and Exchanges
	8) Sales and Pre-Sales
	9) Service Outages and Maintenance
	10) Technical Support

We will be using "Customer Support on Twitter" for this problem.

**References:**

* Dataset: [Link](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter)

Credits: https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter

spaCy NLP library is used for

	- Lowercasing
	- Punctuation removal
	- Stopword removal
	- Lemmatization

For tokenization, embeddings and classification, **BERT** is used

**Note: As spaCy embeddings do NOT understand context, BERT transformer model is choosen for producing contextual embeddings.**

#### 1) Load the dataset:

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
#import pandas library
import pandas as pd

# Path to the dataset stored in Google Drive
file_path = "/content/drive/MyDrive/Colab Notebooks/datasets/aa_dataset-tickets-multi-lang-5-2-50-version.csv"

#read the dataset "twitter_classification_dataset.csv" provided and load it into dataframe "df"
df= pd.read_csv(file_path)

#print the shape of data
print(df.shape)

#Filter English tickets only
df = df[df['language'] == 'en']

#print the top5 rows
df.head()

#print(df['queue'].value_counts())

(28587, 16)


Unnamed: 0,subject,body,answer,type,queue,priority,language,version,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8
1,Account Disruption,"Dear Customer Support Team,\n\nI am writing to...","Thank you for reaching out, <name>. We are awa...",Incident,Technical Support,high,en,51,Account,Disruption,Outage,IT,Tech Support,,,
2,Query About Smart Home System Integration Feat...,"Dear Customer Support Team,\n\nI hope this mes...",Thank you for your inquiry. Our products suppo...,Request,Returns and Exchanges,medium,en,51,Product,Feature,Tech Support,,,,,
3,Inquiry Regarding Invoice Details,"Dear Customer Support Team,\n\nI hope this mes...",We appreciate you reaching out with your billi...,Request,Billing and Payments,low,en,51,Billing,Payment,Account,Documentation,Feedback,,,
4,Question About Marketing Agency Software Compa...,"Dear Support Team,\n\nI hope this message reac...",Thank you for your inquiry. Our product suppor...,Problem,Sales and Pre-Sales,medium,en,51,Product,Feature,Feedback,Tech Support,,,,
5,Feature Query,"Dear Customer Support,\n\nI hope this message ...",Thank you for your inquiry. Please specify whi...,Request,Technical Support,high,en,51,Feature,Product,Documentation,Feedback,,,,


#### **2) Preprocessing: Load Spacy model**

In [None]:
#uncomment the below line and run this cell to install the large english model which is trained on wikipedia data
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
#import spacy and load the language model downloaded
import spacy

nlp = spacy.load("en_core_web_lg")

#### **3) Preprocess the text using spacy pipeline**

In [None]:
def preprocess_text(text):
  doc = nlp(text)
  tokens = []
  for token in doc:
    # Remove punctuation, spaces, stop words
    if token.is_punct or token.is_space or token.is_stop:
      continue

    # Lemmatize (normalize words to base form) and lower case
    tokens.append(token.lemma_.lower())

  return " ".join(tokens)


clean_text = preprocess_text("India has won the world cup after 28 years. That's a PHENOMINAL achievement!!!")
print(clean_text)

india win world cup 28 year phenominal achievement


#### **MODEL 1: Logistic Regression (Baseline)**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, balanced_accuracy_score

# TF-IDF Vectorization
tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2)
)

# Preprocess text and add it in df
df['processed_text'] = df['body'].apply(preprocess_text)

# Encode target label into numbers
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['queue'])

X = df['processed_text']
y = df['labels']

X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train_tfidf = tfidf.fit_transform(X_train_lr)
X_test_tfidf = tfidf.transform(X_test_lr)

# Logistic Regression model
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight=None,  # IMPORTANT
    n_jobs=-1
)

# Train the logistic regression model
lr_model.fit(X_train_tfidf, y_train_lr)

# Predictions
y_pred_lr = lr_model.predict(X_test_tfidf)

lr_acc = accuracy_score(y_test, y_pred_lr)
lr_balanced_acc = balanced_accuracy_score(y_test, y_pred_lr)
lr_precision, lr_recall, lr_f1, _ = precision_recall_fscore_support(
    y_test, y_pred_lr, average='weighted'
)
lr_macro_f1 = precision_recall_fscore_support(
    y_test, y_pred_lr, average='macro'
)[2]

print(classification_report(
    y_test_lr,
    y_pred_lr,
    target_names=label_encoder.classes_
))

#### **MODEL 2: BERT BASE (No class weighting, No Oversampling)**

**Version 1:** Without PyTorch — Using HuggingFace Pipeline when
  - You are doing inference predictions only
  - You want fast prototyping
  - You want minimal code
  - You don't care about custom layers or tensors


In [None]:
# import transformers libray
from transformers import BertTokenizer, BertForSequenceClassification
import torch

In [None]:
# BERT Classification pipeline

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=10  # Billing / Technical / Account / Refund...
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### **Inference only (spaCy → BERT pipeline)**


In [None]:
def classify_ticket(text):

    # 1. Preprocess using spaCy
    clean_text = preprocess_text(text)

    # 2. Tokenize with BERT tokenizer
    inputs = tokenizer(
        clean_text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=128
    )

    # 3. Forward pass through BERT model
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

    return predicted_class

In [None]:
result = classify_ticket("Unable to swipe ID card")
print(result)
print(model.config.id2label[result])

'result = classify_ticket("Unable to swipe ID card")\nprint(result)\nprint(model.config.id2label[result])'

#### **Tokenization and Classification using BERT**
**Version 2**: With PyTorch

- You want to train or fine-tune BERT
- You want full control over:
- Model architecture
    - Gradients
    - Optimizers
    - Loss functions
- You want to experiment deeply with NLP

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer,DataCollatorWithPadding

from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support

import numpy as np

In [None]:
# Check if CUDA (GPU) is available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")

Using device: cuda


#### **Train validation test Split**

In [None]:
# Split data
train_df, test_df = train_test_split(
    df[['processed_text', 'labels']],  # DataFrame with both features and labels
    test_size=0.2,
    random_state=42,
    stratify=df['labels']
)

# Further split train into train and validation
train_df, val_df = train_test_split(
    train_df,
    test_size=0.1,  # 10% of train for validation
    random_state=42,
    stratify=train_df['labels']
)

#### Convert to HF Dataset:

In [None]:
# Convert DF to HF Dataset
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

#### Load Tokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#### Define Tokenization function with padding and truncation


In [None]:
def tokenize_function(examples):
  """
  Tokenize the processed_text column
  """
  return tokenizer(
        examples['processed_text'],
        truncation=True,
        padding='max_length',  # Static padding
        max_length=256,        # Adjust based on your text length
        return_tensors=None    # Don't return tensors yet
    )

#### Apply tokenization to train and test datasets

In [None]:
tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True  # Process in batches (faster)
)

tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True  # Process in batches (faster)
)

tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True
)

Map:   0%|          | 0/11763 [00:00<?, ? examples/s]

Map:   0%|          | 0/1307 [00:00<?, ? examples/s]

Map:   0%|          | 0/3268 [00:00<?, ? examples/s]

In [None]:
# Check a tokenized example
example = tokenized_train_dataset[0]
print(example)

{'processed_text': 'system experience periodic slowdown peak hour likely server overload attempt address restart server implement load balancing', 'labels': 9, '__index_level_0__': 2652, 'input_ids': [101, 2291, 3325, 15861, 4030, 7698, 4672, 3178, 3497, 8241, 2058, 11066, 3535, 4769, 23818, 8241, 10408, 7170, 20120, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

#### CRITICAL UNDERSTANDING: DataLoader is NOT NEEDED with Trainer API!
================================================================================

What Trainer Does Internally:
1. Creates DataLoaders automatically from your datasets
2. Uses batch_size from TrainingArguments
3. Handles shuffling (train=True, eval=False)
4. Uses data_collator for batching
5. Manages all the iteration logic

Trainer API:                           

Dataset → Trainer (creates DataLoader internally) → Training

YOU HANDLE:                          
- Dataset preparation                
- Tokenization                       
- TrainingArguments                  
- Trainer creation                   
                                                                
                                     
 TRAINER HANDLES:
 - DataLoader creation
 - Batching
 - Shuffling
 - Device placement
 - Training loop
 - Evaluation
 - Checkpointing

#### Load the BERT BASE model (no weights)

In [None]:
num_classes = len(label_encoder.classes_)

bert_base_model_no_weights  = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = num_classes

)
print(f"Model loaded with {num_classes} classes")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded with 10 classes


#### Data collator for dynamic padding

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### Define Compute Metrics Function

In [None]:
def compute_metrics(eval_pred):
    """
    Compute metrics for evaluation

    Args:
        eval_pred: tuple of (predictions, labels)

    Returns:
        Dictionary of metrics
    """
    logits, labels = eval_pred

    # Get predicted class (argmax of logits)
    predictions = np.argmax(logits, axis=1)

    # Calculate metrics
    accuracy = accuracy_score(labels, predictions)
    balanced_accuracy = balanced_accuracy_score(labels, predictions)
    
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        labels,
        predictions,
        average='macro'
    )

    precision_weighted, recall_weighted, f1_weighted, _ = precision_recall_fscore_support(
        labels,
        predictions,
        average='weighted'
    )

    return {
        'accuracy': accuracy,
        'balanced_accuracy': bal_acc,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted
    }


#### Define TrainingArguments

In [None]:
training_args_no_weights = TrainingArguments(
    # Output directory
    output_dir='./results_no_weights',

    # Training hyperparameters
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    weight_decay=0.01,

    # Mixed Precision Training (FP16) - 2x faster on GPU!
    fp16=torch.cuda.is_available(),  # Automatic: True if GPU, False if CPU

     # DataLoader settings
    dataloader_num_workers=4,  # Parallel data loading
    dataloader_pin_memory=True,  # Faster data transfer to GPU

    # Evaluation strategy
    eval_strategy="epoch",  # Evaluate at the end of each epoch
    save_strategy="epoch",   # Save checkpoint at the end of each epoch

    # Logging
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=100,

    warmup_ratio=0.1,
    lr_scheduler_type="linear",

    # Model saving
    load_best_model_at_end=True,  # Load best model at end
    metric_for_best_model="accuracy",  # Metric to determine best model
    greater_is_better=True,

    # Reproducibility
    seed=42,
)

#### Create Trainer

In [None]:
trainer_no_weights = Trainer(
    model=bert_base_model_no_weights,
    args=training_args_no_weights,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


#### Train the BERT BASE model

In [None]:
train_result_no_weights = trainer_no_weights.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.6619,1.600457,0.440704,0.376446,0.440704,0.385099
2,1.532,1.495754,0.482785,0.470085,0.482785,0.434945
3,1.2872,1.37846,0.522571,0.534016,0.522571,0.492765
4,1.034,1.324733,0.55241,0.540982,0.55241,0.534225
5,0.8529,1.303828,0.581484,0.57213,0.581484,0.567528


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Evaluate the Bert Base on Test Dataset

In [None]:
test_results_no_weights = trainer_no_weights.evaluate(tokenized_test_dataset)



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Get predictions on BERT Base no weight model test dataset

In [None]:
predictions_output_no_weights = trainer_no_weights.predict(tokenized_test_dataset)
# Extract predictions and labels
predictions_no_weights = np.argmax(predictions_output_no_weights.predictions, axis=1)
true_labels = predictions_output_no_weights.label_ids



#### Classification Report BERT Base

In [None]:
report = classification_report(
    true_labels,
    predictions_no_weights,
    target_names=label_encoder.classes_,
    digits=4
)
print(report)

                                 precision    recall  f1-score   support

           Billing and Payments     0.8322    0.7931    0.8122       319
               Customer Service     0.4746    0.4461    0.4599       482
                General Inquiry     1.0000    0.0213    0.0417        47
                Human Resources     0.9412    0.2286    0.3678        70
                     IT Support     0.4836    0.4562    0.4695       388
                Product Support     0.4552    0.5122    0.4820       615
          Returns and Exchanges     0.5652    0.3171    0.4062       164
            Sales and Pre-Sales     0.5435    0.2427    0.3356       103
Service Outages and Maintenance     0.7200    0.6767    0.6977       133
              Technical Support     0.5708    0.7064    0.6314       947

                       accuracy                         0.5548      3268
                      macro avg     0.6586    0.4400    0.4704      3268
                   weighted avg     0.5691    0.5

In [None]:
counts = df['labels'].value_counts()

for label_id, count in counts.items():
    print(f"Class {label_id} ({label_encoder.classes_[label_id]}): {count}")

#### **Handle Imbalance**

In [None]:
!pip install imbalanced-learn



In [None]:
from imblearn.over_sampling import RandomOverSampler

# After train_test_split
X_train_original, X_test, y_train_original, y_test = train_test_split(
    df['processed_text'].values,
    df['labels'].values,
    test_size=0.2,
    random_state=42,
    stratify=df['labels']
)

# Apply oversampling
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(
    X_train_original.reshape(-1, 1),
    y_train_original
)

X_train_resampled = X_train_resampled.flatten()

# Balanced dataset
train_dataset_balanced = Dataset.from_dict({
    'processed_text': X_train_resampled.tolist(),
    'labels': y_train_resampled.tolist()
})


In [None]:
print("\nAfter oversampling:")
print(f"  Shape: {X_train_resampled.shape}")
unique, counts = np.unique(y_train_resampled, return_counts=True)
for u, c in zip(unique, counts):
    print(f"  Class {u}: {c} samples")


After oversampling:
  Shape: (37900,)
  Class 0: 3790 samples
  Class 1: 3790 samples
  Class 2: 3790 samples
  Class 3: 3790 samples
  Class 4: 3790 samples
  Class 5: 3790 samples
  Class 6: 3790 samples
  Class 7: 3790 samples
  Class 8: 3790 samples
  Class 9: 3790 samples


#### Tokenize, Train and Evaluate the Oversampled BERT model

In [None]:

# Tokenize
tokenized_train_balanced = train_dataset_balanced.map(
    tokenize_function,
    batched=True
)

# Use oversampled model and trainer!
bert_model_oversampled  = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_classes
)

# Training arguments
training_args_oversampled = TrainingArguments(
    # Output directory
    output_dir='./results_bert_oversampled',

    # Training hyperparameters
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    weight_decay=0.01,

    # Mixed Precision Training (FP16) - 2x faster on GPU!
    fp16=torch.cuda.is_available(),  # Automatic: True if GPU, False if CPU

     # DataLoader settings
    dataloader_num_workers=4,  # Parallel data loading
    dataloader_pin_memory=True,  # Faster data transfer to GPU

    # Evaluation strategy
    eval_strategy="epoch",  # Evaluate at the end of each epoch
    save_strategy="epoch",   # Save checkpoint at the end of each epoch

    # Logging
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=100,

    warmup_ratio=0.1,
    lr_scheduler_type="linear",

    # Model saving
    load_best_model_at_end=True,  # Load best model at end
    metric_for_best_model="accuracy",  # Metric to determine best model
    greater_is_better=True,

    # Reproducibility
    seed=42,
)

trainer_oversampled = Trainer(
    model=bert_model_oversampled,
    args=training_args_oversampled,
    train_dataset=tokenized_train_balanced,  # Balanced dataset
    eval_dataset=tokenized_val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

train_result_oversampled = trainer_oversampled.train()

# Evaluate on test
test_results_oversampled = trainer_oversampled.evaluate(tokenized_test_dataset)

# Predictions
predictions_output_oversampled = trainer_oversampled.predict(tokenized_test_dataset)
# Extract predictions and labels
predictions_oversampled = np.argmax(predictions_output_oversampled.predictions, axis=1)
true_labels = predictions_output_oversampled.label_ids


Map:   0%|          | 0/37900 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_standard = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.8124,1.058339,0.570008,0.596079,0.570008,0.560906
2,0.4035,0.587185,0.768936,0.787841,0.768936,0.762799
3,0.2394,0.272344,0.89824,0.912512,0.89824,0.896292
4,0.1385,0.10433,0.9671,0.969349,0.9671,0.967347
5,0.0769,0.037478,0.986993,0.987158,0.986993,0.987014




TrainOutput(global_step=11845, training_loss=0.5158429384332208, metrics={'train_runtime': 2333.6926, 'train_samples_per_second': 81.202, 'train_steps_per_second': 5.076, 'total_flos': 2.4931563170304e+16, 'train_loss': 0.5158429384332208, 'epoch': 5.0})

#### Classification Report BERT oversampled

In [None]:
report_oversampled = classification_report(
    true_labels,
    predictions_oversampled,
    target_names=label_encoder.classes_,
    digits=4
)
print(report_oversampled)

In [None]:
#### **Final comparison all models**

In [None]:
comparison_data = {
    'Model': [
        'Logistic Regression',
        'BERT (No Weighting)',
        'BERT + Oversampling'
    ],
    'Accuracy': [
        lr_acc,
        test_results_no_weights['eval_accuracy'],
        test_results_oversampled['eval_accuracy']
    ],
    'Balanced Accuracy': [
        lr_balanced_acc,
        test_results_no_weights['eval_balanced_accuracy'],
        test_results_oversampled['eval_balanced_accuracy']
    ],
    'Macro F1': [
        lr_macro_f1,
        test_results_no_weights['eval_f1_macro'],
        test_results_oversampled['eval_f1_macro']
    ],
    'Weighted F1': [
        lr_f1,
        test_results_no_weights['eval_f1_weighted'],
        test_results_oversampled['eval_f1_weighted']
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

#### Save the final model

In [None]:
#Set save path in Google Drive

model_dir  = '/content/drive/MyDrive/models/'

if not os.path.exists(model_dir):
    os.makedirs(model_dir)
    print(f"Created directory: {model_dir}")
else:
    print(f"Directory already exists: {model_dir}")

# Create project-specific subfolder
project_name = 'text_classification'
project_dir = os.path.join(model_dir, project_name)

if not os.path.exists(project_dir):
    os.makedirs(project_dir)
    print(f"Created project directory: {project_dir}")
else:
    print(f"Project directory already exists: {project_dir}")

save_path = "/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model"
predictions_oversampled.save_model(save_path)
tokenizer.save_pretrained(save_path)

Created directory: /content/drive/MyDrive/models/
✓ Created project directory: /content/drive/MyDrive/models/text_classification


('/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model/tokenizer_config.json',
 '/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model/special_tokens_map.json',
 '/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model/vocab.txt',
 '/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model/added_tokens.json')

#### Save label encoder

In [None]:
import pickle

label_encoder_path = os.path.join(save_path, 'label_encoder.pkl')

print("\nSaving label encoder to Google Drive...")

try:
    with open(label_encoder_path, 'wb') as f:
        pickle.dump(label_encoder, f)
    print(f"Label encoder saved to: {label_encoder_path}")
except Exception as e:
    print(f"Error: {e}")



Saving label encoder to Google Drive...
Label encoder saved to: /content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model/label_encoder.pkl


In [3]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


#### Predictions

In [4]:
# ============================================================
# PREDICTIONS USING TRAINER API
# ============================================================

# Load model and create Trainer
from transformers import (
    BertForSequenceClassification,
    BertTokenizer,
    Trainer,
    TrainingArguments
)
from datasets import Dataset
import pickle
import numpy as np
import torch

# Mount Drive and load
from google.colab import drive
drive.mount('/content/drive')

save_path = "/content/drive/MyDrive/models/text_classification/ticket_classification_nlp_model"

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained(save_path)
tokenizer = BertTokenizer.from_pretrained(save_path)

# Load label encoder
with open(f'{save_path}/label_encoder.pkl', 'rb') as f:
    label_encoder = pickle.load(f)

print("Model, tokenizer, and label encoder loaded")

# Create Trainer for inference
training_args = TrainingArguments(
    output_dir='./predictions',
    per_device_eval_batch_size=32,  # Batch size for predictions
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
)

print("Trainer created for inference")

# Prepare your texts for prediction
# Examples for Prediction on new texts
new_texts = [
    "I have a billing issue with my account",
    "Technical support needed for software bug",
    "How do I return my purchase?",
    "My password is not working"
]

# Preprocess texts (use same preprocessing as training)
import spacy
nlp = spacy.load('en_core_web_lg')

def preprocess_text(text):
    if not text:
        return ""
    doc = nlp(str(text).lower())
    tokens = [
        token.lemma_ for token in doc
        if not token.is_stop and not token.is_punct and not token.is_space and token.text.strip()
    ]
    return ' '.join(tokens)

processed_texts = [preprocess_text(text) for text in new_texts]

# Create Dataset from new texts
predict_dataset = Dataset.from_dict({
    'processed_text': processed_texts
})

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples['processed_text'],
        truncation=True,
        max_length=256,
    )

tokenized_predict = predict_dataset.map(tokenize_function, batched=True)

print(f"Prepared {len(tokenized_predict)} texts for prediction")

# Make predictions using Trainer.predict()
print("\nMaking predictions...")

predictions_output = trainer.predict(tokenized_predict)

# Extract predictions
logits = predictions_output.predictions  # Raw model outputs
predicted_labels = np.argmax(logits, axis=1)  # Get predicted class indices
probabilities = torch.softmax(torch.tensor(logits), dim=1).numpy()  # Convert to probabilities

# Step 6: Display results
print("\n" + "="*80)
print("PREDICTIONS")
print("="*80 + "\n")

for i, (text, pred_idx, probs) in enumerate(zip(new_texts, predicted_labels, probabilities)):
    predicted_class = label_encoder.classes_[pred_idx]
    confidence = probs[pred_idx]

    # Get top 3 predictions
    top_3_indices = np.argsort(probs)[-3:][::-1]

    print(f"Text {i+1}: {text}")
    print(f"Predicted: {predicted_class} (confidence: {confidence:.4f})")
    print(f"Top 3 predictions:")
    for rank, idx in enumerate(top_3_indices, 1):
        cls = label_encoder.classes_[idx]
        prob = probs[idx]
        print(f"  {rank}. {cls}: {prob:.4f} ({prob*100:.1f}%)")
    print("-" * 80)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Model, tokenizer, and label encoder loaded
Trainer created for inference


  trainer = Trainer(


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Prepared 4 texts for prediction

Making predictions...





PREDICTIONS

Text 1: I have a billing issue with my account
Predicted: Billing and Payments (confidence: 0.9998)
Top 3 predictions:
  1. Billing and Payments: 0.9998 (100.0%)
  2. General Inquiry: 0.0001 (0.0%)
  3. Service Outages and Maintenance: 0.0000 (0.0%)
--------------------------------------------------------------------------------
Text 2: Technical support needed for software bug
Predicted: Technical Support (confidence: 0.9655)
Top 3 predictions:
  1. Technical Support: 0.9655 (96.5%)
  2. Product Support: 0.0339 (3.4%)
  3. Customer Service: 0.0002 (0.0%)
--------------------------------------------------------------------------------
Text 3: How do I return my purchase?
Predicted: Billing and Payments (confidence: 0.9928)
Top 3 predictions:
  1. Billing and Payments: 0.9928 (99.3%)
  2. Returns and Exchanges: 0.0061 (0.6%)
  3. Technical Support: 0.0003 (0.0%)
--------------------------------------------------------------------------------
Text 4: My password is not work