# Multi-Class Text Classification with BERT 🚀

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue)](https://www.python.org/downloads/release)
[![PyTorch Version](https://img.shields.io/badge/pytorch-1.8%2B-orange)](https://pytorch.org/get-started/locally/)

## Project Overview

### 🏢 Business Overview
In this NLP project, we aim to perform multiclass text classification using a pre-trained BERT model. 

### 🎯 Aim
The goal is to leverage the power of the BERT (Bidirectional Encoder Representations) model, an open-source ML framework for Natural Language Processing, to achieve state-of-the-art results in multiclass text classification.

## Data Description

The dataset includes customer complaints about financial products, with columns for complaint text and product labels. The task is to predict the product category based on the complaint text.

## Tech Stack

- **Language:** Python
- **Libraries:** pandas, torch, nltk, numpy, pickle, re, tqdm, sklearn, transformers

## Prerequisite

1. Install the torch framework
2. Understanding of Multiclass Text Classification using Naive Bayes
3. Familiarity with Skip Gram Model for Word Embeddings
4. Knowledge of building Multi-Class Text Classification Models with RNN and LSTM
5. Understanding Text Classification Model with Attention Mechanism in NLP

## Approach

1. **Data Processing**
   - Read CSV, handle null values, encode labels, preprocess text.

2. **Model Building**
   - Create BERT model, define dataset, train and test functions.

3. **Training**
   - Load data, split, create datasets and loaders.
   - Train BERT model on GPU/CPU.

4. **Predictions**
   - Make predictions on new text data.

## Project Structure

- **Input:** complaints.csv
- **Output:** bert_pre_trained.pth, label_encoder.pkl, labels.pkl, tokens.pkl
- **Source:** model.py, data.py, utils.py
- **Files:** Engine.py, bert.ipynb, processing.py, predict.py, README.md, requirements.txt

## Takeaways

1. Solving business problems using pre-trained models.
2. Leveraging BERT for text classification.
3. Data preparation and model training.
4. Making predictions on new data.



In [4]:
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn as nn
from transformers import BertModel
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm
2024-11-11 21:03:22.434590: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-11 21:03:22.570337: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1731339202.637467    7554 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1731339202.656461    7554 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-11 21:03:22.775148: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorF

In [6]:
df = pd.read_csv('../../processed_data/full_data.csv')  


In [8]:
df['sub_category'].unique()

array(['cyber bullying/stalking/sexting', 'fraud call/vishing',
       'online gambling  betting', 'online job fraud',
       'upi related frauds', 'internet banking related fraud',
       'rape/gang rape-sexually abusive content', 'other',
       'profile hacking identity theft',
       'debit/credit card fraud or sim swap fraud',
       'ewallet related fraud', 'data breach/theft',
       'cheating by impersonation',
       'denial of service (dos)/distributed denial of service (ddos) attacks',
       'fakeimpersonating profile', 'cryptocurrency fraud',
       'sexually explicit act', 'sexually obscene material',
       'malware attack', 'business email compromise/email takeover',
       'email hacking', 'hacking/defacement',
       'unauthorised access/data breach', 'sql injection',
       'provocative speech for unlawful acts', 'ransomware attack',
       'cyber terrorism',
       'child pornography/child sexual abuse material (csam)',
       'tampering with computer source documen

In [16]:
lr = 1e-3
seq_len = 20
dropout = 0.5
num_epochs = 10
label_col = "sub_category"
tokens_path = "tokens.pkl"
labels_path = "labels.pkl"
data_path = "../../processed_data/full_data.csv"
model_path = "bert_pre_trained.pth"
text_col_name = "crimeaditionalinfo"
label_encoder_path = "label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
               'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
               'Credit card or prepaid card': 'card',
               'Money transfer, virtual currency, or money service': 'money_transfer',
               'virtual currency': 'money_transfer',
               'Mortgage': 'mortgage',
               'Payday loan, title loan, or personal loan': 'loan',
               'Debt collection': 'debt_collection',
               'Checking or savings account': 'savings_account',
               'Credit card': 'card',
               'Bank account or service': 'savings_account',
               'Credit reporting': 'credit_report',
               'Prepaid card': 'card',
               'Payday loan': 'loan',
               'Other financial service': 'others',
               'Virtual currency': 'money_transfer',
               'Student loan': 'loan',
               'Consumer Loan': 'loan',
               'Money transfers': 'money_transfer'}

In [17]:
def save_file(name, obj):
    """
    Function to save an object as pickle file
    """
    with open(name, 'wb') as f:
        pickle.dump(obj, f)


def load_file(name):
    """
    Function to load a pickle object
    """
    return pickle.load(open(name, "rb"))

## Process text data
---

In [18]:
data = df

In [19]:
data.dropna(subset=[text_col_name], inplace=True)

In [20]:
data

Unnamed: 0,category,sub_category,crimeaditionalinfo
0,online and social media related crime,cyber bullying/stalking/sexting,i had continue received random calls and abusi...
1,online financial fraud,fraud call/vishing,the above fraudster is continuously messaging ...
2,online gambling betting,online gambling betting,he is acting like a police and demanding for m...
3,online and social media related crime,online job fraud,in apna job i have applied for job interview f...
4,online financial fraud,fraud call/vishing,i received a call from lady stating that she w...
...,...,...,...
124882,online and social media related crime,online matrimonial fraud,a lady named rashmi probably a fake name had c...
124883,online financial fraud,internet banking related fraud,i am mr chokhe ram two pers mobile number wer...
124884,any other cyber crime,other,mai bibekbraj maine pahle ki complain kar chuk...
124885,online financial fraud,internet banking related fraud,received url link for updating kyc from mobile...


In [21]:
# data.replace({label_col: product_map}, inplace=True)

### Encode labels

In [22]:
label_encoder = LabelEncoder()
label_encoder.fit(data[label_col])
labels = label_encoder.transform(data[label_col])

In [23]:
save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)

### Process the text column

In [24]:
input_text = list(data[text_col_name])

In [25]:
len(input_text)

124887

### Convert text to lower case

In [26]:
input_text = [i.lower() for i in tqdm(input_text)]

  0%|          | 0/124887 [00:00<?, ?it/s]

100%|██████████| 124887/124887 [00:00<00:00, 940261.51it/s]


### Remove punctuations except apostrophe

In [27]:
input_text = [re.sub(r"[^\w\d'\s]+", " ", i)
             for i in tqdm(input_text)]

  0%|          | 0/124887 [00:00<?, ?it/s]

100%|██████████| 124887/124887 [00:01<00:00, 115211.32it/s]


### Remove digits

In [28]:
input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]

  input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]
100%|██████████| 124887/124887 [00:00<00:00, 151811.24it/s]


### Remove more than one consecutive instance of 'x'

In [29]:
input_text = [re.sub(r'[x]{2,}', "", i) for i in tqdm(input_text)]

100%|██████████| 124887/124887 [00:00<00:00, 207912.22it/s]


### Remove multiple spaces with single space

In [30]:
input_text = [re.sub(' +', ' ', i) for i in tqdm(input_text)]

100%|██████████| 124887/124887 [00:02<00:00, 58427.91it/s]


### Tokenize the text

In [31]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [32]:
input_text[0]

'i had continue received random calls and abusive messages in my whatsapp someone added my number in a unknown facebook group name with only girls and still getting calls from unknown numbers pls help me and sort out the issue as soon as possible thank you'

In [33]:
sample_tokens = tokenizer(input_text[0], padding="max_length",
                         max_length=seq_len, truncation=True,
                         return_tensors="pt")

In [34]:
sample_tokens

{'input_ids': tensor([[  101,   178,  1125,  2760,  1460,  7091,  3675,  1105, 22898,  7416,
          1107,  1139,  1184,  3202,  8661,  1800,  1896,  1139,  1295,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [35]:
sample_tokens["input_ids"]

tensor([[  101,   178,  1125,  2760,  1460,  7091,  3675,  1105, 22898,  7416,
          1107,  1139,  1184,  3202,  8661,  1800,  1896,  1139,  1295,   102]])

In [36]:
sample_tokens["attention_mask"]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [37]:
tokens = [tokenizer(i, padding="max_length", max_length=seq_len, 
                    truncation=True, return_tensors="pt") 
         for i in tqdm(input_text)]

100%|██████████| 124887/124887 [01:38<00:00, 1266.55it/s]


### Save the tokens

In [38]:
save_file(tokens_path, tokens)

## Create Bert model
---

In [39]:
class BertClassifier(nn.Module):
    
    def __init__(self, dropout, num_classes):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        for param in self.bert.parameters():
            param.required_grad = False
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, num_classes)
        self.activation = nn.ReLU()
    
    def forward(self, input_ids, attention_mask):
        _, bert_output = self.bert(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  return_dict=False)
        dropout_output = self.activation(self.dropout(bert_output))
        final_output = self.linear(dropout_output)
        return final_output

## Create PyTorch Dataset
---

In [40]:
class TextDataset(torch.utils.data.Dataset):
    
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels
        
    def __len__(self):
        return len(self.tokens)
    
    def __getitem__(self, idx):
        return self.labels[idx], self.tokens[idx]

### Function to train the model

In [41]:
def train(train_loader, valid_loader, model, criterion, optimizer, 
          device, num_epochs, model_path):
    """
    Function to train the model
    :param train_loader: Data loader for train dataset
    :param valid_loader: Data loader for validation dataset
    :param model: Model object
    :param criterion: Loss function
    :param optimizer: Optimizer
    :param device: CUDA or CPU
    :param num_epochs: Number of epochs
    :param model_path: Path to save the model
    """
    best_loss = 1e8
    for i in range(num_epochs):
        print(f"Epoch {i+1} of {num_epochs}")
        valid_loss, train_loss = [], []
        model.train()
        # Train loop
        for batch_labels, batch_data in tqdm(train_loader):
            input_ids = batch_data["input_ids"]
            attention_mask = batch_data["attention_mask"]
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            input_ids = torch.squeeze(input_ids, 1)
            # Forward pass
            batch_output = model(input_ids, attention_mask)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss
            ###batch_labels = batch_labels.type(torch.LongTensor)
            loss = criterion(batch_output, batch_labels)
            train_loss.append(loss.item())
            optimizer.zero_grad()
            # Backward pass
            loss.backward()
            # Gradient update step
            optimizer.step()
        model.eval()
        # Validation loop
        for batch_labels, batch_data in tqdm(valid_loader):
            input_ids = batch_data["input_ids"]
            attention_mask = batch_data["attention_mask"]
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            input_ids = torch.squeeze(input_ids, 1)
            # Forward pass
            batch_output = model(input_ids, attention_mask)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss
            ###batch_labels = batch_labels.type(torch.LongTensor)
            loss = criterion(batch_output, batch_labels)
            valid_loss.append(loss.item())
        t_loss = np.mean(train_loss)
        v_loss = np.mean(valid_loss)
        print(f"Train Loss: {t_loss}, Validation Loss: {v_loss}")
        if v_loss < best_loss:
            best_loss = v_loss
            # Save model if validation loss improves
            torch.save(model.state_dict(), model_path)
        print(f"Best Validation Loss: {best_loss}")

### Function to test the model

In [42]:
def test(test_loader, model, criterion, device):
    """
    Function to test the model
    :param test_loader: Data loader for test dataset
    :param model: Model object
    :param criterion: Loss function
    :param device: CUDA or CPU
    """
    model.eval()
    test_loss = []
    test_accu = []
    for batch_labels, batch_data in tqdm(test_loader):
        input_ids = batch_data["input_ids"]
        attention_mask = batch_data["attention_mask"]
        # Move data to GPU if available
        batch_labels = batch_labels.to(device)
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        input_ids = torch.squeeze(input_ids, 1)
        # Forward pass
        batch_output = model(input_ids, attention_mask)
        batch_output = torch.squeeze(batch_output)
        # Calculate loss
        ###batch_labels = batch_labels.type(torch.LongTensor)
        loss = criterion(batch_output, batch_labels)
        test_loss.append(loss.item())
        batch_preds = torch.argmax(batch_output, axis=1)
        # Move predictions to CPU
        if torch.cuda.is_available():
            batch_labels = batch_labels.cpu()
            batch_preds = batch_preds.cpu()
        # Compute accuracy
        test_accu.append(accuracy_score(batch_labels.detach().
                                        numpy(),
                                        batch_preds.detach().
                                        numpy()))
    test_loss = np.mean(test_loss)
    test_accu = np.mean(test_accu)
    print(f"Test Loss: {test_loss}, Test Accuracy: {test_accu}")

## Train Bert model
---

### Load the files

In [43]:
tokens = load_file(tokens_path)
labels = load_file(labels_path)
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)

In [44]:
num_classes

42

### Split data into train, validation and test sets

In [45]:
X_train, X_test, y_train, y_test = train_test_split(tokens, labels,
                                                   test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, 
                                                      y_train,
                                                     test_size=0.25)

### Create PyTorch datasets

In [46]:
train_dataset = TextDataset(X_train, y_train)
valid_dataset = TextDataset(X_valid, y_valid)
test_dataset = TextDataset(X_test, y_test)

### Create data loaders

In [47]:
train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=16,
                                           shuffle=True,
                                           drop_last=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
                                           batch_size=16)
test_loader = torch.utils.data.DataLoader(test_dataset, 
                                         batch_size=16)

### Create model object

In [48]:
device = torch.device("cuda:0" if torch.cuda.is_available()
                     else "cpu")

In [49]:
device

device(type='cpu')

In [50]:
model = BertClassifier(dropout, num_classes)

### Define loss function and optimizer

In [51]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

### Move the model to GPU if available

In [52]:
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

### Training loop

In [53]:
train(train_loader, valid_loader, model, criterion, optimizer,
     device, num_epochs, model_path)

Epoch 1 of 10


  6%|▌         | 270/4683 [06:21<1:44:00,  1.41s/it]


KeyboardInterrupt: 

### Test the model

In [41]:
test(test_loader, model, criterion, device)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:57<00:00,  1.73it/s]

Test Loss: 1.6601403439044953, Test Accuracy: 0.453125





## Predict on new text
---

In [42]:
input_text = '''I am a victim of Identity Theft & currently have an Experian account that 
I can view my Experian Credit Report and getting notified when there is activity on 
my Experian Credit Report. For the past 3 days I've spent a total of approximately 9 
hours on the phone with Experian. Every time I call I get transferred repeatedly and 
then my last transfer and automated message states to press 1 and leave a message and 
someone would call me. Every time I press 1 I get an automatic message stating than you 
before I even leave a message and get disconnected. I call Experian again, explain what 
is happening and the process begins again with the same end result. I was trying to have 
this issue attended and resolved informally but I give up after 9 hours. There are hard 
hit inquiries on my Experian Credit Report that are fraud, I didn't authorize, or recall 
and I respectfully request that Experian remove the hard hit inquiries immediately just 
like they've done in the past when I was able to speak to a live Experian representative 
in the United States. The following are the hard hit inquiries : BK OF XXXX XX/XX/XXXX 
XXXX XXXX XXXX  XX/XX/XXXX XXXX  XXXX XXXX  XX/XX/XXXX XXXX  XX/XX/XXXX XXXX  XXXX 
XX/XX/XXXX'''

In [43]:
input_text = input_text.lower()
input_text = re.sub(r"[^\w\d'\s]+", " ", input_text)
input_text = re.sub("\d+", "", input_text)
input_text = re.sub(r'[x]{2,}', "", input_text)
input_text = re.sub(' +', ' ', input_text)

In [44]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [45]:
tokens = tokenizer(input_text, padding="max_length",
                 max_length=seq_len, truncation=True,
                 return_tensors="pt")

In [46]:
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"]

In [47]:
device = torch.device("cuda:0" if torch.cuda.is_available()
                     else "cpu")

In [48]:
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)

In [49]:
input_ids = torch.squeeze(input_ids, 1)

In [50]:
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)

In [51]:
# Create model object
model = BertClassifier(dropout, num_classes)

# Load trained weights
model.load_state_dict(torch.load(model_path))

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.cuda()
    
# Forward pass
out = torch.squeeze(model(input_ids, attention_mask))

# Find predicted class
prediction = label_encoder.classes_[torch.argmax(out)]
print(f"Predicted Class: {prediction}")

Predicted Class: credit_report
