# LLM-Based Text Classification Project Documentation

## Data Source
#### The dataset used for this project is the AG News Classification Dataset from Kaggle. This dataset consists of news articles categorized into four categories: World, Sports, Business, and Sci/Tech. It is ideal for training and evaluating transformer-based language models like BERT Mini and MiniLM for text classification tasks.

#### Link for dataset - https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset

## **Summary of the Approach**
This project involves adapting transformer-based language models (BERT Mini and MiniLM) for text classification on a news dataset. The models predict one of four categories: World, Sports, Business, and Sci/Tech.

### **Steps Taken:**

### 1. **Data Preprocessing:**
- Loaded training and testing datasets.
- Combined the `Title` and `Description` fields into a single text field.
- Cleaned text by removing special characters, single letters, HTML tags, and extra spaces.
- Converted labels from 1-4 to 0-3 for compatibility with PyTorch models.

### 2. **Model Selection:**
- **Models Used:**
  - `BERT Mini` (`prajjwal1/bert-mini`): Compact BERT model for reduced inference time.
  - `MiniLM` (`microsoft/MiniLM-L12-H384-uncased`): Smaller model offering efficiency and good performance.
- **Rationale:** These models balance performance and computational efficiency.

### 3. **Transforming the Model:**
- Modified the output layer to match the number of classes (4).
- Applied Cross-Entropy Loss for classification.

### 4. **Model Training:**
- Created PyTorch dataloaders using a custom `NewsDataset` class.
- Trained models using:
  - Optimizer: AdamW
  - Learning Rate: 2e-5
  - Loss Function: CrossEntropyLoss
  - Batch Size: 16

### 5. **Model Evaluation:**
- Evaluated models using accuracy, precision, recall, and F1-score.
- Compared performance metrics between models.

---

### **Challenges Encountered:**
1. **Training Time:** Utilized GPU when available for faster training.
2. **Text Length Limitation:** Truncated long texts to fit model requirements.

---

### **Discussion:**

#### **1. Why This Dataset and Models?**
- The dataset is suitable for text classification and publicly available.
- The selected models are efficient and perform well for small-to-medium datasets.

#### **2. Differences from Language Modeling:**
- Language modeling predicts the next word, while classification predicts a class label.
- Training involves cross-entropy loss instead of next-token prediction.

#### **3. Limitations and Improvements:**
- **Limitations:** Limited dataset size, no hyperparameter tuning.
- **Improvements:** Larger datasets, ensemble models, hyperparameter optimization.

---

### **Documented Code Snippets:**

In [10]:
# Connecting it with Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
# Connecting it to the approriate folder
%cd /content/drive/MyDrive/Colab-Notebooks
!ls

/content/drive/MyDrive/Colab-Notebooks
test.csv  train.csv  Untitled0.ipynb


In [12]:
# Importing the libraries

import pandas as pd
import re
import torch
from torch.utils.data import DataLoader, Dataset
from torch import nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm

In [13]:
# Data Loading
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [None]:
train_df.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [14]:
test_df.head()

Unnamed: 0,Class Index,Title,Description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...


In [15]:
#Data Preprocessing

train_df['text'] = train_df['Title'] + " " + train_df['Description']
test_df['text'] = test_df['Title'] + " " + test_df['Description']

train_df = train_df[['text', 'Class Index']]
test_df = test_df[['text', 'Class Index']]

def clean_text(text):
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\b[a-zA-Z]\b", " ", text)
    text = re.sub(r"<[^>]*>", " ", text)
    text = text.lower().strip()
    text = re.sub(r"\s+", " ", text)
    return text

train_df['clean_text'] = [clean_text(t) for t in train_df['text']]
test_df['clean_text'] = [clean_text(t) for t in test_df['text']]

In [16]:
train_df.head()

Unnamed: 0,text,Class Index,clean_text
0,Wall St. Bears Claw Back Into the Black (Reute...,3,wall st bears claw back into the black reuters...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,3,carlyle looks toward commercial aerospace reut...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,3,oil and economy cloud stocks outlook reuters r...
3,Iraq Halts Oil Exports from Main Southern Pipe...,3,iraq halts oil exports from main southern pipe...
4,"Oil prices soar to all-time record, posing new...",3,oil prices soar to all time record posing new ...


In [17]:
test_df.head()

Unnamed: 0,text,Class Index,clean_text
0,Fears for T N pension after talks Unions repre...,3,fears for pension after talks unions represent...
1,The Race is On: Second Private Team Sets Launc...,4,the race is on second private team sets launch...
2,Ky. Company Wins Grant to Study Peptides (AP) ...,4,ky company wins grant to study peptides ap ap ...
3,Prediction Unit Helps Forecast Wildfires (AP) ...,4,prediction unit helps forecast wildfires ap ap...
4,Calif. Aims to Limit Farm-Related Smog (AP) AP...,4,calif aims to limit farm related smog ap ap so...


In [18]:
# Adjusting Class Index to Start from 0
train_df['Class Index'] = train_df['Class Index'] - 1
test_df['Class Index'] = test_df['Class Index'] - 1

In [19]:
train_df['Class Index'].value_counts()

Unnamed: 0_level_0,count
Class Index,Unnamed: 1_level_1
2,30000
3,30000
1,30000
0,30000


In [20]:
test_df['Class Index'].value_counts()

Unnamed: 0_level_0,count
Class Index,Unnamed: 1_level_1
2,1900
3,1900
1,1900
0,1900


In [21]:
# Loading Pre-Trained Tokenizers and Models
# We use Hugging Face's Transformers library to load tokenizers and models for fine-tuning:
# - BERT Mini: A lightweight version of BERT, suitable for limited computational resources.
# - MiniLM: A compact language model designed for tasks requiring fast inference.
# Both models are configured for sequence classification with 4 output labels corresponding to the dataset classes.

tokenizer_bert = AutoTokenizer.from_pretrained("prajjwal1/bert-mini")
model_bert = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-mini", num_labels=4)

tokenizer_minilm = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
model_minilm = AutoModelForSequenceClassification.from_pretrained("microsoft/MiniLM-L12-H384-uncased", num_labels=4)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-mini and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/133M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# Custom Dataset Class for News Classification
# This class inherits from PyTorch's Dataset class and is used to preprocess and manage the news classification data.

class NewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_len, return_tensors="pt")
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.long)
        }

In [23]:
# Prepare Text and Label Lists for Training and Testing
# Extracts text data and labels from the training and testing DataFrames

train_texts = train_df['clean_text'].tolist()
train_labels = train_df['Class Index'].tolist()

test_texts = test_df['clean_text'].tolist()
test_labels = test_df['Class Index'].tolist()

# Create DataLoaders for BERT and MiniLM Models
# DataLoaders are used to efficiently load and batch data for model training and evaluation.

train_loader_bert = DataLoader(NewsDataset(train_texts, train_labels, tokenizer_bert), batch_size=16, shuffle=True)
test_loader_bert = DataLoader(NewsDataset(test_texts, test_labels, tokenizer_bert), batch_size=16)

train_loader_minilm = DataLoader(NewsDataset(train_texts, train_labels, tokenizer_minilm), batch_size=16, shuffle=True)
test_loader_minilm = DataLoader(NewsDataset(test_texts, test_labels, tokenizer_minilm), batch_size=16)

In [24]:
# Training Function
# This function trains the model on the training data using the specified optimizer and loss criterion.

def train_model(model, train_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc="Training"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    return total_loss / len(train_loader)

# Evaluation Function
# This function evaluates the model on the test data and calculates performance metrics.

def evaluate_model(model, test_loader, device):
    model.eval()
    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()
            predictions.extend(preds)
            true_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='weighted')
    return acc, precision, recall, f1

In [26]:
# Device Configuration
# Check if a GPU is available and set the device accordingly; otherwise, use the CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Model Setup for BERT Mini
model_bert.to(device)

# Define the Optimizer and Loss Function
optimizer_bert = AdamW(model_bert.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

# Train the Model
train_loss = train_model(model_bert, train_loader_bert, optimizer_bert, criterion, device)
print(f"\nTraining Loss: {train_loss:.4f}")

# Evaluate the Model
acc, precision, recall, f1 = evaluate_model(model_bert, test_loader_bert, device)
print(f"\nBERT Mini - Accuracy: {acc:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")

Training: 100%|██████████| 7500/7500 [03:22<00:00, 37.05it/s]



Training Loss: 0.2075


Evaluating: 100%|██████████| 475/475 [00:07<00:00, 60.18it/s]


BERT Mini - Accuracy: 0.9246, Precision: 0.9244, Recall: 0.9246, F1: 0.9245





In [27]:
# Model Setup for MiniLM
model_minilm.to(device)

# Define the Optimizer
optimizer_minilm = AdamW(model_minilm.parameters(), lr=2e-5)

# Train the Model
train_loss = train_model(model_minilm, train_loader_minilm, optimizer_minilm, criterion, device)
print(f"Training Loss: {train_loss:.4f}")

# Evaluate the Model
acc, precision, recall, f1 = evaluate_model(model_minilm, test_loader_minilm, device)
print(f"MiniLM - Accuracy: {acc:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")

Training: 100%|██████████| 7500/7500 [14:25<00:00,  8.66it/s]


Training Loss: 0.2690


Evaluating: 100%|██████████| 475/475 [00:17<00:00, 26.46it/s]

MiniLM - Accuracy: 0.9299, Precision: 0.9299, Recall: 0.9299, F1: 0.9298





# PROJECT REPORT: LLM-BASED TEXT CLASSIFICATION

## APPROACH SUMMARY:
### This project focuses on using transformer-based models (BERT Mini and MiniLM) to classify news articles into one of four categories: World, Sports, Business, and Sci/Tech.

## STEPS TAKEN:

## 1. DATA PREPROCESSING:
### - Combined `Title` and `Description` columns, cleaned the text by removing special characters and extra spaces.
### - Adjusted the class labels for compatibility with PyTorch models.

## 2. MODEL SELECTION:
### - Used **BERT Mini** (`prajjwal1/bert-mini`) and **MiniLM** (`microsoft/MiniLM-L12-H384-uncased`) for efficient classification.
### - Both models are lightweight and perform well for smaller datasets.

## 3. TRAINING AND EVALUATION:
### - Fine-tuned the models using PyTorch with AdamW optimizer, Cross-Entropy loss, and a batch size of 16.
### - Evaluated performance using accuracy, precision, recall, and F1-score.

## CHALLENGES:
### - **Training Time:** Optimized by using GPU when available.
### - **Text Length:** Managed long text inputs by truncating them.

## KEY RESULTS:
### - **BERT Mini** and **MiniLM** models both demonstrated effective performance in classifying news articles.
### - Performance metrics (accuracy, precision, recall, F1-score) were calculated for each model to assess and compare their results.

## FUTURE IMPROVEMENTS:
### - Larger datasets, hyperparameter tuning, and potential use of ensemble models could further enhance model performance.