<a href="https://colab.research.google.com/github/meesamamir/LLM-Project/blob/main/LLM_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Multi-Classification using NLP Techniques

This project implements a comprehensive news classification system using NLP techniques. We will use the AG News dataset, which contains news headlines labeled as follows:

0: World

1: Sports

2: Business

3: Sci/Tech

The goal is to first build a baseline classifier using logistic regression and then develop a neural network classifier for comparison.

Below, we describe each step and provide detailed comments.

In [None]:
!pip install datasets
!pip install gensim
# !pip install seaborn

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import random
np.random.seed(42)

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

# 1 - Data Prep

## 1.1 - Data Preprocessing
We load the AG News dataset using the Huggingface datasets library. Then, we preprocess the text headlines by converting to lowercase and using CountVectorizer to convert the text into bag-of-words features. We will later report the vocabulary size and the shape of our data matrices.

TODO: Find ways to fix any training imbalances in data labels

## 1.2 - Load the AG News dataset

In [None]:
# The default split provides train and test

from datasets import load_dataset

dataset = load_dataset("fancyzhx/ag_news")

In [None]:
import pandas as pd
from collections import Counter
from datasets import load_dataset

# Load the AG News dataset from Hugging Face
dataset = load_dataset("fancyzhx/ag_news")

# Convert dataset to pandas DataFrame
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])

# Count labels in training and test sets
train_counts = Counter(df_train['label'])
test_counts = Counter(df_test['label'])

# Print label distribution
print("Training Set Label Counts:", train_counts)
print("Testing Set Label Counts:", test_counts)

## 1.3 - Extract training and test texts and labels

In [None]:
# For convenience, extract training and test texts and labels

train_texts = [example["text"] for example in dataset["train"]]
train_labels = [example["label"] for example in dataset["train"]]

test_texts = [example["text"] for example in dataset["test"]]
test_labels = [example["label"] for example in dataset["test"]]

print(f"Number of training examples (full): {len(train_texts)}")
print(f"Number of test examples: {len(test_texts)}")

## 1.4 - Add slicing of training data

Comment out the following code block to use full 120k rows for training data

In [None]:
import random
random.seed(42)

indices = list(range(len(train_texts)))
random.shuffle(indices)

slice_indices = indices[:30000]  # Select 30k rows (approx. 25%)

# slice_indices = indices[:120000]  # Select 120k rows

train_texts = [train_texts[i] for i in slice_indices]
train_labels = [train_labels[i] for i in slice_indices]

print(f"Using {len(train_texts)} training examples for experiments.")

## 1.5 - Display a snapshot of the data (first 5 rows)

In [None]:
snapshot = pd.DataFrame({"Headline": train_texts, "Label": train_labels}).head(5)
print("Data Snapshot (first 5 rows):")
print(snapshot)

## 1.6 - Feature Extraction with Bag-of-Words / TF‑IDF
We use ***CountVectorizer*** from scikit-learn to tokenize the headlines. Text data is converted into numerical features using **CountVectorizer**

*   All words are converted to lowercase.
*   We use a minimum document frequency to filter rare tokens.

We then report the vocabulary size and the shapes of the training and test matrices.

In [None]:
# Create a CountVectorizer with lowercase conversion and minimum document frequency threshold
vectorizer = CountVectorizer(lowercase=True, min_df=3)

# Fit on training texts and transform both training and test texts
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Report the vocabulary size and matrix shapes
vocab_size = len(vectorizer.vocabulary_)
print(f"\nVocabulary size: {vocab_size}\n")
print(f"Shape of training data matrix: {X_train.shape}") # (n_train, vocab_size)
print(f"Shape of test data matrix: {X_test.shape}")      # (n_test, vocab_size)


# 2 - Baseline Classifier: Logistic Regression

We implement a baseline logistic regression classifier.

*   First, we train without regularization.
*   Then, we train with L2 regularization.

We report the classification accuracies on the training and test sets.

## 2.1 - Logistic Regression **without** Regularization

In [None]:
print("\n### Logistic Regression Baseline (No Regularization) ###\n")

# Instantiate logistic regression with a very high C to effectively turn off regularization
lr_no_reg = LogisticRegression(fit_intercept=True, C=1e9, solver="liblinear", max_iter=1000)

# Train the classifier
lr_no_reg.fit(X_train, train_labels)

# Predict on training and test sets
train_preds_no_reg = lr_no_reg.predict(X_train)
test_preds_no_reg = lr_no_reg.predict(X_test)

# Computing accuracy scores
train_acc_no_reg = accuracy_score(train_labels, train_preds_no_reg)
test_acc_no_reg = accuracy_score(test_labels, test_preds_no_reg)

print(f"Training Accuracy (No Reg): {train_acc_no_reg:.4f}")
print(f"Test Accuracy (No Reg): {test_acc_no_reg:.4f}")


## 2.2 - Logistic Regression with L2 Regularization

In [None]:
print("\n### Logistic Regression Baseline (With L2 Regularization, lambda=10.0) ###\n")

# For L2 regularization with lambda=10.0, set C = 1/lambda = 0.1.
lr_reg = LogisticRegression(fit_intercept=True, C=0.1, solver="liblinear", max_iter=1000)

# Train the classifier
lr_reg.fit(X_train, train_labels)

# Predict on training and test sets
train_preds_reg = lr_reg.predict(X_train)
test_preds_reg = lr_reg.predict(X_test)

# Computing accuracy scores
train_acc_reg = accuracy_score(train_labels, train_preds_reg)
test_acc_reg = accuracy_score(test_labels, test_preds_reg)

print(f"Training Accuracy (With Reg): {train_acc_reg:.4f}")
print(f"Test Accuracy (With Reg): {test_acc_reg:.4f}")

## 2.3 - Additional Evaluation (Precision, Recall, F1-score, Confusion Matrix)


In [None]:
print("\nClassification Report (Test, With Regularization):")
print(classification_report(test_labels, test_preds_reg))

print("Confusion Matrix:")
print(confusion_matrix(test_labels, test_preds_reg))


## 2.4 - Confusion Matrix Heatmap

In [None]:
import seaborn as sns

label_names = ["World", "Sports", "Business", "Sci/Tech"]
cm = confusion_matrix(test_labels, test_preds_reg)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_names, yticklabels=label_names)
plt.title("Confusion Matrix - Logistic Regression (With Reg)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# 3 - Neural Network Classifier

Next, we implement a simple feedforward neural network classifier for the news classification task. The architecture is as follows:

1. Embedding Layer: Converts word indices to dense vectors.

2. First Hidden Layer: Sums the embeddings over the sequence and applies a linear transformation followed by a sigmoid activation.

3. Second Hidden Layer: Applies another linear transformation and sigmoid activation.

4. Output Layer: Applies a final linear transformation to predict the logits for the 4 classes, followed by softmax during evaluation.

We use PyTorch to implement the model.

Next, we will then train the network and evaluate its performance.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Setting torch seed for reproducibility
torch.manual_seed(42)


## 3.1 - Custom Dataset and DataLoader
We define a custom dataset to process news headlines. Each headline is tokenized (using a basic split), mapped to indices using the vocabulary from CountVectorizer, and then finally padded/truncated to a fixed length.

In [None]:
# Building a word-to-index mapping from CountVectorizer's vocabulary
# Note: CountVectorizer vocabulary maps: word -> index.

vocab = vectorizer.vocabulary_

# We assume the padding index is 0. If "<pad>" is not in the vocabulary, we can use 0 as default.
# (In practice, you might add a dedicated pad token to the vocabulary.)
pad_idx = 0
default_idx = 0 # Unknown words get index 0

# Define maximum sequence length (e.g., 20 tokens)
max_seq_len = 20

# Convert text to lowercase and split on whitespace.
def tokenize_text(text):
  tokens = text.lower().split()
  indices = [vocab.get(token, default_idx) for token in tokens]

  return indices


In [None]:
class AGNewsDataset(Dataset):
  def __init__(self, texts, labels, max_len=max_seq_len):
    self.texts = texts
    self.labels = labels
    self.max_len = max_len

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    indices = tokenize_text(self.texts[idx])
    if len(indices) < self.max_len:
        indices = indices + [pad_idx] * (self.max_len - len(indices))
    else:
        indices = indices[:self.max_len]
    label = self.labels[idx]

    return torch.tensor(indices, dtype=torch.long), torch.tensor(label, dtype=torch.long)

In [None]:
# Create dataset and DataLoader instances
train_dataset = AGNewsDataset(train_texts, train_labels, max_len=max_seq_len)
test_dataset = AGNewsDataset(test_texts, test_labels, max_len=max_seq_len)

batch_size = 64

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [None]:
class NeuralClassifier(nn.Module):
  # Initialize the neural network.
  def __init__(self, vocab_size, embed_size, hidden_size, num_classes=4):
    super(NeuralClassifier, self).__init__()

    """ Initialize the neural network.
    Args:
        vocab_size (int): Size of the vocabulary.
        embed_size (int): Dimensionality of word embeddings.
        hidden_size (int): Size of hidden layers.
        num_classes (int): Number of news categories.
    """
    # Defined layers as per 'Neural Networks for NLP.pdf' slides 27-29:
        # Step 1: Embedding layer (with padding support)
        # Step 2: Linear layer from embed_size to hidden_size (first hidden layer)
        # Step 3: Second linear layer from hidden_size to hidden_size (second hidden layer)
        # Step 4: Output linear layer from hidden_size to class_size

    # Add more layers here:

    self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=pad_idx)
    self.fc1 = nn.Linear(embed_size, hidden_size)
    self.fc2 = nn.Linear(hidden_size, hidden_size)
    self.fc_out = nn.Linear(hidden_size, num_classes)

  def forward(self, x):
    """
    Forward pass of the model.

    Args:
        x (Tensor): Input tensor of shape [batch_size, seq_len].

    Returns:
        logits (Tensor): Logits of shape [batch_size, num_classes].
    """
    embedded = self.embedding(x)               # [B, L, E]
    summed = embedded.sum(dim=1)               # [B, E] fixed-size representation
    h1 = torch.sigmoid(self.fc1(summed))       # first hidden layer
    h2 = torch.sigmoid(self.fc2(h1))           # second hidden layer
    logits = self.fc_out(h2)                   # output layer (logits)

    return logits


## 3.2 - Training and Evaluation Functions
We define functions for training one mini-batch and evaluating the model over an entire DataLoader.

### 3.2.1 - Training Function:


In [None]:
def train_batch(batch, model, optimizer):
  """ Train on one mini-batch
  Args:
    batch (tuple): (texts, labels) from DataLoader.
    model (nn.Module): Neural network.
    optimizer: Optimizer instance.

  Returns:
      loss (float): Mini-batch loss.
  """
  # set in training mode
  model.train()

  # initialize optimizer
  optimizer.zero_grad()

  # forward: prediction
  texts, labels = batch
  logits = model(texts)
  loss = F.cross_entropy(logits, labels)

  # backward: gradient computation
  loss.backward()

  # norm clipping, in case the gradient norm is too large
  torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

  # gradient-based update parameter
  optimizer.step()

  return loss.item()

### 3.2.2 - Evaluating Function:

In [None]:
def evaluate(loader, model):
  """ Evaluate model on given DataLoader.
  Args:
    loader (DataLoader): Evaluation data.
    model (nn.Module): Neural network.

  Returns:
    avg_loss (float): Average loss.
    accuracy (float): Classification accuracy.
  """
  model.eval()
  total_loss = 0.0
  total_correct = 0
  total_examples = 0
  with torch.no_grad():
      for texts, labels in loader:
          logits = model(texts)
          loss = F.cross_entropy(logits, labels)
          total_loss += loss.item() * texts.size(0)
          preds = torch.argmax(logits, dim=1)
          total_correct += (preds == labels).sum().item()
          total_examples += texts.size(0)
  avg_loss = total_loss / total_examples
  accuracy = total_correct / total_examples

  return avg_loss, accuracy


## 3.3 - Training the Neural Network Classifier
Hyperparameters:

- Embedding size: 64

- Hidden size: 64

- Optimizer: SGD with learning rate 0.05

- Epochs: 50 (for demonstration; increase as needed)

We record the training loss and validation performance at each epoch.

In [None]:
# Hyperparameters
embed_size = 64
hidden_size = 64
num_classes = 4
num_epochs = 50 # Increase for final training
learning_rate = 0.05

# Initialize the neural network model
model_nn = NeuralClassifier(
    vocab_size=len(vocab),
    embed_size=embed_size,
    hidden_size=hidden_size,
    num_classes=num_classes)

# Use SGD optimizer
optimizer_nn = torch.optim.SGD(
    model_nn.parameters(), lr=learning_rate)

# Record training progress
train_losses = []
val_losses = []
val_accuracies = []

print("\n### Training Neural Network Classifier ###\n")
for epoch in range(1, num_epochs + 1):
    # 1. Train for one epoch
    epoch_loss = 0.0
    for batch in train_loader:
        loss = train_batch(batch, model_nn, optimizer_nn)
        epoch_loss += loss

    # 2. Compute average training loss for this epoch
    avg_train_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # 3. Evaluate on the test/validation set
    val_loss, val_acc = evaluate(test_loader, model_nn)
    val_losses.append(val_loss)
    val_accuracies.append(val_acc)

    # 4. Print a single summary line for this epoch
    print(f"Epoch {epoch}: "
          f"Train Loss = {avg_train_loss:.4f}, "
          f"Val Loss = {val_loss:.4f}, "
          f"Val Accuracy = {val_acc:.4f}")

print("\nNeural Network Training Completed.")
print(f"Best Validation Accuracy: {max(val_accuracies):.4f}")
print(f"Final Validation Accuracy: {val_accuracies[-1]:.4f}")

## 3.4 - Visualization and Summary
We plot the training and validation loss curves. In addition, note that evaluation metrics such as precision, recall, and F1-score (as well as a confusion matrix) should be computed for a thorough evaluation. For language model-based approaches, perplexity would also be computed.


In [None]:
plt.figure(figsize=(8, 5))
plt.plot(range(1, num_epochs+1), train_losses, color="red", label="Training Loss")
plt.plot(range(1, num_epochs+1), val_losses, color="blue", label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training vs. Validation Loss")
plt.legend()
plt.show()

# 4 - Additional Experiments and Quantitative Analysis
This section performs further analysis on the dataset:
1. Plotting the frequency distribution of words (to check Zipf’s law).
2. Generating word embeddings using Word2Vec and displaying similar words.
3. Summarizing performance metrics in a table.

## 4.1 - Frequency Distribution Plot

In [None]:
# Flatten all tokens from training texts (using simple whitespace split)
all_tokens = []
for text in train_texts:
    all_tokens.extend(text.lower().split())

token_counts = Counter(all_tokens)

# Sort tokens by frequency
sorted_counts = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
tokens, frequencies = zip(*sorted_counts)

# Plot rank-frequency (log-log scale)
ranks = range(1, len(frequencies) + 1)

plt.figure(figsize=(8, 5))
plt.loglog(ranks, frequencies, marker=".")
plt.xlabel("Rank")
plt.ylabel("Frequency")
plt.title("Word Frequency Distribution (Zipf's Law)")
plt.show()

Frequency INVERSLY PROPORTIONAL to Rank

Hence, model follows Zipf's Law

##4.2 Word Embeddings using Word2Vec

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec

# Tokenize the training texts (simple whitespace split)
tokenized_texts = [text.lower().split() for text in train_texts]

# Train Word2Vec model (using a small dimension for demonstration)
w2v_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=5, min_count=3, workers=4, seed=42)

# Display top 5 similar words for selected keywords
keywords = ["economy", "games", "technology", "market"]
for word in keywords:
    if word in w2v_model.wv:
        similar = w2v_model.wv.most_similar(word, topn=5)
        print(f"\nTop similar words to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word}: {score:.4f}")
    else:
        print(f"\n'{word}' not found in vocabulary.")


## 4.3 Performance Comparison Table

In [None]:
# Create a summary table for performance metrics
import pandas as pd

# We have baseline (lr_reg) and neural network (model_nn) performance metrics.
# Here we use the test accuracy from earlier experiments.
performance_data = {
    "Model": ["Logistic Regression (With Reg)", "Neural Network"],
    "Test Accuracy": [test_acc_reg, val_accuracies[-1]]
}

performance_df = pd.DataFrame(performance_data)
print("Performance Comparison:")
print(performance_df)

# 5 - Error Analysis
Below, we discuss examples where the models fail. For instance, consider headlines that are misclassified by the baseline model.


In [None]:
# For demonstration, print 5 misclassified examples from logistic regression with regularization
misclassified = []
for text, true_label, pred_label in zip(test_texts, test_labels, test_preds_reg):
    if true_label != pred_label:
        misclassified.append((text, true_label, pred_label))
    if len(misclassified) >= 5:
        break

print("Examples of Misclassified Headlines (Logistic Regression with Reg):")
for idx, (text, true_label, pred_label) in enumerate(misclassified, start=1):
    print(f"\nExample {idx}:")
    print(f"Headline: {text}")
    print(f"True Label: {true_label} ({label_names[true_label]})")
    print(f"Predicted Label: {pred_label} ({label_names[pred_label]})")

#6 - Add more models here

Compare our models with models that are already in the market e.g. XGBOOST, RandomForest

Find standard/best methods to report results

#Manual Headline Prediction
Enter a news headline manually and see the predictions from:
1. Baseline Logistic Regression (using CountVectorizer features)
2. Neural Network Classifier (using our custom tokenization and padding)

The output includes the predicted text label and the confidence (as a percentage).



In [None]:
# Define label mapping
label_mapping = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

# Input: Manually enter a headline
headline = "Local sports team wins championship in a stunning upset"
print(f"Headline: {headline}\n")


# -------------------------------------------------------------------------------------------------------

# ---- Baseline Prediction using Logistic Regression ----
# Vectorize the headline using the previously fitted CountVectorizer
X_input = vectorizer.transform([headline])

# Get probability estimates
probs_baseline = lr_reg.predict_proba(X_input)[0]

# Determine the predicted label and confidence
pred_label_baseline = lr_reg.predict(X_input)[0]
confidence_baseline = np.max(probs_baseline) * 100

print("Baseline Logistic Regression:")
print(f" Predicted Label: {pred_label_baseline}: {label_mapping[pred_label_baseline]}")
print(f" Confidence: {confidence_baseline:.2f}%")
print()

# -------------------------------------------------------------------------------------------------------

# ---- Neural Network Prediction ----
# Tokenize the headline using our custom function
indices = tokenize_text(headline)
# Pad or truncate the sequence to max_seq_len
if len(indices) < max_seq_len:
    indices = indices + [pad_idx] * (max_seq_len - len(indices))
else:
    indices = indices[:max_seq_len]
# Convert to a torch tensor and add batch dimension
input_tensor = torch.tensor([indices], dtype=torch.long)

# Predict with the neural network
model_nn.eval()
with torch.no_grad():
    logits = model_nn(input_tensor)
    # Apply softmax to obtain probability distribution
    probs_nn = F.softmax(logits, dim=1)[0]
    pred_label_nn = torch.argmax(probs_nn).item()
    confidence_nn = probs_nn[pred_label_nn].item() * 100

print("Neural Network Classifier:")
print(f" Predicted Label: {pred_label_nn}: {label_mapping[pred_label_nn]}")
print(f" Confidence: {confidence_nn:.2f}%")

# 7 - LSTM 