# NLP Topic Classification and Clustering Demo

This demo uses the 20 Newsgroups dataset to demonstrate:

• TF-IDF classification  
• SentenceTransformer embedding classification  
• Topic clustering and hierarchical topic tree generation  

Dataset size: 10,000 documents across 20 topics

In [1]:
import sys

sys.path.append("..")

from src.data_loader import load_raw_dataset, subsample_dataset

import pandas as pd
from IPython.display import Image, display

## Dataset Overview

We load the 20 Newsgroups dataset and subsample to 10,000 documents.

In [2]:
texts, labels, label_names = load_raw_dataset()

texts, labels = subsample_dataset(
    texts,
    labels,
    max_samples=10000
)

print("Dataset loaded")
print("Number of documents:", len(texts))
print("Number of classes:", len(label_names))

Dataset loaded
Number of documents: 10000
Number of classes: 20


## TF-IDF Classification Results

TF-IDF converts documents into sparse numerical features.

Linear SVM achieved the best classification performance.

In [None]:
print("TF-IDF Classification Results:\n")

display(
    Image(
        "../outputs/tfidf/tfidf_confusion_matrix_compact_Linear_SVM.png",
        width=600
    )
)

TF-IDF Classification Results:



FileNotFoundError: No such file or directory: '../outputs/bow/tfidf_confusion_matrix_compact_Linear_SVM.png'

FileNotFoundError: No such file or directory: '../outputs/bow/tfidf_confusion_matrix_compact_Linear_SVM.png'

<IPython.core.display.Image object>

## SentenceTransformer Embedding Classification

SentenceTransformer converts documents into dense semantic vectors.

These embeddings capture semantic meaning rather than word frequency.

In [None]:
print("SentenceTransformer Classification Results:\n")

display(
    Image(
        "../outputs/embeddings/confusion_matrix_compact_Linear_SVM.png",
        width=600
    )
)

## Model Comparison

We compare classification accuracy and Macro-F1 scores across feature types.

TF-IDF achieved slightly higher classification accuracy than SentenceTransformer embeddings, reaching about 71% with Linear SVM compared to about 69% for embeddings. This is because TF-IDF captures specific keywords that strongly indicate topic categories in the 20 Newsgroups dataset. 

SentenceTransformer embeddings, while slightly less accurate for classification, capture semantic meaning and are better suited for clustering and topic discovery. This makes TF-IDF more effective for supervised classification, while embeddings are more useful for unsupervised semantic analysis.

In [None]:
comparison = pd.read_csv(
    "../outputs/comparisons/model_comparison.csv",
    index_col=0
)

comparison

## Elbow Method for Optimal Cluster Selection

We use the elbow method to determine the optimal number of topic clusters.

In [None]:
display(
    Image(
        "../outputs/clustering/elbow_plot.png",
        width=500
    )
)

## Topic Tree

We generate a hierarchical topic tree using clustering and LLM-based labeling.

Top-level clusters contain subtopic clusters.

In [None]:
with open("../outputs/clustering/topic_tree.txt") as f:
    print(f.read())

## Summary

• TF-IDF achieved the highest classification accuracy  
• SentenceTransformer embeddings enabled semantic clustering  
• Elbow method determined optimal cluster count  
• LLM-generated labels produced interpretable topic hierarchy  