- This project implements a hybrid approach to classify unlabeled news articles into meaningful categories by combining unsupervised topic modeling and supervised text classification using BERT.
- I have used the following dataset for building my model - https://www.kaggle.com/datasets/rmisra/news-category-dataset.
Workflow:
Unsupervised Topic Modeling:
Leveraged BERTopic to discover latent topics from the news articles.
Generated interpretable topics using top words and probabilities.
Example Topics:
Politics: ["government", "election", "policy"]
Economy: ["market", "economy", "trade"]
Pseudo-Labeling:
Selected high-confidence topic assignments to pseudo-label articles.
Used these labels as training data for the supervised phase.
Supervised Learning with TFDistilBertForSequenceClassification:
Fine-tuned DistilBERT on the pseudo-labeled data for multi-class classification.
Iterative Refinement:
Used feedback loops to refine topic modeling outputs and improve labeling.
Incorporated domain knowledge to handle ambiguous or low-confidence samples.
Explainability:
Visualized topics using word clouds for interpretability.
Cross-checked BERT predictions with topic modeling results to ensure consistency.
-
Create a virtual environment (python3 -m venv venv && source venv/bin/activate)
-
Download all the dependencies by running
make requirementsin your terminal -
Follow the unsupervised.ipynb file to generate pseudo labels using Topic Modeling. Feel free to experiment as per your wish.
-
Once the final input file is generated from topic modeling, follow the supervised.ipynb to create a Supervised Text Classification Model. Note: Model can take hours to build based on your infrastructure. It took me 6 hrs on my macbook.


