GitHub - iPrinka/text-classification: This project implements a hybrid approach to classify unlabeled news articles into meaningful categories by combining unsupervised topic modeling and supervised text classification using BERT.

Project Summary

This project implements a hybrid approach to classify unlabeled news articles into meaningful categories by combining unsupervised topic modeling and supervised text classification using BERT.
I have used the following dataset for building my model - https://www.kaggle.com/datasets/rmisra/news-category-dataset.

Flow Diagram

Approach

Workflow:

Unsupervised Topic Modeling:
    Leveraged BERTopic to discover latent topics from the news articles.
    Generated interpretable topics using top words and probabilities.
    Example Topics:
        Politics: ["government", "election", "policy"]
        Economy: ["market", "economy", "trade"]

Pseudo-Labeling:
    Selected high-confidence topic assignments to pseudo-label articles.
    Used these labels as training data for the supervised phase.

Supervised Learning with TFDistilBertForSequenceClassification:
    Fine-tuned DistilBERT on the pseudo-labeled data for multi-class classification.

Iterative Refinement:
    Used feedback loops to refine topic modeling outputs and improve labeling.
    Incorporated domain knowledge to handle ambiguous or low-confidence samples.

Explainability:
    Visualized topics using word clouds for interpretability.
    Cross-checked BERT predictions with topic modeling results to ensure consistency.

How to use

Create a virtual environment (python3 -m venv venv && source venv/bin/activate)
Download all the dependencies by running make requirements in your terminal
Follow the unsupervised.ipynb file to generate pseudo labels using Topic Modeling. Feel free to experiment as per your wish.
Once the final input file is generated from topic modeling, follow the supervised.ipynb to create a Supervised Text Classification Model. Note: Model can take hours to build based on your infrastructure. It took me 6 hrs on my macbook.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
data.json		data.json
requirements.txt		requirements.txt
supervised.ipynb		supervised.ipynb
unsupervised.ipynb		unsupervised.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Summary

Flow Diagram

Approach

How to use

Topic Word Cloud Example

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Summary

Flow Diagram

Approach

How to use

Topic Word Cloud Example

Output

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages