Indonesia News Title Classification

Project Intro/Objective

This project demonstrate a simple classification task of text data using some basic features (BoW and TF-IDF) and models (Naive Bayes & Logistic Regression). Through the process, this project also leverages some important concept in machine learning such as cross validation, confusion matrix, simple parameter tuning using randomized search, data preprocessing, and many more. The final model of this project have the accuracy of 81 % with F1 score 81 % while tested on the test set. Afterwards, this project also try to show how to deploy the model using Flask frameworks. Hopefully, this project is helping beginners who are trying to enter the ML world, especially those who are interested in text processing.

Methods Used

Exploratory Data Analysis
Bag of Words
TF-IDF
Cross Validation
ML Model (logistic regression, naive bayes)

Requirements

All codes are written in python. The analysis and the modelling were done in Jupyter Notebook while the model is hosted using Flask. The libraries used are:

Data Science libraries (numpy, sklearn)
Indonesia NLP library (PySastrawi)
Web Framework library (flask)
Heroku account to host the model in the cloud

Project Description

The dataset used in this project is Indonesian News Title dataset which contains more than 90.000 news title. You can download the dataset here. As the name suggest, an effort to classify news topic is being done. The topics are divided into nine categories: News, Hot, Finance, Travel, Inet, Health, Oto, Food, and Sport. The challenge of this project is the imbalance dataset of each categories. Here are the distribution of all categories: I used Bag of Words and TF-IDF feature in this project because they are the most simplest and easier to understand rather than more complex one, such as word embedding. Although, some paper suggest that using word vector + LDA topic modelling could also be used. But, this technique is not yet explored in this project. Surprisingly, using only those simple feature already produced a relatively good result. Here are the final scores in the test dataset:

Metrics	Score
Accuracy	81%
Recall	81%
Precision	82%
F1 Score	81%

Replication Step

To replicate this project:

Clone this repo
Activate the virtual environment
Install the requirement in requirements.txt using pip install requirements.txt -u.
Run the Flask server using flask run command

Api Endpoints:

Route	Required Parameters	Return
/predictNewsTitle	q (news title)	Given news title with its predicted category

Featured Notebooks/Analysis/Deliverables

Jupyter Notebook

Contact

Reach me through ibamibrahim0 [at] gmail

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
img		img
notebook		notebook
Procfile		Procfile
README.MD		README.MD
app.py		app.py
final_model.pickle		final_model.pickle
requirements.txt		requirements.txt
vectorizer.pickle		vectorizer.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

img

img

notebook

notebook

Procfile

Procfile

README.MD

README.MD

app.py

app.py

final_model.pickle

final_model.pickle

requirements.txt

requirements.txt

vectorizer.pickle

vectorizer.pickle

Repository files navigation

Indonesia News Title Classification

Project Intro/Objective

Methods Used

Requirements

Project Description

Replication Step

Featured Notebooks/Analysis/Deliverables

Contact

About

Releases

Packages

Languages

ibamibrahim/indonesia-news-title-classification

Folders and files

Latest commit

History

Repository files navigation

Indonesia News Title Classification

Project Intro/Objective

Methods Used

Requirements

Project Description

Replication Step

Featured Notebooks/Analysis/Deliverables

Contact

About

Resources

Stars

Watchers

Forks

Languages