MultiLabel-Book-Genre-Classifier

A text classification model from data collection, model training, and deployment.
The model can classify 141 different types of book genres
The keys of deployment\genre_types_encoded.json shows the book genres

Data Collection

Data was collected from a Goodreads Website Listing: https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once
The data collection process is divided into 2 steps:

Book URL Scraping: The book urls were scraped with scraper\book_url_scraper.py and the urls are stored along with book title in scraper\book_urls.csv
Book Details Scraping: Using the urls, book description and genres are scraped with scraper\book_details_scraper.py and they are stored in data\book_detils.csv

In total, I scraped 6,313 book details

Data Preprocessing

Initially there were 640 different genres in the dataset. After some analysis, I found out 499 of them are rare (probably custom genres by users). So, I removed those genres and then I have 141 genres. After that, I removed the description without any genres resulting in 6,104 samples.

Model Training

Finetuned a distilrobera-base model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed here

Model Compression and ONNX Inference

The trained model has a memory of 300+MB. I compressed this model using ONNX quantization and brought it under 80MB.

Model Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

Web Deployment

Deployed a Flask App built to take descprition and show the genres as output. Check flask branch. The website is live here

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

deployment

deployment

models

models

notebooks

notebooks

scraper

scraper

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

MultiLabel-Book-Genre-Classifier

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Releases

Packages

Languages

License

msi1427/MultiLabel-Book-Genre-Classifier

Folders and files

Latest commit

History

Repository files navigation

MultiLabel-Book-Genre-Classifier

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages