A text classification model from data collection, model training, and deployment.
The model can classify 141 different types of book genres
The keys of deployment\genre_types_encoded.json
shows the book genres
Data was collected from a Goodreads Website Listing: https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once
The data collection process is divided into 2 steps:
- Book URL Scraping: The book urls were scraped with
scraper\book_url_scraper.py
and the urls are stored along with book title inscraper\book_urls.csv
- Book Details Scraping: Using the urls, book description and genres are scraped with
scraper\book_details_scraper.py
and they are stored indata\book_detils.csv
In total, I scraped 6,313 book details
Initially there were 640 different genres in the dataset. After some analysis, I found out 499 of them are rare (probably custom genres by users). So, I removed those genres and then I have 141 genres. After that, I removed the description without any genres resulting in 6,104 samples.
Finetuned a distilrobera-base
model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed here
The trained model has a memory of 300+MB. I compressed this model using ONNX quantization and brought it under 80MB.
The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment
folder or here
Deployed a Flask App built to take descprition and show the genres as output. Check flask
branch. The website is live here