Final Capstone

Book Recommender using BERT NLP on scraped Blurbs

There are 5 notebooks involved in this project:

EDA -- Exploratory Data Analysis. Some plots and statistics on the data I am using
Blurb Scraping -- The scrapy code I used to scrape the blurb data
Blurb Cleaning -- A lot of cleaning was required on the blurbs that I scraped in order to be used. This includes using Bert-as-a-Service to get vector embeddings of each blurb
Parameter Tuning -- Running the model in different circumstances to find an optimal RMSE
Model Execution -- Functions to implement the recommender system, and utilizes optimal cleaned and processed data.

Project Overview

This doesn’t explain much of the WHY or my thoughts, just a bullet outline of the project. The ‘why’ is addressed more in the powerpoint.

https://docs.google.com/presentation/d/1iW6igOogAVr58uQi9FfgY0fs_O62Kxh5bgsa_M2OJOU/edit?usp=sharing

Goal: Build a book recommender

Conduct exploratory data analysis on my dataset
Scrape the ‘blurbs’ for each book in my dataset
Clean scraped books by matching them to the correct ISBN
Generate vector embeddings of the blurbs using BERT
Cluster the vector embeddings using K-Means
Use SVD to predict a user-item matrix where the items are clusters
Tune parameters of SVD to achieve a low RMSE
Compare to multiple different baseline RMSEs
Evaluate RMSE on ISBNs despite training on Clusters to get a real evaluation
Use the predicted matrix to find the highest rated clusters
Find the most popular books from the recommended clusters, and recommend them
Create a version of the model that is FAST so it can be used in real-time by a webapp or company that would use my product.

Next Steps/Possible Improvements

3/4 of the books were lost in the process of scraping blurbs and cleaning them and matching them to the correct book. The model would improve greatly with a fresh scrape. Other improvements to the data source would involve a more streamlined and consistent blurb cleaning process, since some of my algorithms are 'hacks' and only really work for this data. The data source I trained the model on is limited and 16 years old. Being able to implement this model into an existing book-related web service would improve the amount of data greatly
The BERT vector embeddings used were a simple arithmetic mean of the vectors for the all of the sentences in the blurb. The BERT model was also not pre-trained on any of my data and was simply 'out of the box.' In the future I would train the neural network model on the data that I have. It was just a little complicated and out of my expertise to do this, and BERT is known for being very powerful out of the box.
Languages have an undesired effect on the model. It is good at recommending German books to German readers, but is not able to recommend anything content-related within a language other than the primary language in the dataset: English. For example, a user that only reads French romance novels, would get recommended French sci-fi, as opposed to English romance novels. If I had enough data for other languages I could run seperate models... Or I could find translations of the blurbs, and just always recommend English books, and non-English readers could look for translated copies based on the recommendations.
There are tons of repeat books that have different ISBNs in the dataset. I should try to combine these.
Explore other clustering methods better and run more evaluations on the clustering. K-Means is intuitively a good model for this data, and is finding subjectively meaningful clusters, but other algorithms with different parameters could be explored.
More thorough parameter tuning would improve the model to having a better RMSE. However, a full gridsearch would require additional computer power and time.
Other, more naive clustering methods should be considered and evaluated as baselines. For example, scraping the 'genre' of the book and clustering by that instead of BERT vectors is simpler and might provide better results. Clustering by Authors doesn't solve the sparsity problem very well since there are so many Authors still.
The model only predicts the 'best' books right now. For a book recommender, where the user very likely has not reported all of the books they have read, a system that recommends high quality, lesser known books is probably superior. I would like to incorporate more 'business-solution' functions to the final model.
Create a webapp to help show off this model and let people start getting recommendations!

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Blurb Cleaning.ipynb		Blurb Cleaning.ipynb
BlurbScraper.ipynb		BlurbScraper.ipynb
EDA.ipynb		EDA.ipynb
ModelExecution.ipynb		ModelExecution.ipynb
ParameterTuning.ipynb		ParameterTuning.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Final Capstone

Project Overview

Next Steps/Possible Improvements

About

Uh oh!

Releases

Packages

Languages

joedobrow/FinalCapstone

Folders and files

Latest commit

History

Repository files navigation

Final Capstone

Project Overview

Next Steps/Possible Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages