# COGS 118B - Final Project

## Book Recommendation System

### Group members

- Natalia Abdulmawla
- Aarya Patel
- Brian Lee
- Holden Ly
- Yanxiong Chen

### Abstract

The goal of our project was to explore the book-crossing dataset and build a reccomendation system for books based on the dataset. The dataset contains a total of 8 different features, with 1,149,780 ratings of 271,379 books measured on a scale of 1 to 10 by 278,858 reviewers. To build this system we sought to use various unsupervised learning techniques such as k-means clustering and principal component analysis to find what optimally works with our dataset while exploring the dataset and find underlying patterns within the data. This allowed us to recommend books by using the similarities between user preferences. Subsequently we attempted to evaluate the performance of the algorithm. Through this project we aimed to demonstrate the challenges in providing personalized book recommendations based on a person's interests using the books and their given features. 

### Background

Recent decades has seen an an explosion of all kinds of content, from literature to music to fashion trends, particularly by the increasing ease of access to the internet. By the nature of the global marketplace the volume of options for a user to choose from can be overwhelming. This has let to people having to deal with the challenge of making informed decisions on the content they want to consume. This has led to the rise in the use of recommendation systems that personalize choices based on a user's preferences and behaviors to alleviate this burden placed on people<a name="rs"></a>[<sup>[2]</sup>](#rsnote). Recommendation systems have become far reaching in our digital lives due to their practical applications in fields such as e-commerce, entertainment, or social media in enhancing user experience and satisfaction<a name="icwww"></a>[<sup>[3]</sup>](#icwwwnote). In order to provide personalized choices these systems leverage machine learning algorithms to predict a user's preferences to provide recommendations. 

Research into recommendation systems have explored various approaches to address the challenge of providing the most accurate personalized recommendations to users. Collaborative filtering is one of the most popular techniques. By identifying patterns by examining the interactions between users and items, users with similar tastes are grouped together thus enabling personalized recommendations of items based on the preferences of a group of users. Another popular technique is content-based filtering, utilizing the characteristics of an item to identify and recommend items that align with a user's preferences. Machine learning techniques such as k-means clustering and principal component analysis, among many others, have also been of great use into building and improving recommendation systems by enabling researchers to identify meaningful patterns from large datasets<a name="rsf"></a>[<sup>[4]</sup>](#rsfnote).

Book recommendations are one application of these recommendation systems, frequently used by book-oriented platforms such as Goodreads with the goal to be to assist a user in discovering books that align with their preferences and interests. Subsequently, these platforms have accumulated large amounts of data related to user preferences and behaviors which are then used to continously improve the performance and accuracy of recommendation system they use in their platforms.

Our project of developing a basic personalized book recommendation system focuses on leveraging the Book-Crossing dataset, a collection of user ratings and book information. Compiled by Cai-Nicolas Ziegler of the Institute for Information Systems, the Book-Crossing dataset comprises 1,149,780 ratings (on a scale of 1-10 stars) of 271,379 books provided by 278,858 reviewers, making it a valuable dataset for exploring user preferences and behaviors<a name="bx"></a>[<sup>[1]</sup>](#bxnote). It provides a comprehensive view of user preferences towards different books, giving us a solid groundwork for the development of a recommendation system.

Through our project we seek to build upon the existing research by employing unsupervised learning techniques to uncover hidden patterns within the Book-Crossing dataset. By groping reviewers and books based on their ratings and other features, we seek to have a deeper understanding of user preferences towards books and their characteristics, along with a deeper understanding of the unsupervised learning techniques which are used in building a recommendation system and their real world applications. 

### Problem Statement

The goal of our project is to develop a book recommendation system using the Book-Crossing dataset. The dataset represents a large collection of user-book interactions, including ratings provided by users for various books. By utilizing the 1.1 million ratings of 271,379 books by 278,858 reviewers and applying various unsupervised learning techniques, we seek to create a personalized book recommendation system.

The challenge that comes in doing this is properly applying the techniques onto the vast amount of data available to us in the dataset. To overcome this challenge we first explore the data to identify patterns and similarities among the books, the reviewers, and their ratings. In doing a basic exploration we see how we must clean and preprocess the data, by handling missing values and outliers, and ensuring consistency in the data, making it easier to apply machine learning techniques on the data. This problem and the solution can be reproduced by applying the same methodology of data preprocessing, feature engineering, and model building to a similar dataset, enabling someone to create a similar recommendation system for a different domain such as music or e-commerce.


### Data

In order to build our recommendation system we needed to find a dataset with the appropriate information. We found a dataset from the Institute for Information Systems<a name="bx"></a>[<sup>[1]</sup>](#bxnote). This consists of three different datasets:

**Books.csv** with 271379 observations and 5 features: a unique numerical identifier for each book under the column **'ISBN'**, the title of a book under the column **'Title'**, the author of a book under the column **'Author'**, the numerical year a book was published under the column **'Year'**, and the name of a publisher that a book is released under the column **'Publisher'**. Each observation represents a published book and some of the information which describes that book. 

**Ratings.csv** with 1149780 observations and 3 features: a unique numerical identifier for each reviewer which provides anonimity under the column **'User-ID'**, a nuique numerical identifier for each book being rated under the column **'ISBN'**, and the representation of the ratings given by reviewers to the books on a range of 1-10, under the column **'Rating'**. Each observation represents a rating given by a reviewer to a book. 

**Users.csv** with 278859 observations and 2 features: a unique identifier for each reviewer under the column **'User-ID'**, and the age of the reviewer under the column **'Age'**. Each observation represents a unique anonimized reviewer and their personal information.

Before being able to use the data we had to do a basic exploratory data analysis to see some information and patterns within the data. This allowed us to find out how we had to clean the data to make it usable. For example, in the Users dataset there were many missing ages for reviewers and outlier ages which were not logical, such as 130+ or under 5 years old. We handled these by replacing them with the mean age of the reviewers. We also did some filtering of the data which were not necessary and would affect our results or computation time, such as including only including ratings of books which are found in our books dataset. Other tasks we did to clean and preprocess the data include renaming columsn to make them easier to use, converting data types and encoding data, and dropping any duplicates. 

### Proposed Solution

To develop a book recommendation system using the Book-Crossing dataset, we will employ unsupervised learning techniques such as k-means clustering and gaussian mixture models using libraries such as scikit-learn and scipy. This gives us a larger insight into the patterns in the data that we cannot see using simpler methods of visualizing the data. Should we discover that these techniques will not be the most appropriate to use in our data we will consider more traditional techniques used in the creation of a recommendation system such as content-based filtering or collaborative filtering. The approach of particular interest is collaborative filtering, allowing us to use the patterns in user-item interactions to identify reviewers with similar preferences and subsequently recommend items liked by those reviewers. This process is done by calculating user similarity through measures of similarity such as consine similarity. By using a similar methodology of exploring and preprocessing a similar dataset then applying the appropriate algorithms our solution is entirely reproducible by another person interested in developing a recommendation system.

### Evaluation Metrics

To evaluate the aforementioned proposed solution, we utilize two main metrics: silhouette score, implemented via scikit-learn's metrics package, and root mean squared error (RMSE), implemented via the surprise module. 

The silhouette score, calculated by taking the mean of (b - a) / max(a, b) over all samples such that a is the mean intra-cluster distance and b is the distance between sample and nearest cluster. This was primarily utilized as a metric to iteratively search for the optimal n_clusters parameter in our proposed KMeans solution. A score near 1 means well-clustered and clearly definable in terms of labeling, near 0 indicates data point straggling between two clusters, and near -1 points to incorrect clustering. 

The RMSE metric, calculated by $$\sqrt{\frac{1}{\hat{R}} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2}$$ provides the average magnitude of the errors between predicted and actual values. In this case, it was used as an accuracy metric on the testing data predictions of the various collaborative filtering algorithms with respect to their ground truths. The score as a whole represents the average magnitude of error on the same scale as the data, or in this case ratings between 1-10. A score near 0 represents extremely accurate predictions while scores on the larger side, such as above 5, indicated significant errors in the model and predictive ability.

### Results

#### Subsection X

### Discussion

#### Interpreting Results

#### Limitations

#### Ethics & Privacy

#### Conclusion

### Footnotes

<a name="bxnote"></a>1.[^](#bx): Ziegler, C. (Aug 2004) Book-Crossing Dataset *Institut für Informatik, Universität Freiburg* https://web.archive.org/web/20200511092532/http://www2.informatik.uni-freiburg.de/~cziegler/BX/<br>

<a name=”rsnote”></a>2.[^](#rs): Marcuzzo, M., et al. (28 Jul 2022) Recommendation Systems: An Insight Into Current Development and Future Research Challenges *IEEE Access* https://ieeexplore.ieee.org/document/9843966<br>

<a name=”icwwwnote”></a>3.[^](#icwww): Ziegler, C., et al. (May 2005) Improving Recommendation Lists Through Topic Diversification *IWC32* doi:10.1145/1060745.1060754<br>

<a name="rsfnote"></a>4.[^](#rsf): Linden, G. et al. (January 2003) Amazon.com
Recommendations Item-to-Item Collaborative Filtering *IEEE Internet Computing* https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf<br>

<a name=”cocnote”></a>5.[^](#coc): Co-clustering *Surprise Documentation* https://surprise.readthedocs.io/en/stable/co_clustering.html<br>

<a name=”acnote”></a>6.[^](#ac): Accuracy *Surprise Documentation* https://surprise.readthedocs.io/en/stable/accuracy.html<br>

<a name=”pwcnote”></a>7.[^](#pwc): Recommendation Systems *Papers With Code* https://paperswithcode.com/task/recommendation-systems<br>

<a name=”kgnote”></a>8.[^](#kg): Book-Crossing Dataset https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset/data<br>