# BOOK RECOMMENDER SYSTEM

**MEMBERS:**

1. Shraddha Audut
2. Sraavya Idumudi
3. Emily Tran
4. Lina Urrego
5. Liya Yonas

## Introduction



Finding a good book to read can often feel overwhelming, especially with all number of titles available across various genres, authors, and styles. Readers frequently struggle to discover books that match their preferences, leading to frustration or disengagement from reading altogether.
Our project aims to address this challenge by building a book recommendation system that predicts books a user might enjoy based on their previous reading history and preferences. By leveraging book metadata (e.g., title, author, genre) and user interactions (e.g., ratings and reviews), we aim to create a system that provides personalized and meaningful book suggestions.
The ultimate goal is to simplify the process of book discovery for readers, making it more enjoyable and tailored to their tastes.

---



## Preprocesing Data


---



For the preprocessing portion of the assignmnet, the databases from the zip file were loaded into the notebook. Each file had to be encoded into 'Latin-1' given that there was an issue with their formatting. Then, each database was checked for any null values. The dataframe with the most null values was 'users.csv', with 110762 rows in the 'Age' column being null. The 'books.csv' dataframe had a total of 7 null rows. When looking at the datatypes of the columns in 'books.csv', all were object type, despite the 'Year-Of-Publication' holding only numbers. After filtering through the column for any string values, none were found. The column was then simply put through the '.astype(int)' method. Filtering through these values did show that there were some year values set at 0 or over 2024. These values were dropped given that they are not realistic.  In order to condense the user dataframe a bit more, any ages that were under 12 or over 75 were dropped. Lasty, columns 'Image-URL-L', 'Image-URL-M' and 'Image-URL-S' were dropped given that they will not be used in the modeling. The 'Publisher' column was dropped as well. From the 'users.csv', 'Location' was dropped as well. Book ratings that were equal to 0 were dropped, given that the rating scale chosen for this experiment was between 1-10. Lasty, the databases were merged in order to have the information on full database. 'users.csv' and 'ratings.csv' were merged first on 'User-ID' and then that database was merged with 'books.csv' on ISBN. This gave us a more comlpete and robust dataframe to work with. In order to avoid any duplicated titles, the average of the ratings per unique title were calcualted and updated into the database.


In [None]:
! pip install kaggle

In [None]:
#!/bin/bash
!kaggle datasets download saurabhbagchi/books-dataset


Dataset URL: https://www.kaggle.com/datasets/saurabhbagchi/books-dataset
License(s): CC0-1.0
Downloading books-dataset.zip to /content
 85% 21.0M/24.6M [00:00<00:00, 28.3MB/s]
100% 24.6M/24.6M [00:00<00:00, 31.4MB/s]


In [None]:
! unzip books-dataset.zip

Archive:  books-dataset.zip
  inflating: books_data/books.csv    
  inflating: books_data/ratings.csv  
  inflating: books_data/users.csv    


In [None]:
import pandas as pd

In [None]:
books_df = pd.read_csv('books_data/books.csv', encoding='latin-1', on_bad_lines='warn', sep=";")

Skipping line 43667: expected 8 fields, saw 10
Skipping line 51751: expected 8 fields, saw 9

  books_df = pd.read_csv('books_data/books.csv', encoding='latin-1', on_bad_lines='warn', sep=";")
Skipping line 104319: expected 8 fields, saw 9
Skipping line 121768: expected 8 fields, saw 9

  books_df = pd.read_csv('books_data/books.csv', encoding='latin-1', on_bad_lines='warn', sep=";")
Skipping line 150789: expected 8 fields, saw 9
Skipping line 157128: expected 8 fields, saw 9
Skipping line 180189: expected 8 fields, saw 9
Skipping line 185738: expected 8 fields, saw 9

  books_df = pd.read_csv('books_data/books.csv', encoding='latin-1', on_bad_lines='warn', sep=";")
Skipping line 220626: expected 8 fields, saw 9
Skipping line 227933: expected 8 fields, saw 11
Skipping line 228957: expected 8 fields, saw 10
Skipping line 245933: expected 8 fields, saw 9
Skipping line 251296: expected 8 fields, saw 9
Skipping line 259941: expected 8 fields, saw 9
Skipping line 261529: expected 8 fields, 

In [None]:
books_df.head(15)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...
8,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...


In [None]:
ratings_df = pd.read_csv('books_data/ratings.csv', encoding='latin-1', on_bad_lines='warn',sep=";")

In [None]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
users_df = pd.read_csv('books_data/users.csv', encoding='latin-1', on_bad_lines='warn', sep=";")


In [None]:
print(books_df.shape)
print(ratings_df.shape)
print(users_df.shape)

(271360, 8)
(1149780, 3)
(278858, 3)


In [None]:
print(" ")
print('Books')
print(" ")
print(books_df.isna().sum())
print(" ")
print('Ratings')
print(" ")
print(ratings_df.isna().sum())
print(" ")
print('Users')
print(" ")
print(users_df.isna().sum())

 
Books
 
ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64
 
Ratings
 
User-ID        0
ISBN           0
Book-Rating    0
dtype: int64
 
Users
 
User-ID          0
Location         0
Age         110762
dtype: int64


In [None]:
books_df = books_df.dropna()
users_df = users_df.dropna()

In [None]:
print(" ")
print('Books')
print(" ")
print(books_df.isna().sum())
print(" ")
print('Users')
print(" ")
print(users_df.isna().sum())

 
Books
 
ISBN                   0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64
 
Users
 
User-ID     0
Location    0
Age         0
dtype: int64


In [None]:
print(books_df.shape)
print(ratings_df.shape)
print(users_df.shape)

(271353, 8)
(1149780, 3)
(168096, 3)


After dropping null rows, books_df has 271353 rows. It only dropped 7 rows. Users now has 168096 rows.It dropped 110762 rows, specifically where the users age was not included.

In [None]:
books_df = books_df.drop(columns=['Image-URL-M', 'Image-URL-L', 'Image-URL-S'])

Dropped columns 'Image-URL-S', 'Image-URL-M' and 'Image-URL-L' given that we do not need these images.

In [None]:
books_df.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication',
       'Publisher'],
      dtype='object')

In [None]:
books_df.dtypes

Unnamed: 0,0
ISBN,object
Book-Title,object
Book-Author,object
Year-Of-Publication,object
Publisher,object


In [None]:
years = books_df['Year-Of-Publication'].unique()
sorted(years)

TypeError: '<' not supported between instances of 'str' and 'int'

Year of Publication should be an integer data type, however there are no string rows that stand out.

In [None]:
books_df['Year-Of-Publication'] = books_df['Year-Of-Publication'].astype(int)

In [None]:
year_is_zero = books_df[books_df['Year-Of-Publication'] == 0].index
books_df.drop(year_is_zero, inplace=True)

In [None]:
year_is_larger = books_df[books_df['Year-Of-Publication'] > 2024].index
books_df.drop(year_is_larger, inplace=True)

In [None]:
year_is_less = books_df[books_df['Year-Of-Publication'] < 1920].index
books_df.drop(year_is_less, inplace=True)

In [None]:
books_df.shape

(266679, 5)

In [None]:
books_df.dtypes

Unnamed: 0,0
ISBN,object
Book-Title,object
Book-Author,object
Year-Of-Publication,int64
Publisher,object


In [None]:
users_df.dtypes

Unnamed: 0,0
User-ID,int64
Location,object
Age,float64


In [None]:
age_is_below_twelve = users_df[users_df['Age'] < 12].index
users_df.drop(age_is_below_twelve, inplace=True)

In [None]:
age_is_above_seventy = users_df[users_df['Age'] > 70].index
users_df.drop(age_is_above_seventy, inplace=True)

In [None]:
users_df.shape

(164859, 3)

In [None]:
rating_is_zero = ratings_df[ratings_df['Book-Rating'] == 0].index
ratings_df.drop(rating_is_zero, inplace=True)

In [None]:
ratings_df.shape

(433671, 3)

In [None]:
books_df.drop(columns=['Publisher'], inplace=True)

In [None]:
users_df.drop(columns=['Location'], inplace=True)

In [None]:
print(books_df.columns)
print(ratings_df.columns)
print(users_df.columns)

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication'], dtype='object')
Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')
Index(['User-ID', 'Age'], dtype='object')


In [None]:
a = users_df
b = ratings_df

In [None]:
user_ratings_df = pd.merge(a, b, on='User-ID', how = 'inner')

In [None]:
user_ratings_df.columns

Index(['User-ID', 'Age', 'ISBN', 'Book-Rating'], dtype='object')

In [None]:
user_ratings_df.shape

(299554, 4)

In [None]:
user_ratings_df.head()

Unnamed: 0,User-ID,Age,ISBN,Book-Rating
0,10,26.0,8477024456,6
1,19,14.0,375759778,7
2,42,17.0,553582747,7
3,44,51.0,440223571,8
4,51,34.0,440225701,9


In [None]:
c = user_ratings_df
d = books_df
books_ratings_df = pd.merge(c, d, on='ISBN', how = 'inner')

In [None]:
books_ratings_df.head()

Unnamed: 0,User-ID,Age,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication
0,19,14.0,375759778,7,Prague : A Novel,ARTHUR PHILLIPS,2003
1,42,17.0,553582747,7,From the Corner of His Eye,Dean Koontz,2001
2,44,51.0,440223571,8,This Year It Will Be Different: And Other Stories,Maeve Binchy,1997
3,51,34.0,440225701,9,The Street Lawyer,JOHN GRISHAM,1999
4,56,24.0,671623249,7,LONESOME DOVE,Larry McMurtry,1986


In [None]:
books_ratings_df.shape

(260889, 7)

In [None]:
agg_title= books_ratings_df.groupby('Book-Title').agg(Rating=('Book-Rating', 'mean'), Book_Author=('Book-Author', 'first'), Year_Of_Publication=('Year-Of-Publication', 'first'))

In [None]:
agg_title.head()

Unnamed: 0_level_0,Rating,Book_Author,Year_Of_Publication
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Ask Lily (Young Women of Faith: Lily Series, Book 5)",8.0,Nancy N. Rue,2001
Dark Justice,10.0,Jack Higgins,2004
"Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth",8.333333,Elizabeth Roberts,1991
Final Fantasy Anthology: Official Strategy Guide (Brady Games),10.0,David Cassady,1999
Flight of Fancy: American Heiresses (Zebra Ballad Romance),8.0,Tracy Cozzens,2002


In [None]:
new_books_df = pd.DataFrame(agg_title.to_records())

In [None]:
new_books_df.head()

Unnamed: 0,Book-Title,Rating,Book_Author,Year_Of_Publication
0,"Ask Lily (Young Women of Faith: Lily Series, ...",8.0,Nancy N. Rue,2001
1,Dark Justice,10.0,Jack Higgins,2004
2,Earth Prayers From around the World: 365 Pray...,8.333333,Elizabeth Roberts,1991
3,Final Fantasy Anthology: Official Strategy Gu...,10.0,David Cassady,1999
4,Flight of Fancy: American Heiresses (Zebra Ba...,8.0,Tracy Cozzens,2002


In [None]:
new_books_df.shape

(105796, 4)

In [None]:
books_ratings_df.columns

Index(['User-ID', 'Age', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author',
       'Year-Of-Publication'],
      dtype='object')

In [None]:
collab_df = books_ratings_df[['User-ID', 'Book-Title', 'Book-Rating']]

In [None]:
collab_df = pd.DataFrame(books_ratings_df[['User-ID', 'Book-Title', 'Book-Rating']])

In [None]:
collab_df.shape

(260889, 3)

## Modeling


---



In [None]:
import numpy as np
import matplotlib.pyplot as plt

!pip install scikit-surprise


Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357277 sha256=acb9c199b26e177289c2c8c46322b5c336c571f665dc91a3b4b68eeab4257fd6
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [None]:
from surprise import SVD, Reader, Dataset
from surprise.model_selection import train_test_split

In [None]:
reader = Reader(rating_scale=(1, 10))
collab_df = Dataset.load_from_df(collab_df, reader)

In [None]:
coll_sample = books_ratings_df[['User-ID','Book-Title','Book-Rating']].sample(frac=0.1, random_state=42)
data_sample = Dataset.load_from_df(coll_sample[['User-ID', 'Book-Title', 'Book-Rating']], Reader(rating_scale=(1, 10)))

In [None]:
trainset, testset = train_test_split(data_sample, test_size=.20)

In [None]:
from surprise import KNNWithMeans

In [None]:
sim_options = {
    "name": "cosine",
    "user_based": False,  # To compute similarities between items
}

algo = KNNWithMeans(sim_options=sim_options)
trainset = data_sample.build_full_trainset()

In [None]:
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7ca95bc44d90>

In [None]:
target = "Animal Farm"
book_to_inner_id = {trainset.to_raw_iid(inner_id): inner_id for inner_id in trainset.all_items()}

In [None]:
if target in book_to_inner_id:
    inner_id = book_to_inner_id[target]

    # Get similar books
    similarities = algo.get_neighbors(inner_id, k=10)  # Get top 10 similar items

    # Convert back to raw IDs (book titles)
    similar_books = [trainset.to_raw_iid(inner_id) for inner_id in similarities]
    print(f"Books similar to '{target}':")
    for book in similar_books:
        print(book)
else:
    print(f"Book '{target}' not found in the dataset.")

Books similar to 'Animal Farm':
The Girl Who Loved Tom Gordon
1984
Matilda
The Outsiders (Now in Speak!)
Tara Road
The Mile High Club (Kinky Friedman Novels (Hardcover))
A Man and His Mother: An Adopted Son's Search
Big Fish: A Novel of Mythic Proportions
A Life Less Ordinary: A Novel
Adventures of Tom Sawyer


#### Evaluation for how good model works on dataframe

Reference: https://realpython.com/build-recommendation-engine-collaborative-filtering/

## Evaluation

---



To help us understand how well our book recommender system works, we ran a Precision@K to see what is the proportion of recommended items in the set is relevant to the user. In addition to seeing how well the recommender system runs, it would allow us to understand what we need improvement on.

In [None]:
from surprise import accuracy
from collections import defaultdict

In [None]:
testset = trainset.build_testset()  # Use all interactions as the test set
predictions = algo.test(testset)

Above, we created a testset which would allow us to interpret and use all interactions with the recommender system as a test set. Precision@K works by identifing a "ground truth" which is based on a user's actual interactions with a relevant item. For example if the user liked "To Kill a Mockingbird", it would mark it as a relevant book for the user.

The Top K looks at the top k items in the recommendation list and counts how many are in the ground truth. For this Precision@K evaluation, our k was defined as 3.

In [None]:
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)  # Sort by estimated rating
        top_n[uid] = user_ratings[:n]

    return top_n

top_n = get_top_n(predictions, n=5)  # Example with top-5 recommendations

def precision_at_k(top_n, ground_truth, k=5):
    precisions = []
    for uid, recommended_items in top_n.items():
        recommended_items = [iid for iid, _ in recommended_items[:k]]
        relevant_items = ground_truth.get(uid, [])

        if relevant_items:
            relevant_and_recommended = set(recommended_items) & set(relevant_items)
            precision = len(relevant_and_recommended) / k
            precisions.append(precision)

    return np.mean(precisions)

# Create ground truth from testset
ground_truth = defaultdict(list)
for uid, iid, true_r in testset:
    if true_r >= 7:  # Example threshold for relevance
        ground_truth[uid].append(iid)

# Calculate Precision@k
precision_k = precision_at_k(top_n, ground_truth, k=3)
print(f"Precision@3: {precision_k:.4f}")

Precision@3: 0.5125


After running the Precision@K with K being 3, we can see that our recommender system has a .5125 or 51.25% accuracy! This means the book recommender system, on average, about 51.25% of the top 5 recommendations were relevant to users. This is considered a reasonable score, as our database contains a considerably large catalog, as users would explore fewer items.

To improve on this model in the future, some tuning such as changing the number of K or feature engineering could help improve the accuracy of the recommender system.

## Storytelling & Conclusion

---



Through this project, we gained valuable insights into building recommendation systems and explored their real-world applications. Below are the key takeaways from our work:

Achieving Our Goal:
We successfully developed a book recommendation system that suggests personalized book options based on user data and book metadata. Our main objective—helping readers find books they would enjoy—was effectively met using collaborative and content-based filtering techniques.

Insights from Data:
Patterns in user preferences revealed fascinating trends. For example, popular genres like mystery and science fiction appealed to a wide audience, while niche genres attracted dedicated but smaller groups of readers. This information can guide publishers and retailers in catering to diverse markets.

Challenges and Areas for Improvement:
We identified limitations, such as addressing the "cold-start problem" for new users or books. In the future, we plan to incorporate advanced methods like natural language processing to analyze review sentiments and add contextual recommendations based on user mood or seasonality.

Lessons Learned:
This project provided a deeper understanding of data mining techniques, algorithm selection, and data preprocessing. Beyond technical skills, we learned the importance of critically evaluating results and continuously improving models for better performance.

Overall, this project emphasized how machine learning can create personalized experiences and how critical thinking drives successful outcomes in real-world applications.

## Impact

---



Our project has a broader impact that goes beyond just recommending books. Here’s how it creates a difference:

Social Impact:
By making it easier to find enjoyable books, our system can encourage people to read more, ultimately promoting literacy and intellectual growth. It also connects readers to diverse genres and authors, fostering a richer appreciation for literature.

Ethical Considerations:
Recommender systems rely heavily on user data, making data privacy a critical concern. Our system emphasizes the need to handle user information responsibly and transparently. Additionally, we recognize that algorithms may unintentionally reinforce biases by favoring popular books, so future refinements will focus on promoting fairness and diversity.

Support for Authors and Publishers:
The system has the potential to spotlight lesser-known authors and works, providing them with increased visibility and opportunities. This could diversify the literary market while helping readers discover hidden gems.

Potential Risks:
While our intentions are positive, we acknowledge potential risks, such as over-reliance on algorithms or the possibility of narrowing user preferences by recommending similar books repeatedly. Balancing personalization with exploration will be a future priority.

In conclusion, our book recommendation system has the potential to positively influence the reading community, from encouraging literacy to supporting authors. At the same time, this project highlighted the importance of considering social and ethical responsibilities when deploying machine learning solutions.

## GitHub Repository

---



https://discord.com/channels/@me/1310627522186055730/1313200032630968435