# Book Recommender System

This project embarks on the fascinating journey of building a Book Recommender System using data sourced from Kaggle. 
With the goal of providing users with book recommendations, it leverages basic unsupervised learning algorithms to propose the user with book based on authors, ratings, and categories. 


## Key Features:

<b>Data Source -</b>  Kaggle:

This project uses the data available on Kaggle for the extensive analysis and allows the users to identify the next read based on the latest book the user read.

### Unsupervised Learning Algorithms:

The core of our recommender system lies in the application of basic unsupervised learning algorithms. This project uses the data and provide the personalized book suggestions without the need for labeled training data.

### Author-based Recommendations:

The recommender system intelligently analyzes authorship patterns to suggest books that align with your reading preferences.

### Rating-driven Suggestions:

Your past ratings play a crucial role in shaping the recommendations. The system examines your rating history to understand your preferences, ensuring that each suggested book resonates with your taste.


### Category-specific Suggestions:

Dive into genres that user is interested in. Whether the user is fan of mystery, romance, or science fiction, our recommender system tailors suggestions based on the categories that pique your interest.


## Import Libraries

In [1]:
import pandas as pd
import re
import time


## Load Data

In [2]:
# https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset/data
books1 = pd.read_csv('./data/Preprocessed1.csv')
books2 = pd.read_csv('./data/Preprocessed2.csv')
books3 = pd.read_csv('./data/Preprocessed3.csv')
books4 = pd.read_csv('./data/Preprocessed4.csv')
books5 = pd.read_csv('./data/Preprocessed5.csv')
books6 = pd.read_csv('./data/Preprocessed6.csv')
print(f'1st Set of Books  size: {books1.shape}')
print(f'2nd Set of Books  size: {books2.shape}')
print(f'3rd Set of Books  size: {books3.shape}')
print(f'4th Set of Books  size: {books4.shape}')
print(f'5th Set of Books  size: {books5.shape}')
print(f'6th Set of Books  size: {books6.shape}')

books = pd.concat([books1, books2, books3, books4, books5, books6], axis=0)
print(f'Shape of Books: {books.shape}')
books.head()

1st Set of Books  size: (185693, 19)
2nd Set of Books  size: (186463, 19)
3rd Set of Books  size: (189340, 19)
4th Set of Books  size: (200477, 19)
5th Set of Books  size: (215255, 19)
6th Set of Books  size: (53947, 19)
Shape of Books: (1031175, 19)


Unnamed: 0.1,Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada
2,2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],ottawa,ontario,canada
3,3,11676,"n/a, n/a, n/a",34.7439,2005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],,,
4,4,41385,"sudbury, ontario, canada",34.7439,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],sudbury,ontario,canada


## How it Works:



### Data Preprocessing:

To start with, preprocess the data. This includes handling missing values, cleaning data, and organizing it into a structured format.

### Unsupervised Learning Model:

Leveraging basic unsupervised learning algorithms, it unveil hidden patterns in the dataset. It autonomously identify associations between books, authors, and categories.

### User-centric Recommendations:

The recommender system takes user input, such as favorite authors, preferred categories, and historical ratings, to generate personalized book recommendations.

### Utilize the Scikit for recommendation:

The recommendor system make use of scikit cosine-similarity algorithm to recommend the book to the user based on his last book issued.


In [3]:
dataFrame = books.copy()
print(f'Original Dataset Shape: {dataFrame.shape}')
dataFrame.dropna(inplace=True)

dataFrame.drop(columns = ['Unnamed: 0','location',
                   'img_s','img_m','city','age',
                   'state','Language','country',
                   'year_of_publication'],axis=1,inplace = True)

categories = dataFrame['Category'].unique()



# Clean up categories
dataFrame.drop(index=dataFrame[dataFrame['Category'] == '9'].index, inplace=True)

dataFrame['Category'] = dataFrame['Category'].apply(lambda x: re.sub('[\W_]+',' ',x).strip())

categories = dataFrame['Category'].unique()
print(f"Unique Categories:{categories}") 

# Remove rows that constains 0 as rating
ratings = dataFrame['rating'].unique()

#dataFrame.drop(index=dataFrame[dataFrame['rating'] == 0].index, inplace=True)
ratings = dataFrame['rating'].unique()
print(f"Unique Ratings:{ratings}")

print(f'Cleanedup Dataset Shape: {dataFrame.shape}')


Original Dataset Shape: (1031175, 19)
Unique Categories:['Actresses' '1940 1949' 'Fiction' ... 'Algonquian Indians' 'Menus'
 'Merchants']
Unique Ratings:[ 5  0  8  6  9  7 10  4  1  2  3]
Cleanedup Dataset Shape: (93945, 9)


# Item Based Collabrative Rating

## 1. Rating-driven Suggestions

In [4]:
book_basic_data = dataFrame.copy()
book_basic_data.drop(columns = ['book_author','publisher','img_l',
                   'Summary', 'Category'],axis=1,inplace = True)

print(f'Number of Books used for recommendation: {book_basic_data.shape}')



def recommend_by_ratings(book_title):
    if book_title in dataFrame['book_title'].values:
    
        num_ratings = pd.DataFrame(book_basic_data['book_title'].value_counts())
        #print(f'Number of ratings of the book: {num_ratings}')
        
        less_rating_books = num_ratings[num_ratings['book_title'] <= 1].index
        common_books = book_basic_data[~book_basic_data['book_title'].isin(less_rating_books)]
        #print(f'common_books :::{common_books}')
        
        start = time.time()
        
        user_book_df = pd.pivot_table(data=common_books, index=['user_id'],
                                                    columns=['book_title'],
                                                    values='rating')
        
        end = time.time()
        print(f'Time Taken: {end - start} seconds')
        user_book_df.fillna(0, inplace=True)
        book = user_book_df[book_title]
        
    
        recom_books = pd.DataFrame(user_book_df.corrwith(book). \
                                      sort_values(ascending=False)).reset_index(drop=False)
        
        print(f'Top 10 Recommended Books: \n{recom_books.head(10)}')
    else:
        print(f'Book is not available in dataset. Try different book')


## Assuming the user read the book 'Timeline'
print(f'Recommending for `Wild Animus`') 
recommend_by_ratings('Wild Animus')

Number of Books used for recommendation: (93945, 4)
Recommending for `Wild Animus`
Time Taken: 9.055074214935303 seconds
Top 10 Recommended Books: 
                                          book_title         0
0                                        Wild Animus  1.000000
1  A Cat in the Manger: An Alice Nestleton Myster...  0.165174
2                                     Marley's Ghost  0.165174
3  The Berenstain Bears and the Trouble with Grow...  0.165174
4                              More Than a Carpenter  0.165174
5                                              Candy  0.124081
6  She Said Yes: The Unlikely Martyrdom of Cassie...  0.123225
7                       Myth Directions (Myth Books)  0.116554
8                                          Earthclan  0.116554
9                                A College of Magics  0.116554


In [5]:
print(f'Recommending for `Clara Callan`') 
recommend_by_ratings('Clara Callan')

Recommending for `Clara Callan`
Time Taken: 9.160154819488525 seconds
Top 10 Recommended Books: 
                                          book_title         0
0                                       Clara Callan  1.000000
1                                          Treasures  0.847990
2          Blood is the Sky (An Alex McKnight Novel) -0.000074
3                               INTRUDER IN THE WIND -0.000074
4             Cannery Row (Steinbeck \Essentials\")" -0.000074
5                                 Holiday for Murder -0.000074
6                                 Vegetarian Cooking -0.000074
7  Match Made In Texas (Harlequin American Romanc... -0.000074
8  Last Dance (Man Of The Month, Freedom Valley) ... -0.000074
9               Cassie's Cowboy Daddy (Desire, 1439) -0.000074


## 2. Category-specific Suggestions

In [6]:
book_with_categories = dataFrame.copy()
book_with_categories.drop(columns = ['rating','publisher','img_l',
                   'Summary','book_author'],axis=1,inplace = True)

print(f'Number of Books used for recommendation: {book_with_categories.shape}')

def recommend_by_category(book_title):
    if book_title in dataFrame['book_title'].values:
    
        count_by_categories = pd.DataFrame(book_with_categories['Category'].value_counts())
        #print(f'count_by_categories : {count_by_categories}')
        
        books_with_few_categories = count_by_categories[count_by_categories['Category'] <= 1].index
        #print(f'books_with_few_categories: {books_with_few_categories}')
        
        books_with_more_categories = book_with_categories[~book_with_categories['Category'].isin(books_with_few_categories)]
        #print(f'books_with_more_categories: {books_with_more_categories}')
        
        books_with_more_categories['belongs_to'] = True
        
        book_category_df = books_with_more_categories.pivot_table(index=['Category'],
                                                    columns=['book_title'],
                                                    values='belongs_to', aggfunc='any', fill_value=False)
        #print(f'book_category_df : {book_category_df}')
        book = book_category_df[book_title]
    
        recom_books = pd.DataFrame(book_category_df.corrwith(book). \
                                      sort_values(ascending=False)).reset_index(drop=False)
        
        print(f'Recommended Books: \n{recom_books}')
    else:
        print(f'Book is not available in dataset. Try different book')


## Assuming the user read the book 'Dark Justice'
print(f'Recommending for `Click`') 
recommend_by_category('Click')

Number of Books used for recommendation: (93945, 4)
Recommending for `Click`


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_with_more_categories['belongs_to'] = True


Recommended Books: 
                                              book_title         0
0                 The Sex Files (Harlequin Blaze, No 67)  1.000000
1            Gaudi Afternoon: A Cassandra Reilly Mystery  1.000000
2      Other Amanda  (Loving Dangerously) (Harlequin ...  1.000000
3                                   Other (Childe Cycle)  1.000000
4      The Victorian Fairy Tale Book (Pantheon Fairy ...  1.000000
...                                                  ...       ...
35666                                  Flaubert's Parrot -0.001009
35667                                 Pilgrim's Progress -0.001009
35668                                            Matilda -0.001236
35669                        Another Roadside Attraction -0.001236
35670                                       Little Women -0.001236

[35671 rows x 2 columns]


## 3. Author-based Recommendations

In [7]:
book_author_data = dataFrame.copy()
book_author_data.drop(columns = ['rating','publisher','img_l',
                   'Summary'],axis=1,inplace = True)

print(f'Number of Books: {len(book_author_data)}')

def recommend_by_author(book_title):
    if book_title in dataFrame['book_title'].values:
    
        count_by_authors = pd.DataFrame(book_author_data['book_author'].value_counts())
        #print(f'count_by_authors : {count_by_authors}')
        authors_with_less_book = count_by_authors[count_by_authors['book_author'] <= 200].index
        #print(f'authors_with_less_book: {authors_with_less_book}')
        
        authors_with_more_books = book_author_data[~book_author_data['book_author'].isin(authors_with_less_book)]
        #print(f'authors_with_more_books: {authors_with_more_books}')
        
        authors_with_more_books['wrote_book'] = True
        
        author_book_df = authors_with_more_books.pivot_table(index=['book_author'],
                                                    columns=['book_title'],
                                                    values='wrote_book', aggfunc='any', fill_value=False)
        #print(f'author_book_df : {author_book_df}')
        book = author_book_df[book_title]
        
        #print(f'{book}')
    
        recom_books = pd.DataFrame(author_book_df.corrwith(book). \
                                      sort_values(ascending=False)).reset_index(drop=False)
        
        print(f'Recommended Books: {recom_books}')
    else:
        print(f'Book is not available in dataset. Try different book')

## Assuming the user read the book 'Timeline'
recommend_by_author('Timeline')

Number of Books: 93945
Recommended Books:                                             book_title        0
0                                                Congo  1.00000
1    Michael Crichton: A New Collection of Three Co...  1.00000
2                                             Timeline  1.00000
3                              The Great Train Robbery  1.00000
4                                           Disclosure  1.00000
..                                                 ...      ...
792                                   Answered Prayers -0.05547
793                                              Wings -0.05547
794                                       Kaleidoscope -0.05547
795                                        Dating Game -0.05547
796                                 The Long Road Home -0.05547

[797 rows x 2 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  authors_with_more_books['wrote_book'] = True


## 4. Using Scikit Learn

In [8]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

start = time.time()

book_df = dataFrame.copy()
# Create a user-item matrix
user_item_matrix = book_df.pivot(index='user_id', columns='isbn', values='rating').fillna(0)

# Calculate item-item similarity using cosine similarity
item_similarity = cosine_similarity(user_item_matrix.T)

# Convert the similarity matrix into a DataFrame
item_similarity_df = pd.DataFrame(item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns)
end = time.time()
print(f'Time Taken: {end - start} seconds')

def get_book_recommendations(book_id, n=5):
    similar_books = item_similarity_df[book_id].sort_values(ascending=False).index[1:n+1]
    return similar_books



Time Taken: 154.02481412887573 seconds


In [9]:
# Get recommendations for book with ID 0002005018 i.e. Clara Callan
book_id_to_recommend = '0002005018' 
recommendations = get_book_recommendations(book_id_to_recommend)

print(f"Top 5 Recommendations for Book {book_id_to_recommend}:")
print(recommendations)

recommended_books_name = [book_df[book_df['isbn'] == bookid]['book_title'].iloc[0] if bookid in book_df['isbn'].values else None for bookid in recommendations]
# Print the results
for item, book_title in zip(recommendations, recommended_books_name):
    print(f"Item: {item}, book_title: {book_title}")



Top 5 Recommendations for Book 0002005018:
Index(['0440214009', '0743244176', '074324088X', '0743240928', '0743241908'], dtype='object', name='isbn')
Item: 0440214009, book_title: Treasures
Item: 0743244176, book_title: Miss Manners' Guide to Rearing Perfect Children
Item: 074324088X, book_title: I Hate To See That Evening Sun Go Down : Collected Stories
Item: 0743240928, book_title: That Old Ace in the Hole : A Novel
Item: 0743241908, book_title: What We Saw : The Events of September 11, 2001, in Words, Pictures, and Video


## Conclusion:
This Book Recommender System uses one attribute at time to recommend the user with the next one. This recommender can be extended to next level where the system utilises all the attributes together to provide the best suggestion. The book recommender can also utilize the summary column to extract the keywords of the book and provide the next recommendation to user.