> # Content Based GoodBooks Recommender System

In this project our aim is to find a a dataset that can be used to inform a content-based recommender systems  and build a Python Notebook that:

   (1) Loads the dataset

   (2) Creates a content-based recommender system

   (3) Uses quantitative metrics to evaluate the recommendations of the system.

## About Dataset

To begin with, we have selected a dataset from kaggle containing ten thousand books & one million ratings. 

The dataset can be found [here](https://www.kaggle.com/datasets/zygmunt/goodbooks-10k?select=ratings.csv).

Some informations about the dataset:

   * Contains ratings for ten thousand popular books. 
   * Generally, there are 100 reviews for each book, although some have less - fewer - ratings. 
   * There are also books marked to read by the users, book metadata (author, year, etc.) and tags.
   * As to the source, let's say that these ratings were found on the internet. 

Contents of dataset:

   * **ratings.csv** contains ratings
   * **to_read.csv** provides IDs of the books marked "to read" by each user, as user_id,book_id pairs
   * **books.csv** has metadata for each book (goodreads IDs, authors, title, average rating, etc.)
   * **book_tags.csv** contains tags/shelves/genres assigned by users to books
   * **tags.csv** translates tag IDs to names

## Content Based Recommender System

* We will start by **importing the necessary libraries**

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from datetime import datetime

from functions.import_and_preprocess_initial_data import import_data
from functions.import_and_preprocess_initial_data import preprocess_data

from functions.content_based_recsys import load_books
from functions.content_based_recsys import get_book_similarity
from functions.content_based_recsys import recommend_books
from functions.content_based_recsys import generate_users
from functions.content_based_recsys import exploit_simulate
from functions.content_based_recsys import compute_average_precision

pd.set_option('display.max_columns', None)

### Read Data

* Next, we will **import** and **pre-process** the **initial data** and **save** them in a **CSV** file

In [2]:
# paths to load data
path_books = 'data/books.csv'
path_ratings = 'data/ratings.csv'
path_book_tags = 'data/book_tags.csv'
path_tags = 'data/tags.csv'

# function to import data
books, ratings, book_tags, tags = import_data(path_books, path_ratings, path_book_tags, path_tags)

# function to preprocess data
df = preprocess_data(ratings, books, book_tags, tags)

# export preprocess data to csv
df.to_csv('data/preprocessed_data.csv',index=False)

#print df info
print(f'df.shape: {df.shape}',end='\n\n')
df.head(5)

df.shape: (9258, 10)



Unnamed: 0_level_0,authors,original_publication_year,original_title,average_rating,ratings_count_new,per_positive_ratings,per_negative_ratings,per_average_ratings,total_number_of_tags,tag_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2767052,Suzanne Collins,2008.0,The Hunger Games,4.34,4942365,84.7,3.9,11.3,287490,"favorites, currently-reading, young-adult, fic..."
3,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,4.44,4800065,86.8,3.7,9.5,786374,"to-read, favorites, fantasy, currently-reading..."
41865,Stephenie Meyer,2005.0,Twilight,3.57,3916824,56.9,22.8,20.3,121636,"young-adult, fantasy, favorites, vampires, ya,..."
2657,Harper Lee,1960.0,To Kill a Mockingbird,4.25,3340896,81.3,5.3,13.4,148466,"classics, favorites, to-read, classic, histori..."
4671,F. Scott Fitzgerald,1925.0,The Great Gatsby,3.89,2773745,67.9,10.2,21.9,134429,"classics, favorites, fiction, classic, books-i..."


### Load Preprocessed Data
* We **load the preprocessed data** we saved before using the `load_books` function
* This function:
   * takes as **input** the **path of the CSV** file where we have store the preprocessed data
   * **creates a new class** named `book`, which defines the attributes of a book object
   * **returns** a **list** named `books` having each book object, a **dictionary** called `title_index` holding the index of each book title and a **matrix** called `tags_similarity_matrix` havingthe cosine similarities between the book tags.       

In [4]:
# set starting time
start_time = datetime.now()

# function to load pre-processed data regarding books
books, tags_similarity_matrix, title_index = load_books('data/preprocessed_data.csv')

# end time
end_time = datetime.now()

# total execution time
total_time = end_time - start_time

print('Loading of Preprocessed Data Completed', end='\n\n')
print(f'Execution time: {total_time}')

Loading of Preprocessed Data Completed

Execution time: 0:00:04.738516


### Recommendations

* In this step we will run a **recommendation example**, given a book title, to recommend books, in order to see how our recommender system works
* For this porpuse, we will run the function `recommend_books`
* This function:
   * takes as **input** a **book_title** and some other parameters
   * **returns** a **dictionary** with all the recommendations for the given title, sorted by their simialrity score
* The parameter `weights` contains the weights we want to assign to each factor and can **range between 0 and 1**

In [5]:
# define weights
weights = {'authors':1,
           'publication_year':1,
           'rating':1,
           'positive_ratings':1,
           'tags':1
          } 

In [6]:
# set example title (you can also use the title: 'The Hunger Games')
example_title = "Harry Potter and the Philosopher's Stone"

# set starting time
start_time = datetime.now()

# define how many books to recommend
books_to_recommend = 10

# run function for recommendation
recommendations = recommend_books(example_title,
                                  books_to_recommend,
                                  books,
                                  weights,
                                  tags_similarity_matrix,
                                  title_index)


# print the recommendations
print(f'Below are the books that are similar to <{example_title}>',end='\n\n')

for i, recommendation in enumerate(recommendations):
    print(f'Recommendation {i+1}')
    print("-"*17)
    print(f'Book Title: {recommendation[0]}')
    print(f'Similarity Score: {recommendation[1][0]}')
    print(f'Factors: {recommendation[1][1]}',end='\n\n')


# end time
end_time = datetime.now()

# total execution time
total_time = end_time - start_time

print(f'Execution time: {total_time}')

Below are the books that are similar to <Harry Potter and the Philosopher's Stone>

Recommendation 1
-----------------
Book Title: Harry Potter and the Deathly Hallows
Similarity Score: 3.43
Factors: [('authors', 1.0), ('positive_ratings', 0.95), ('tags', 0.92), ('rating', 0.46), ('publication_year', 0.1)]

Recommendation 2
-----------------
Book Title: Harry Potter and the Half-Blood Prince
Similarity Score: 3.42
Factors: [('authors', 1.0), ('positive_ratings', 0.96), ('tags', 0.93), ('rating', 0.45), ('publication_year', 0.08)]

Recommendation 3
-----------------
Book Title: Harry Potter and the Chamber of Secrets
Similarity Score: 3.4
Factors: [('authors', 1.0), ('positive_ratings', 0.98), ('tags', 0.97), ('rating', 0.44), ('publication_year', 0.01)]

Recommendation 4
-----------------
Book Title: Harry Potter and the Goblet of Fire
Similarity Score: 3.36
Factors: [('authors', 1.0), ('positive_ratings', 0.96), ('tags', 0.92), ('rating', 0.45), ('publication_year', 0.03)]

Recommenda

### Evaluation

* In our next step, we want to **evaluate** the results of our recommender system **based on quantitative metrics**.
* For this reason we will:

     (1) Generate Fake Users
     
     (2) Make Recommendations for those Users
     
     (3) Evaluate results based on **Mean Average Precision (MAP) Score**

* We will start by **Generating Fake Users**
* For that porpuse, we will use the function `generate_users` which **creates 10 fake users with 5 random seed books each**

In [7]:
# define factors to consider when generating fake users
factors = ['authors',
           'publication_year',
           'rating',
           'positive_ratings',
           'tags'
          ]

In [8]:
# set starting time
start_time = datetime.now()

#run function to generate fake users
users = generate_users(books, tags_similarity_matrix, title_index,factors)

# end time
end_time = datetime.now()

# total execution time
total_time = end_time - start_time

print(f'Execution time: {total_time}')

10 fake users have been created succesfully

Execution time: 0:00:09.215274


* Then, we will **make recommendations for the fake users**
* We will use the function `exploit_simulate` which **recommend 50 books** to each fake user and **returns which books each user likes or dislikes**

In [9]:
# set starting time
start_time = datetime.now()

# function to assign likes and dislikes to each fake user
user_recommended_books = exploit_simulate(users, books, title_index, tags_similarity_matrix)

# end time
end_time = datetime.now()

# total execution time
total_time = end_time - start_time

print(f'Execution time: {total_time}')

For user 1 the Total Likes out of 50 are : 45
For user 2 the Total Likes out of 50 are : 28
For user 3 the Total Likes out of 50 are : 27
For user 4 the Total Likes out of 50 are : 27
For user 5 the Total Likes out of 50 are : 32
For user 6 the Total Likes out of 50 are : 13
For user 7 the Total Likes out of 50 are : 24
For user 8 the Total Likes out of 50 are : 27
For user 9 the Total Likes out of 50 are : 43
For user 10 the Total Likes out of 50 are : 22
Execution time: 0:14:04.190004


* Since we have created fake users and we have their preferences, we continue with the **evaluation of our recommender system**
* Our evaluation will be based on **Mean Average Precision (MAP)**.
* We will first calculate the **average precision score for each user**, using the function `compute_average_precision` and then we will **compute the average of all the users**.

In [10]:
# initialize list to store the precision of each user
users_avg_precisions = []

# for each user 
for user, recommended_books in user_recommended_books.items():
    
    # compute average precision of each user
    avg_precision = compute_average_precision(recommended_books)
    
    # append to the list
    users_avg_precisions.append(avg_precision)

# compute the mean average precision
mean_avg_precision = round(np.mean(users_avg_precisions),2)

print(f'Mean Average Precision (MAP): {mean_avg_precision}')

Mean Average Precision (MAP): 0.62
