# Hybrid book recommender system

The objetive of this project is to create a hybrid book recommendation system that combines three types of recommenders:
- Simple recommender
- Collaborative filtering engine
- **Content-based recommenders**

For this recommender we will use the dataset available at https://www.kaggle.com/sp1thas/book-depository-dataset?select=dataset.csv

More information on the three types of recommenders available at https://www.datacamp.com/community/tutorials/recommender-systems-python (partial tutorial, using a different dataset).

## Imports

In [1]:
import pandas as pd
import numpy as np

# Preprocess dataset.csv

## Load data

In [2]:
dataset_raw = pd.read_csv ('dataset2/dataset.csv',
                         dtype={'authors':str,
                                'categories':str,
                                'description':str,
                                'id':np.int32,
                                'image-url':str,
                                'isbn10':str,
                                'isbn13':str,
                                'lang':str,
                                'title':str},
                         parse_dates=['publication-date'],
                         usecols=['authors',
                                  'categories',
                                  'description',
                                  'id',
                                  'image-url',
                                  'isbn10',
                                  'isbn13',
                                  'lang',
                                  'publication-date',
                                  'title'])

In [3]:
dataset_raw.rename(columns={'authors':'Authors',
                          'categories':'Categories',
                          'description':'Description',
                          'id':'ID',
                          'image-url':'Image_URL',
                          'isbn10':'ISBN10',
                          'isbn13':'ISBN13',
                          'lang':'Language',
                          'publication-date':'Publication_date',
                          'title':'Title'},inplace=True)

## Explore 'dataset_raw'

In [4]:
dataset_raw.head(3)

Unnamed: 0,Authors,Categories,Description,ID,Image_URL,ISBN10,ISBN13,Language,Publication_date,Title
0,[1],"[214, 220, 237, 2646, 2647, 2659, 2660, 2679]",SOLDIER FIVE is an elite soldier's explosive m...,-2095311218,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,184018907X,9781840189070,en,2004-10-14,Soldier Five : The Real Truth About The Bravo ...
1,"[2, 3]","[235, 3386]",John Moran and Carl Williams were the two bigg...,-2090952917,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,184454737X,9781844547371,en,2009-03-13,Underbelly : The Gangland War
2,[4],"[358, 2630, 360, 2632]",Sir Phillip knew that Eloise Bridgerton was a ...,185860283,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,8416327866,9788416327867,es,2020-04-30,"A Sir Phillip, Con Amor"


In [5]:
dataset_raw.count()

Authors             1109383
Categories          1109383
Description         1029296
ID                  1109383
Image_URL           1109356
ISBN10              1109383
ISBN13              1109383
Language            1048976
Publication_date    1106780
Title               1109383
dtype: int64

In [6]:
dataset_raw.isna().sum()

Authors                 0
Categories              0
Description         80087
ID                      0
Image_URL              27
ISBN10                  0
ISBN13                  0
Language            60407
Publication_date     2603
Title                   0
dtype: int64

7% of books don't have a description.
5% don't have language defined.
0,2% don't have publication dates.

# Preprocess authors.csv

## Load data

In [7]:
authors_raw = pd.read_csv('dataset2/authors.csv',
                         dtype={'author_name':str,
                                'author_id':np.int32})

### Explore 'authors_raw'

In [8]:
authors_raw.describe

<bound method NDFrame.describe of         author_id        author_name
0            9561                NaN
1          451324      # House Press
2          454250      # Petal Press
3          249724    #GARCIA MIGUELE
4          287710  #Worldlcass Media
...           ...                ...
654016     237785                丘宏義
654017      77701           國立彰化師範大學
654018     410709                張成秋
654019     618322                 灰雁
654020     373580                 菊子

[654021 rows x 2 columns]>

In [10]:
authors_raw.rename(columns={'author_id':'Author_ID',
                            'author_name':'Author_Name'},inplace=True)

In [18]:
authors_raw.loc[authors_raw['Author_Name'].isna()]

Unnamed: 0,Author_ID,Author_Name
0,9561,
449420,5766,


Two authors have missing names, but we'll leave them, as books with (only) these authors will have NaN in the author list.

# Preprocess categories.csv

### Load data

In [25]:
categories_raw = pd.read_csv ('dataset2/categories.csv')

In [26]:
categories_raw.columns

Index(['category_id', 'category_name'], dtype='object')

In [27]:
categories_raw.dtypes

category_id       int64
category_name    object
dtype: object

In [28]:
categories_raw.rename(columns={'category_id':'Category_ID',
                            'category_name':'Category_Name'},inplace=True)

In [29]:
categories_raw.head()

Unnamed: 0,Category_ID,Category_Name
0,1998,.Net Programming
1,176,20th Century & Contemporary Classical Music
2,3291,20th Century & Contemporary Classical Music
3,2659,20th Century History: C 1900 To C 2000
4,2661,21st Century History: From C 2000 -


In [30]:
categories_raw.isna().sum()

Category_ID      0
Category_Name    0
dtype: int64

No empty category names

# Preprocess formats.csv

### Load data

In [31]:
formats_raw = pd.read_csv ('dataset2/formats.csv')

In [33]:
formats_raw.dtypes

format_id       int64
format_name    object
dtype: object

In [34]:
formats_raw.rename(columns={'format_id':'Format_ID',
                            'format_name':'Format_Name'},inplace=True)

In [35]:
formats_raw.head()

Unnamed: 0,Format_ID,Format_Name
0,21,Address
1,5,Audio
2,27,Bath
3,44,Big
4,14,Board


In [37]:
formats_raw.isna().sum()

Format_ID      0
Format_Name    0
dtype: int64

No empty format names

# Export preprocessed dataframes

In [38]:
dataset_raw.to_pickle("./pickle_files/d2_dataset.pkl")
authors_raw.to_pickle("./pickle_files/d2_authors.pkl")
categories_raw.to_pickle("./pickle_files/d2_categories.pkl")
formats_raw.to_pickle("./pickle_files/d2_formats.pkl")

### End of preprocessing