# Hybrid recommender system

The objetive of this project is to create a hybrid book recommendation system that combines three types of recommenders:
- Simple recommender
- Collaborative filtering engine
- **Content-based recommenders**

For this recommender we will use the dataset available at https://www.kaggle.com/sp1thas/book-depository-dataset?select=dataset.csv

More information on the three types of recommenders available at https://www.datacamp.com/community/tutorials/recommender-systems-python (partial tutorial, using a different dataset).

**Content recommender**

The system will recommend books that are similar to a particular book using the pairwise cosine similarity scores based on the book's:

- Categories
- Description (synopsis)
- Title
- Authors

In order to simplify this exercise, only books in english will be recommended (to reduce stop word processing).

## Imports

In [1]:
import pandas as pd

# Load preprocessed data

In [2]:
dataset = pd.read_pickle("./pickle_files/d2_dataset.pkl")

In [3]:
authors = pd.read_pickle("./pickle_files/d2_authors.pkl")

In [4]:
categories = pd.read_pickle("./pickle_files/d2_categories.pkl")

In [5]:
formats = pd.read_pickle("./pickle_files/d2_formats.pkl")

In [6]:
dataset.head()

Unnamed: 0,Authors,Categories,Description,ID,Image_URL,ISBN10,ISBN13,Language,Publication_date,Title
0,[1],"[214, 220, 237, 2646, 2647, 2659, 2660, 2679]",SOLDIER FIVE is an elite soldier's explosive m...,-2095311218,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,184018907X,9781840189070,en,2004-10-14,Soldier Five : The Real Truth About The Bravo ...
1,"[2, 3]","[235, 3386]",John Moran and Carl Williams were the two bigg...,-2090952917,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,184454737X,9781844547371,en,2009-03-13,Underbelly : The Gangland War
2,[4],"[358, 2630, 360, 2632]",Sir Phillip knew that Eloise Bridgerton was a ...,185860283,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,8416327866,9788416327867,es,2020-04-30,"A Sir Phillip, Con Amor"
3,"[5, 6, 7, 8]","[377, 2978, 2980]",The Third Book of General Ignorance gathers t...,930776004,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,571308996,9780571308996,en,2015-10-01,QI: The Third Book of General Ignorance
4,[9],"[2813, 2980]",The Try Guys deliver their first book-an inspi...,367819524,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,8352518,9780008352516,en,2019-06-18,The Hidden Power of F*cking Up


In [7]:
authors.head()

Unnamed: 0,Author_ID,Author_Name
0,9561,
1,451324,# House Press
2,454250,# Petal Press
3,249724,#GARCIA MIGUELE
4,287710,#Worldlcass Media


In [8]:
categories.head()

Unnamed: 0,Category_ID,Category_Name
0,1998,.Net Programming
1,176,20th Century & Contemporary Classical Music
2,3291,20th Century & Contemporary Classical Music
3,2659,20th Century History: C 1900 To C 2000
4,2661,21st Century History: From C 2000 -


In [9]:
formats.head()

Unnamed: 0,Format_ID,Format_Name
0,21,Address
1,5,Audio
2,27,Bath
3,44,Big
4,14,Board


# Content based recommender

We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document.
Scikit-learn has a built-in TfIdfVectorizer class to compute the TF-IDF matrix.
Steps
1. Reduce the dataset to a more manageable size (it is not the objetive of this exercise to manage large dataset)
2. Remove stop words (the', 'a', etc.)
3. Replace not-a-number values with a blank string
4. Construct the TF-IDF matrix on the data.

## Reduce the dataset to a more manageable size

In [10]:
dataset.shape[0]

1109383

### Remove books in languages other than 'en'

As mentioned in the introduction, in order to simplify this exercise, only books in english will be recommended (to simplify stop word processing).

In [11]:
dataset.Language.value_counts()

en     986575
es      25366
de      16180
fr       7495
pl       2924
        ...  
ae          1
aus         1
lad         1
dak         1
rm          1
Name: Language, Length: 162, dtype: int64

In [12]:
dataset.drop(dataset.loc[dataset['Language']!='en'].index, inplace=True)

In [13]:
dataset.shape[0]

986575

### Remove books without descriptions

In [14]:
dataset.Description.isna().sum()

34331

In [15]:
dataset.drop(dataset.loc[dataset.Description.isna()].index, inplace=True)

In [16]:
dataset.shape[0]

952244

### Drop duplicated titles

In [17]:
dataset.Title.value_counts().head(3)

Journal Your Life's Journey : Journals To Write In For Women Cute Plain Blank Notebooks                     208
Rental Property Record Book : Rental Property Landlord Income Maintenance Management Tracker Record Book    178
Bullet Journal : Dot Journaling 110 pages - Size A4 - notebook 8.5" x 11" Dotted paper                      177
Name: Title, dtype: int64

One book appears more than 200 times! Let's eliminate the oldest duplicates using the published date.

In [18]:
dataset.sort_values(by = "Publication_date", na_position ="last", inplace=True)

In [19]:
dataset.drop_duplicates("Title", inplace=True)

In [20]:
dataset.shape[0]

820220

### Drop duplicated ISBN10

In [21]:
dataset.drop_duplicates("ISBN10", inplace=True)

In [22]:
dataset.shape[0]

820220

### Drop books with missing values

In [23]:
dataset.dropna(axis = 0, inplace = True)

In [24]:
dataset.shape[0]

820188

### Drop books yet to be published

In [25]:
dataset.drop(dataset.loc[dataset.Publication_date > "2021-11-01"].index, inplace=True)

In [26]:
dataset.shape[0]

819350

### Use only books in a popular genre

In [27]:
categories

Unnamed: 0,Category_ID,Category_Name
0,1998,.Net Programming
1,176,20th Century & Contemporary Classical Music
2,3291,20th Century & Contemporary Classical Music
3,2659,20th Century History: C 1900 To C 2000
4,2661,21st Century History: From C 2000 -
...,...,...
2770,1634,Zoology: Invertebrates
2771,1644,Zoology: Mammals
2772,1639,Zoology: Vertebrates
2773,3007,Zoos & Wildlife Parks


In [28]:
book_categories = dataset.Categories.copy()

In [29]:
book_categories.head()

1071127    [242, 295, 2722, 2723, 2724, 2725, 2736, 2737,...
217812                                                [1477]
901711                                                 [287]
361771                            [335, 341, 2622, 362, 370]
748231                     [249, 271, 273, 2802, 2812, 2982]
Name: Categories, dtype: object

In [30]:
book_categories.iloc[0]

'[242, 295, 2722, 2723, 2724, 2725, 2736, 2737, 2738, 2740, 2741, 3234, 783, 3328]'

In [31]:
type(book_categories.iloc[0])

str

Remove brackets

In [32]:
book_categories = book_categories.apply(lambda x: x[1:-1])

In [33]:
book_categories.iloc[0]

'242, 295, 2722, 2723, 2724, 2725, 2736, 2737, 2738, 2740, 2741, 3234, 783, 3328'

In [34]:
type(book_categories)

pandas.core.series.Series

In [35]:
book_categories = book_categories.to_frame()

In [36]:
book_categories

Unnamed: 0,Categories
1071127,"242, 295, 2722, 2723, 2724, 2725, 2736, 2737, ..."
217812,1477
901711,287
361771,"335, 341, 2622, 362, 370"
748231,"249, 271, 273, 2802, 2812, 2982"
...,...
927073,"1520, 1541, 1549, 1593, 1721, 1778, 1791, 1833..."
909485,"1721, 1727"
967678,"107, 108, 110, 1823"
1108881,"3092, 3098, 3100, 3106"


In [37]:
expanded_categories = book_categories["Categories"].str.split(" ", expand = True)

In [38]:
expanded_categories

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,77,78,79,80,81,82,83,84,85,86
1071127,242,295,2722,2723,2724,2725,2736,2737,2738,2740,...,,,,,,,,,,
217812,1477,,,,,,,,,,...,,,,,,,,,,
901711,287,,,,,,,,,,...,,,,,,,,,,
361771,335,341,2622,362,370,,,,,,...,,,,,,,,,,
748231,249,271,273,2802,2812,2982,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
927073,1520,1541,1549,1593,1721,1778,1791,1833,2577,,...,,,,,,,,,,
909485,1721,1727,,,,,,,,,...,,,,,,,,,,
967678,107,108,110,1823,,,,,,,...,,,,,,,,,,
1108881,3092,3098,3100,3106,,,,,,,...,,,,,,,,,,


In [39]:
expanded_categories.iloc[:,0].value_counts()

334,     29015
292,     17643
214,     14153
2488,     9514
336,      8991
         ...  
1572,        1
1017         1
1239         1
1185         1
1679,        1
Name: 0, Length: 3587, dtype: int64

There are almost 30 thousand books with first category equal to 334

In [40]:
categories.loc[categories.Category_ID == 334]

Unnamed: 0,Category_ID,Category_Name
568,334,Contemporary Fiction


Let's keep only books of contemporary fiction in order to reduce our dataset

In [41]:
dataset = dataset[dataset.Categories.str.contains("334")]

In [42]:
dataset.shape[0]

51524

OK. Let's go forward with these 51524 books of contemporary fiction

In [43]:
dataset['Description'].str.len().max()

10687

The biggest book description has 10687 sharacters. It's way too long.
Let's truncate the description at 100 characters, plus howmany necessary in order to leave the last word complete.

In [44]:
def smart_truncate(content):
    if len(content) <= 100:
        return content
    else:
        return ' '.join(content[:100+1].split(' ')[0:-1])

In [45]:
dataset['Short_Description'] = dataset.Description.apply(smart_truncate)

In [158]:
dataset.Short_Description.iloc[0]

'The King in Yellow is a book of short stories by the American writer Robert W. Chambers, first'

We need to reset the index before we proceed.

In [67]:
dataset.head()

Unnamed: 0,Authors,Categories,Description,ID,Image_URL,ISBN10,ISBN13,Language,Publication_date,Title,Short_Description
379552,[25746],"[334, 350, 2624, 355, 2629, 358, 2630]",The King in Yellow is a book of short stories ...,-1553238272,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,238226201X,9782382262016,en,1895-01-01,The King in Yellow,The King in Yellow is a book of short stories ...
397601,[75858],"[334, 350, 2624]","""To the priests, the soldiers, the judges, to ...",1823040226,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,1463573219,9781463573218,en,1899-06-01,Torture Garden,"""To the priests, the soldiers, the judges, to ..."
782624,[83068],"[334, 335, 2978, 2983, 3098]",Three Men on the Bummel (also known as Three M...,2133882611,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,1774415607,9781774415603,en,1900-01-01,Three Men on Wheels,Three Men on the Bummel (also known as Three M...
785201,[83068],"[334, 2978, 2983]",They and I is an English humor classic from Je...,2133883946,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,177441693X,9781774416938,en,1903-04-01,They and I,They and I is an English humor classic from Je...
767556,[83068],"[290, 334, 2978, 2983]",This classic collection of humourous essays fr...,2133879778,https://d1w7fb2mkkr3kw.cloudfront.net/assets/i...,1774412772,9781774412770,en,1905-06-01,Idle Ideas in 1905,This classic collection of humourous essays fr...


In [70]:
dataset.reset_index(drop = True, inplace = True)

# Content recommender system using 'Description'

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [48]:
tfidf = TfidfVectorizer(stop_words='english')

### Construct the TF-IDF matrix

In [49]:
tfidf_matrix = tfidf.fit_transform(dataset['Short_Description'])

In [50]:
tfidf_matrix.shape

(51524, 39471)

Almost forty thousand volcabularies in 51 thousand books.

In [51]:
from sklearn.metrics.pairwise import linear_kernel

The next step takes a bit of time (and memory).

In [52]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [53]:
cosine_sim.shape

(51524, 51524)

In [56]:
cosine_sim_df = pd.DataFrame(data=cosine_sim)

In [58]:
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.10913896, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.10913896, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.08979574],
       [0.        , 0.        , 0.        , ..., 0.        , 0.08979574,
        1.        ]])

### Save cosine similarity sccore

(!) The next command generates an 8 GB file.

In [None]:
cosine_sim_df.to_pickle("./pickle_files/cosine_similarity_score.pkl")

# Load cosine similarity score - only if you didn't run the code above

(!) This requires about 23 GB of memory

In [46]:
cosine_sim_df = pd.read_pickle("./pickle_files/cosine_similarity_score.pkl")

In [47]:
cosine_sim = cosine_sim_df.to_numpy()

### Construct a reverse map of indices and book ISBN10

In [74]:
indices = pd.Series(dataset.index, index=dataset['ISBN10']).drop_duplicates()

In [100]:
def get_recommendations(ISBN10, cosine_sim=cosine_sim):
    # Get the index of the book that matches the ISBN
    idx = indices[ISBN10]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return dataset['ISBN10'].iloc[book_indices]

Let's find results similar to the first book on the dataset:

In [101]:
dataset.Title[0]

'The King in Yellow'

In [102]:
pd.set_option('display.max_colwidth', None)

In [103]:
dataset.loc[get_recommendations('238226201X', cosine_sim).index, ["Title", "Short_Description"]]

Unnamed: 0,Title,Short_Description
7438,The King in Yellow and Other Horror,"Every story of The King in Yellow has something riveting about it ... so perfectly realized, they"
2080,Men without Women,CLASSIC SHORT STORIES FROM THE MASTER OF AMERICAN FICTION
31289,"The Yellow Wallpaper (Wisehouse Classics - First 1892 Edition, with the Original Illustrations by Joseph Henry Hatfield)","THE YELLOW WALLPAPER is a story by the American writer Charlotte Perkins Gilman, first published in"
28946,The Best American Catholic Short Stories : A Sheed & Ward Collection,The Best American Catholic Short Stories captures twenty of the best short stories from thirteen
18051,The Signet Classic Book of American Short Stories,The best of American short fiction
15547,The Wizdom of Oz,Read this book and find your very own yellow brick road to enlightenment! Have you ever felt short
20598,Assignment In Eternity,"Classic novellas and short stories from the Dean of Science Fiction, Robert"
17739,The Collected Stories : World's End; Sinning with Annie; Jungle Bells; the Consul's File; the London Embassy;,"The Collected Stories is a collection of award-winning writer Paul Theroux's short stories, each of"
4831,N.P.,N.P. is the title of a last collection of short stories by a celebrated Japanese writer. Written in
30321,100 Years of the Best American Short Stories,The Best American Short Stories is the longest running and best-selling series of short fiction in


# Save recommender

In [104]:
indices.to_pickle("./pickle_files/indices.pkl")

### End of preprocessing