# Books Recommender Engine: Preprocessing and Collaborative Recommender

- [The Error](#The-Error)

### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
  * [Pivot Table](#Pivot-Table)
  * [Sparse Matrix](#Sparse-Matrix)

## Imports

In [1]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances

%config InlineBackend.figure_format = 'retina'

## Preprocessing

In [2]:
#reading in the data
explicit_ratings = pd.read_csv('./datasets/explicit_ratings.csv')
explicit_ratings.head()

Unnamed: 0.1,Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,24,88006,1551669498,1,Girls Night,Stef Ann Holm,2002,Mira
1,25,76151,1551669498,2,Girls Night,Stef Ann Holm,2002,Mira
2,26,229011,1551669498,3,Girls Night,Stef Ann Holm,2002,Mira
3,27,28700,1551669498,6,Girls Night,Stef Ann Holm,2002,Mira
4,28,47866,1551669498,7,Girls Night,Stef Ann Holm,2002,Mira


In [3]:
#For a user based recommender, I only need the user, rating, and title, so I'm going to drop the others
explicit_ratings.drop(['ISBN', 'Book-Author', 'Year-Of-Publication', 'Publisher'], 1,  inplace=True)

In [4]:
#this is too much data for a pivot table though, so let's pull those who have more than 100 ratings and 
#see if that's small enough a set to create a pivot table
user_counts = explicit_ratings['User-ID'].value_counts()
sample_ratings = explicit_ratings[explicit_ratings['User-ID'].isin(user_counts[user_counts >= 100].index)]

In [5]:
#checking out the shape of the sample size
sample_ratings.shape

(79536, 4)

### Pivot Table

In [6]:
#seeing if my sample size is small enough to create a pivot table
pivot = sample_ratings.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')

In [7]:
#it worked, and I want to check out the updated size
#We have 348 users rating a total of 48,237 titles
pivot.shape

(48237, 348)

### Sparse Matrix

The file size is large and it can slow down memory with calculating distance in pairwise, so I am converting it from a pandas datafram to a sparse dataframe. 

In [8]:
#getting the size of the pivot file
sys.getsizeof(pivot)

138697528

In [9]:
#preprocessing step of converting nans to zeros.
pivot_sparse = sparse.csr_matrix(pivot.fillna(0))

In [10]:
#getting the updated file size
sys.getsizeof(pivot_sparse)

56

## Recommender

Noting that I tired sklearns metrics pairwise distances and cosine similarity and got better scores with pairwise. Though, still not great. 

In [11]:
#setting up the recommender 
recommender = pairwise_distances(pivot_sparse, metric="cosine")

In [12]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(48237, 48237)

In [13]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=pivot.index, index=pivot.index)
recommender_df.head(3)

Book-Title,Dark Justice,Final Fantasy Anthology: Official Strategy Guide Brady Games,Highland Desire Zebra Splendor Historical Romances,How to Travel with a Salmon and Other Essays,Lamb to the Slaughter and Other Stories Penguin 60s S.,Murder of a Sleeping Beauty Scumble River Mysteries Paperback,Nonbook Materials: The Organization of Integrated Collections,This Place Has No Atmosphere Laurel-Leaf Books,Travel Companion Chile and Easter Island Travel Companion,the Devil Will Drag You Under,...,sharks reading discovery,stardust,termcap & terminfo O'Reilly Nutshell,the Dark Light Years,them Modern Library,together by christmas,whataboutrick.com: a poetic tribute to Richard A. Ricci,Â¿Eres tu mi mamÃ¡?Are You My Mother?,Ã?ngeles fugaces Falling Angels,Ã?Â?ber das Fernsehen.
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Dark Justice,0.0,1.0,1.0,1.0,1.0,0.353838,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.386428,1.0,1.0,1.0,1.0
Final Fantasy Anthology: Official Strategy Guide Brady Games,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Highland Desire Zebra Splendor Historical Romances,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Evaluation of the Recommender Engine

In [48]:
#this is code to help find how the title is listed, adjusting the head value will give you more listed options
#if there are any, which there can be especially with foreign versions of titles
q = 'Harry Potter'
explicit_ratings[explicit_ratings['Book-Title'].str.contains(q)]['Book-Title'].head(1)

83    Harry Potter and the Chamber of Secrets
Name: Book-Title, dtype: object

In [68]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['The Lovely Bones: A Novel'].sort_values()[1:11]

Book-Title
The Book of Ruth Oprah's Book Club Paperback                          0.669837
Gap Creek: The Story Of A Marriage                                    0.677572
Where the Heart Is Oprah's Book Club Paperback                        0.682920
A Painted House                                                       0.682925
The Other Side and Back: A Psychic's Guide to Our World and Beyond    0.683016
Bastard Out of Carolina                                               0.691984
The Nanny Diaries: A Novel                                            0.694710
Life of Pi                                                            0.696095
Lucky                                                                 0.708716
Bel Canto: A Novel                                                    0.711325
Name: The Lovely Bones: A Novel, dtype: float64

In [29]:
#Looking up recommendations for those who liked The Secret Life of Bees:
recommender_df['To Kill a Mockingbird'].sort_values()[1:11]

Book-Title
The Catcher in the Rye                                              0.662548
STONES FROM THE RIVER                                               0.699207
The Hotel New Hampshire                                             0.705292
Wicked: The Life and Times of the Wicked Witch of the West          0.711798
The God of Small Things                                             0.713226
Five Quarters of the Orange                                         0.714590
Drowning Ruth                                                       0.717430
Lives on the Boundary                                               0.718341
Rebecca                                                             0.719051
One Day in the Life of Ivan Denisovich Signet Classics Paperback    0.719239
Name: To Kill a Mockingbird, dtype: float64

In [39]:
#Looking up recommendations for those who liked Harry Potter and the Sorcerer's Stone:
recommender_df['Harry Potter and the Chamber of Secrets'].sort_values()[1:11]

Book-Title
Sarong Party Girl                                        0.219131
The Complete Idiot's GuideR to Scrapbooking              0.219131
HOW LEOPARD GOT SPOTS Little Barefoot Books              0.219131
Hey Good Looking Sweet Dreams No 82                      0.219131
Hints from Heloise  Co                                   0.219131
101 WaysMeet MR Rt Bantam Sweet Dreams Romances          0.219131
Love and Marriage                                        0.219131
Goodbye Forever Bantam Sweet Dreams Romances             0.219131
Truth about Me & Bobby V Bantam Sweet Dreams Romances    0.219131
The Lengthening Shadow                                   0.219131
Name: Harry Potter and the Chamber of Secrets, dtype: float64

In [30]:
#Looking up recommendations for those who liked Interview with the Vampire:
recommender_df['Interview with the Vampire'].sort_values()[1:11]

Book-Title
The Tale of the Body Thief Vampire Chronicles Paperback         0.469585
The Queen of the Damned Vampire Chronicles Paperback            0.582157
Pandora: New Tales of the Vampires New Tales of the Vampires    0.615537
Black Beauty                                                    0.621014
The Witching Hour Lives of the Mayfair Witches                  0.647606
Skeleton Crew                                                   0.659668
Vittorio the Vampire: New Tales of the Vampires                 0.663360
Catch 22                                                        0.665097
Dr. Seuss's A B C I Can Read It All by Myself Beginner Books    0.665864
Violin                                                          0.668535
Name: Interview with the Vampire, dtype: float64

### Item Pivot Table

In [14]:
item_pivot = sample_ratings.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating')

In [15]:
item_pivot.shape

(348, 48237)

### Sparse Matrix

The file size is large and it can slow down memory with calculating distance in pairwise, so I am converting it from a pandas datafram to a sparse dataframe. 

In [17]:
#getting the size of the pivot file
sys.getsizeof(item_pivot)

134294616

In [18]:
#preprocessing step of converting nans to zeros.
sparse_pivot = sparse.csr_matrix(item_pivot.fillna(0))

In [19]:
#getting the updated file size
sys.getsizeof(sparse_pivot)

56

## Recommender

Noting that I tired sklearns metrics pairwise distances and cosine similarity and got better scores with pairwise. Though, still not great. 

In [20]:
#setting up the recommender 
item_recommender = pairwise_distances(sparse_pivot, metric="cosine")

In [21]:
#verifying the shape of the engine to make sure the numbers are the same
item_recommender.shape

(348, 348)

In [22]:
#creating a dataframe to bring the title names back into view
item_recommender_df = pd.DataFrame(item_recommender, columns=item_pivot.index, index=item_pivot.index)
item_recommender_df.head(3)

User-ID,2033,2276,4017,5582,6242,6251,6543,6575,7346,8067,...,244685,245410,245827,246311,247429,247447,248718,249894,250709,277427
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.964824,1.0,1.0,1.0,0.952951,1.0,1.0,0.987337,1.0,...,1.0,0.978062,0.993965,1.0,1.0,0.990691,0.987177,0.955503,1.0,1.0
2276,0.964824,0.0,1.0,1.0,1.0,0.996316,1.0,0.99143,0.979152,1.0,...,0.973343,0.993569,0.989884,1.0,1.0,0.99343,0.998632,0.991481,1.0,1.0
4017,1.0,1.0,0.0,0.977109,0.903242,0.987751,0.976813,0.92977,0.990496,0.995667,...,1.0,1.0,0.991148,0.954538,1.0,0.992415,0.992658,0.994793,0.987908,0.991718


## The Error

## Evaluation of the Recommender Engine

In [23]:
#
item_recommender_df['2033']

KeyError: '2033'

In [44]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['2033'].sort_values()[1:11]

KeyError: '2033'