<img src="./assets/smallbookstack.jpg" style="float: left; margin: 20px; height: 100px">

# Book Recommender Engines Capstone Project<br><br>Collaborator-Based: Preprocessing and Engine<br>
***

### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
  * [Pivot Table](#Pivot-Table)
  * [Sparse Matrix](#Sparse-Matrix)
- [Recommender](#Recommender)
- [Evaluation of the Recommender Engine](#Evaluation-of-the-Recommender-Engine)

## Imports

In [1]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances

%config InlineBackend.figure_format = 'retina'

## Preprocessing

In [2]:
#reading in the data
explicit_ratings = pd.read_csv('./datasets/explicit_ratings.csv')
explicit_ratings.head()

Unnamed: 0.1,Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,24,88006,1551669498,1,Girls Night,Stef Ann Holm,2002,Mira
1,25,76151,1551669498,2,Girls Night,Stef Ann Holm,2002,Mira
2,26,229011,1551669498,3,Girls Night,Stef Ann Holm,2002,Mira
3,27,28700,1551669498,6,Girls Night,Stef Ann Holm,2002,Mira
4,28,47866,1551669498,7,Girls Night,Stef Ann Holm,2002,Mira


In [3]:
#For a user based recommender, I only need the user, rating, and title, so I'm going to drop the others
explicit_ratings.drop(['Unnamed: 0','ISBN', 'Book-Author', 'Year-Of-Publication', 'Publisher'], 1,  inplace=True)

In [4]:
#this is too much data for a pivot table though, so let's pull those who have more than 100 ratings and 
#see if that's small enough a set to create a pivot table
user_counts = explicit_ratings['User-ID'].value_counts()
sample_ratings = explicit_ratings[explicit_ratings['User-ID'].isin(user_counts[user_counts >= 200].index)]

In [5]:
#checking out the shape of the sample size
sample_ratings.shape

(44443, 3)

### Pivot Table

In [6]:
#seeing if my sample size is small enough to create a pivot table
pivot = sample_ratings.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')

In [7]:
#it worked, and I want to check out the updated size
#We have 939 users rating a total of 62,813 titles
pivot.shape

(32476, 84)

### Sparse Matrix

The file size is large and it can slow down memory with calculating distance in pairwise, so I am converting it from a pandas datafram to a sparse dataframe. 

In [8]:
#getting the size of the pivot file
sys.getsizeof(pivot)

24754726

In [9]:
#preprocessing step of converting nans to zeros.
pivot_sparse = sparse.csr_matrix(pivot.fillna(0))

In [10]:
#getting the updated file size
sys.getsizeof(pivot_sparse)

56

## Recommender

Noting that I tired sklearns metrics pairwise distances and cosine similarity and got better scores with pairwise. Though, still not great. 

In [11]:
#setting up the recommender 
recommender = pairwise_distances(pivot_sparse, metric="cosine")

In [12]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(32476, 32476)

In [13]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=pivot.index, index=pivot.index)
recommender_df.head(3)

Book-Title,Dark Justice,Final Fantasy Anthology: Official Strategy Guide Brady Games,Lamb to the Slaughter and Other Stories Penguin 60s S.,Murder of a Sleeping Beauty Scumble River Mysteries Paperback,Nonbook Materials: The Organization of Integrated Collections,This Place Has No Atmosphere Laurel-Leaf Books,!%@ A Nutshell handbook,!Arriba! Comunicacion y cultura,###############################################################################################################################################################################################################################################################,$14 In The Bank Cathy Collection,...,schÃ?Â¶ner wohnen.doc. Ein WG- Roman.,sed & awk 2nd Edition,stardust,termcap & terminfo O'Reilly Nutshell,the Dark Light Years,them Modern Library,together by christmas,whataboutrick.com: a poetic tribute to Richard A. Ricci,Â¿Eres tu mi mamÃ¡?Are You My Mother?,Ã?Â?ber das Fernsehen.
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Dark Justice,0.0,1.0,1.0,0.252591,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.386428,1.0,1.0,1.0
Final Fantasy Anthology: Official Strategy Guide Brady Games,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Lamb to the Slaughter and Other Stories Penguin 60s S.,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Evaluation of the Recommender Engine

In [49]:
#this is code to help find how the title is listed, adjusting the head value will give you more listed options
#if there are any, which there can be especially with foreign versions of titles
q = 'Harry Potter'
explicit_ratings[explicit_ratings['Book-Title'].str.contains(q)]['Book-Title'].head(1)

83    Harry Potter and the Chamber of Secrets
Name: Book-Title, dtype: object

In [50]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['The Lovely Bones: A Novel'].sort_values()[1:11]

Book-Title
Empire Falls                            0.409501
The Nanny Diaries: A Novel              0.430060
Life of Pi                              0.437717
Running with Scissors: A Memoir         0.454954
The Liar's Club: A Memoir               0.458086
Tell No One                             0.464186
Flashback                               0.466779
Name of the Rose-Nla                    0.481264
No Second  Chance                       0.492641
Club Dead Southern Vampire Mysteries    0.493534
Name: The Lovely Bones: A Novel, dtype: float64

In [51]:
#Looking up recommendations for those who liked The Secret Life of Bees:
recommender_df['To Kill a Mockingbird'].sort_values()[1:11]

Book-Title
The Catcher in the Rye                               0.491385
Animal Farm                                          0.518920
Tears of the Giraffe No.1 Ladies Detective Agency    0.523610
Break in                                             0.529655
What's Bred in the Bone                              0.531692
Hamlet                                               0.546357
Black Beauty                                         0.549343
The Bad Girl's Guide to the Party Life               0.550735
Island of the Sequined Love Nun                      0.551166
Still Life with Woodpecker                           0.561161
Name: To Kill a Mockingbird, dtype: float64

In [52]:
#Looking up recommendations for those who liked Harry Potter and the Sorcerer's Stone:
recommender_df['Harry Potter and the Chamber of Secrets'].sort_values()[1:11]

Book-Title
Dear Annie: A No-nonsense Guide to Getting Dressed                   0.0
Life Isn't All Ha Ha Hee Hee                                         0.0
East of Ealing                                                       0.0
Quote...unquote                                                      0.0
Far from the Madding Crowd Penguin Classics                          0.0
Pierre Et Gilles Postcardbooks                                       0.0
Tom's Midnight Garden                                                0.0
The Secret Thoughts of Cats The Secret Thoughts Of:                  0.0
The Secret Thoughts of Dogs The Secret Thoughts Series               0.0
The Nation's Favourite Comic Poems: A Selection of Humorous Verse    0.0
Name: Harry Potter and the Chamber of Secrets, dtype: float64

In [53]:
#Looking up recommendations for those who liked Interview with the Vampire:
recommender_df['Interview with the Vampire'].sort_values()[1:11]

Book-Title
The Tale of the Body Thief Vampire Chronicles Paperback       0.407516
The Scarlet Letter                                            0.448679
Selected Poems                                                0.454054
Cry to Heaven                                                 0.454633
Seinlanguage                                                  0.462390
Angels in America: Millennium Approaches Angels in America    0.476248
Black Beauty                                                  0.501449
Poemcrazy: Freeing Your Life With Words                       0.504500
Candide Candide                                               0.506849
Walden and Other Writings                                     0.507646
Name: Interview with the Vampire, dtype: float64

### Item Pivot Table

In [14]:
item_pivot = sample_ratings.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating')

In [15]:
item_pivot.shape

(348, 48237)

### Sparse Matrix

The file size is large and it can slow down memory with calculating distance in pairwise, so I am converting it from a pandas datafram to a sparse dataframe. 

In [16]:
#getting the size of the pivot file
sys.getsizeof(item_pivot)

134294616

In [17]:
#preprocessing step of converting nans to zeros.
sparse_pivot = sparse.csr_matrix(item_pivot.fillna(0))

In [18]:
#getting the updated file size
sys.getsizeof(sparse_pivot)

56

## Recommender

Noting that I tired sklearns metrics pairwise distances and cosine similarity and got better scores with pairwise. Though, still not great. 

In [19]:
#setting up the recommender 
item_recommender = pairwise_distances(sparse_pivot, metric="cosine")

In [20]:
#verifying the shape of the engine to make sure the numbers are the same
item_recommender.shape

(348, 348)

In [21]:
#creating a dataframe to bring the title names back into view
item_recommender_df = pd.DataFrame(item_recommender, columns=item_pivot.index, index=item_pivot.index)
item_recommender_df.head(3)

User-ID,2033,2276,4017,5582,6242,6251,6543,6575,7346,8067,...,244685,245410,245827,246311,247429,247447,248718,249894,250709,277427
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.964824,1.0,1.0,1.0,0.952951,1.0,1.0,0.987337,1.0,...,1.0,0.978062,0.993965,1.0,1.0,0.990691,0.987177,0.955503,1.0,1.0
2276,0.964824,0.0,1.0,1.0,1.0,0.996316,1.0,0.99143,0.979152,1.0,...,0.973343,0.993569,0.989884,1.0,1.0,0.99343,0.998632,0.991481,1.0,1.0
4017,1.0,1.0,0.0,0.977109,0.903242,0.987751,0.976813,0.92977,0.990496,0.995667,...,1.0,1.0,0.991148,0.954538,1.0,0.992415,0.992658,0.994793,0.987908,0.991718


## The Error

## Evaluation of the Recommender Engine

In [25]:
#
item_recommender_df.columns

Int64Index([  2033,   2276,   4017,   5582,   6242,   6251,   6543,   6575,
              7346,   8067,
            ...
            244685, 245410, 245827, 246311, 247429, 247447, 248718, 249894,
            250709, 277427],
           dtype='int64', name='User-ID', length=348)

In [27]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df[' 2033 '].sort_values()[1:11]

KeyError: ' 2033 '

**Please continue to [3-Content-Based-Data-Cleaning-and-EDA.ipynb](./3-Content-Based-Data-Cleaning-and-EDA.ipynb) for the next step in the project: Content-Based: Data Cleaning and EDA**