<img align="left" src="../assets/books.png">

# Book Recommender Engines

## Part 2: Collaborator-Based Preprocessing and Engine<br>
***

#### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
  * [Pivot Table](#Pivot-Table)
  * [Sparse Matrix](#Sparse-Matrix)
- [Recommender](#Recommender)
- [Evaluation of the Recommender Engine](#Evaluation-of-the-Recommender-Engine)

### Imports

In [1]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity

#importing warnings to turn off future warnings
import warnings
warnings.simplefilter(action='ignore')

### Preprocessing

For preprocessing, I dropped the columns I didn't need, then convert the data into a pivot table. From there, I created a sparse matrix to help with keeping the file size from slowing down processing.

In [2]:
#reading in the data
explicit_ratings = pd.read_csv('../datasets/book_crossing/explicit_ratings.csv')
explicit_ratings.head()

Unnamed: 0.1,Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,24,88006,1551669498,1,Girls Night,Stef Ann Holm,2002,Mira
1,25,76151,1551669498,2,Girls Night,Stef Ann Holm,2002,Mira
2,26,229011,1551669498,3,Girls Night,Stef Ann Holm,2002,Mira
3,27,28700,1551669498,6,Girls Night,Stef Ann Holm,2002,Mira
4,28,47866,1551669498,7,Girls Night,Stef Ann Holm,2002,Mira


For a user based recommender, I only need the user, rating, and title, so I'm going to drop the others.

In [19]:
#Dropping unneeded columns
explicit_ratings.drop(['Unnamed: 0','ISBN', 'Book-Author', 'Year-Of-Publication', 'Publisher'], 1,  inplace=True)

This is too much data for a pivot table though, so let's pull those who have more than 200 ratings and see if that's small enough a set to create a pivot table.

In [20]:
#Creating a sample of users
user_counts = explicit_ratings['User-ID'].value_counts()
sample_ratings = explicit_ratings[explicit_ratings['User-ID'].isin(user_counts[user_counts >= 200].index)]

In [21]:
#checking out the shape of the sample size
sample_ratings.shape

(42747, 3)

### Pivot Table

In [22]:
#seeing if my sample size is small enough to create a pivot table
pivot = sample_ratings.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')

In [23]:
#it worked, and I want to check out the updated size
pivot.shape

(34466, 80)

### Sparse Matrix

The file size is large and it can slow down memory with calculating distance in pairwise, so I am converting it from a pandas dataframe to a sparse dataframe. 

In [24]:
#getting the size of the pivot file
sys.getsizeof(pivot)

25245407

In [25]:
#preprocessing step of converting nans to zeros.
pivot_sparse = sparse.csr_matrix(pivot.fillna(0))

In [26]:
#getting the updated file size
sys.getsizeof(pivot_sparse)

56

### Recommender

To calculate cosine similarity I tried sklearns metrics pairwise distances and cosine similarity and got very similar results. However, cosine similarity ran faster on my computer so I stuck with that. 

In [27]:
#setting up the recommender 
recommender = cosine_similarity(pivot_sparse)

In [28]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(34466, 34466)

In [29]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=pivot.index, index=pivot.index)
recommender_df.head(3)

Book-Title,Dark Justice,Final Fantasy Anthology: Official Strategy Guide (Brady Games),Good Wives: Image and Reality in the Lives of Women in Northern New England,Highland Desire (Zebra Splendor Historical Romances),Murder of a Sleeping Beauty (Scumble River Mysteries (Paperback)),Nonbook Materials: The Organization of Integrated Collections,This Place Has No Atmosphere (Laurel-Leaf Books),!Arriba! Comunicacion y cultura,!Trato hecho!: Spanish for Real Life,$14 In The Bank (Cathy Collection),...,street bible,termcap & terminfo (O'Reilly Nutshell),the Dark Light Years,them (Modern Library),together by christmas,wet sand,whataboutrick.com: a poetic tribute to Richard A. Ricci,Â¡Corre,Â¿Eres tu mi mamÃ¡?/Are You My Mother?,Ã?Â?ber das Fernsehen.
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Dark Justice,1.0,0.0,0.0,0.0,0.747409,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.613572,0.0,0.0,0.0,0.0,0.0
Final Fantasy Anthology: Official Strategy Guide (Brady Games),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Good Wives: Image and Reality in the Lives of Women in Northern New England,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Evaluation of the Recommender Engine

For evaluating the engine, I ran searches on a variety of titles in different genres to see what would show up. What follows first is a search tool that can be used to find titles. You have to enter the title exactly as listed to find books that are similar to it, so the next cells are handy for that.

In [30]:
#setting the columns so we can see the full titles
pd.set_option('display.max_colwidth', -1)

This following is code to help find how the title is listed, adjusting the head value will give you more listed options if there are any, which there can be especially with foreign versions of titles.

In [55]:
#Code to search for titles
q = 'Dark Justice'
explicit_ratings[explicit_ratings['Book-Title'].str.contains(q)]['Book-Title'].head()

54339     Dark Justice (Ben Kincaid)
209842    Dark Justice              
209843    Dark Justice              
284668     Dark Justice             
Name: Book-Title, dtype: object

In [31]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['The Lovely Bones: A Novel'].sort_values(ascending=False)[1:11]

Book-Title
Empire Falls                                     0.588190
Running with Scissors: A Memoir                  0.545046
Nights in Rodanthe                               0.533569
Name of the Rose-Nla                             0.518736
The Nanny Diaries: A Novel                       0.517222
Fried Green Tomatoes at the Whistle Stop Cafe    0.507825
No Second  Chance                                0.507359
Club Dead (Southern Vampire Mysteries)           0.506466
Eyes of Prey                                     0.503926
The Liar's Club: A Memoir                        0.492674
Name: The Lovely Bones: A Novel, dtype: float64

In [32]:
#Looking up recommendations for those who liked The Secret Life of Bees:
recommender_df['To Kill a Mockingbird'].sort_values(ascending=False)[1:11]

Book-Title
Animal Farm                                                             0.495847
One Day in the Life of Ivan Denisovich (Signet Classics (Paperback))    0.484952
Rumble Fish (Laurel-Leaf Contemporary Fiction)                          0.473494
Looking for Rachel Wallace                                              0.471411
Dinky Hocker Shoots                                                     0.467858
The Widening Gyre                                                       0.451176
The \LATE NIGHT WITH DAVID LETTERMAN\" BOOK OF TOP TEN LISTS"           0.451176
Flowers for Algernon (Bantam Classic)                                   0.451176
Vale of the Vole (Xanth)                                                0.451088
The Kalahari Typing School for Men (No. 1 Ladies' Detective Agency)     0.450596
Name: To Kill a Mockingbird, dtype: float64

In [33]:
#Looking up recommendations for those who liked Dune:
recommender_df['Dune'].sort_values(ascending=False)[1:11]

Book-Title
And That's My Final Offer! (His A Doonesbury book)    0.997785
A Pirate Looks at Fifty                               0.993884
Mars Crossing                                         0.986394
National LampoonPresents True Facts: the Book         0.980581
Einstein's Bridge                                     0.974391
House Atreides (Dune: House Trilogy                   0.767982
Still Pumped From Using The Mouse                     0.727825
It Came From The Far Side                             0.719425
Yukon Ho!                                             0.714563
Die letzte VerschwÃ?Â¶rung. Roman.                    0.707107
Name: Dune, dtype: float64

In [34]:
#Looking up recommendations for those who liked Interview with the Vampire:
recommender_df['Interview with the Vampire'].sort_values(ascending=False)[1:11]

Book-Title
The Tale of the Body Thief (Vampire Chronicles (Paperback))    0.695873
Seinlanguage                                                   0.631424
Candide (Candide)                                              0.579207
BEST AMERICAN POETRY 1993                                      0.552866
Risen                                                          0.521179
The Only Astrology Book You'll Ever Need                       0.520706
The Witching Hour (Lives of the Mayfair Witches)               0.517412
The Vampire Lestat (Vampire Chronicles                         0.516246
Awakening                                                      0.510640
Disney's: The Great Mouse Detective (Golden Look Look Book)    0.502266
Name: Interview with the Vampire, dtype: float64

In [35]:
#Looking up recommendations for those who liked Harry Potter and the Chamber of Secrets:
recommender_df['Harry Potter and the Chamber of Secrets'].sort_values(ascending=False)[1:11]

Book-Title
Death By Spaghetti                                           1.0
Vegetarian Cooking (Rd Home Handbooks)                       1.0
The Vengeance of If...                                       1.0
The Wimbledon Poisoner                                       1.0
The Pegasus book of ponds and streams; (The Pegasus books    1.0
Tales of Ancient Egypt (Puffin Classics)                     1.0
The Little White Horse                                       1.0
The Revolutionary If                                         1.0
Lucia Triumphant                                             1.0
Lucia in London (Isis)                                       1.0
Name: Harry Potter and the Chamber of Secrets, dtype: float64

Of the books I've searched for, Interview with the Vampire has given the best results in terms of what I can eyeball as being similar matches. For that one, it recommended books in the same series, which makes sense. Romance titles appeared to be recommending other romance titles. Big popular books like Harry Potter had trouble getting good recommendations though. It is likely that happened because it's a small user set (80 on the most current run of the model), so it's recommending books based on a very select audience. 

I tried changing my users sample from those who rated more than 100 books to 200 books and found that the higher number of books gave me better scores. I dropped down to users who rated 50 books and my scores got worse. This was was unexpected because we really do need a larger sample of users here to have more balanced recommendations. It is likely that in this pool of users, going with less of them who rate more books is giving us a more like-minded subset. 

Also, the database needs more cleaning for titles. There are many varieties of Harry Potter titles based on foreign language as well as extras notations in the title, such as: (paperback). If I had more time, I would look at removing non-English titles and would clean out those extra bits to consolidate varying versions of the same title.

**Please continue to [3-Content-Based-Data-Cleaning-and-EDA.ipynb](./3-Content-Based-Data-Cleaning-and-EDA.ipynb) for the next step in the project: Content-Based: Data Cleaning and EDA**