<img src="./assets/bookstack.jpg" style="float: left; margin: 20px; height: 100px">

# Book Recommender Engines Capstone Project<br><br>Content-Based: Preprocessing and Engine<br>
***

### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
- [Recommender](#Recommender)
- [Evaluating the Recommender](#Evalutating-the-Recommender)

## Imports

In [143]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity

%config InlineBackend.figure_format = 'retina'

## Preprocessing

In [144]:
#reading in the data
goodreads_sample = pd.read_csv('./datasets/goodreads_sample.csv')
#dropping the unnamed columns
goodreads_sample.drop(columns='Unnamed: 0', inplace = True)
#checking out the file
goodreads_sample.head()

Unnamed: 0,author_name,book_average_rating,book_title,genre_1,genre_2,num_ratings,num_reviews,pages,publish_date,score
0,J.K. Rowling,4.56,Harry Potter and the Half-Blood Prince,Fantasy,Young Adult,2036961,32557,652,2005,1217
1,J.K. Rowling,4.48,Harry Potter and the Order of the Phoenix,Fantasy,Young Adult,2087093,34321,870,2003,690
2,J.K. Rowling,4.55,Harry Potter and the Prisoner of Azkaban,Fantasy,Young Adult,2276977,44377,435,1999,368
3,Douglas Adams,4.38,The Ultimate Hitchhiker's Guide to the Galaxy,Science Fiction,Fiction,255070,4753,815,1996,2374
4,Bill Bryson,4.2,A Short History of Nearly Everything,Nonfiction,Science,240843,10362,544,2003,1079


I need to drop columns I won't be using for the recommender. Based on how many authors and categories there are, I believe there will be too many dummy fields in a pivot table. So I'm going to start with genre_1 and test that first and drop the others for now.

In [145]:
#Dropping author name and genre_2 because I believe it will be to much data to make them into dummies
#But I will come back and reassess after I get a working engine
#goodreads_sample.drop(['author_name', 'genre_2'], 1,  inplace=True)
goodreads_sample.drop(['num_reviews', 'book_average_rating'], 1,  inplace=True)

In [146]:
#setting the titles to the index
goodreads_sample.set_index('book_title', inplace = True)

In [147]:
goodreads_sample.head()

Unnamed: 0_level_0,author_name,genre_1,genre_2,num_ratings,pages,publish_date,score
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Harry Potter and the Half-Blood Prince,J.K. Rowling,Fantasy,Young Adult,2036961,652,2005,1217
Harry Potter and the Order of the Phoenix,J.K. Rowling,Fantasy,Young Adult,2087093,870,2003,690
Harry Potter and the Prisoner of Azkaban,J.K. Rowling,Fantasy,Young Adult,2276977,435,1999,368
The Ultimate Hitchhiker's Guide to the Galaxy,Douglas Adams,Science Fiction,Fiction,255070,815,1996,2374
A Short History of Nearly Everything,Bill Bryson,Nonfiction,Science,240843,544,2003,1079


In [148]:
#turning genre_1 into dummies
goodreads_dummies = pd.get_dummies(goodreads_sample, columns=['author_name', 'genre_1', 'genre_2'], drop_first=True)

In [149]:
goodreads_dummies.head()

Unnamed: 0_level_0,num_ratings,pages,publish_date,score,author_name_A. Kirk,author_name_A. Digger Stolz,author_name_A. Lee Martinez,author_name_A. Lynden Rolland,author_name_A. Manette Ansay,author_name_A. Meredith Walters,...,genre_2_Thriller,genre_2_Travel,genre_2_Unfinished,genre_2_War,genre_2_Warfare,genre_2_Westerns,genre_2_Womens Fiction,genre_2_World War II,genre_2_Writing,genre_2_Young Adult
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Harry Potter and the Half-Blood Prince,2036961,652,2005,1217,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Harry Potter and the Order of the Phoenix,2087093,870,2003,690,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Harry Potter and the Prisoner of Azkaban,2276977,435,1999,368,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
The Ultimate Hitchhiker's Guide to the Galaxy,255070,815,1996,2374,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A Short History of Nearly Everything,240843,544,2003,1079,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Recommender

Something about how I couldn't do a pivot table for this one because of volume of columns so I went direct to cosine similarity.

In [150]:
#setting up the recommender 
recommender = cosine_similarity(goodreads_dummies.iloc[:,:])

In [151]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(15144, 15144)

In [152]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=goodreads_dummies.index, index=goodreads_dummies.index)
recommender_df.head(3)

book_title,Harry Potter and the Half-Blood Prince,Harry Potter and the Order of the Phoenix,Harry Potter and the Prisoner of Azkaban,The Ultimate Hitchhiker's Guide to the Galaxy,A Short History of Nearly Everything,Notes from a Small Island,The Mother Tongue: English and How It Got That Way,Hatchet,Changeling,The Known World,...,Save Me from Myself,Somewhere on Maui,Dead by Morning,Jade City,Grasping at Eternity,If I Let You Go,Becoming Human,Shanghai Nobody,Slay,The Baghdad Clock
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Harry Potter and the Half-Blood Prince,1.0,1.0,1.0,0.999935,0.999964,0.999307,0.988661,0.999897,0.247019,0.997949,...,0.14125,0.03927,0.0895,0.807403,0.629421,0.267568,0.596655,0.031374,0.008376,0.302195
Harry Potter and the Order of the Phoenix,1.0,1.0,1.0,0.999932,0.999963,0.999299,0.988624,0.999893,0.24679,0.997946,...,0.141014,0.039015,0.089382,0.80733,0.629278,0.267378,0.596502,0.031297,0.008236,0.301964
Harry Potter and the Prisoner of Azkaban,1.0,1.0,1.0,0.99993,0.999961,0.999291,0.988594,0.99989,0.246595,0.997935,...,0.140811,0.038821,0.089196,0.807214,0.629122,0.267202,0.596338,0.031158,0.008049,0.301773


## Evaluation of the Recommender Engine

In [153]:
#reading in the data
find_title = pd.read_csv('./datasets/goodreads_sample.csv')
#dropping the unnamed columns
find_title.drop(columns='Unnamed: 0', inplace = True)

In [154]:
#this is code to help find how the title is listed, adjusting the head value will give you more listed options
#if there are any, which there can be especially with foreign versions of titles
q = 'Lovely Bones'
find_title[find_title['book_title'].str.contains(q)]['book_title'].head()

7605    The Lovely Bones
Name: book_title, dtype: object

In [155]:
#Looking up recommendations for those who liked The Pendragon:
recommender_df['Harry Potter and the Half-Blood Prince'].sort_values(ascending=False)[1:11]

book_title
Mockingjay                                   1.0
The Help                                     1.0
The Lovely Bones                             1.0
The Da Vinci Code                            1.0
Harry Potter and the Order of the Phoenix    1.0
Angels & Demons                              1.0
The Girl on the Train                        1.0
Animal Farm                                  1.0
The Girl with the Dragon Tattoo              1.0
Gone Girl                                    1.0
Name: Harry Potter and the Half-Blood Prince, dtype: float64

In [156]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['The Lovely Bones'].sort_values(ascending=False)[1:11]

book_title
Mockingjay                                1.0
The Help                                  1.0
The Da Vinci Code                         1.0
Harry Potter and the Half-Blood Prince    1.0
The Girl on the Train                     1.0
The Giver                                 1.0
Animal Farm                               1.0
The Book Thief                            1.0
Angels & Demons                           1.0
Memoirs of a Geisha                       1.0
Name: The Lovely Bones, dtype: float64