# Books Recommender Engine: Preprocessing and Content Recommender

- [The Error](#The-Error)

### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
- [Recommender](#Recommender)
- [Evaluating the Recommender](#Evalutating-the-Recommender)

## Imports

In [108]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity

%config InlineBackend.figure_format = 'retina'

## Preprocessing

In [116]:
#reading in the data
goodreads_sample = pd.read_csv('./datasets/goodreads_sample.csv')
#dropping the unnamed columns
goodreads_sample.drop(columns='Unnamed: 0', inplace = True)
#checking out the file
goodreads_sample.head()

Unnamed: 0,author_name,book_average_rating,book_title,genre_1,genre_2,num_ratings,num_reviews,pages,publish_date,score
0,Carolyn Keene,4.24,Nancy Drew: #1-6 [Box Set],Mystery,Young Adult,2883,108,0,1930,2937
1,C.S. Lewis,4.12,Space Trilogy: Out of the Silent Planet...,Science Fiction,Fiction,8258,358,0,1938,1706
2,Paul Zindel,3.59,The Pigman,Fiction,Young Adult,24602,1328,0,1968,3447
3,Paul Scott,4.48,Raj Quartet-4v-Boxed Jewell in Crown,Fiction,Cultural,861,67,0,1976,3981
4,Catherine Christian,3.96,The Pendragon,Mythology,Fantasy,373,24,0,1979,14141


I need to drop columns I won't be using for the recommender. Based on how many authors and categories there are, I believe there will be too many dummy fields in a pivot table. So I'm going to start with genre_1 and test that first and drop the others for now.

In [117]:
#
goodreads_sample.drop(['author_name', 'genre_2'], 1,  inplace=True)

In [118]:
#setting the titles to the index
goodreads_sample.set_index('book_title', inplace = True)

In [119]:
goodreads_sample.head()

Unnamed: 0_level_0,book_average_rating,genre_1,num_ratings,num_reviews,pages,publish_date,score
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Nancy Drew: #1-6 [Box Set],4.24,Mystery,2883,108,0,1930,2937
Space Trilogy: Out of the Silent Planet / Perelandra / That Hideous Strength,4.12,Science Fiction,8258,358,0,1938,1706
The Pigman,3.59,Fiction,24602,1328,0,1968,3447
Raj Quartet-4v-Boxed Jewell in Crown,4.48,Fiction,861,67,0,1976,3981
The Pendragon,3.96,Mythology,373,24,0,1979,14141


In [113]:
#turning genre_1 into dummies
goodreads_dummies = pd.get_dummies(goodreads_sample, columns=['genre_1'], drop_first=True)

In [114]:
goodreads_dummies.head()

Unnamed: 0_level_0,book_average_rating,num_ratings,num_reviews,pages,publish_date,score,genre_1_Adult Fiction,genre_1_Adventure,genre_1_Amish,genre_1_Animals,...,genre_1_Travel,genre_1_Unfinished,genre_1_United States,genre_1_War,genre_1_Warfare,genre_1_Westerns,genre_1_Womens Fiction,genre_1_World War II,genre_1_Writing,genre_1_Young Adult
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Nancy Drew: #1-6 [Box Set],4.24,2883,108,0,1930,2937,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Space Trilogy: Out of the Silent Planet / Perelandra / That Hideous Strength,4.12,8258,358,0,1938,1706,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Pigman,3.59,24602,1328,0,1968,3447,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Raj Quartet-4v-Boxed Jewell in Crown,4.48,861,67,0,1976,3981,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Pendragon,3.96,373,24,0,1979,14141,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Recommender

Something about how I couldn't do a pivot table for this one because of volume of columns so I went direct to cosine similarity.

In [103]:
#setting up the recommender 
recommender = cosine_similarity(goodreads_dummies.iloc[:,:])

In [104]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(21487, 21487)

In [105]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=goodreads_dummies.index, index=goodreads_dummies.index)
recommender_df.head(3)

book_title,Nancy Drew: #1-6 [Box Set],Space Trilogy: Out of the Silent Planet / Perelandra / That Hideous Strength,The Pigman,Raj Quartet-4v-Boxed Jewell in Crown,The Pendragon,House on Mango Street,Baby-Sitters Club Boxed Set #1,Baby-Sitters Club Boxed Set #1,"The Clan of the Cave Bear, the Valley of Horses, the Mammoth Hunters, the Plains of Passage","The Clan of the Cave Bear, the Valley of Horses, the Mammoth Hunters, the Plains of Passage",...,The Second World War,The Second World War,The Second World War,Pandora,Pandora,The Complete Aubrey/Maturin Novels (5 Volumes),Worm,Worm,Frog and Toad Together,Songs from the Phenomenal Nothing
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Nancy Drew: #1-6 [Box Set],1.0,0.827883,0.749036,0.874143,0.71489,0.64524,0.822424,0.894445,0.811121,0.823193,...,0.412825,0.64663,0.629695,0.278614,0.238524,0.341433,0.566352,0.469355,0.710558,0.784855
Space Trilogy: Out of the Silent Planet / Perelandra / That Hideous Strength,0.827883,1.0,0.987174,0.452859,0.251017,0.958874,0.989643,0.988292,0.9842,0.986008,...,0.385163,0.413053,0.414455,0.184436,0.172674,0.267043,0.599353,0.578487,0.978302,0.325851
The Pigman,0.749036,0.987174,1.0,0.344132,0.173505,0.988866,0.961473,0.952079,0.954264,0.95486,...,0.336067,0.344565,0.347201,0.127419,0.119164,0.222291,0.571006,0.559481,0.99766,0.187054


## The Error

## Evaluation of the Recommender Engine

In [106]:
#this is code to help find how the title is listed, adjusting the head value will give you more listed options
#if there are any, which there can be especially with foreign versions of titles
q = 'The Pendragon'
goodreads_sample[goodreads_sample['book_title'].str.contains(q)]['book_title'].head(1)

KeyError: 'book_title'

In [90]:
#Looking up recommendations for those who liked The Pendragon:
recommender_df['The Pendragon'].sort_values()[1:11]

KeyError: 'The Pendragon'