<img align="left" src="./assets/books.png">

# Book Recommender Engines

## Part 4: Content-Based: Preprocessing and Engine<br>
***

#### Contents:
- [Imports](#Imports)
- [Reading in the Data](#Reading-in-the-Data)
- [Preprocessing](#Preprocessing)
- [Recommender](#Recommender)
- [Evaluating the Recommender](#Evalutating-the-Recommender)
- [Conclusion and Next Steps](#Conclusion-and-Next-Steps)

### Imports

In [1]:
#importing the packages
import pandas as pd
import sys
from scipy import sparse 
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity

### Preprocessing

For preprocessing, I needed to drop the columns I wouldn't be using, and dummy the category columns. Unlike with what I did for the collaborator engine, this data can't be put into a pivot table because of volume of columns created with dummies. 

I tested the data pulled from preprocessing in the model a variety of ways, seeing which features would give me better scores. For example, I watched for things like whether or not a book from the same series was recommended, if it was part of a series that is. I also watched for age appropriate titles. For example, Interview With the Vampire and If You Give a Mouse a Cookie should not be recommended together based on age-appropriate content. 

Some interesting finds with testing here was that removing pages meant that picture books would get recommended with Harry Potter. However, my best scores resulted in including all the columns and creating dummies for all three of the category columns. So, I have removed my drop columns code from the notebook.

In [23]:
#reading in the data
goodreads_sample = pd.read_csv('./datasets/goodreads_sample.csv')
#dropping the unnamed columns
goodreads_sample.drop(columns='Unnamed: 0', inplace = True)
#checking out the file
goodreads_sample.head()

Unnamed: 0,author_name,book_average_rating,book_title,genre_1,genre_2,num_ratings,num_reviews,pages,publish_date,score
0,J.K. Rowling,4.56,Harry Potter and the Half-Blood Prince,Fantasy,Young Adult,2036961,32557,652,2005,1217
1,J.K. Rowling,4.48,Harry Potter and the Order of the Phoenix,Fantasy,Young Adult,2087093,34321,870,2003,690
2,J.K. Rowling,4.55,Harry Potter and the Prisoner of Azkaban,Fantasy,Young Adult,2276977,44377,435,1999,368
3,Douglas Adams,4.38,The Ultimate Hitchhiker's Guide to the Galaxy,Science Fiction,Fiction,255070,4753,815,1996,2374
4,Bill Bryson,4.2,A Short History of Nearly Everything,Nonfiction,Science,240843,10362,544,2003,1079


In [24]:
#setting the titles to the index
goodreads_sample.set_index('book_title', inplace = True)

In [26]:
#creating dummies
goodreads_dummies = pd.get_dummies(goodreads_sample, columns=['author_name', 'genre_1', 'genre_2'], drop_first=True)

In [34]:
#making sure the index and dummies were set up right
goodreads_dummies.head(1)

Unnamed: 0_level_0,book_average_rating,num_ratings,num_reviews,pages,publish_date,score,author_name_A. Kirk,author_name_A. Digger Stolz,author_name_A. Lee Martinez,author_name_A. Lynden Rolland,...,genre_2_Thriller,genre_2_Travel,genre_2_Unfinished,genre_2_War,genre_2_Warfare,genre_2_Westerns,genre_2_Womens Fiction,genre_2_World War II,genre_2_Writing,genre_2_Young Adult
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Harry Potter and the Half-Blood Prince,4.56,2036961,32557,652,2005,1217,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Recommender

Since this data set can't be set up as pivot table or use the sparse matrix, the consine similary calculation will take a little longer to run.

In [27]:
#setting up the recommender 
recommender = cosine_similarity(goodreads_dummies.iloc[:,:])

In [28]:
#verifying the shape of the engine to make sure the numbers are the same
recommender.shape

(15144, 15144)

In [29]:
#creating a dataframe to bring the title names back into view
recommender_df = pd.DataFrame(recommender, columns=goodreads_dummies.index, index=goodreads_dummies.index)
recommender_df.head(3)

book_title,Harry Potter and the Half-Blood Prince,Harry Potter and the Order of the Phoenix,Harry Potter and the Prisoner of Azkaban,The Ultimate Hitchhiker's Guide to the Galaxy,A Short History of Nearly Everything,Notes from a Small Island,The Mother Tongue: English and How It Got That Way,Hatchet,Changeling,The Known World,...,Save Me from Myself,Somewhere on Maui,Dead by Morning,Jade City,Grasping at Eternity,If I Let You Go,Becoming Human,Shanghai Nobody,Slay,The Baghdad Clock
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Harry Potter and the Half-Blood Prince,1.0,1.0,0.999994,0.999931,0.999599,0.998868,0.986936,0.999416,0.247355,0.994958,...,0.141587,0.039364,0.089998,0.7954,0.626063,0.267909,0.596201,0.031485,0.008415,0.302575
Harry Potter and the Order of the Phoenix,1.0,1.0,0.999995,0.99993,0.99961,0.998874,0.986926,0.999427,0.247139,0.99499,...,0.141362,0.039112,0.089896,0.795411,0.625976,0.267732,0.596077,0.03141,0.008277,0.302369
Harry Potter and the Prisoner of Azkaban,0.999994,0.999995,1.0,0.999929,0.999685,0.998951,0.987071,0.999512,0.247021,0.99521,...,0.141228,0.038935,0.089813,0.795837,0.626184,0.267639,0.596098,0.031291,0.008096,0.302337


### Evaluation of the Recommender Engine

For evaluating the engine, I ran searches on a variety of titles in different genres to see what would show up. What follows is a search tool that can be used to find titles. You have to enter the title exactly as listed to find books that are similar to it, so the next cells are handy for that.

In [30]:
#reading in fresh data because the other dataframe has titles as the index now
find_title = pd.read_csv('./datasets/goodreads_sample.csv')
#dropping the unnamed columns
find_title.drop(columns='Unnamed: 0', inplace = True)

In [21]:
#this is code to help find how the title is listed, adjusting the head value will give you more listed options
#if there are any, which there can be especially with foreign versions of titles
q = 'Interview with'
find_title[find_title['book_title'].str.contains(q)]['book_title'].head()

1122    Interview with the Vampire
Name: book_title, dtype: object

In [31]:
#Looking up recommendations for those who liked The Pendragon:
recommender_df['Harry Potter and the Half-Blood Prince'].sort_values(ascending=False)[1:11]

book_title
Harry Potter and the Order of the Phoenix    1.000000
The Golden Compass                           1.000000
Of Mice and Men                              0.999999
The Hobbit                                   0.999999
Eragon                                       0.999999
Memoirs of a Geisha                          0.999998
The Great Gatsby                             0.999998
Gone with the Wind                           0.999997
Deception Point                              0.999996
Where the Wild Things Are                    0.999996
Name: Harry Potter and the Half-Blood Prince, dtype: float64

In [32]:
#Looking up recommendations for those who liked The Lovely Bones:
recommender_df['The Lovely Bones'].sort_values(ascending=False)[1:11]

book_title
The Catcher in the Rye                      1.000000
The Hitchhiker's Guide to the Galaxy        0.999999
1984                                        0.999999
To Kill a Mockingbird                       0.999999
Holes                                       0.999999
Harry Potter and the Prisoner of Azkaban    0.999998
Slaughterhouse-Five                         0.999998
Animal Farm                                 0.999998
La naranja mecánica                         0.999996
The Grapes of Wrath                         0.999996
Name: The Lovely Bones, dtype: float64

In [33]:
#Looking up recommendations for those who liked Interview with the Vampire:
recommender_df['Interview with the Vampire'].sort_values(ascending=False)[1:11]

book_title
The Nanny Diaries                                        0.999979
If You Give a Mouse a Cookie                             0.999974
The World According to Garp                              0.999963
The Bourne Identity                                      0.999938
Ella Enchanted                                           0.999935
The Purpose Driven Life: What on Earth Am I Here for?    0.999930
Breakfast of Champions                                   0.999918
Fables, Vol. 1: Legends in Exile                         0.999906
White Fang                                               0.999904
Tales of a Fourth Grade Nothing                          0.999902
Name: Interview with the Vampire, dtype: float64

While I tried a variety of feature tests with this engine, I think this direction is flawed based on what titles are being recommended. The scores look great, but on closer look, I wouldn't recommend a children's picture book (If You Give a Mouse a Cookie) to someone reading Interview with the Vampire. And that's only one example of faulty recommending I'm seeing. I think the issue is that with content based engines, we need more data than what I have here in this data set. I think having book descriptions would be a good start. Fuller content paints a better picture of what the book is actually about.

### Conclusion and Next Steps

Two engines were created based on book data. One was a collaborator engine based on user ratings and the other was a content engine based on product features. Of the two engines, the collaborator engine gave me slightly better recommendations based on book subject matter than the content engine did. The latter gave better scores, but looking closer I can see the subject matter doesn't match up as well. With that in mind, I would recommend a collaborator-based engine, unless you don't have the option to collect a a growing list of user ratings. Then I would suggest a content-based, only I would recommend pulling in more descriptive content and using Natural Language Processing and Word2Vec to assess the relationship between the words and make recommendations based on that.

I do see great potential in these engines. Had I more time, the next steps I would have taken and recommend considering are the following:

1. Collect a larger master data set to test the models on. The bookstore's data would be ideal here. Or, data can be collected via web scraping or an API that includes product descriptions and user reviews.

2. Something I ran out of time to do was to look into importing foreign language packages. How foreign titles get addressed in the data will need to be something considered for next steps. Either import a package that can handle them, or remove them from the data set. 

3. There was some author related data in the Goodreads set that I didn't get a chance to test. I would like to see if that would help with getting getting better recommendations or not.

4. Explore options to incorporate consumer usability for the engine by adding a web-based front-end.