#### Data 612 - Project 2 : Content-Based and Collaborative Filtering<br>Date: June 18, 2019<br>Team Info: 
+ Christina Valore
+ Juliann McEachern 
+ Rajwant Mishra

<h1 align="center">Goodreads Books Recommender Systems</h1>

## Dataset Selection

Data was obtain from [goodbooks2017](#cite-goodbooks2017). Add more details here:
+  `books`: dataset
+  `book_tags`: dataset
+  `tags`: dataset
+  `ratings`: dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data from local csv  into pandas dataframe
books = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/books.csv')
book_tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/book_tags.csv')
tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/tags.csv')
ratings = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/ratings.tar.gz', 
                      compression='gzip')

# Clean ratings data
ratings = ratings.drop('ratings.csv', axis=1)
ratings = ratings[:-1].astype(int)

In [2]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


In [3]:
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [4]:
tags.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [5]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


## Content-Based Filtering 

Through content-based filtering, we individually filtered user recommendations based on unique, item profiles using our `book`, `book_tag`, and `tags` datasets.

#### Item Profile 

Using a few data transformations, we create individual item profiles, which features include concatenated tags describing each book. 

In [6]:
# CBF Data Cleaning
## select only books writen in english and subset goodreads book id, title, and authors
filter_list = ['eng', 'en-US', 'en-GB', 'en-CA', 'en']
eng_books = books[books.language_code.isin(filter_list)]
subset_books = eng_books[['goodreads_book_id', 'title', 'authors']]

# join tags and books with tags
join_tags = book_tags.set_index('tag_id').join(tags.set_index('tag_id')).drop('count', axis=1)
join_book = pd.merge(subset_books, join_tags, on='goodreads_book_id')
CBF_tags = join_book.groupby(['goodreads_book_id','title','authors'],
                             as_index=False).agg(lambda x:', '.join(x)).rename({'tag_name':'tags'}, axis=1)

We passed the tags column (or profile) as a vector through a term frequency times inverse document frequency (TF-IDF) matrix. This process mines and scores important words from the profile. 

We then created a cosine similiarity matrices for book tags to make our recomendation predictions. Finally, we build a `CBF_recommend function`, which uses the cosine similarities  to identify the top *n* matches for a particular book based solely on it's profile.  

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Generate TF-IDF matrix for tags
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

# Generate cosine similarity matrix for tags 
tf_idf_matrix = vectorizer.fit_transform(CBF_tags['tags'])
co_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)

# Create list to match title indices in function
indices = pd.Series(data=CBF_tags.index, index=CBF_tags['title']) 

# Book recommendation function 
def CBF_recommend(title, n):
    if n > 0: # logical statement to ensure valid input for n
        recommendations = CBF_tags[['title', 'authors']] # set recommendation output: title, author
        idx = indices[title] # set index to title
        
        # list and sort similarity scores 
        score = pd.DataFrame(enumerate(co_sim[idx]), columns=['ID', 'score']).drop('ID', axis=1).sort_values('score', ascending = False).iloc[1:,]
  
        # recommend top n books 
        top_n = score[1:n+1]
        test = recommendations.iloc[top_n.index].join(top_n)
        test.index = np.arange(1, len(test) + 1)
        return test
    else: 
        print("Select a value greater than 0 and try again.")

#### Content-Based Filtering Examples

The following examples are used to test our `CBF_recommend function` and view correlation score of recommended books. 

In [8]:
CBF_recommend('To Kill a Mockingbird', 3)

Unnamed: 0,title,authors,score
1,Of Mice and Men,John Steinbeck,0.520684
2,The Great Gatsby,F. Scott Fitzgerald,0.512877
3,Lord of the Flies,William Golding,0.495521


In [9]:
CBF_recommend('Nineteen Minutes', 3)

Unnamed: 0,title,authors,score
1,The Tenth Circle,Jodi Picoult,0.352281
2,Salem Falls,Jodi Picoult,0.344941
3,Handle with Care,Jodi Picoult,0.323383


In [10]:
CBF_recommend('A Game of Thrones (A Song of Ice and Fire, #1)', 3)

Unnamed: 0,title,authors,score
1,"A Feast for Crows (A Song of Ice and Fire, #4)",George R.R. Martin,0.685467
2,"A Dance with Dragons (A Song of Ice and Fire, #5)",George R.R. Martin,0.676265
3,"A Storm of Swords (A Song of Ice and Fire, #3)",George R.R. Martin,0.660945


#### Content-Based Recommendations from User Input 

The `booksearch function` below allows users to search for book titles within our goodbooks compilation. Users can take the output to guide their search for specific item recommendations. 

In [15]:
#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

choices = CBF_tags['title']
search_value = input("Search book titles: ")

def booksearch(title):
    fuzzy = process.extract(search_value, choices)
    results = [x[0] for x in fuzzy]
    print("\n".join(str(x) for x in results))

booksearch(search_value)

Search book titles: sun
In a Sunburned Country
The Sun Also Rises
Rising Sun
Sunshine
Half of a Yellow Sun


We also created a `title_recommendations function`, which finds the best title matches from the user's input and runs the selection through our content-based recommender. The user can also select the number of recommendations they wish to receive here. 

In [16]:
title_input = input("Input book title to view recommendations: ")
    
def title_recommendations(title): 
    recommend_n = input("Input number of recommendations you would like to receive: ")
    user_selection = process.extractOne(title_input, choices)[0]
    print("\n Recommending titles based on the book: ",user_selection)
    return CBF_recommend(user_selection, int(recommend_n))

title_recommendations(title_input)

Input book title to view recommendations: sun also rises
Input number of recommendations you would like to receive: 10

 Recommending titles based on the book:  The Sun Also Rises


Unnamed: 0,title,authors,score
1,The Sound and the Fury,William Faulkner,0.525416
2,To Have and Have Not,Ernest Hemingway,0.50585
3,Tender Is the Night,F. Scott Fitzgerald,0.496121
4,A Room with a View,E.M. Forster,0.492896
5,As I Lay Dying,William Faulkner,0.490031
6,One Flew Over the Cuckoo's Nest,Ken Kesey,0.483343
7,For Whom the Bell Tolls,Ernest Hemingway,0.474811
8,The Grapes of Wrath,John Steinbeck,0.469059
9,The Awakening,Kate Chopin,0.463095
10,An American Tragedy,"Theodore Dreiser, Richard R. Lingeman",0.45822


#### Content-Based Analysis

Upon initial review, the `CBF_recommend` function appears to match book recommendations very effectively based the created item profiles. This method works nicely because it does not require data on other users and does not rate our items based on popularity. 

However, we found this method suffered from a common drawback of the content-based approach, over-specification. Unlike the "To Kill a Mockingbird" recommendations, we see that the top recommender results for "Nineteen Minutes" and "A Game of Thrones" are for other novels written by the same authors as the book we searched for.  

## User-User Collaborative Filtering 

## Item-Item Collaborative Filtering 

## Summary
Please provide at least one graph, and a textual summary of your findings and recommendations. 

## Sources

**To do: figure out jupyter nbconvert citations**

http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/

@article{goodbooks2017,
    author = {Zajac, Zygmunt},
    title = {Goodbooks-10k: a new dataset for book recommendations},
    year = {2017},
    publisher = {FastML},
    journal = {FastML},
    howpublished = {\url{http://fastml.com/goodbooks-10k}},
},
