#### Data 612 - Project 2 : Content-Based and Collaborative Filtering<br>Date: June 18, 2019<br>Team Info: 
+ Christina Valore
+ Juliann McEachern 
+ Rajwant Mishra

<h1 align="center">Good Books Recommender System</h1>

## Dataset Selection

Data was obtain from [goodbooks2017](#cite-goodbooks2017).

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data from local csv  into pandas dataframe
book_tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/book_tags.csv')
tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/tags.csv')
books = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/books.csv')
ratings = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/ratings.tar.gz', 
                      compression='gzip')

#### Data Cleaning

In [None]:
# Clean ratings data
ratings = ratings.drop('ratings.csv', axis=1)
ratings = ratings[:-1].astype(int)

In [365]:
# Clean books data
## select only books writen in english 
filter_list = ['eng', 'en-US', 'en-GB', 'en-CA', 'en']
books_df = books[books.language_code.isin(filter_list)]

## subset columns
books_df = books_df[['book_id', 'goodreads_book_id', 'isbn', 'authors', 'title', 'original_publication_year', 'average_rating']]

## drop 15 occurances of no publication year
books_df = books_df.dropna(axis=0, subset=['original_publication_year'])

## change publication year data type to int
books_df['original_publication_year'] = books_df['original_publication_year'].astype(int)

## join book_tags, tags, and books dataframes
merge_tags = pd.merge(book_tags, tags, on='tag_id')
group_tags = pd.DataFrame(merge_tags.groupby('goodreads_book_id')['tag_name'].apply(lambda x: "%s" % ', '.join(x)))
reindex_tags = group_tags.reset_index().rename({'tag_name':'tags'}, axis=1)
tagged_books = pd.merge(books_df, reindex_tags, on='goodreads_book_id')

## view tagged_books
tagged_books


Unnamed: 0,book_id,goodreads_book_id,isbn,authors,title,original_publication_year,average_rating,tags
0,1,2767052,439023483,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",2008,4.34,"to-read, fantasy, favorites, currently-reading..."
1,2,3,439554934,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,1997,4.44,"to-read, fantasy, favorites, currently-reading..."
2,3,41865,316015849,Stephenie Meyer,"Twilight (Twilight, #1)",2005,3.57,"to-read, fantasy, favorites, currently-reading..."
3,4,2657,61120081,Harper Lee,To Kill a Mockingbird,1960,4.25,"to-read, favorites, currently-reading, young-a..."
4,5,4671,743273567,F. Scott Fitzgerald,The Great Gatsby,1925,3.89,"to-read, favorites, currently-reading, young-a..."
5,6,11870085,525478817,John Green,The Fault in Our Stars,2012,4.26,"to-read, favorites, currently-reading, young-a..."
6,7,5907,618260307,J.R.R. Tolkien,The Hobbit,1937,4.25,"to-read, fantasy, favorites, currently-reading..."
7,8,5107,316769177,J.D. Salinger,The Catcher in the Rye,1951,3.79,"to-read, favorites, currently-reading, young-a..."
8,9,960,1416524797,Dan Brown,"Angels & Demons (Robert Langdon, #1)",2000,3.85,"to-read, fantasy, favorites, currently-reading..."
9,10,1885,679783261,Jane Austen,Pride and Prejudice,1813,4.24,"to-read, favorites, young-adult, fiction, book..."


## Content-Based Filtering 

Individually filter recommendations based on books with similar features.

#### Item profile

We start by creating an item profile for each book which contains features such as authors or tags.

TF-IDF: Term Frequency times Inverse document frequency. 

In [448]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Create TF-IDF features matrix and cosine similarity matrix for tags 
TF = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
TFIDF_tag_matrix = TF.fit_transform(tagged_books['tags'])
tag_csm = linear_kernel(TFIDF_tag_matrix, TFIDF_tag_matrix)

# Create TF-IDF features matrix and cosine similarity matrix for authors 
TFIDF_author_matrix = TF.fit_transform(tagged_books['authors'])
author_csm = linear_kernel(TFIDF_author_matrix, TFIDF_author_matrix)

# Create array and indices series for recommender functions
titles = tagged_books['title']
authors = tagged_books['authors']
indices = pd.Series(tagged_books.index, index=tagged_books['title'])

# Recommend books from cosine similarity score of book tags
def tag_recommender(title):
    # Set indices to titles
    idx = indices[title]
    
    # list and sort similarity scores 
    score = list(enumerate(tag_csm[idx]))
    score = sorted(score, key=lambda x: x[1], reverse=True)
    
    # recommend top 5 books 
    top_five = score[1:6]
    book_indices = [i[0] for i in top_five]
    return titles.iloc[book_indices]

# Recommend books from cosine similarity score of authors
def author_recommender(title):
    # Set indices to titles
    idx = indices[title]
    
    # list and sort similarity scores 
    score = list(enumerate(author_csm[idx]))
    score = sorted(score, key=lambda x: x[1], reverse=True)
    
    # recommend top 5 books 
    top_five = score[1:6]
    book_indices = [i[0] for i in top_five]
    return titles.iloc[book_indices]


In [451]:
# Test functions

book_test = ['The Great Gatsby', 'Gone Girl']

for i in book_test:
    print('Top 5 Recommendations for:', i, '\n', tag_recommender(i),' \n \n')
    
for i in book_test:
    print('Top 5 Recommendations for:', i, '\n', author_recommender(i),' \n \n')

Top 5 Recommendations for: The Great Gatsby 
 7       The Catcher in the Rye
31             Of Mice and Men
3        To Kill a Mockingbird
27           Lord of the Flies
1492               Ethan Frome
Name: title, dtype: object  
 

Top 5 Recommendations for: Gone Girl 
 219                Dark Places
236              Sharp Objects
2444    The Kind Worth Killing
1288              Pretty Girls
58       The Girl on the Train
Name: title, dtype: object  
 

Top 5 Recommendations for: The Great Gatsby 
 1134                                  Tender Is the Night
2166                                This Side of Paradise
3356                  The Curious Case of Benjamin Button
6593                                    The Short Stories
7649    The Billionaire's Obsession ~ Simon (The Billi...
Name: title, dtype: object  
 

Top 5 Recommendations for: Gone Girl 
 219                            Dark Places
236                          Sharp Objects
2050                           The Grownup
1952 

## User-User Collaborative Filtering 

## Item-Item Collaborative Filtering 

## Summary
Please provide at least one graph, and a textual summary of your findings and recommendations. 

## Sources

**To do: figure out jupyter nbconvert citations**

http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/

@article{goodbooks2017,
    author = {Zajac, Zygmunt},
    title = {Goodbooks-10k: a new dataset for book recommendations},
    year = {2017},
    publisher = {FastML},
    journal = {FastML},
    howpublished = {\url{http://fastml.com/goodbooks-10k}},
},
