**Data 612: Project 3 - Matrix Factorization Methods<br>Christina Valore, Juliann McEachern, & Rajwant Mishra<br>Due: June 25, 2019**

<h1 align="center">Goodreads Recommender Systems</h1>

<h2 style="color:#088A68;">Getting Started</h2>

For project 3, we choose to continue our work with Goodreads books and build a recommender system that utilizes implicit matrix factorization techniques. As we have learned, singular value decomposition (SVD) matrices can be computationally expensive. Thus, our work will focus on a small subset of Goodreads book data we previously explored. 

We will also compare the performance of our functions and calculations to the results generated by the `surprise` package. 

#### Python Dependencies

In [1]:
# The usual suspects 
import numpy as np, pandas as pd 

# Visualization packages
import seaborn as sns

# Scikits packages
## Suprise!
from surprise.model_selection import train_test_split
from surprise import KNNWithMeans, SVD, Dataset, Reader, accuracy

## TFIDF
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer


#### Data Preparation  

In [2]:
books = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/books.csv')
ratings = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/ratings.tar.gz', 
                      compression='gzip')

In [3]:
book_tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/book_tags.csv')
tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/tags.csv')

In [4]:
r=ratings[:-1].astype(int).drop('ratings.csv', axis=1)
b=books[['goodreads_book_id','book_id', 'title', 'authors']]
df=r.set_index('book_id').join(b.set_index('book_id')).drop('goodreads_book_id', axis=1).reset_index()
df.head()

Unnamed: 0,book_id,user_id,rating,title,authors
0,1,2886,5,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins
1,1,6158,5,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins
2,1,3991,4,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins
3,1,5281,5,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins
4,1,5721,5,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins


In [5]:
t = book_tags.set_index('tag_id').join(tags.set_index('tag_id')).drop('count', axis=1).merge(b, on='goodreads_book_id')
CBF = t.groupby(['goodreads_book_id','book_id','title','authors'],as_index=False).agg(lambda x:', '.join(x)).rename({'tag_name':'tags'}, axis=1).drop('goodreads_book_id', axis=1)
CBF.head()

Unnamed: 0,book_id,title,authors,tags
0,27,Harry Potter and the Half-Blood Prince (Harry ...,"J.K. Rowling, Mary GrandPré","2005, 5-star, 5-stars, adventure, all-time-fav..."
1,21,Harry Potter and the Order of the Phoenix (Har...,"J.K. Rowling, Mary GrandPré","2003, 5-star, action, all-time-favorites, all-..."
2,2,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré","5-stars, adventure, all-time-favorites, all-ti..."
3,18,Harry Potter and the Prisoner of Azkaban (Harr...,"J.K. Rowling, Mary GrandPré, Rufus Beck","5-star, 5-stars, adventure, all-time-favorites..."
4,24,Harry Potter and the Goblet of Fire (Harry Pot...,"J.K. Rowling, Mary GrandPré","5-star, 5-stars, adventure, all-time-favorites..."


In [6]:
# Reused CBF TFIDF to create best subset for analysis
tdv = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english') # create vector
tfidf = tdv.fit_transform(CBF['tags']) # apply tfidf vector
cos = linear_kernel(tfidf, tfidf) # cosine similarity
i = pd.Series(data=CBF.index, index=CBF['title']) # index book_id for function

# Recommendation function 
def recommend(title):
    subset = CBF[['book_id','title', 'authors']] # set recommendation output
    idx = i[title] # set index to title
    #subset cos score
    score = pd.DataFrame(enumerate(cos[idx]), columns=['ID', 'score']).drop('ID', axis=1).sort_values('score', ascending = False).iloc[1:,]
    # recommend 50 books 
    top_n = score[1:26]
    rec = subset.iloc[top_n.index].join(top_n)
    rec.index = np.arange(1, len(rec) + 1)
    return rec


In [7]:
sub=recommend("The Great Gatsby").drop('score',axis=1).merge(df, on=['book_id', 'title', 'authors'], how='inner')

In [8]:
s = sub.groupby(['user_id'])['book_id'].apply(list)
s = s[s.str.len() > 1].reset_index()
v = s.book_id.str.len().sort_values(ascending=False).index
v1 = s.reindex(v).reset_index(drop=True)
id=v1[1:101].drop('book_id',axis=1)

In [9]:
subset=id.merge(sub, on='user_id', how='inner')
matrix=subset.pivot_table(index='user_id', columns='title', values='rating', aggfunc='count', fill_value=0)
matrix

title,A Farewell to Arms,A Separate Peace,A Streetcar Named Desire,Animal Farm,Cannery Row,Death of a Salesman,Ethan Frome,Fahrenheit 451,Frankenstein,Great Expectations,...,The Adventures of Huckleberry Finn,The Awakening,The Grapes of Wrath,The Old Man and the Sea,The Pearl,The Scarlet Letter,The Sun Also Rises,Their Eyes Were Watching God,To Kill a Mockingbird,Wuthering Heights
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
209,0,0,0,1,0,1,0,1,0,1,...,1,1,1,0,1,1,1,0,1,1
233,0,1,1,1,0,1,0,1,1,1,...,1,1,1,0,1,1,0,1,1,0
247,1,1,1,0,1,1,1,1,1,0,...,0,0,1,0,1,1,1,1,1,0
947,1,1,1,1,0,1,1,0,0,1,...,1,1,1,1,1,1,0,0,0,1
1590,0,0,1,1,0,0,1,1,1,1,...,1,1,1,1,0,1,0,1,1,1
2575,1,0,1,1,0,0,1,1,0,0,...,1,0,1,1,1,1,1,0,1,1
2617,0,0,0,1,1,0,0,1,1,1,...,1,1,1,1,1,0,1,1,1,1
2792,0,0,1,1,0,1,0,1,1,1,...,1,0,1,1,0,1,0,0,1,1
2895,1,0,0,1,0,1,0,1,1,1,...,1,0,1,1,1,0,1,0,1,1
3786,0,0,0,1,1,0,1,1,1,0,...,1,1,1,1,1,1,0,1,1,1


#### Data Visualization

In [10]:
# TO DO

<h2 style="color:#088A68;">Singular Value Decomposition</h2>

<h2 style="color:#088A68;">Alternating Least Squares Method</h2>
In case we choose to explore ALS technique as well ^^

<h2 style="color:#088A68;">Surprise!</h2>

<h2 style="color:#088A68;">Analysis</h2>

---
#### References: 
*  **[Goodbooks-10k:](http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/)** A New Dataset for Book Recommendations