#### Data 612 - Project 2 : Content-Based and Collaborative Filtering<br>Date: June 18, 2019<br>Team Info: 
+ Christina Valore
+ Juliann McEachern 
+ Rajwant Mishra

<h1 align="center">Goodreads Books Recommender Systems</h1>

## Dataset Selection

Data was obtain from [goodbooks2017](#cite-goodbooks2017). Add more details here:
+  `books`: dataset
+  `book_tags`: dataset
+  `tags`: dataset
+  `ratings`: dataset

In [269]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data from local csv  into pandas dataframe
books = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/books.csv')
book_tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/book_tags.csv')
tags = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/tags.csv')
ratings = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-2/data/ratings.tar.gz', 
                      compression='gzip')

# Clean ratings data
ratings = ratings.drop('ratings.csv', axis=1)
ratings = ratings[:-1].astype(int)

In [270]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


In [271]:
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [272]:
tags.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [273]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


## Content-Based Filtering 

Through content-based filtering, we individually filtered user recommendations based on unique, item profiles using our `book`, `book_tag`, and `tags` datasets.

#### Item Profile 

Using a few data transformations, we create individual item profiles, which features include concatenated tags describing each book. 

In [274]:
# CBF Data Cleaning
## select only books writen in english and subset goodreads book id, title, and authors
filter_list = ['eng', 'en-US', 'en-GB', 'en-CA', 'en']
eng_books = books[books.language_code.isin(filter_list)]
subset_books = eng_books[['goodreads_book_id', 'title', 'authors']]

# join tags and books with tags
join_tags = book_tags.set_index('tag_id').join(tags.set_index('tag_id')).drop('count', axis=1)
join_book = pd.merge(subset_books, join_tags, on='goodreads_book_id')
CBF_tags = join_book.groupby(['goodreads_book_id','title','authors'],
                             as_index=False).agg(lambda x:', '.join(x)).rename({'tag_name':'tags'}, axis=1)

We passed the tags column (or profile) as a vector through a term frequency times inverse document frequency (TF-IDF) matrix. This process mines and scores important words from the profile. 

We then created a cosine similiarity matrices for book tags to make our recomendation predictions. Finally, we build a `CBF_recommend function`, which uses the cosine similarities  to identify the top *n* matches for a particular book based solely on it's profile.  

In [275]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Generate TF-IDF matrix for tags
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

# Generate cosine similarity matrix for tags 
tf_idf_matrix = vectorizer.fit_transform(CBF_tags['tags'])
co_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)

# Create list to match title indices in function
indices = pd.Series(data=CBF_tags.index, index=CBF_tags['title']) 

# Book recommendation function 
def CBF_recommend(title, n):
    if n > 0: # logical statement to ensure valid input for n
        recommendations = CBF_tags[['title', 'authors']] # set recommendation output: title, author
        idx = indices[title] # set index to title
        
        # list and sort similarity scores 
        score = pd.DataFrame(enumerate(co_sim[idx]), columns=['ID', 'score']).drop('ID', axis=1).sort_values('score', ascending = False).iloc[1:,]
  
        # recommend top n books 
        top_n = score[1:n+1]
        test = recommendations.iloc[top_n.index].join(top_n)
        test.index = np.arange(1, len(test) + 1)
        return test
    else: 
        print("Select a value greater than 0 and try again.")

#### Content-Based Filtering Examples

The following examples are used to test our `CBF_recommend function` and view correlation score of recommended books. 

In [276]:
CBF_recommend('To Kill a Mockingbird', 3)

Unnamed: 0,title,authors,score
1,Of Mice and Men,John Steinbeck,0.520684
2,The Great Gatsby,F. Scott Fitzgerald,0.512877
3,Lord of the Flies,William Golding,0.495521


In [277]:
CBF_recommend('Nineteen Minutes', 3)

Unnamed: 0,title,authors,score
1,The Tenth Circle,Jodi Picoult,0.352281
2,Salem Falls,Jodi Picoult,0.344941
3,Handle with Care,Jodi Picoult,0.323383


In [278]:
CBF_recommend('A Game of Thrones (A Song of Ice and Fire, #1)', 3)

Unnamed: 0,title,authors,score
1,"A Feast for Crows (A Song of Ice and Fire, #4)",George R.R. Martin,0.685467
2,"A Dance with Dragons (A Song of Ice and Fire, #5)",George R.R. Martin,0.676265
3,"A Storm of Swords (A Song of Ice and Fire, #3)",George R.R. Martin,0.660945


#### Content-Based Recommendations from User Input 

The `booksearch function` below allows users to search for book titles within our goodbooks compilation. Users can take the output to guide their search for specific item recommendations. 

In [280]:
# pip install fuzzywuzzy
# pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

choices = CBF_tags['title']
search_value = input("Search book titles: ")

def booksearch(title):
    fuzzy = process.extract(search_value, choices)
    results = [x[0] for x in fuzzy]
    print("\n".join(str(x) for x in results))

booksearch(search_value)

SyntaxError: invalid syntax (<ipython-input-280-23fd94558724>, line 1)

We also created a `title_recommendations function`, which finds the best title matches from the user's input and runs the selection through our content-based recommender. The user can also select the number of recommendations they wish to receive here. 

In [None]:
title_input = input("Input book title to view recommendations: ")
    
def title_recommendations(title): 
    recommend_n = input("Input number of recommendations you would like to receive: ")
    user_selection = process.extractOne(title_input, choices)[0]
    print("\n Recommending titles based on the book: ",user_selection)
    return CBF_recommend(user_selection, int(recommend_n))

title_recommendations(title_input)

#### Content-Based Analysis

Upon initial review, the `CBF_recommend` function appears to match book recommendations very effectively based the created item profiles. This method works nicely because it does not require data on other users and does not rate our items based on popularity. 

However, we found this method suffered from a common drawback of the content-based approach, over-specification. Unlike the "To Kill a Mockingbird" recommendations, we see that the top recommender results for "Nineteen Minutes" and "A Game of Thrones" are for other novels written by the same authors as the book we searched for.  

## User-User Collaborative Filtering 

## Item-Item Collaborative Filtering 

1. Data Sparsity: In case of large number of items, number of items a user has rated reduces to a tiny percentage making the correlation coefficient less reliable
2. User profiles change quickly and the entire system model had to be recomputed which is both time and computationally expensive
To cater to these issues, we will use ITEM-ITEM collaborative filtering.

<b>ITEM-ITEM collaborative filtering</b>
ITEM-ITEM collaborative filtering look for items that are similar to the articles that user has already rated and recommend most similar articles. But what does that mean when we say item-item similarity? In this case we don’t mean whether two items are the same by attribute like Fountain pen and pilot pen are similar because both are pen. Instead, what similarity means is how people treat two items the same in terms of like and dislike.
This method is quite stable in itself as compared to User based collaborative filtering because the average item has a lot more ratings than the average user. So an individual rating doesn’t impact as much.

To calculate similarity between two items, we looks into the set of items the target user has rated and computes how similar they are to the target item i and then selects k most similar items. Similarity between two items is calculated by taking the ratings of the users who have rated both the items and thereafter using the cosine similarity function mentioned below.<br>
Once we have the similarity between the items, the prediction is then computed by taking a weighted average of the target user’s ratings on these similar items. The formula to calculate rating is very similar to the user based collaborative filtering except the weights are between items instead of between users. And we use the current users rating for the item or for other items, instead of other users rating for the current items.

In [281]:
#make necesarry imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import correlation, cosine
import ipywidgets as widgets
from IPython.display import display, clear_output
from sklearn.metrics import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
import sys, os
from contextlib import contextmanager
import ipywidgets as widgets
from IPython.display import display, clear_output
from sklearn.metrics import pairwise_distances
global k,metric
k=4
metric='cosine'

# Method to show N item from Dict
def show_dict_item(n, dict_obj):
    return {k: dict_obj[k] for k in list(dict_obj)[:n]}


In [282]:

## starting new Test with 10K data
test_rating = ratings[1:10000]
test_rating.head()

Unnamed: 0,user_id,book_id,rating
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3
5,2,26,4


In [283]:
len(test_rating)

9999

In [284]:
test_rating_item = pd.merge(test_rating,books_df)
test_rating_item.head()

Unnamed: 0,user_id,book_id,rating,goodreads_book_id,isbn,authors,title,original_publication_year,average_rating
0,2,4081,4,231,312424442.0,Tom Wolfe,I am Charlotte Simmons,2004,3.4
1,258,4081,5,231,312424442.0,Tom Wolfe,I am Charlotte Simmons,2004,3.4
2,364,4081,4,231,312424442.0,Tom Wolfe,I am Charlotte Simmons,2004,3.4
3,316,4081,2,231,312424442.0,Tom Wolfe,I am Charlotte Simmons,2004,3.4
4,2,260,5,4865,,Dale Carnegie,How to Win Friends and Influence People,1936,4.13


In [285]:

test_rating_item = test_rating_item[['book_id','title','user_id','rating']]
test_rating_item.head()

Unnamed: 0,book_id,title,user_id,rating
0,4081,I am Charlotte Simmons,2,4
1,4081,I am Charlotte Simmons,258,5
2,4081,I am Charlotte Simmons,364,4
3,4081,I am Charlotte Simmons,316,2
4,260,How to Win Friends and Influence People,2,5


In [286]:
test_M1 =pd.pivot_table(test_rating_item,index='user_id',columns='title',values='rating',fill_value=0)

test_M1.T.head()

user_id,1,2,4,6,8,9,10,11,15,18,...,429,439,440,444,446,447,449,452,453,454
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Angels (Walsh Family, #3)",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"'Tis (Frank McCourt, #2)",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100 Selected Poems,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1776,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [287]:
# Get 5 similar items from the nearest neighbour 
n = 5 
cosine_nn = NearestNeighbors(n_neighbors=n, algorithm='brute',metric='cosine')
item_cosine_nn_fit = cosine_nn.fit(test_M1.T.values)
item_distance, item_indices = item_cosine_nn_fit.kneighbors(test_M1.T.values )

In [288]:
# Cosine distance and Item 
# note this is not cosine similarlity , which can be obtained
# by doing 1 - cosine distance 
item_distance,item_indices

type(item_distance)
item_distance[1:3],item_indices[1:3]

(array([[0.        , 0.17798027, 0.22217421, 0.2602069 , 0.28073273],
        [0.        , 0.40911613, 0.41654003, 0.41654003, 0.41654003]]),
 array([[   1,  986, 1081, 1496, 1233],
        [   2,  143,  191, 1606, 2002]], dtype=int64))

In [289]:
# Here we are listing all the book and corresponding 5 recommendation 
items_dic = {}

for i in range (len(test_M1.T.index)):
    item_idx = item_indices[i]
    col_names = test_M1.T.index[item_idx].tolist()
    items_dic[test_M1.T.index[i]] = col_names

In [318]:
# List of the Book reccomnded by item title 
show_dict_item(2,items_dic)

{' Angels (Walsh Family, #3)': ['The Druid of Shannara (Heritage of Shannara, #2)',
  'The Elfstones of Shannara  (The Original Shannara Trilogy, #2)',
  'Intensity',
  'Rose Madder',
  'The Black Unicorn (Magic Kingdom of Landover, #2)'],
 "'Salem's Lot": ["'Salem's Lot",
  'Night Shift',
  'Pet Sematary',
  'The Dead Zone',
  'Skeleton Crew']}

In [291]:
# Lets find list of Book read by user

has_read= {}
row_indexes = {}
for i,row in test_M1.iterrows():
    rows = [x for x in range(0,len(test_M1.columns))]
    combine = list(zip(row.index,row.values,rows))
    read = [(x,z) for x,y,z in combine if y!=0]
    index = [i[1] for i in read]
    row_names = [i[0] for i in read]
    row_indexes[i] = index
    has_read[i] = row_names




In [320]:
# has_read.items()
type(has_read),show_dict_item(2,has_read)

(dict,
 {1: ['Balzac and the Little Chinese Seamstress',
   'Gilead (Gilead, #1)',
   'Housekeeping',
   'Never Let Me Go',
   'The Book Thief',
   'The History of Love',
   'The Sea'],
  2: ['Harry Potter Collection (Harry Potter, #1-6)',
   'Heart of Darkness',
   'How to Win Friends and Influence People',
   'I am Charlotte Simmons',
   'Memoirs of a Geisha',
   'The Da Vinci Code (Robert Langdon, #2)',
   'The Drama of the Gifted Child: The Search for the True Self',
   'The House of God',
   'The Millionaire Next Door: The Surprising Secrets of Americas Wealthy',
   'Who Moved My Cheese?']})

In [315]:
#row_indexes.items(),
type(row_indexes),show_dict_item(2,row_indexes)

{1: [197, 581, 688, 977, 1405, 1616, 1875],
 2: [633, 647, 692, 705, 899, 1486, 1518, 1629, 1737, 2162]}

In [294]:
item_indices[1:5]

array([[   1,  986, 1081, 1496, 1233],
       [   2,  143,  191, 1606, 2002],
       [  11,    3,  256,   69, 1764],
       [   4, 1303,  894, 1837, 1453]], dtype=int64)

In [295]:
# Sample of data 
# item_indexs = (454, [5, 83, 215, 302, 437, 1031, 1707, 2063, 2098])  i.e. User and Book ID of the read books
# item_distance = (array([[0.        , 0.17798027, 0.22217421, 0.2602069 , 0.28073273],
#         [0.        , 0.40911613, 0.41654003, 0.41654003, 0.41654003]]),
# item_indices =  array([[   1,  986, 1081, 1496, 1233],
#         [   2,  143,  191, 1606, 2002]], dtype=int64))
#-----------------------------------------------------------------------------------------------------------------
# We will read all Book read by User 454 , and find the similar book from item_indices and item_distance
# Then we will remove all the books already read by user 454 and sort the result to store in final recommendation 

top_rec = {}
# Find the Item close to the Item already read 
# Get the read item and find the distance from the Item_disatnce for the book already read 
for k, v in row_indexes.items():
    item_idx = [j for i in item_indices[v] for j in i]
    item_dist = [j for i in item_distance[v] for j in i]
    # Put this info in one list
    combine = list(zip(item_dist,item_idx))
    # Keep out the already read Book
    diction = {i:d for d,i in combine  if i not in v}
    zipped = list(zip(diction.keys(),diction.values()))
    #sorting our result so that we have most similar item on the top 
    sort = sorted(zipped,key=lambda x: x[1])  
    # to get the actaul movie name , pass the actual user-item matrim 
    # test_M.columns[2116] 'Vernon God Little'
    recommendations = [(test_M.columns[i],d) for i,d in sort]
    idp = [(i,d) for i,d in sort ] 
#     print(item_idx)
    top_rec[k]= recommendations
#     print('Combine:',item_idx)
#     print('Only Note Read:',recommendations)
#     print('Item dis:',idp)
          


In [296]:
top_rec.items()
len(test_M.index)

255

In [297]:
#some Test 
# check if User exist in Record
2 in test_M.index 

True

In [298]:
has_read[2] # list all the Book read by user 2
print("Boork Read so far: \n{}".format('\n'.join(has_read[2])))

Boork Read so far: 
Harry Potter Collection (Harry Potter, #1-6)
Heart of Darkness
How to Win Friends and Influence People
I am Charlotte Simmons
Memoirs of a Geisha
The Da Vinci Code (Robert Langdon, #2)
The Drama of the Gifted Child: The Search for the True Self
The House of God
The Millionaire Next Door: The Surprising Secrets of Americas Wealthy
Who Moved My Cheese?


In [299]:
# build Final Recommendation 
def get_book_recommendation(user,number_of_rec=5):
    if user in test_M.index :
        print("Boork Read so far: \n\n{}".format('\n'.join(has_read[user])))
        print()
        print("\n\nTOP 5 RECOMMENDATION:")
        # Get the Move name along with similarity score 
        for k,v in top_rec.items():
            if user == k:
                for i in v[:number_of_rec]:
                    print('{} with similarlity: {:.4f}'.format(i[0],1-i[1]))
    else:
        print(" Sorry user is not found")
    

In [300]:
# Propose recommendation for User 2
get_book_recommendation(2)


Boork Read so far: 

Harry Potter Collection (Harry Potter, #1-6)
Heart of Darkness
How to Win Friends and Influence People
I am Charlotte Simmons
Memoirs of a Geisha
The Da Vinci Code (Robert Langdon, #2)
The Drama of the Gifted Child: The Search for the True Self
The House of God
The Millionaire Next Door: The Surprising Secrets of Americas Wealthy
Who Moved My Cheese?



TOP 5 RECOMMENDATION:
Plainsong (Plainsong, #1) with similarlity: 0.6402
Random Family: Love, Drugs, Trouble, and Coming of Age in the Bronx with similarlity: 0.6402
Tender at the Bone: Growing Up at the Table with similarlity: 0.6402
One Good Turn (Jackson Brodie, #2) with similarlity: 0.6402
A Supposedly Fun Thing I'll Never Do Again:  Essays and Arguments with similarlity: 0.6247


### Predict Rating of books using Item to Item (IICF)

In [301]:
# Predict Rating of books
item_distance_p = 1 - item_distance  # actaul similarlity 
test_predction = item_distance_p.T.dot(test_M1.T.values)/np.array([np.abs(item_distance_p.T).sum(axis = 1)]).T
real_rating = test_M1.T.values[item_distance_p.argsort()[0]]
test_predction
item_distance_p.T.dot(test_M1.T.values)

array([[ 23.        ,  43.        , 364.        , ...,  13.        ,
        244.        ,  42.        ],
       [ 12.84371217,  30.60439691, 222.17943689, ...,   7.15978527,
        149.09282941,  22.07803942],
       [ 12.77437879,  28.78289484, 207.62542351, ...,   6.49645683,
        141.98679436,  20.30402605],
       [ 12.74387464,  25.70639651, 195.08457672, ...,   6.43217442,
        135.55939244,  19.68470576],
       [ 12.15220253,  24.04089531, 187.81210418, ...,   6.40233125,
        129.96588445,  18.14283567]])

In [302]:
test_predction

array([[0.01039313, 0.01943064, 0.1644826 , ..., 0.00587438, 0.11025757,
        0.01897876],
       [0.0069546 , 0.01657164, 0.12030552, ..., 0.00387687, 0.08073065,
        0.0119548 ],
       [0.00720121, 0.01622559, 0.11704326, ..., 0.0036622 , 0.08004125,
        0.01144585],
       [0.00744793, 0.01502364, 0.11401366, ..., 0.00375917, 0.07922524,
        0.01150437],
       [0.00733897, 0.0145188 , 0.11342365, ..., 0.0038665 , 0.07848911,
        0.01095684]])

### RMSE Calculation

In [303]:
def rmse(prediction,real):
    prediction = prediction[real.nonzero()].flatten()
    real= real[real.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction,real))
    

In [304]:
error_rate = rmse(test_predction,real_rating)
print("Accuracy:{:.3f}".format(100-error_rate))
print("RMSE: {:.3f}".format(error_rate))
error_rate

Accuracy:96.353
RMSE: 3.647


3.6470512670697888

## Summary
Please provide at least one graph, and a textual summary of your findings and recommendations. 

## Sources

**To do: figure out jupyter nbconvert citations**

http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/

@article{goodbooks2017,
    author = {Zajac, Zygmunt},
    title = {Goodbooks-10k: a new dataset for book recommendations},
    year = {2017},
    publisher = {FastML},
    journal = {FastML},
    howpublished = {\url{http://fastml.com/goodbooks-10k}},
},
