# The story of building a collaborative filter
## Introduction

The purpose of this notebook is to practice creating a collaborative recommender. I followed code posted on Kaggle.com, where the data can be obtained.

* Code credit to [here](https://www.kaggle.com/sriharshavogeti/collaborative-recommender-system-on-goodreads)
* Data from [here](https://www.kaggle.com/zygmunt/goodbooks-10k)

## Preliminaries

Import the necessary packages

In [126]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Read the data

In [107]:
df_books = pd.read_csv('../data/books.csv')
df_ratings = pd.read_csv('../data/ratings.csv')
df_tags = pd.read_csv('../data/tags.csv')
df_book_tags = pd.read_csv('../data/book_tags.csv')

Look at the info for each dataframe 

In [108]:
df_books.info(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
id                           10000 non-null int64
book_id                      10000 non-null int64
best_book_id                 10000 non-null int64
work_id                      10000 non-null int64
books_count                  10000 non-null int64
isbn                         9300 non-null object
isbn13                       9415 non-null float64
authors                      10000 non-null object
original_publication_year    9979 non-null float64
original_title               9415 non-null object
title                        10000 non-null object
language_code                8916 non-null object
average_rating               10000 non-null float64
ratings_count                10000 non-null int64
work_ratings_count           10000 non-null int64
work_text_reviews_count      10000 non-null int64
ratings_1                    10000 non-null int64
ratings_2                    10000 n

In [27]:
df_books.head(2)

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...


In [34]:
df_ratings.info(2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981756 entries, 0 to 981755
Data columns (total 3 columns):
book_id    981756 non-null int64
user_id    981756 non-null int64
rating     981756 non-null int64
dtypes: int64(3)
memory usage: 22.5 MB


In [29]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34252 entries, 0 to 34251
Data columns (total 2 columns):
tag_id      34252 non-null int64
tag_name    34252 non-null object
dtypes: int64(1), object(1)
memory usage: 535.3+ KB


In [30]:
df_book_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999912 entries, 0 to 999911
Data columns (total 3 columns):
goodreads_book_id    999912 non-null int64
tag_id               999912 non-null int64
count                999912 non-null int64
dtypes: int64(3)
memory usage: 22.9 MB


In [149]:
print(f'The number of reviews is {len(df_ratings)}.')

The number of reviews is 981756.


## More preliminaries...

This section is for tests to understand the following code.

First, let's look at the groupby method 
* returns a groupby object that contains information about the groups
* need to add a function that does something with the groups to see the result
* example from the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

In [155]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


In [162]:
df.groupby(['Animal']).mean()
df.groupby(['Animal']).max()
df.groupby(['Animal']).sum()

Unnamed: 0_level_0,Max Speed
Animal,Unnamed: 1_level_1
Falcon,750.0
Parrot,50.0


So we take the first step in our recommender and group the user ratings by book.

In [156]:
# create groups in the dataframe
testdf = df_ratings[['user_id','rating']].groupby(df_ratings['book_id'])


In [163]:
print(f'There are {len(testdf.groups.keys())} groups (think: books) in the ratings dataframe.')

There are 10000 groups (think: books) in the ratings dataframe.


In [167]:
testdf.get_group(1)
print(f'In group 1 there are {len(testdf.get_group(1))} reviews.')

In group 1 there are 100 reviews.


## Recommender prep code
Here, we make a list of dictionaries from the big dataframe. See the comments in the code.


In [113]:
listOfDictonaries = []
indexMap = {}
reverseIndexMap = {}
ptr = 0

for book_key in testdf.groups.keys(): # testdf.groups.keys() is # of books with ratings
    tempDict={}
    
    groupDF = testdf.get_group(book_key) #new df - one book with all the reviews

    for idx in range(0,len(groupDF)):  #for all the ratings in the new df
        tempDict[groupDF.iloc[idx,0]] = groupDF.iloc[idx,1] #let the tempdict store the rating in a dict object
    
    listOfDictonaries.append(tempDict)

    #use a pointer scheme
    indexMap[ptr] = book_key
    reverseIndexMap[book_key] = ptr #for next step
    ptr = ptr+1

So, we now have a list of dictionaries where each dictionary is a book with ratings.
_____________

An aside... do all the books have 100 ratings? (Answer - No)

In [114]:
len(listOfDictonaries[0])

100

In [1]:
# for idx in testdf.groups.keys():
#     if len(listOfDictonaries[idx]) != 100:
#         print(idx);

In [116]:
len(listOfDictonaries[11])

99

_______________

## Create the collaborative recommender

In [117]:
#use DictVectorizer
dictVectorizer = DictVectorizer(sparse=True)  #create a dictvectorizer instance
vector = dictVectorizer.fit_transform(listOfDictonaries) #fir & transform the dictionaries
vector #creates a spare matrix of numbers (floats)

<10000x53424 sparse matrix of type '<class 'numpy.float64'>'
	with 979478 stored elements in Compressed Sparse Row format>

____________________________

Woah! What is going on here? 
* Transform lists of feature-value mappings (the dictionaries) to vectors.
* Example from documentation ([here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)) below:

In [175]:
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
D_df = pd.DataFrame(D)
X = v.fit_transform(D)
D

[{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]

In [176]:
D_df

Unnamed: 0,bar,baz,foo
0,2.0,,1
1,,1.0,3


In [178]:
X

array([[2., 0., 1.],
       [0., 1., 3.]])

_______________

We use the idea of a cosine similarity matrix to find book vectors that are similar based on ratings.

In [119]:
pairwiseSimilarity = cosine_similarity(vector)

In [120]:
type(pairwiseSimilarity)

numpy.ndarray

## Bazinga!!
And finally...

In [140]:
def printBookDetails(bookID):
    print("Title:", df_books[df_books['id']==bookID]['original_title'].values[0])
    print("Author:",df_books[df_books['id']==bookID]['authors'].values[0])
    print("Printing Book-ID:",bookID)
    print("=================++++++++++++++=========================")


def getTopRecommandations(bookID):
    row = reverseIndexMap[bookID]
    print("------INPUT BOOK--------")
    printBookDetails(bookID)
    print("-------RECOMMENDATIONS----------")
    similarBookIDs = [printBookDetails(indexMap[i]) for i in np.argsort(pairwiseSimilarity[row])[-7:-2][::-1]]

In [141]:
printBookDetails(1245)

Title: The Brethren
Author: John Grisham
Printing Book-ID: 1245


In [127]:
getTopRecommandations(1)

------INPUT BOOK--------
Title: The Hunger Games
Author: Suzanne Collins
Printing Book-ID: 1
-------RECOMMENDATIONS----------
Title: The Help
Author: Kathryn Stockett
Printing Book-ID: 31
Title: Harry Potter and the Philosopher's Stone
Author: J.K. Rowling, Mary GrandPré
Printing Book-ID: 2
Title: Mockingjay
Author: Suzanne Collins
Printing Book-ID: 20
Title: Twilight
Author: Stephenie Meyer
Printing Book-ID: 3
Title: The Secret Garden
Author: Frances Hodgson Burnett
Printing Book-ID: 93


We can search by author.

In [146]:
df_books[df_books['authors']=='Ray Bradbury']

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
47,48,4381,4381,1272463,507,307347974,9780307000000.0,Ray Bradbury,1953.0,Fahrenheit 451,...,570498,1176240,30694,28366,64289,238242,426292,419051,https://images.gr-assets.com/books/1351643740m...,https://images.gr-assets.com/books/1351643740s...
636,637,76778,76778,4636013,271,553278223,9780553000000.0,Ray Bradbury,1950.0,The Martian Chronicles,...,143236,156328,5204,1666,5808,28385,56556,63913,https://images.gr-assets.com/books/1374049948m...,https://images.gr-assets.com/books/1374049948s...
1640,1641,248596,248596,1183550,136,380729407,9780381000000.0,Ray Bradbury,1962.0,Something Wicked This Way Comes,...,64813,71886,4526,1121,3850,15942,27380,23593,https://images.gr-assets.com/books/1409596011m...,https://images.gr-assets.com/books/1409596011s...
1979,1980,24830,24830,1065861,123,000712774X,9780007000000.0,Ray Bradbury,1951.0,The Illustrated Man,...,49551,57968,2615,341,1688,10484,22836,22619,https://images.gr-assets.com/books/1374049820m...,https://images.gr-assets.com/books/1374049820s...
3126,3127,50033,50033,1627774,136,671037706,9780671000000.0,Ray Bradbury,1957.0,Dandelion Wine,...,32867,39979,3226,730,1958,7453,13001,16837,https://images.gr-assets.com/books/1374049845m...,https://images.gr-assets.com/books/1374049845s...


In [147]:
getTopRecommandations(3127)

------INPUT BOOK--------
Title: Dandelion Wine
Author: Ray Bradbury
Printing Book-ID: 3127
-------RECOMMENDATIONS----------
Title: The Illustrated Man
Author: Ray Bradbury
Printing Book-ID: 1980
Title: The Martian Chronicles
Author: Ray Bradbury
Printing Book-ID: 637
Title: Demian: Die Geschichte einer Jugend
Author: Hermann Hesse
Printing Book-ID: 2097
Title: White Fang
Author: Jack London
Printing Book-ID: 935
Title: I, Robot
Author: Isaac Asimov
Printing Book-ID: 432
