###Collaborative Filtering with K Nearest neighbors.

This filtering is defined by finding products liked by "similar" users. The Nearest neighbor algorithms uses ratings of the "most similar" users. 


Let's start with loading the data first. What do we know about Pandas? Try to remember how we read a csv. Using that, upload a new .csv as given. 

In [2]:
#@title


import pandas as pd
dataFile='data/BX-CSV-Dump/BX-Book-Ratings.csv'
data=pd.read_csv(dataFile,sep=";",header=0,names=["user","isbn","rating"], 
                encoding = 'iso-8859-1')

How will you display the first few lines of data?

In [3]:
#@title
data.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Now, let's get the books. You have to read this into a csv again. 

In [0]:
#@title
bookFile='BX-Books.csv'
books=pd.read_csv(bookFile,sep=";",header=0,error_bad_lines=False, usecols=[0,1,2],index_col=0,names=['isbn',"title","author"])

Show the first few lines of the data you just collected. 

In [0]:
#@title
books.head()

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
195153448,Classical Mythology,Mark P. O. Morford
2005018,Clara Callan,Richard Bruce Wright
60973129,Decision in Normandy,Carlo D'Este
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
393045218,The Mummies of Urumchi,E. J. W. Barber


Define a function named bookMeta that returns the title and author of a book when given the ISBN.

In [0]:
#@title
def bookMeta(isbn):
    title = books.at[isbn,"title"]
    author = books.at[isbn,"author"]
    return title, author
bookMeta("0671027360")

('Angels &amp; Demons', 'Dan Brown')

Feed the data as ISBN and index.

In [0]:
#@title
data = data[data["isbn"].isin(books.index)]

Now define a function named faveBooks that takes a user and their ratings. Return the ratings in a sorted fashion.

In [0]:
#@title
def faveBooks(user,N):
    userRatings = data[data["user"]==user]
    sortedRatings = pd.DataFrame.sort_values(userRatings,['rating'],ascending=[0])[:N] 
    sortedRatings["title"] = sortedRatings["isbn"].apply(bookMeta)
    return sortedRatings

Use the function you made above and output a random event to make sure it works. 

In [0]:
#@title
faveBooks(204622,5)

Unnamed: 0,user,isbn,rating,title
844955,204622,0967560500,10,"(Natural Hormonal Enhancement, Rob Faigin)"
844935,204622,0671027360,10,"(Angels &amp; Demons, Dan Brown)"
844926,204622,0385504209,10,"(The Da Vinci Code, Dan Brown)"
844958,204622,097173660X,9,"(Life After School Explained, Cap &amp; Compass)"
844920,204622,0060935464,9,"(To Kill a Mockingbird, Harper Lee)"


Display the shape of the data here.

In [0]:
#@title
data.shape

(1031175, 3)

Count the value of the data based on ISBN. You can try out the value_counts() function here by Pandas. Show the first few lines of your output and the shape of this new variable named usersPerISBN.

In [0]:
#@title
usersPerISBN = data.isbn.value_counts()
usersPerISBN.head(10)

0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
044023722X     647
0142001740     615
067976402X     614
0671027360     586
0446672211     585
Name: isbn, dtype: int64

In [0]:
#@title
usersPerISBN.shape

(270170,)

Do the same as ISBNsPerUser and show the count of values and it's shape. 

In [0]:
#@title
ISBNsPerUser = data.user.value_counts()

In [0]:
#@title
ISBNsPerUser.shape

(92107,)

Make sure that the data has usersPerISBN>10 and ISBNsPerUser>10. 

In [0]:
#@title
data = data[data["isbn"].isin(usersPerISBN[usersPerISBN>10].index)]

In [0]:
#@title
data = data[data["user"].isin(ISBNsPerUser[ISBNsPerUser>10].index)]

Create a matrix named userItemRatingMatrix and make a pivot table that contains the User and ISBN as explained in the theory. This matrix should be filled by the ratings. Show the first few lines and shape of this matrix. 

In [0]:
#@title
userItemRatingMatrix=pd.pivot_table(data, values='rating',
                                    index=['user'], columns=['isbn'])

In [0]:
#@title
userItemRatingMatrix.head()

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,...,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,5.0,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,
242,,,,,,,,,,,...,,,,,,,,,,
243,,,,,,,,,,,...,,,,,,,,,,
254,,,,,,,,,,,...,,,,,,,,,,


In [0]:
#@title
userItemRatingMatrix.shape

(10706, 15451)

Let's get 2 random users and do some analysis. 

In [0]:
#@title
user1 = 204622
user2 = 255489

Transpose the matrix for user 1's and user 2's ratings and display the first few lines. 

In [0]:
#@title
user1Ratings = userItemRatingMatrix.transpose()[user1]
user1Ratings.head()

isbn
0002005018   NaN
0002251760   NaN
0002259834   NaN
0002558122   NaN
0006480764   NaN
Name: 204622, dtype: float64

In [0]:
#@title
user2Ratings = userItemRatingMatrix.transpose()[user2]

Now use hamming distance and find it between the two users using a function named distance. You should import the scipy.spatial.distance library and get hamming from it for the distance measurement. 

In [0]:
#@title
from scipy.spatial.distance import hamming 
hamming(user1Ratings,user2Ratings)

0.9999352792699502

In [0]:
#@title
import numpy as np
def distance(user1,user2):
        try:
            user1Ratings = userItemRatingMatrix.transpose()[user1]
            user2Ratings = userItemRatingMatrix.transpose()[user2]
            distance = hamming(user1Ratings,user2Ratings)
        except: 
            distance = np.NaN
        return distance 

Pick a distance between two users and output that. 

In [0]:
#@title
distance(204622,10118)

0.9998705585399004

Based on one of the users, and then make a matrix while removing that user for your calculation of the distance. The goal is to find the distance between that user and all the other ones. Name the distance variable for all users as allUsers. 

In [0]:
#@title
user = 204622
allUsers = pd.DataFrame(userItemRatingMatrix.index)
allUsers = allUsers[allUsers.user!=user]
allUsers.head()

Unnamed: 0,user
0,8
1,99
2,242
3,243
4,254


In [0]:
#@title
allUsers["distance"] = allUsers["user"].apply(lambda x: distance(user,x))

In [0]:
#@title
allUsers.head()

Unnamed: 0,user,distance
0,8,1.0
1,99,1.0
2,242,0.999935
3,243,0.999935
4,254,1.0


Let's use K Nearest Neighbors now! Use K as 10 and sort the values on allUsers and display them. 

In [0]:
#@title
K = 10
KnearestUsers = allUsers.sort_values(["distance"],ascending=True)["user"][:K]

In [0]:
#@title
KnearestUsers

3201     82893
3368     87555
2624     68555
1813     48046
5401    140036
7584    198711
565      16795
8866    232131
239       7346
9693    251422
Name: user, dtype: int64

Make a function for nearestNeighbors and find the KnearestUsers for the user we picked. 

In [0]:
#@title
def nearestNeighbors(user,K=10):
    allUsers = pd.DataFrame(userItemRatingMatrix.index)
    allUsers = allUsers[allUsers.user!=user]
    allUsers["distance"] = allUsers["user"].apply(lambda x: distance(user,x))
    KnearestUsers = allUsers.sort_values(["distance"],ascending=True)["user"][:K]
    return KnearestUsers

In [0]:
#@title
KnearestUsers = nearestNeighbors(user)

In [0]:
#@title
KnearestUsers

3201     82893
3368     87555
2624     68555
1813     48046
5401    140036
7584    198711
565      16795
8866    232131
239       7346
9693    251422
Name: user, dtype: int64

Now find the ratings of the Nearest Neighbors and put it in NNRatings. 

In [0]:
#@title
NNRatings = userItemRatingMatrix[userItemRatingMatrix.index.isin(KnearestUsers)]
NNRatings

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,...,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7346,,,,,,,,,,,...,,,,,,,,,,
16795,,,,,,,,,,,...,,,,,,,,,,
48046,,,,,,,,,,,...,,,,,,,,,,
68555,,,,,,,,,,,...,,,,,,,,,,
82893,,,,,,,,,,,...,,,,,,,,,,
87555,,,,,,,,,,,...,,,,,,,,,,
140036,,,,,,,,,,,...,,,,,,,,,,
198711,,,,,,,,,,,...,,,,,,,,,,
232131,,,,,,,,,,,...,,,,,,,,,,
251422,,,,,,,,,,,...,,,,,,,,,,


Find an average rating of NNRatings. 

In [0]:
#@title
avgRating = NNRatings.apply(np.nanmean).dropna()
avgRating.head()

  labels=labels)


isbn
0007154615    1.5
0020125305    0.0
0020125607    0.0
0020198817    0.0
0020198906    8.0
dtype: float64

Now find the books already read. 

In [0]:
#@title
booksAlreadyRead = userItemRatingMatrix.transpose()[user].dropna().index
booksAlreadyRead

Index([u'006016848X', u'0060935464', u'0140042598', u'0140178724',
       u'0142004278', u'0380732238', u'0385504209', u'0425109720',
       u'0425152898', u'0440136482', u'0440241162', u'0451191145',
       u'0451197127', u'0553096060', u'0671027360', u'0671027387',
       u'0671666258', u'0688174574', u'0743225708', u'076790592X',
       u'0785264280', u'0786868716', u'0802131867', u'0802132952',
       u'0971880107', u'1853260045', u'1853260126', u'1853260207',
       u'185326041X', u'1878424114'],
      dtype='object', name=u'isbn')

Find an average of the booksAlreadyRead. 

In [0]:
#@title
avgRating = avgRating[~avgRating.index.isin(booksAlreadyRead)]

Now for the final part. We take the top N suggestions for any user and then show them the final output. Make a variable named topNISBNs and sort out the average ratings in it. Make a function named topN and apply all the functions we have made and then pass on a user to topN and output the final result. 

In [0]:
#@title
N=3
topNISBNs = avgRating.sort_values(ascending=False).index[:N]

In [0]:
#@title
pd.Series(topNISBNs).apply(bookMeta)

0              (Love, Greg &amp; Lauren, Greg Manning)
1    (The Two Towers (The Lord of the Rings, Part 2...
2    (Harry Potter and the Sorcerer's Stone (Book 1...
Name: isbn, dtype: object

In [0]:
#@title
def topN(user,N=3):
    KnearestUsers = nearestNeighbors(user)
    NNRatings = userItemRatingMatrix[userItemRatingMatrix.index.isin(KnearestUsers)]
    avgRating = NNRatings.apply(np.nanmean).dropna()
    booksAlreadyRead = userItemRatingMatrix.transpose()[user].dropna().index
    avgRating = avgRating[~avgRating.index.isin(booksAlreadyRead)]
    topNISBNs = avgRating.sort_values(ascending=False).index[:N]
    return pd.Series(topNISBNs).apply(bookMeta)

In [0]:
#@title
faveBooks(204813,10)

Unnamed: 0,user,isbn,rating,title
845417,204813,399149848,10,"(Birthright, Nora Roberts)"
845407,204813,385504209,10,"(The Da Vinci Code, Dan Brown)"
845382,204813,373218036,10,"(Truly, Madly Manhattan, Nora Roberts)"
845359,204813,142001805,10,"(The Eyre Affair: A Novel, Jasper Fforde)"
845431,204813,446527793,10,"(The Guardian, Nicholas Sparks)"
845416,204813,399149392,10,"(Chesapeake Blue (Quinn Brothers (Hardcover)),..."
845432,204813,446531332,9,"(Nights in Rodanthe, Nicholas Sparks)"
845434,204813,446606243,9,"(The Tenth Justice, Brad Meltzer)"
845451,204813,671027360,9,"(Angels &amp; Demons, Dan Brown)"
845433,204813,446532452,9,"(The Wedding, Nicholas Sparks)"


In [0]:
#@title
topN(204813,10)

0    (Waiting For Nick (Silhouette Special Edition)...
1           (Wringer (Trophy Newbery), Jerry Spinelli)
2    (The Star Wars Trilogy: Star Wars, the Empire ...
3          (One, Two, Buckle My Shoe, Agatha Christie)
4                          (On the Road, Jack Kerouac)
5                 (Dead Poets Society, N.H. Kleinbaum)
6     (Go Ask Alice (Avon/Flare Book), James Jennings)
7                        (Carolina Moon, Nora Roberts)
8    (Illusions: The Adventures of a Reluctant Mess...
9    (You Just Don't Duct Tape a Baby!: True Tales ...
Name: isbn, dtype: object