## Find Top 10 Recommendations for a user X in destination Las Vegas

#### <u>Setup the data</u>

In [21]:
import pandas as pd
dataFile='/home/navanga/Dev/Notebooks/results-111489203.csv'
userRatingData=pd.read_csv(dataFile,sep=",",header=0,names=["user","productcode","rating"])

In [22]:
userRatingData.head()

Unnamed: 0,user,productcode,rating
0,22431518,2224GHOST,5
1,22431518,33837P1,5
2,21441947,30711A,5
3,22402085,3738BICYCLE,5
4,22431518,29843P2,5


Count number of distinct products and the number of times it has been rated

In [23]:
userRatingsPerProductCode = userRatingData.productcode.value_counts()
userRatingsPerProductCode.head()

3731VATICAN          906
3731COLOSSEUM        469
6980ROME             312
2970AH29             277
2142TYO_F800_F820    270
Name: productcode, dtype: int64

Unique Products - Number of columns in the matrix


In [11]:
userRatingsPerProductCode.shape

(17602,)

Count number of distinct users and the number of products they have rated

In [24]:
productsPerUser = userRatingData.user.value_counts()
productsPerUser.head()

2           90
10711298    49
8241481     30
15659698    27
20865781    27
Name: user, dtype: int64

Unique users - Number of rows in the matrix

In [10]:
productsPerUser.shape

(60074,)

Consider only products that have been rated by more than 10 users

In [25]:
userRatingData = userRatingData[userRatingData["productcode"].isin(userRatingsPerProductCode[userRatingsPerProductCode > 10].index)]

Consider only users that have rated more than X number of products

In [26]:
userRatingData = userRatingData[userRatingData["user"].isin(productsPerUser[productsPerUser > 10].index)]

### <u>Create the Matrix</u>

In [27]:
userProductRatingMatrix=pd.pivot_table(userRatingData, values='rating',
                                    index=['user'], columns=['productcode'])

In [28]:
userProductRatingMatrix.head()

productcode,10559P5,10559P7,10784P1,10784P4,10919P3,10923P2,10981P1,11057P1,11143P1,11146P1,...,9218P15,9218P18,9376P7,9406P19,9511P2,9555P1,9819EIFFELDINNER,9860P5,9872P7,9910P1
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,5.0,5.0,,,,,,,,,...,,,,,,,,,,
10332,,,,,,,,,,5.0,...,,,,,,,,,,
55246,,,,,,,,,,,...,,,,,,,,,,
82381,,,,,,,,,,,...,,,,,,,,,,
92033,,,,,,,,,,,...,,,,,,,,,,


In [16]:
userProductRatingMatrix.shape

(227, 937)

Find the Hamming distance between them

In [29]:
from scipy.spatial.distance import hamming 


### <u>Find the Nearest Neighbours</u>

Compute the distance between the two users:
Get user1/user2 rating for all products compute the hamming distance
In case of error, return NaN

In [30]:
import numpy as np
def distance(user1,user2):
        try:
            user1Ratings = userProductRatingMatrix.transpose()[user1]
            user2Ratings = userProductRatingMatrix.transpose()[user2]
            distance = hamming(user1Ratings,user2Ratings)
        except: 
            distance = np.NaN
        return distance 

In [None]:
distance(21005956,2)

Get all the users other than the active user

In [32]:
activeUser = 21005956
allUsers = pd.DataFrame(userProductRatingMatrix.index)
allUsers = allUsers[allUsers.user != activeUser]

In [33]:
allUsers.head()

Unnamed: 0,user
0,2
1,10332
2,55246
3,82381
4,92033


Add a new distance column  - The distance between the active user and the other users

In [35]:
allUsers["distance"] = allUsers["user"].apply(lambda x: distance(activeUser,x))

In [48]:
allUsers.head()

Unnamed: 0,user,distance
0,2,1.0
1,10332,1.0
2,55246,1.0
3,82381,1.0
4,92033,1.0


Get the Nearest Neighbours for the active user

In [39]:
K = 10
KnearestUsers = allUsers.sort_values(["distance"],ascending=True)["user"][:K]

In [41]:
KnearestUsers

56      5172665
80      7785448
125    12372823
101     9617339
145    15129184
146    15144335
147    15227340
148    15337419
149    15344886
150    15475826
Name: user, dtype: int64

In [49]:
def nearestNeighbors(user,K=10):
    allUsers = pd.DataFrame(userProductRatingMatrix.index)
    allUsers = allUsers[allUsers.user!=user]
    allUsers["distance"] = allUsers["user"].apply(lambda x: distance(user,x))
    KnearestUsers = allUsers.sort_values(["distance"],ascending=True)["user"][:K]
    return KnearestUsers

In [None]:
KnearestUsers = nearestNeighbors(82381, 10)

In [None]:
KnearestUsers.head()

### <u>Find the top N recommendations for a given user </u>

Get the ratings of the nearest neighbours for all products

In [50]:
NNRatings = userProductRatingMatrix[userProductRatingMatrix.index.isin(KnearestUsers)]

In [51]:
NNRatings

productcode,10559P5,10559P7,10784P1,10784P4,10919P3,10923P2,10981P1,11057P1,11143P1,11146P1,...,9218P15,9218P18,9376P7,9406P19,9511P2,9555P1,9819EIFFELDINNER,9860P5,9872P7,9910P1
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5172665,,,,,,,,,,,...,,,,,,,,,,
7785448,,,,,,,,,,,...,,,,,,,,,,
9617339,,,,,,,,,,,...,,,,,,,,,,
12372823,,,,,,,,,,,...,,,,,,,,,,
15129184,,,,,,,,,,,...,,,,,,,,,,
15144335,,,,,,,,,,,...,,,,,,,,,,
15227340,,,,,,,,,,,...,,,,,,,,,,
15337419,,,,,,,,,,,...,,,,,,,,,,
15344886,,,,,,,,,,,...,,,,,,,,,,
15475826,,,,,,,,,,,...,,,,,,,,,,


Average the rating of the product column to get the average rating of the nearest neighbours for that product

In [52]:
avgRating = NNRatings.apply(np.nanmean).dropna()



In [55]:
avgRating.head()

productcode
15693HOUSES    4.0
2050BR         5.0
2050PAJ        5.0
2050PE         5.0
2138B84        4.0
dtype: float64

Products already rated by the active user - dropna() - Remove columns without a rating so we are left
with only columns that have a rating - ie they have been rated by the active user

In [56]:
productsAlreadyRated = userProductRatingMatrix.transpose()[activeUser].dropna().index
productsAlreadyRated

Index(['2142TYO_F300_F308', '2142TYO_F800_F820', '28575P2', '5754ICNDMZJSA',
       '6006TYOAPTHTL', '6006TYOHTLAPT'],
      dtype='object', name='productcode')

Remove products already rated

In [57]:
avgRating = avgRating[~avgRating.index.isin(productsAlreadyRated)]

In [58]:
avgRating.head()

productcode
15693HOUSES    4.0
2050BR         5.0
2050PAJ        5.0
2050PE         5.0
2138B84        4.0
dtype: float64

Top N products of the nearest neighbours that have the highes rating

In [59]:
N=10
topNProducts = avgRating.sort_values(ascending=False).index[:N]

In [62]:
topNProducts

Index(['9205P4', '3864BCNHTLAPT', '3858EE036', '3731VATICAN', '3731TUSCANY',
       '7812P2', '3627PARHTLAPTCDG', '3588SEGWAY01', '3542SB', '3253OSPREY'],
      dtype='object', name='productcode')

In [63]:
def topRecommendations(user,N=10):
    KnearestUsers = nearestNeighbors(user)
    NNRatings = userProductRatingMatrix[userProductRatingMatrix.index.isin(KnearestUsers)]
    avgRating = NNRatings.apply(np.nanmean).dropna()
    productsAlreadyRated = userProductRatingMatrix.transpose()[user].dropna().index
    avgRating = avgRating[~avgRating.index.isin(productsAlreadyRated)]
    topProductCodes = avgRating.sort_values(ascending=False).index[:N]
    return topProductCodes

In [64]:
topRecommendations(21005956)



Index(['9205P4', '3864BCNHTLAPT', '3858EE036', '3731VATICAN', '3731TUSCANY',
       '7812P2', '3627PARHTLAPTCDG', '3588SEGWAY01', '3542SB', '3253OSPREY'],
      dtype='object', name='productcode')