
___
## Recommender Systems with KNN 

Welcome to the code notebook for Recommender Systems with Python. In this lecture we will develop basic recommendation systems using Python and pandas.

In this notebook, we will focus on providing a basic recommendation system by suggesting items that are most similar to a particular item, in this case, companies.


## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Get the Data

In [2]:

xl = pd.ExcelFile('ratings.xlsx')
df1 = xl.parse('Sheet1')

In [3]:
df1.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,28,3,2,2013-07-04 19:4:25
1,16,24,4,2011-07-15 10:2:12
2,24,39,5,2011-08-25 19:34:13
3,9,18,1,2010-04-26 18:39:6
4,23,29,3,2008-10-25 14:49:47


Now let's get the company titles:

In [4]:
xl1 = pd.ExcelFile('redcrow_company_list.xlsx')
company_titles = xl1.parse('Sheet1')
company_titles.head()

Unnamed: 0,item_id,title
0,1,Acclinate Genetics
1,2,Activ Surgical
2,3,AIM Medical Robotics
3,4,AMChart
4,5,AngioInsight Inc


We can merge them together:

In [5]:
df = pd.merge(df1,company_titles,on='item_id')

df.head()


Unnamed: 0,user_id,item_id,rating,timestamp,title
0,28,3,2,2013-07-04 19:4:25,AIM Medical Robotics
1,17,3,4,2009-07-25 13:55:2,AIM Medical Robotics
2,30,3,3,2013-01-08 14:1:26,AIM Medical Robotics
3,28,3,3,2011-01-23 16:55:0,AIM Medical Robotics
4,30,3,1,2009-10-01 23:23:34,AIM Medical Robotics


# EDA

Let's explore the data a bit and get a look at some of the best rated companies.

## Visualization Imports

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

In [7]:
combine_company_rating = df.dropna(axis = 0, subset = ['title'])
company_ratingCount = (combine_company_rating.
     groupby(by = ['title'])['rating'].
     count().
     reset_index().
     rename(columns = {'rating': 'totalRatingCount'})
     [['title', 'totalRatingCount']]
    )
company_ratingCount.head()

Unnamed: 0,title,totalRatingCount
0,AIM Medical Robotics,13
1,AMChart,15
2,ARIZ Precision Medicine,12
3,Acclinate Genetics,14
4,Activ Surgical,9


In [8]:
rating_with_totalRatingCount = combine_company_rating.merge(company_ratingCount, left_on = 'title', right_on = 'title', how = 'left')
rating_with_totalRatingCount.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title,totalRatingCount
0,28,3,2,2013-07-04 19:4:25,AIM Medical Robotics,13
1,17,3,4,2009-07-25 13:55:2,AIM Medical Robotics,13
2,30,3,3,2013-01-08 14:1:26,AIM Medical Robotics,13
3,28,3,3,2011-01-23 16:55:0,AIM Medical Robotics,13
4,30,3,1,2009-10-01 23:23:34,AIM Medical Robotics,13


In [9]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(company_ratingCount['totalRatingCount'].describe())

count   39.000
mean    12.821
std      3.523
min      6.000
25%     10.500
50%     13.000
75%     14.500
max     22.000
Name: totalRatingCount, dtype: float64


In [10]:
popularity_threshold = 9
rating_popular_company= rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
rating_popular_company.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title,totalRatingCount
0,28,3,2,2013-07-04 19:4:25,AIM Medical Robotics,13
1,17,3,4,2009-07-25 13:55:2,AIM Medical Robotics,13
2,30,3,3,2013-01-08 14:1:26,AIM Medical Robotics,13
3,28,3,3,2011-01-23 16:55:0,AIM Medical Robotics,13
4,30,3,1,2009-10-01 23:23:34,AIM Medical Robotics,13


In [11]:
rating_popular_company.shape

(471, 6)

In [12]:
## First lets create a Pivot matrix

company_features_df=rating_popular_company.pivot_table(index='title',columns='user_id',values='rating').fillna(0)
company_features_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AIM Medical Robotics,0.0,2.0,2.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.5,0.0,2.0
AMChart,0.0,4.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,4.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,3.0,5.0,2.0
ARIZ Precision Medicine,3.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,3.0,1.0
Acclinate Genetics,4.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,...,4.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.5,0.0
Activ Surgical,4.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's create a ratings dataframe with average rating and number of ratings:

In [13]:
from scipy.sparse import csr_matrix

company_features_df_matrix = csr_matrix(company_features_df.values)

from sklearn.neighbors import NearestNeighbors


model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(company_features_df_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [14]:

company_features_df.shape

(35, 30)

In [15]:
query_index = np.random.choice(company_features_df.shape[0])
print(query_index)
distances, indices = model_knn.kneighbors(company_features_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

34


In [16]:

company_features_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AIM Medical Robotics,0.0,2.0,2.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.5,0.0,2.0
AMChart,0.0,4.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,4.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,3.0,5.0,2.0
ARIZ Precision Medicine,3.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,3.0,1.0
Acclinate Genetics,4.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,...,4.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.5,0.0
Activ Surgical,4.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(company_features_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, company_features_df.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Maternity Partners LLC:

1: Healthy Within, with distance of 0.41139150116768064:
2: Livia Medicines, with distance of 0.5147402323292576:
3: Boinca Therapeutics, with distance of 0.5158547002294919:
4: Maculus Therapeutix, with distance of 0.5162038620953012:
5: EnamelPure, with distance of 0.5509695360434179:
