**Using k-Nearest Neighbors to Identify User Ratings**

This particular model uses a concept called neighborhood collaborative filtering to identify a small number of recommended restaurants for a particular user based on the same user's previously-stated preferences for similar restaurants. As was previously mentioned, the sample we are using for this model includes only those reviewers who have reviewed at least 150 restaurants previously, and thus the stated preferences are already present in the sample used for this model. 

The model included here is based on a solution to the same problem for CS109a in 2013. The documentation for this problem can be found here: http://nbviewer.jupyter.org/github/cs109/content/blob/master/HW4_solutions.ipynb


In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
%matplotlib inline

**Read in Data**

The training and test samples used here were created previously prior to beginning analysis. They are the same training and test sets as have been used in previous models throughout this project. 

In [2]:
train_data = pd.read_csv('Data/train/WI/train_150.csv')
test_data = pd.read_csv('Data/test/WI/test_150.csv')

In [3]:
train_data.shape, test_data.shape

((659, 13), (178, 13))

In [4]:
train_data = train_data[0:250]

In [5]:
train_data.head()

Unnamed: 0,review_date,business_longitude,business_id,business_categories,business_name,business_state,review_score,user_id,user_average_rating,business_review_count,business_average_rating,business_latitude,user_review_count
0,2017-02-23,-89.475496,kWItkhjHRuzfC11CP1E0ZQ,"['Mexican', 'Restaurants']",Lalo's Mexican Restaurant,WI,4.0,8teQ4Zc9jpl_ffaPJUn6Ew,3.99,28,4.0,43.083088,560
1,2014-08-06,-89.401581,npqc1DO90A5TzavGmYyGhA,"['Beer Gardens', 'Nightlife', 'Bars', 'Hot Dog...",OSS,WI,4.0,8teQ4Zc9jpl_ffaPJUn6Ew,3.99,169,4.5,43.06768,560
2,2015-08-17,-89.395633,QxqMKO5IZCQUUWIx8nTDJw,"['Sandwiches', 'Coffee & Tea', 'Restaurants', ...",Colectivo Coffee,WI,4.0,8teQ4Zc9jpl_ffaPJUn6Ew,3.99,33,3.5,43.074867,560
3,2016-06-08,-89.52035,OzlqZV3-Pywyml2TC52WDw,"['Restaurants', 'American (Traditional)']",Mr. Brews Taphouse,WI,3.0,8teQ4Zc9jpl_ffaPJUn6Ew,3.99,38,4.0,42.99464,560
4,2016-06-06,-89.381547,cTPdZ9Va0HCLESf5gNnXwA,"['Bars', 'Lounges', 'Pizza', 'Nightlife', 'Res...",Lucille,WI,4.0,8teQ4Zc9jpl_ffaPJUn6Ew,3.99,135,3.5,43.074523,560


**Data Cleaning**

Because this model is based on a user's previous experiences with similar restaurants, we need a way to define which restaurants in this dataset are similar to one another. One such measure is a "common user support", which shows the number of users who have rated any particular pair of restaurants. We need such a measurement because common user support can be used later throughout this problem as a proxy for how similar a pair of restaurants may be to each other.  

In [None]:
restaurants=train_data.business_id.unique()
supports=[]
for i,rest1 in enumerate(restaurants):
    for j,rest2 in enumerate(restaurants):
        if  i < j:
            rest1_reviewers = train_data[train_data.business_id==rest1].user_id.unique()
            rest2_reviewers = train_data[train_data.business_id==rest2].user_id.unique()
            common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers)
            supports.append(len(common_reviewers))
print("Mean support is:",np.mean(supports))
plt.hist(supports)

In [6]:
import pickle

In [None]:
file_Name = "test_restaurants"
# open the file for writing
fileObject = open(file_Name,'wb') 
# this writes the object a to the
# file named 'testfile'
pickle.dump(restaurants,fileObject)
# here we close the fileObject
fileObject.close()

In [None]:
file_Name = "test_supports"
# open the file for writing
fileObject = open(file_Name,'wb') 

# this writes the object a to the
# file named 'testfile'
pickle.dump(supports,fileObject)

# here we close the fileObject
fileObject.close()

In [None]:
file_Name = "test_rest1_reviewers"
# open the file for writing
fileObject = open(file_Name,'wb') 

# this writes the object a to the
# file named 'testfile'
pickle.dump(rest1_reviewers,fileObject)

# here we close the fileObject
fileObject.close()


In [None]:
file_Name = "test_rest2_reviewers"
# open the file for writing
fileObject = open(file_Name,'wb') 

# this writes the object a to the
# file named 'testfile'
pickle.dump(rest2_reviewers,fileObject)

# here we close the fileObject
fileObject.close()


In [None]:
file_Name = "test_common_reviewers"
# open the file for writing
fileObject = open(file_Name,'wb') 

# this writes the object a to the
# file named 'testfile'
pickle.dump(common_reviewers,fileObject)

# here we close the fileObject
fileObject.close()

In [7]:
#fileObject = open("test_common_reviewers",'rb')

#common_reviewers = pickle.load(fileObject)  
rest2_reviewers = pickle.load(open("test_rest2_reviewers",'rb')) 
rest1_reviewers = pickle.load(open("test_rest1_reviewers",'rb')) 
supports = pickle.load(open("test_supports",'rb')) 
restaurants = pickle.load(open("test_restaurants",'rb'))

#db = pickle.load(open("test_db",'rb'))

In [None]:
#supports_saved

In [None]:
#restaurants_saved

This Now that we have defined similar restaurants, we use this information to create a correlation measure that determines the ew

In [None]:
#Now create the pearson correlation coefficient between the newly corrected user ratings for users 
#who have reviewed the same restaurants. 

In [8]:
from scipy.stats.stats import pearsonr
def pearson_sim(rest1_reviews, rest2_reviews, n_common):
    """
    Given a subframe of restaurant 1 reviews and a subframe of restaurant 2 reviews,
    where the reviewers are those who have reviewed both restaurants, return 
    the pearson correlation coefficient between the user average subtracted ratings.
    The case for zero common reviewers is handled separately. Its
    ok to return a NaN if any of the individual variances are 0.
    """
    if n_common==0:
        rho=0.
    else:
        diff1=rest1_reviews['business_average_rating']-rest1_reviews['user_average_rating']
        diff2=rest2_reviews['business_average_rating']-rest2_reviews['user_average_rating']
        try:
            rho=pearsonr(diff1, diff2)[0]
        except:
            return 0
    return rho

In [9]:
#Ok but let's say you get a particular business. Here's a way to spit out the dataframe of the reviews for 
#that particular restaurant

In [10]:
def get_restaurant_reviews(restaurant_id, df, set_of_users):
    """
    given a resturant id and a set of reviewers, return the sub-dataframe of their
    reviews.
    """
    mask = (df.user_id.isin(set_of_users)) & (df.business_id==restaurant_id)
    reviews = df[mask]
    reviews = reviews[reviews.user_id.duplicated()==False]
    return reviews

In [11]:
#Now that you have these reviews, calculate the similarity between restaurants at the database level

In [12]:
"""
Function
--------
calculate_similarity

Parameters
----------
rest1 : string
    The id of restaurant 1
rest2 : string
    The id of restaurant 2
df : DataFrame
  A dataframe of reviews, such as the smalldf above
similarity_func : func
  A function like pearson_sim above which takes two dataframes of individual
  restaurant reviews made by a common set of reviewers, and the number of
  common reviews. This function returns the similarity of the two restaurants
  based on the common reviews.
  
Returns
--------
A tuple
  The first element of the tuple is the similarity and the second the
  common support n_common. If the similarity is a NaN, set it to 0
"""
#your code here
def calculate_similarity(rest1, rest2, df, similarity_func):
    # find common reviewers
    rest1_reviewers = df[df.business_id==rest1].user_id.unique()
    rest2_reviewers = df[df.business_id==rest2].user_id.unique()
    common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers)
    n_common=len(common_reviewers)
    #get reviews
    rest1_reviews = get_restaurant_reviews(rest1, df, common_reviewers)
    rest2_reviews = get_restaurant_reviews(rest2, df, common_reviewers)
    sim=similarity_func(rest1_reviews, rest2_reviews, n_common)
    if np.isnan(sim):
        return 0, n_common
    return sim, n_common

In [13]:
#Create now a database of similarities and common supporters!

In [14]:
class Database:
    "A class representing a database of similaries and common supports"
    
    def __init__(self, df):
        "the constructor, takes a reviews dataframe like smalldf as its argument"
        database={}
        self.df=df
        self.uniquebizids={v:k for (k,v) in enumerate(df.business_id.unique())}
        keys=self.uniquebizids.keys()
        l_keys=len(keys)
        self.database_sim=np.zeros([l_keys,l_keys])
        self.database_sup=np.zeros([l_keys, l_keys], dtype=np.int)
        
    def populate_by_calculating(self, similarity_func):
        """
        a populator for every pair of businesses in df. takes similarity_func like
        pearson_sim as argument
        """
        items=self.uniquebizids.items()
        for b1, i1 in items:
            for b2, i2 in items:
                if i1 < i2:
                    sim, nsup=calculate_similarity(b1, b2, self.df, similarity_func)
                    self.database_sim[i1][i2]=sim
                    self.database_sim[i2][i1]=sim
                    self.database_sup[i1][i2]=nsup
                    self.database_sup[i2][i1]=nsup
                elif i1==i2:
                    nsup=self.df[self.df.business_id==b1].user_id.count()
                    self.database_sim[i1][i1]=1.
                    self.database_sup[i1][i1]=nsup
                    

    def get(self, b1, b2):
        "returns a tuple of similarity,common_support given two business ids"
        sim=self.database_sim[self.uniquebizids[b1]][self.uniquebizids[b2]]
        nsup=self.database_sup[self.uniquebizids[b1]][self.uniquebizids[b2]]
        return (sim, nsup)

In [15]:
#Now let's make the database and save it as db

In [16]:
np.seterr(all='raise')
db=Database(train_data)
db.populate_by_calculating(pearson_sim)

In [17]:
def shrunk_sim(sim, n_common, reg=3.):
    "takes a similarity and shrinks it down by using the regularizer"
    ssim=(n_common*sim)/(n_common+reg)
    return ssim

In [18]:
"""
Function
--------
knearest

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
set_of_restaurants : array
    The set of restaurants from which we want to find the nearest neighbors
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
from operator import itemgetter
def knearest(restaurant_id, set_of_restaurants, dbase, k=7, reg=3.):
    """
    Given a restaurant_id, dataframe, and database, get a sorted list of the
    k most similar restaurants from the entire database.
    """
    similars=[]
    for other_rest_id in set_of_restaurants:
        if other_rest_id!=restaurant_id:
            sim, nc=dbase.get(restaurant_id, other_rest_id)
            ssim=shrunk_sim(sim, nc, reg=reg)
            similars.append((other_rest_id, ssim, nc ))
    similars=sorted(similars, key=itemgetter(1), reverse=True)
    return similars[0:k]

In [19]:
testbizid="kWItkhjHRuzfC11CP1E0ZQ"
testbizid2="npqc1DO90A5TzavGmYyGhA"

In [20]:
def biznamefromid(df, theid):
    return df['business_name'][df['business_id']==theid].values[0]
def usernamefromid(df, theid):
    return df['user_id'][df['user_id']==theid].values[0]

In [21]:
def get_user_top_choices(user_id, df, numchoices=5):
    "get the sorted top 5 restaurants for a user by the star rating the user gave them"
    #udf=df[df.user_id==user_id][['business_id','business_average_rating']].sort(['business_average_rating'], ascending=False).head(numchoices)
    udf=df[df.user_id==user_id][['business_id','business_average_rating']].head(numchoices)
    return udf
testuserid="8teQ4Zc9jpl_ffaPJUn6Ew"
print("For user", testuserid, "top choices are:") 
bizs=get_user_top_choices(testuserid, train_data)['business_id'].values
[biznamefromid(train_data, business_id) for business_id in bizs]

For user 8teQ4Zc9jpl_ffaPJUn6Ew top choices are:


["Lalo's Mexican Restaurant",
 'OSS',
 'Colectivo Coffee',
 'Mr. Brews Taphouse',
 'Lucille']

In [22]:
"""
Function
--------
get_top_recos_for_user

Parameters
----------
userid : string
    The id of the user for whom we want the top recommendations
df : Dataframe
    The dataframe of restaurant reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businesses. e.g. dbase.get(rid1,rid2)
n: int
    the n top choices of the user by star rating
k : int
    the number of nearest neighbors desired, default 8
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top recommendations. The list is a list of tuples
    (business_id, business_avg). You are combining the k-nearest recommendations 
    for each of the user's n top choices, removing duplicates and the ones the user
    has already rated.
"""
#your code here
def get_top_recos_for_user(userid, df, dbase, n=5, k=7, reg=3.):
    bizs=get_user_top_choices(userid, df, numchoices=n)['business_id'].values
    rated_by_user=df[df.user_id==userid].business_id.values
    tops=[]
    for ele in bizs:
        t=knearest(ele, df.business_id.unique(), dbase, k=k, reg=reg)
        for e in t:
            if e[0] not in rated_by_user:
                tops.append(e)

    #there might be repeats. unique it
    ids=[e[0] for e in tops]
    uids={k:0 for k in list(set(ids))}

    topsu=[]
    for e in tops:
        if uids[e[0]] == 0:
            topsu.append(e)
            uids[e[0]] =1
    topsr=[]     
    for r, s,nc in topsu:
        avg_rate=df[df.business_id==r].stars.mean()
        topsr.append((r,avg_rate))
        
    topsr=sorted(topsr, key=itemgetter(1), reverse=True)

    if n < len(topsr):
        return topsr[0:n]
    else:
        return topsr

In [29]:
print("For user", usernamefromid(train_data,testuserid), "the top recommendations are:")
toprecos=get_top_recos_for_user(testuserid, train_data, db, n=5, k=7, reg=3.)
for biz_id, biz_avg in toprecos:
    print(biznamefromid(train_data,biz_id), "| Average Rating |", biz_avg)

For user 8teQ4Zc9jpl_ffaPJUn6Ew the top recommendations are:


In [23]:
"""
Function
--------
knearest_amongst_userrated

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
def knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=7, reg=3.):
    dfuser=df[df.user_id==user_id]
    bizsuserhasrated=dfuser.business_id.unique()
    return knearest(restaurant_id, bizsuserhasrated, dbase, k=k, reg=reg)

In [24]:
"""
Function
--------
rating

Parameters
----------
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A float
    which is the impued rating that we predict that user_id will make for restaurant_id
"""
#your code here
def rating(df, dbase, restaurant_id, user_id, k=7, reg=3.):
    mu=df.stars.mean()
    users_reviews=df[df.user_id==user_id]
    nsum=0.
    scoresum=0.
    nears=knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=k, reg=reg)
    restaurant_mean=df[df.business_id==restaurant_id].business_avg.values[0]
    user_mean=users_reviews.user_avg.values[0]
    scores=[]
    for r,s,nc in nears:
        scoresum=scoresum+s
        scores.append(s)
        r_reviews_row=users_reviews[users_reviews['business_id']==r]
        r_stars=r_reviews_row.stars.values[0]
        r_avg=r_reviews_row.business_avg.values[0]
        rminusb=(r_stars - (r_avg + user_mean - mu))
        nsum=nsum+s*rminusb
    baseline=(user_mean +restaurant_mean - mu)
    #we might have nears, but there might be no commons, giving us a pearson of 0
    if scoresum > 0.:
        val =  nsum/scoresum + baseline
    else:
        val=baseline
    return val
        

In [25]:
print("For user", usernamefromid(train_data,testuserid), "the top recommendations are:")
toprecos=get_top_recos_for_user(testuserid, train_data, db, n=5, k=7, reg=3.)
for biz_id, biz_avg in toprecos:
    print( biznamefromid(train_data,biz_id), "| Average Rating |", biz_avg)

For user 8teQ4Zc9jpl_ffaPJUn6Ew the top recommendations are:


In [26]:
print( "User Average", train_data[train_data.user_id==testuserid].business_average_rating.mean(),"for",usernamefromid(train_data,testuserid))
print( "Predicted ratings for top choices calculated earlier:")
for biz_id,biz_avg in toprecos:
    print(biznamefromid(train_data, biz_id),"|",rating(train_data, db, biz_id, testuserid, k=7, reg=3.),"|","Average",biz_avg )

User Average 3.7535714285714286 for 8teQ4Zc9jpl_ffaPJUn6Ew
Predicted ratings for top choices calculated earlier:


In [None]:
def shrunk_sim(sim, n_common, reg=3.):
    "takes a similarity and shrinks it down by using the regularizer"
    ssim=(n_common*sim)/(n_common+reg)
    return ssim

In [None]:
"""
Function
--------
knearest

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
set_of_restaurants : array
    The set of restaurants from which we want to find the nearest neighbors
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
from operator import itemgetter
def knearest(restaurant_id, set_of_restaurants, dbase, k=7, reg=3.):
    """
    Given a restaurant_id, dataframe, and database, get a sorted list of the
    k most similar restaurants from the entire database.
    """
    similars=[]
    for other_rest_id in set_of_restaurants:
        if other_rest_id!=restaurant_id:
            sim, nc=dbase.get(restaurant_id, other_rest_id)
            ssim=shrunk_sim(sim, nc, reg=reg)
            similars.append((other_rest_id, ssim, nc ))
    similars=sorted(similars, key=itemgetter(1), reverse=True)
    return similars[0:k]

In [None]:
testbizid="kWItkhjHRuzfC11CP1E0ZQ"
testbizid2="npqc1DO90A5TzavGmYyGhA"

In [None]:
def biznamefromid(df, theid):
    return df['business_name'][df['business_id']==theid].values[0]
def usernamefromid(df, theid):
    return df['user_name'][df['user_id']==theid].values[0]

In [None]:
print(testbizid, biznamefromid(train_data,testbizid))
print(testbizid2, biznamefromid(train_data, testbizid2))


In [None]:
tops=knearest(testbizid, train_data.business_id.unique(), db, k=7, reg=3.)
print("For ",biznamefromid(train_data, testbizid), ", top matches are:")
for i, (biz_id, sim, nc) in enumerate(tops):
    print(i,biznamefromid(train_data,biz_id), "| Sim", sim, "| Support",nc)

In [None]:
#Find the nearest restaurants based on what the user has already rated themselves

In [None]:
"""
Function
--------
knearest_amongst_userrated

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
def knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=7, reg=3.):
    dfuser=df[df.user_id==user_id]
    bizsuserhasrated=dfuser.business_id.unique()
    return knearest(restaurant_id, bizsuserhasrated, dbase, k=k, reg=reg)

In [None]:
#Return the predicted rating someone might give to a particular restaurant

In [None]:
"""
Function
--------
rating

Parameters
----------
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A float
    which is the impued rating that we predict that user_id will make for restaurant_id
"""
#your code here
def rating(df, dbase, restaurant_id, user_id, k=7, reg=3.):
    mu=df.stars.mean()
    users_reviews=df[df.user_id==user_id]
    nsum=0.
    scoresum=0.
    nears=knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=k, reg=reg)
    restaurant_mean=df[df.business_id==restaurant_id].business_avg.values[0]
    user_mean=users_reviews.user_avg.values[0]
    scores=[]
    for r,s,nc in nears:
        scoresum=scoresum+s
        scores.append(s)
        r_reviews_row=users_reviews[users_reviews['business_id']==r]
        r_stars=r_reviews_row.stars.values[0]
        r_avg=r_reviews_row.business_avg.values[0]
        rminusb=(r_stars - (r_avg + user_mean - mu))
        nsum=nsum+s*rminusb
    baseline=(user_mean +restaurant_mean - mu)
    #we might have nears, but there might be no commons, giving us a pearson of 0
    if scoresum > 0.:
        val =  nsum/scoresum + baseline
    else:
        val=baseline
    return val