**Using k-Nearest Neighbors to Identify User Ratings**

This particular model uses a concept called neighborhood collaborative filtering to identify a small number of recommended restaurants for a particular user based on the same user's previously-stated preferences for similar restaurants. As was previously mentioned, the sample we are using for this model includes only those reviewers who have reviewed at least 150 restaurants previously, and thus the stated preferences are already present in the sample used for this model. 

The model included here is based on a solution to the same problem for CS109a in 2013. The documentation for this problem can be found here: http://nbviewer.jupyter.org/github/cs109/content/blob/master/HW4_solutions.ipynb


In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
%matplotlib inline

**Read in Data**

The training and test samples used here were created previously prior to beginning analysis. They are the same training and test sets as have been used in previous models throughout this project. 

In [2]:
train_data = pd.read_csv('Data/train/ON/train_150.csv')
test_data = pd.read_csv('Data/test/ON/test_150.csv')

In [3]:
train_data.shape, test_data.shape

((28542, 13), (7303, 13))

In [4]:
train_data.head()

Unnamed: 0,review_date,business_longitude,business_id,business_categories,business_name,business_state,review_score,user_id,user_average_rating,business_review_count,business_average_rating,business_latitude,user_review_count
0,2013-03-11,-79.379528,3dfJjFeXVb8UoESrypybeQ,"['Restaurants', 'Japanese']",Ninki Japanese Cuisine,ON,4.0,--Qh8yKWAvIP4V4K8ZPfHA,3.19,73,2.5,43.649349,503
1,2012-12-02,-79.40677,8MWiywu09bhLWIpzVaD4gw,"['Breakfast & Brunch', 'Restaurants']",Aunties & Uncles,ON,4.0,--Qh8yKWAvIP4V4K8ZPfHA,3.19,312,3.5,43.657,503
2,2014-01-07,-79.412422,JsAjP49bCk9KjJmmmjyM4w,"['Bakeries', 'French', 'Restaurants', 'Food']",Clafouti Patisserie & Caf\u00e9,ON,4.0,--Qh8yKWAvIP4V4K8ZPfHA,3.19,82,3.5,43.645248,503
3,2013-04-26,-79.380229,ZsttjmFUQvZ2KOGyTRy6mQ,"['Soup', 'Restaurants', 'Salad', 'Breakfast & ...",Soup Nutsy,ON,5.0,--Qh8yKWAvIP4V4K8ZPfHA,3.19,46,3.5,43.647664,503
4,2012-03-11,-79.443617,id7aJmK8mFs5XSXNa4YgvA,"['Cafes', 'Restaurants', 'Breakfast & Brunch']",The Pinball Cafe,ON,4.0,--Qh8yKWAvIP4V4K8ZPfHA,3.19,8,4.0,43.639342,503


**Data Cleaning**

Because this model is based on a user's previous experiences with similar restaurants, we need a way to define which restaurants in this dataset are similar to one another. One such measure is a "common user support", which shows the number of users who have rated any particular pair of restaurants. We need such a measurement because common user support can be used later throughout this problem as a proxy for how similar a pair of restaurants may be to each other.  

In [None]:
restaurants=train_data.business_id.unique()
supports=[]
for i,rest1 in enumerate(restaurants):
    for j,rest2 in enumerate(restaurants):
        if  i < j:
            rest1_reviewers = train_data[train_data.business_id==rest1].user_id.unique()
            rest2_reviewers = train_data[train_data.business_id==rest2].user_id.unique()
            common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers)
            supports.append(len(common_reviewers))
print("Mean support is:",np.mean(supports))
plt.hist(supports)

This Now that we have defined similar restaurants, we use this information to create a correlation measure that determines the ew

In [None]:
#Now create the pearson correlation coefficient between the newly corrected user ratings for users 
#who have reviewed the same restaurants. 

In [None]:
from scipy.stats.stats import pearsonr
def pearson_sim(rest1_reviews, rest2_reviews, n_common):
    """
    Given a subframe of restaurant 1 reviews and a subframe of restaurant 2 reviews,
    where the reviewers are those who have reviewed both restaurants, return 
    the pearson correlation coefficient between the user average subtracted ratings.
    The case for zero common reviewers is handled separately. Its
    ok to return a NaN if any of the individual variances are 0.
    """
    if n_common==0:
        rho=0.
    else:
        diff1=rest1_reviews['business_average_rating']-rest1_reviews['user_average_rating']
        diff2=rest2_reviews['business_average_rating']-rest2_reviews['user_average_rating']
        try:
            rho=pearsonr(diff1, diff2)[0]
        except:
            return 0
    return rho

In [None]:
#Ok but let's say you get a particular business. Here's a way to spit out the dataframe of the reviews for 
#that particular restaurant

In [None]:
def get_restaurant_reviews(restaurant_id, df, set_of_users):
    """
    given a resturant id and a set of reviewers, return the sub-dataframe of their
    reviews.
    """
    mask = (df.user_id.isin(set_of_users)) & (df.business_id==restaurant_id)
    reviews = df[mask]
    reviews = reviews[reviews.user_id.duplicated()==False]
    return reviews

In [None]:
#Now that you have these reviews, calculate the similarity between restaurants at the database level

In [None]:
"""
Function
--------
calculate_similarity

Parameters
----------
rest1 : string
    The id of restaurant 1
rest2 : string
    The id of restaurant 2
df : DataFrame
  A dataframe of reviews, such as the smalldf above
similarity_func : func
  A function like pearson_sim above which takes two dataframes of individual
  restaurant reviews made by a common set of reviewers, and the number of
  common reviews. This function returns the similarity of the two restaurants
  based on the common reviews.
  
Returns
--------
A tuple
  The first element of the tuple is the similarity and the second the
  common support n_common. If the similarity is a NaN, set it to 0
"""
#your code here
def calculate_similarity(rest1, rest2, df, similarity_func):
    # find common reviewers
    rest1_reviewers = df[df.business_id==rest1].user_id.unique()
    rest2_reviewers = df[df.business_id==rest2].user_id.unique()
    common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers)
    n_common=len(common_reviewers)
    #get reviews
    rest1_reviews = get_restaurant_reviews(rest1, df, common_reviewers)
    rest2_reviews = get_restaurant_reviews(rest2, df, common_reviewers)
    sim=similarity_func(rest1_reviews, rest2_reviews, n_common)
    if np.isnan(sim):
        return 0, n_common
    return sim, n_common

In [None]:
#Create now a database of similarities and common supporters!

In [None]:
class Database:
    "A class representing a database of similaries and common supports"
    
    def __init__(self, df):
        "the constructor, takes a reviews dataframe like smalldf as its argument"
        database={}
        self.df=df
        self.uniquebizids={v:k for (k,v) in enumerate(df.business_id.unique())}
        keys=self.uniquebizids.keys()
        l_keys=len(keys)
        self.database_sim=np.zeros([l_keys,l_keys])
        self.database_sup=np.zeros([l_keys, l_keys], dtype=np.int)
        
    def populate_by_calculating(self, similarity_func):
        """
        a populator for every pair of businesses in df. takes similarity_func like
        pearson_sim as argument
        """
        items=self.uniquebizids.items()
        for b1, i1 in items:
            for b2, i2 in items:
                if i1 < i2:
                    sim, nsup=calculate_similarity(b1, b2, self.df, similarity_func)
                    self.database_sim[i1][i2]=sim
                    self.database_sim[i2][i1]=sim
                    self.database_sup[i1][i2]=nsup
                    self.database_sup[i2][i1]=nsup
                elif i1==i2:
                    nsup=self.df[self.df.business_id==b1].user_id.count()
                    self.database_sim[i1][i1]=1.
                    self.database_sup[i1][i1]=nsup
                    

    def get(self, b1, b2):
        "returns a tuple of similarity,common_support given two business ids"
        sim=self.database_sim[self.uniquebizids[b1]][self.uniquebizids[b2]]
        nsup=self.database_sup[self.uniquebizids[b1]][self.uniquebizids[b2]]
        return (sim, nsup)

In [None]:
#Now let's make the database and save it as db

In [None]:
np.seterr(all='raise')
db=Database(train_data)
db.populate_by_calculating(pearson_sim)

In [None]:
#db.head

In [None]:
#Find the nearest restaurants based on what the user has already rated themselves

In [None]:
"""
Function
--------
knearest_amongst_userrated

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
def knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=7, reg=3.):
    dfuser=df[df.user_id==user_id]
    bizsuserhasrated=dfuser.business_id.unique()
    return knearest(restaurant_id, bizsuserhasrated, dbase, k=k, reg=reg)

In [None]:
#Return the predicted rating someone might give to a particular restaurant

In [None]:
"""
Function
--------
rating

Parameters
----------
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A float
    which is the impued rating that we predict that user_id will make for restaurant_id
"""
#your code here
def rating(df, dbase, restaurant_id, user_id, k=7, reg=3.):
    mu=df.stars.mean()
    users_reviews=df[df.user_id==user_id]
    nsum=0.
    scoresum=0.
    nears=knearest_amongst_userrated(restaurant_id, user_id, df, dbase, k=k, reg=reg)
    restaurant_mean=df[df.business_id==restaurant_id].business_avg.values[0]
    user_mean=users_reviews.user_avg.values[0]
    scores=[]
    for r,s,nc in nears:
        scoresum=scoresum+s
        scores.append(s)
        r_reviews_row=users_reviews[users_reviews['business_id']==r]
        r_stars=r_reviews_row.stars.values[0]
        r_avg=r_reviews_row.business_avg.values[0]
        rminusb=(r_stars - (r_avg + user_mean - mu))
        nsum=nsum+s*rminusb
    baseline=(user_mean +restaurant_mean - mu)
    #we might have nears, but there might be no commons, giving us a pearson of 0
    if scoresum > 0.:
        val =  nsum/scoresum + baseline
    else:
        val=baseline
    return val