<img src='universe.jpg' style="float: right; width: 340px;" alt="Drawing">
 
The main goal is to recommend the most similar hotel to a selected hotel. There are other ways of recommending hotels (content-based k-means clustering, for one, as introduced in another tutorial, '__hotel_clustering_based_on_reviews__'). To __look up__ the most similar hotel in our __universe of hotels__, imagine your favorite hotel __lives__ on Planet A. Now when you travel to a different planet (sorry, __city__), you want to find a hotel that reminds you of your favorite on Planet A. A k-dimensional tree algorithm can help you locate that hotel in almost no time. I'll implement an __Approximate Nearest Neighbors__ search on a __7 dimenional tree__ that has been tailored to our dataset of 10,000 hotels that have been rated by over 1.04 million users on TripAdvisor.

In this tutorial, I'll implement a fast search algorithm for content-based k-Nearest Neighbor model. KNN models were known for their interpretability in the early days of recommender systems. I'll define a way of measuring similarity, i.e. how similar any of the chosen two hotels are, based on their features. This similarity measure helps guide us in the __universe__ of __similar hotels__. The model compares similarity measures among hotels, to decide how __similar__ a pair of hotels are in the rating structures given by travelers. If we would like to know how __similar__ a pair of items are, kNN models would also be a very intuitive way to tell the __degree of similarity__. 

In statistics, the __curse of dimensionality__ has made many problems a lot harder when the dimensions increase. In a 1D space (a straight line on paper), this search problem is essential a query problem; whatever hotel located next to your favorite is the most similar one. In a 2D plane (laying out a map on table), we can draw circles centered at your favorite hotel to search for the next similar one. In a 3D world like ours, there is yet another dimension added to the previous plane, so search time increases as a result.

Hotels in our model will be measured in 7 dimensions: rooms, service, cleanlines, front desk, business service, value, location; each will be given a 0 to 5 scores, where 1 - 5 are travelers' ratings. By using a k-dimensional tree, the next similar hotel will be generated automatically from the training set that we feed in to the algorithm.

I used Python 2 to write this tutorial and implemented the following libraries:

In [1]:
import pandas as pd 
import pprint
import numpy as np
import sklearn.preprocessing as pp
from numpy import linalg

First, we import our example dataset that I scraped and cleaned from TripAdvisor. Each hotel id is unique and represents a hotel in our databse. Ratings of rooms, service, cleanliness, front desk, business service, value, and location are mean ratings taken from travelers.

In [2]:
hotel_ratings = pd.read_csv('../../dataset/nearest_neighbors_hotel.csv')
hotel_ratings.head()

Unnamed: 0,hotel id,rooms,service,cleanliness,front desk,business service,value,location
0,72572,3.644444,4.275556,4.368889,0.76,0.56,4.08,3.822222
1,72586,3.130769,3.838462,3.761538,0.553846,0.423077,3.792308,3.615385
2,73855,3.337423,3.607362,3.95092,0.699387,0.257669,3.251534,3.736196
3,73943,3.677852,4.45302,4.442953,1.030201,0.637584,4.036913,3.637584
4,73947,3.207792,4.11039,4.019481,0.305195,0.194805,3.909091,3.396104


Hotel names and details are in the following dataframe.

In [3]:
hotel_names = pd.read_csv('../../dataset/hotel_info.csv')
hotel_names.head()
#hotel_names[hotel_names['hotel id'] == 79868]

Unnamed: 0,hotel name,hotel id,city,state,zip code,low price,high price
0,Hilton Garden Inn Baltimore White Marsh,100407,Baltimore,MD 21236,21236,$135,$193
1,Hotel Monaco Seattle - a Kimpton Hotel,100504,Seattle,WA 98101,98101,$184,$345
2,Warwick Seattle Hotel,100505,Seattle,WA 98121,98121,$129,$228
3,Hotel Seattle,100506,Seattle,WA 98101,98101,$96,$118
4,Inn at the Market,100507,Seattle,WA 98101,98101,$199,$299


First, we need to define a way of measuring similarity distance. We choose the Euclidean distance:

In [4]:
def euc_distance(point1, point2):
    diff = np.subtract(point1, point2)
    dist = np.linalg.norm(diff)
    return dist

In [5]:
# we pick the first 3000 hotels to illustrate an application of our algorithm.
# randomly shuffle rows
hotel_ratings = hotel_ratings.sample(frac=1)
hotel_ratings.head()

Unnamed: 0,hotel id,rooms,service,cleanliness,front desk,business service,value,location
2148,2514450,3.928144,3.736527,4.281437,0.856287,0.479042,3.892216,3.550898
1009,228949,2.857143,3.25,3.521429,1.057143,0.442857,3.157143,3.028571
2524,2515264,3.457627,3.898305,4.116223,0.849879,0.610169,3.612591,2.995157
2531,2515308,3.289256,4.107438,4.371901,0.355372,0.272727,4.14876,3.347107
2772,2515830,4.074334,4.415147,4.552595,0.00561,0.007013,4.468443,4.00561


In [6]:
#design training and test set after random shuffle:
training = hotel_ratings[:2000]
test = hotel_ratings[2000:]
training.reset_index(inplace=True)
test.reset_index(inplace=True)
training = training[['rooms', 'service', 'cleanliness', 'front desk', 'business service', 'value', 'location']]
test = test[['rooms', 'service', 'cleanliness', 'front desk', 'business service', 'value', 'location']]
training.head()

Unnamed: 0,rooms,service,cleanliness,front desk,business service,value,location
0,3.928144,3.736527,4.281437,0.856287,0.479042,3.892216,3.550898
1,2.857143,3.25,3.521429,1.057143,0.442857,3.157143,3.028571
2,3.457627,3.898305,4.116223,0.849879,0.610169,3.612591,2.995157
3,3.289256,4.107438,4.371901,0.355372,0.272727,4.14876,3.347107
4,4.074334,4.415147,4.552595,0.00561,0.007013,4.468443,4.00561


To generate a tree structure, we define a function that takes arrays of (7-dimensional) points as input, and generates a binary search tree (BST) where each split is performed on one of the 7 dimensions, alternatively, until no more split is available. 

<img src='KDTree-animation.gif' style="float: center; width: 1500px;" alt="Drawing">

(_source:[Wikipedia: kd tree](https://en.wikipedia.org/wiki/K-d_tree)_)

In [7]:
def build_kdtree(points, depth=0):
    n = len(points)

    if n <= 0:
        return None
    # spliting by alternating axis:
    axis = depth % k

    sorted_points = sorted(points, key=lambda point: point[axis])

    return {
        'point': sorted_points[n / 2],
        'left': build_kdtree(sorted_points[:n / 2], depth + 1),
        'right': build_kdtree(sorted_points[n/2 + 1:], depth + 1)
    }

In [8]:
# turning DataFrames into array structure:
training_points = training.as_matrix()
test_points = test.as_matrix()

Now, we are ready to __grow__ our first tree!

In [20]:
k = 7
kdtree = build_kdtree(training_points)

An example of what the tree looks like:

In [21]:
ppr = pprint.PrettyPrinter(indent=4)
ppr.pprint(kdtree['left']['left']['left']['left']['left']['left']['left']['left'])

{   'left': {   'left': {   'left': None,
                            'point': array([1.39877301, 1.6993865 , 1.84662577, 0.16564417, 0.07361963,
       1.89570552, 2.19631902]),
                            'right': None},
                'point': array([1.60493827, 1.83950617, 1.98148148, 0.66666667, 0.2345679 ,
       1.95061728, 2.72222222]),
                'right': {   'left': None,
                             'point': array([1.92322097, 2.32209738, 2.4494382 , 0.52996255, 0.17602996,
       2.64419476, 2.71348315]),
                             'right': None}},
    'point': array([1.8404908 , 2.33128834, 2.14110429, 0.25153374, 0.09815951,
       2.39263804, 1.73619632]),
    'right': {   'left': {   'left': None,
                             'point': array([1.67460317, 2.34920635, 1.79365079, 0.6984127 , 0.3015873 ,
       2.06349206, 3.19047619]),
                             'right': None},
                 'point': array([1.85628743, 2.41916168, 2.25748503, 0.26347305, 0.089

This tree is a BST where each split is first split on dimension 'rooms', then 'service', etc. until starting from 'rooms' all over again. There is no way human brains can visualize this high dimensional space, but the basic principles are just as simple as the 2D version in the above animation.

To avoid going over all branches in search of the _closest_ points, we apply an approximate version of BST search query algorithm. We first define a function that frees us up by _trimming_ half of the tree each time we detect that none of the child nodes are possible of being in the _closer_ class of our _planet_ (in the universe of hotels)...

In [11]:
# a function that decides if points 'p1' or 'p2' is closer to our 'pivot'
def closer_distance(pivot, p1, p2):
    if p1 is None:
        return p2
    if p2 is None:
        return p1

    # calculate the euclidean distances
    d1 = euc_distance(pivot, p1)
    d2 = euc_distance(pivot, p2)
    
    # choose the closer point
    if d1 < d2:
        return p1
    else:
        return p2

Next we need to search for the closest neighbors for a given point on our BST tree.

In [12]:
# a function that searches for the closest neighbors based on a grown BST 'root'.
def kdtree_closest_point(root, point, depth=0):
    if root is None:
        return None
    # alternating split by axis
    axis = depth % k
    
    # initiate search on two branches at the same time
    next_branch = None
    opposite_branch = None
    
    # deciding by preliminary check which branch is worth time investigating
    if point[axis] < root['point'][axis]:
        next_branch = root['left']
        opposite_branch = root['right']
    else:
        next_branch = root['right']
        opposite_branch = root['left']
        
    #initiate recursive process of finding a 'closer_distance' 
    best = closer_distance(point,
                           kdtree_closest_point(next_branch,
                                                point,
                                                depth + 1),
                            root['point'])
    
    # update 'best' point each time a closer point is detected
    if euc_distance(point, best) > abs(point[axis] - root['point'][axis]):
        best = closer_distance(point,
                               kdtree_closest_point(opposite_branch,
                                                    point,
                                                    depth + 1),
                               best)
    # return the closest point
    return best

Let's roll up our sleeves and start __climbing__ (searching)!

If I want to know which hotel is most similar to __The Bentley Hotel__ in __New York City__ whose hotel id is __99302__:

In [23]:
# uncomment the following if you are running this code for the first time
#hotel_names.set_index('hotel name', inplace=True)
hotel_names.loc['The Bentley Hotel']

Unnamed: 0_level_0,hotel id,city,state,zip code,low price,high price
hotel name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Bentley Hotel,2514381,New York City,NY 10065,10065,$219,$438
The Bentley Hotel,2514617,New York City,NY 10065,10065,$219,$446
The Bentley Hotel,99302,New York City,NY 10065,10065,$219,$446


In [24]:
#hotel_ratings.reset_index('hotel id', inplace=True)
hotel_ratings[hotel_ratings['hotel id'] == 2514617]
test_hotel = hotel_ratings[hotel_ratings['hotel id'] == 2514617][['rooms','service', 'cleanliness', 'front desk', 'business service', 
                                                                'value', 'location']].as_matrix()
test_hotel

array([[3.16058394, 3.26277372, 3.69099757, 0.92214112, 0.38686131,
        3.32846715, 2.62530414]])

To find the most similar hotel in the _hotel_ratings_ datafram, we run __kdtree_closest_point__ on our entire __kdtree__, with the __test_hotel__ array as input.

In [25]:
rec_hotel = kdtree_closest_point(kdtree, test_hotel[0])
rec_hotel

array([3.16584158, 3.28217822, 3.7029703 , 0.92821782, 0.38861386,
       3.33910891, 2.62623762])

In [18]:
recommend = np.argwhere(training_points == rec_hotel)
favorite = np.argwhere(test_points == test_hotel)

In [19]:
print ('Recommended hotel with hotel id {}'.format(hotel_ratings.iloc[recommend[0][0]]['hotel id'].astype(int)))

Recommended hotel with hotel id 79868


In [None]:
hotel_names[hotel_names['hotel id'] == 101352]