# Hotel Recommendations

## Question
How would you design a recommendation system for hotels, given the example data below. We can imagine it's possible to get a much larger dataset, but the structure is as below.
We have
* Demographic information about users
* History of hotel bookings

## Data

In [1]:
import pandas as pd

In [2]:
demographics = pd.DataFrame({
    'id': [i for i in range(6)],
    'user':['Alisha', 'Bob', 'Chee', 'David', 'Edwina', 'Faisal'],
    'age': [18, 28, 29, 45, 47, 60],
    'location': ['New York', 'London', 'New York', 'London', 'New York', 'London'],
    'gender': ['f', 'm', 'm', 'm', 'f', 'm']
})

hotels = pd.DataFrame({
    'id': [i for i in range(4)],
    'hotel_name': ['Mariott', 'Holiday Inn', 'Sofitel', 'Hilton']
})

bookings = pd.DataFrame({
    'user_id': [0, 0, 0, 1, 1, 2, 3, 4, 5, 5, 5, 5],
    'hotel_id': [3, 2, 2, 0, 0, 3, 2, 1, 3, 0, 3, 2]
})

In [3]:
hotels

Unnamed: 0,id,hotel_name
0,0,Mariott
1,1,Holiday Inn
2,2,Sofitel
3,3,Hilton


In [4]:
bookings

Unnamed: 0,user_id,hotel_id
0,0,3
1,0,2
2,0,2
3,1,0
4,1,0
5,2,3
6,3,2
7,4,1
8,5,3
9,5,0


In [5]:
demographics

Unnamed: 0,id,user,age,location,gender
0,0,Alisha,18,New York,f
1,1,Bob,28,London,m
2,2,Chee,29,New York,m
3,3,David,45,London,m
4,4,Edwina,47,New York,f
5,5,Faisal,60,London,m


## Simple Heuristics
A really simple way to answer this is that we use demographic information to make a direct prediction about which hotels to recommend.

We can make a simple heuristic that says we should recommend hotels that are popular with the age group of the user. For example, let's say we have a new user called Greg who is 55 and lives in London. Then we could pick out users of the same age group and location and recommend the hotels they have stayed at.

In [6]:
# Simple grouping of ages
demographics['age_group'] = demographics.age.apply(lambda x: '< 30' if x < 30 else '>30')

In [7]:
# Join tables to be able to make recommendations
hotel_bookings = hotels.merge(bookings, left_on='id', right_on='hotel_id').merge(demographics,left_on='user_id',right_on='id').drop(['id_x','id_y'], axis=1)

In [8]:
# So we would recommend these to Greg
hotel_bookings[(hotel_bookings.age_group == '>30')&(hotel_bookings.location == 'London')][['hotel_name', 'user_id']].groupby(
    'hotel_name').count().rename(columns={'user_id':'count_users'}).sort_values('count_users',ascending=False)

Unnamed: 0_level_0,count_users
hotel_name,Unnamed: 1_level_1
Hilton,2
Sofitel,2
Mariott,1


So using this very simple metric where we are picking out users who have the same demograhic characteristics we would recommend either the Hilton or Sofitel.

## Sparsity and Simple Recommender System
So in this case in our simple system there were users in the same demographic group according to our simple heuristic. However, if we made the age groups narrower or introduced new locations we would start to have gaps in our recommendations.

* At first glance it seems similar to content-based filtering because we have a user-item matrix and additional feature information. However the feature information is about users rather than hotels, so we can't apply content-based filtering because this is based on the idea that we have features about our content (here, hotels)
* We could do some kind of collaborative filtering, but using the standard approach we are not making use of the demographic info we have
* It makes most sense to build a user-user similarity based on demographics and use this for recommendations

In [9]:
# Create a simple similarity metric based on demographic info
MAX_DIST = 3
def similarity(u1: pd.Series, u2: pd.Series) -> int:
    return MAX_DIST - int(u1.age_group == u2.age_group) - int(u1.gender == u2.gender) - int(u1.location == u2.location)

In [10]:
# Create a new user Greg and compute a series containing similarity to other users 
user_greg = pd.Series({'user':'Greg','age':55, 'age_group': '>30', 'location':'London'})
similarity_to_greg = demographics.apply(lambda row: similarity(row, demographics.loc[0]), axis=1)
similarity_to_greg.name = 'user_similarity_greg'

In [11]:
# Create a dataframe of weighted similarities by joining the similarity info to the bookings table and hotels table 
weighted_similarity = bookings.merge(hotels, left_on='hotel_id', right_on='id').merge(similarity_to_greg, left_on='user_id', right_index=True)
weighted_similarity['weighted_similarity'] = weighted_similarity.user_similarity_greg/3
weighted_similarity

Unnamed: 0,user_id,hotel_id,id,hotel_name,user_similarity_greg,weighted_similarity
0,0,3,3,Hilton,0,0.0
1,0,2,2,Sofitel,0,0.0
2,0,2,2,Sofitel,0,0.0
3,1,0,0,Mariott,2,0.666667
4,1,0,0,Mariott,2,0.666667
5,2,3,3,Hilton,1,0.333333
6,3,2,2,Sofitel,3,1.0
7,4,1,1,Holiday Inn,1,0.333333
8,5,3,3,Hilton,3,1.0
9,5,0,0,Mariott,3,1.0


In [12]:
# Finally, score hotels by taking the weighted similarity. Here we are grouping to take account of the number of bookings because in the
# weighted_similarity dataframe each row relates to one booking
weighted_similarity[['hotel_name', 'weighted_similarity']].groupby('hotel_name').sum().sort_values('weighted_similarity',ascending=False)

Unnamed: 0_level_0,weighted_similarity
hotel_name,Unnamed: 1_level_1
Hilton,2.333333
Mariott,2.333333
Sofitel,2.0
Holiday Inn,0.333333


So with the current data we would recommend the Hilton or Mariott hotel because according to our simple metric the weighted product of the similarity of users who stayed there and the number of bookings is highest for those two hotels.

This improved system has the advantage that we can make predictions for a user even if there is no data about the exact demographic group the user belongs to, unlike the very simple system above.

## Summary/Comments
* In this case we are using a user-user recommendation system which works because we have demographic information about users. Obviously this is a very simplified example which wouldn't work well in real-world cases
* In most real-world examples item-item recommendation is used since there are usually many more items than users
* Here we are trying to take advantage of the demographic info, so all of the above ideas are based on the assumption that people in the with a similar demographic group (here combination of age, gender and location) are more likely to want to stay in similar hotels