# Which hotel type will an Expedia customer book?

Planning your dream vacation, or even a weekend escape, can be an overwhelming affair. With hundreds, even thousands, of hotels to choose from at every destination, it's difficult to know which will suit your personal preferences. Should you go with an old standby with those pillow mints you like, or risk a new hotel with a trendy pool bar? 

Expedia wants to take the proverbial rabbit hole out of hotel search by providing personalized hotel recommendations to their users. This is no small task for a site with hundreds of millions of visitors every month!

Currently, Expedia uses search parameters to adjust their hotel recommendations, but there arent enough customer specific data to personalize them for each user. In this competition, Expedia is challenging Kagglers to contextualize customer data and predict the likelihood a user will stay at 100 different hotel groups.

Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.

Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don't have historical data.

Your goal of this competition is to predict the booking outcome (hotel cluster) for a user event, based on their search and other attributes associated with that user event.

The train and test datasets are split based on time: training data from 2013 and 2014, while test data are from 2015. Training data includes all the users in the logs, including both click events and booking events. Test data only includes booking events.

destinations.csv data consists of features extracted from hotel reviews text. 

Note that some srch_destination_id's in the train/test files don't exist in the destinations.csv file. This is because some hotels are new and don't have enough features in the latent space. Your algorithm should be able to handle this missing information.

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import csv

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [None]:
from sklearn import *

In [2]:
# read the details from kaggle competition details
details_page = 'https://www.kaggle.com/c/expedia-hotel-recommendations/data'

details = pd.read_html(details_page)

In [3]:
details[0]

Unnamed: 0,File Name,Available Formats
0,sample_submission.csv,.gz (3.52 mb)


In [4]:
details[1]

Unnamed: 0,Column name,Description,Data type
0,date_time,Timestamp,string
1,site_name,ID of the Expedia point of sale (i.e. Expedia....,int
2,posa_continent,ID of continent associated with site_name,int
3,user_location_country,The ID of the country the customer is located,int
4,user_location_region,The ID of the region the customer is located,int
5,user_location_city,The ID of the city the customer is located,int
6,orig_destination_distance,Physical distance between a hotel and a custom...,double
7,user_id,ID of user,int
8,is_mobile,"1 when a user connected from a mobile device, ...",tinyint
9,is_package,1 if the click/booking was generated as a part...,int


In [5]:
details[2]

Unnamed: 0,Column name,Description,Data type
0,srch_destination_id,ID of the destination where the hotel search w...,int
1,d1-d149,latent description of search regions,double


In [6]:
data_dir = 'D:\\Documents\\data_sources\\kaggle\\expeida\\'
# train = pd.read_csv(data_dir + 'train.csv')  # memory error with this

In [None]:
data = []
with open(data_dir + 'train.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
        data = data + [row]

In [None]:
data[0:10]

In [None]:
train.head()

In [None]:
test = pd.read_csv(data_dir + 'test.csv')
destinations = pd.read_csv(data_dir + 'destinations.csv')

In [None]:
test.head()

In [None]:
destinations.head()