# Expedia Hotel Recommendation

Expedia provided logs of customer behavior (https://www.kaggle.com/c/expedia-hotel-recommendations)

The goal of the analysis is to predict the booking outcome (hotel cluster) for a user event

This notebook shows a machine learning approach based on python sklearn library to achieve the above-mentioned goal

### File Descriptions

train.csv - the training set

test.csv - the test set

destinations.csv - hotel search latent attributes

### Loading data sets

In [2]:
### Import libraries
import pandas as pd
import numpy as np
from datetime import datetime

In [3]:
train = pd.read_csv("train.csv", parse_dates=['date_time', 'srch_ci', 'srch_co'])

In [4]:
test = pd.read_csv("test.csv", parse_dates=['srch_ci','srch_co'])

In [5]:
destinations = pd.read_csv("destinations.csv")

### Check data frame dimensions

In [6]:
train.shape

(37670293, 24)

In [7]:
test.shape

(2528243, 22)

In [8]:
destinations.shape

(62106, 150)

### Explore datasets

In [9]:
# summary of train data
#train.info()

In [10]:
## Check correlation of column features to hotel_cluster
train.corr()["hotel_cluster"]

site_name                   -0.022408
posa_continent               0.014938
user_location_country       -0.010477
user_location_region         0.007453
user_location_city           0.000831
orig_destination_distance    0.007260
user_id                      0.001052
is_mobile                    0.008412
is_package                   0.038733
channel                      0.000707
srch_adults_cnt              0.012309
srch_children_cnt            0.016261
srch_rm_cnt                 -0.005954
srch_destination_id         -0.011712
srch_destination_type_id    -0.032850
is_booking                  -0.021548
cnt                          0.002944
hotel_continent             -0.013963
hotel_country               -0.024289
hotel_market                 0.034205
hotel_cluster                1.000000
Name: hotel_cluster, dtype: float64

In [11]:
# change column format from float to interger
train['orig_destination_distance'] = train['orig_destination_distance'].fillna(0).astype(int)
test['orig_destination_distance'] = test['orig_destination_distance'].fillna(0).astype(int)
test_sub=pd.DataFrame(test.drop(['date_time', 'srch_ci','srch_co', 'is_mobile'], axis=1))

In [13]:
# Number of unique user ids in the train and test dataset
len(train.user_id.unique())
len(test.user_id.unique())

1181577

In [14]:
#list(train.hotel_cluster)
len(train.hotel_cluster.unique())

100

In [16]:
train_c = list(train.columns)
test_c = list(test.columns)

['id']

In [18]:
## Columns in train and not test
tr = [x for x in train_c if x not in test_c]
tr

['is_booking', 'cnt', 'hotel_cluster']

In [19]:
## Columns in test and not train
te = [x for x in test_c if x not in train_c]
te

['id']

### Model training and Validation

In [20]:
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Python list of features
feature_cols = ['site_name', 'posa_continent', 'user_location_country', \
       'user_location_region','user_location_city','orig_destination_distance',\
       'user_id', 'is_mobile', 'is_package', \
       'channel', 'srch_adults_cnt', \
       'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id', \
      'srch_destination_type_id', 'hotel_continent', \
       'hotel_country', 'hotel_market']

X = train[feature_cols]

# select hotel_cluster for classification
y = train['hotel_cluster']

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


# train model with KNN algorithm
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [21]:
# Accuracy for the KNN model
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred)

0.30623980231


Accuracy value is not great.

In [61]:
# print the first 50 true and predicted responses
print 'True:     ', y_test.values[0:30]
print 'Predicted:', y_pred[0:30]

True:      [74 85 13 98 61 77 17 71 24 41 83 84  1 34  5 48 90 61  5 39  8 45  9 46  5
 64 34 36  3 92]
Predicted: [74 85 55 95 61  7 47 31 24 96  7 93  1 11 80 32 66 61 95 16  8 45 99 25 48
 52 11 59  3 52]


### Make prediction on test.csv

In [38]:
test_sub['hotel_cluster'] = knn.predict(test_sub)

In [62]:
df=pd.DataFrame(test_sub, columns=['id','hotel_cluster'])

# write one prediction per user event to file
df.to_csv('file_submission.csv', index=False)