In this take home test, I build an xgboost model to predict whether a listing will receive a booking request for a calendar night. Before diving into the model, let's first take a look at the data.

In [None]:
# load libraries

import pandas as pd
from sklearn.utils import shuffle
from sklearn import metrics, preprocessing
import xgboost as xgb
import numpy as np

In [5]:
# load data
data = pd.read_csv('TH_data_challenge.tsv', sep='\t')

In [14]:
# look at the data
data.tail()

Unnamed: 0,dim_is_requested,ds_night,ds,id_listing_anon,id_user_anon,m_effective_daily_price,m_pricing_cleaning_fee,dim_market,dim_lat,dim_lng,...,general_market_m_reservation_requests_0_6_ds_night,general_market_m_is_booked_0_6_ds_night,m_available_listings_ds_night,kdt_score,r_kdt_listing_views_0_6_avg_n100,r_kdt_n_active_n100,r_kdt_n_available_n100,r_kdt_m_effective_daily_price_n100_p50,r_kdt_m_effective_daily_price_available_n100_p50,r_kdt_m_effective_daily_price_booked_n100_p50
184274,True,2015-06-29,2015-05-30,d55be0a0-f3c9-4771-9d2f-1f1cb692e32c,8d6018eb-03e5-424c-9eb8-d65a4d244a5b,102.666667,0,San Francisco,37.753384,-122.500374,...,31.142857,1,2014,1.177419,6.358163,61,32,95.0,99.0,80
184275,True,2015-08-17,2015-07-18,e3e9dc53-125a-474e-935d-bfb5868c3966,7e3ef9e8-537a-41de-add9-b795e30b721e,160.333333,40,San Francisco,37.765903,-122.42749,...,32.571429,1,2032,1.153846,2.163265,61,24,126.5,135.0,100
184276,True,2015-09-03,2015-08-04,cc7e4269-f92a-4b8e-a444-e951f80f7b76,a43e3fae-c169-481a-ad43-121e9ab64fb0,36.084746,30,San Francisco,37.773777,-122.40313,...,38.428571,1,2242,1.559524,1.071429,10,6,70.0,70.0,55
184277,False,2015-10-07,2015-09-07,b5266b83-1dc7-47ba-8d42-dd66037f79b8,41d906f7-2556-4309-a406-f9a7a9e2c30c,0.0,0,San Francisco,37.758934,-122.4373,...,36.142857,1,2631,1.211268,1.852041,96,31,220.0,271.5,228
184278,False,2015-12-15,2015-11-15,3ff697ff-2c9c-40dc-b9bb-71e4fc62d4b8,7066dc20-25e5-45e1-88d0-ec897fb4aa94,250.0,80,San Francisco,37.77246,-122.41706,...,25.857143,1,3561,0.75,1.35102,69,27,217.5,250.0,145


In [16]:
print data.shape

(184279, 46)


It is not difficult to figure that the task is a binary classification problem with 45 features. Through diving into the features, we can see that 'ds-night' is equivalent to 'ds-night-day-of-year' and 'ds' + 30 days = 'ds-night'. Therefore, we can remove 'ds-night' and 'ds' as they are redundant. Similarly, 'id-lising-anon' is equivalent to 'is-user-anon', and hence one of them can be removed from the feature list.

In [None]:
# filter redundant features
filters = ['ds_night', 'ds', 'id_user_anon']
data = data.drop(filters, axis=1)

In [19]:
# look at the filtered data
data.tail()

Unnamed: 0,dim_is_requested,id_listing_anon,m_effective_daily_price,m_pricing_cleaning_fee,dim_market,dim_lat,dim_lng,dim_room_type,dim_person_capacity,dim_is_instant_bookable,...,general_market_m_reservation_requests_0_6_ds_night,general_market_m_is_booked_0_6_ds_night,m_available_listings_ds_night,kdt_score,r_kdt_listing_views_0_6_avg_n100,r_kdt_n_active_n100,r_kdt_n_available_n100,r_kdt_m_effective_daily_price_n100_p50,r_kdt_m_effective_daily_price_available_n100_p50,r_kdt_m_effective_daily_price_booked_n100_p50
184274,True,d55be0a0-f3c9-4771-9d2f-1f1cb692e32c,102.666667,0,San Francisco,37.753384,-122.500374,Private room,1,False,...,31.142857,1,2014,1.177419,6.358163,61,32,95.0,99.0,80
184275,True,e3e9dc53-125a-474e-935d-bfb5868c3966,160.333333,40,San Francisco,37.765903,-122.42749,Private room,2,True,...,32.571429,1,2032,1.153846,2.163265,61,24,126.5,135.0,100
184276,True,cc7e4269-f92a-4b8e-a444-e951f80f7b76,36.084746,30,San Francisco,37.773777,-122.40313,Shared room,4,False,...,38.428571,1,2242,1.559524,1.071429,10,6,70.0,70.0,55
184277,False,b5266b83-1dc7-47ba-8d42-dd66037f79b8,0.0,0,San Francisco,37.758934,-122.4373,Entire home/apt,2,True,...,36.142857,1,2631,1.211268,1.852041,96,31,220.0,271.5,228
184278,False,3ff697ff-2c9c-40dc-b9bb-71e4fc62d4b8,250.0,80,San Francisco,37.77246,-122.41706,Entire home/apt,4,False,...,25.857143,1,3561,0.75,1.35102,69,27,217.5,250.0,145


To train an xgboost classifier, the data has to be converted into DMatrix format. The 'object' type is not yet supported by DMatrix, and hence all the 'object' type columns are preprocessed (encoded) first.

In [None]:
# encode object types
for f in data.columns:
    if data[f].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(data[f].values))
        data[f] = lbl.transform(list(data[f].values))

In [None]:
# shuffle the data and create dummies for categorical variables
y_var = 'dim_is_requested'
cat_var = ['dim_market', 'dim_room_type', 'dim_is_instant_bookable', 'cancel_policy', 
           'dim_has_wireless_internet']

data = shuffle(data, random_state=0)
data = pd.get_dummies(data, columns=cat_var)

There are 4 variables we can either consider them as numerical or categorical: 'ds-night-day-of-week', 'ds-night-day-of-year', 'ds-checkin_gap' and 'ds-checkout-gap'. I tested all the combinations in the cross validation step (explained later in this writeup) and decided to treat them as numerical for higher prediction accuracy.

In [None]:
# split data into training and testing sets
training = data[:int(len(data)*0.8)]
testing = data[int(len(data)*0.8):]

In [24]:
print training.shape
print testing.shape

(147423, 54)
(36856, 54)


In [None]:
# convert training data into DMatrix format
xgdmat = xgb.DMatrix(training[[i for i in training.columns if i != y_var]], training[y_var], 
                     missing=np.nan)

In [37]:
# specify parameters for xgboost and train the xgboost model
our_params = {'eta':0.1, 'seed':0, 'subsample':0.8, 'colsample_bytree':0.8, 
              'objective':'binary:logistic', 'max_depth':10, 'min_child_weight':5, 
              'alpha':100, 'beta':100}
xgb_model = xgb.train(our_params, xgdmat, num_boost_round=300)

The dataset has a small amount of missing values. The reason that I did not deal with these missing values explicitly is because 1) the amount is not significantly large; 2) xgboost has internal functionality to handle missing values and usually performs better than manually filling in missing values with zeros or median values. Internally, xgboost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

In [38]:
# score the test data and compute accuracy of classification
y_pred = xgb_model.predict(xgb.DMatrix(testing[[i for i in testing.columns if i != y_var]],
                                       testing[y_var], missing=np.nan))
predictions = [round(value) for value in y_pred]
accuracy = metrics.accuracy_score(testing[y_var].values, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 83.12%


This is a good accuracy score on this problem, which we would expect, given 1) the capabilities of the xgboost model; 2) the modest complexity of the problem; and 3) the fact that all the major parameters were tuned via extensive cross validation experiments. To be more specific, I have tested all the combinations of ‘eta’ being in [0.01, 0.05, 0.1, 0.2], ‘max-depth-val’ being in [3, 5, 10], ‘min-child-weight’ being in [1, 3, 5],  alpha regularization being in [1, 50, 100], beta regularization being in [1, 50, 100], and finally number of boosting rounds ranging across [100, 200, 300]. As mentioned in previous steps, I also tested any of the ['ds-night-day-of-week', 'ds-night-day-of-year', 'ds-checkin-gap', 'ds-checkout-gap'] should be treated as categorical features.

In [45]:
# output feature importance
print 'Top Features:\n'
print pd.DataFrame(xgb_model.get_fscore(), 
                   index=[0]).transpose().sort_values(0, ascending=False).head(7)

Top Features:

                                  0
m_effective_daily_price        1413
dim_lat                        1050
dim_lng                         853
m_available_listings_ds_night   845
image_quality_score             796
price_booked_most_recent        770
ds_night_day_of_year            725




The top features align with our intuition, as price, location, timing and availability are usually the determining factor for such listing bookings. In this case, image quality is also a key feature which makes intuitive sense due to the nature of online shopping.

Further Discussions: 


* Baseline and model improvement: we can certainly start modeling the task using the simplest logistic regression. However, as an experienced data scientist, we should be able to figure out (without implementing and testing) that given the nature of the problem and structure of the data, gradient boosting methods (e.g., xgboost, lightgbm and etc) should be the way to go. What I did in this test is that I implemented the model using some guessed parameters (i.e., educated guess based on experience) and then due to the fact that the size of the data is not so large, I conducted cross validations to finely tune the parameters and improve the model performance. 

* Identify new features: although many new features can be derived from the given dataset, I think ‘closeness-to-major-attractions’ deriving from location data and ‘ds-night-to-next-holiday’ deriving from time data would be extraordinarily useful. The likelihood if a listing is going be booked on a particular night highly depends on 1) if the listed property is close to some major attractions such as national parks; and 2) if there is any holidays approaching to the predicted booking night.

* How to use the model and findings: 1) Since the predicted scores indicate how likely a listing is going to be booked, Airbnb can always use these probability scores to improve their offer rankings to drive more click-through and bookings. For example, ‘probability’ * ‘profit’ is a very good ranking indicator to me. 2) Hinted by the results, certain feature such as pricing, location and image quality are more important than others. Airbnb can always develop site functionalities to ‘promote’ listings that have certain advantages. For example, we can promote property A by its location and price, and we can promote property B by the high quality photos and availabilities. 

* Finally, the given dataset could potentially be used for many other prediction tasks. The most applicable one, I believe, is the demand forecasting.

Given the time limit, above is the immediate approach I came up with. With more time and more understanding on the data, I would definitely like to do more feature engineering and try to compare the results using different methods.