This notebook contains some experiments like binary classification, three class classification and some interesting findings observed while building those models. The NDCG scorer used in the notebook is a script that was available in Kaggle. NDCG scorer: https://www.kaggle.com/davidgasquez/airbnb-recruiting-new-user-bookings/ndcg-scorer

## Part 1 : date_first_booking based prediction model

As seen during the exploratory analysis, 58% of the data has the country destination as 'NDF' (No Destination Found). Also, it was found that there is a 1-to-1 correlation betweeen NaN 'date_first_booking' and 'NDF'. That is, the date_first_booking is NaN for all the instances whose country destination is 'NDF'. 

So in this part, I am using this information to have a model which predicts the country_destination as NDF with the date_first_booking is NaN. Therefore I drop all the rows with country_destination as NDF from the training data and using RandomForest Classifier to train the remaining data. Once we get the prediction for the remaining countries, I append the prediction list with NDF for all the records in the test data that has the date_first_booking as NaN.

### Load data and split train-test data.

In [50]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv('preprocessed_airbnb_train.csv')
labels = data.loc[:,'country_destination']
data = data.drop(['country_destination'], axis=1)

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(data, labels, random_state=20160202, test_size=0.33)

In [51]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import make_scorer

def dcg_score(y_true, y_score, k=5):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)

#def ndcg_score(ground_truth, predictions, k=5):
def ndcg_score(te_labels, predict, k):
    lb = LabelBinarizer()
    lb.fit(range(len(predict) + 1))
    T = lb.transform(te_labels)

    scores = []

    # Iterate over each y_true and compute the DCG score
    for y_true, y_score in zip(T, predict):
        actual = dcg_score(y_true, y_score, k)
        best = dcg_score(y_true, y_true, k)  
        score = float(actual) / float(best)
        scores.append(score)

    return np.mean(scores)

# NDCG Scorer function
ndcg_scorer = make_scorer(ndcg_score, needs_proba=True, k=5)

### Drop all NDF rows from the training set

In [58]:
train = tr_data.copy()
train['country_destination'] = tr_labels

# NDF is encoded as 0
train_ndf_dropped = train[ train.country_destination != 0 ]

tr_labels = train_ndf_dropped.loc[:,'country_destination']
tr_data = train_ndf_dropped.drop(['country_destination'], axis=1)

### Train using RandomForest

In [59]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=600,criterion='gini', min_samples_leaf=50)
rf = rf.fit(tr_data, tr_labels)

In [60]:
rf_predict = rf.predict_proba( te_data )

### Modify the prediction, to mark 1.0 NDF probability for NaN date_first_booking and 0 for the rest

In [61]:
x = np.insert(rf_predict, 0, 0, axis=1)

In [62]:
te = te_data.copy()
for i in range(12):
    te['pr%d'%i] = x[:, i]

In [63]:
te.loc[ te.dfb_year == 0, 'pr0'] = 1.0
for i in range(1,12):
    te.loc[ te.dfb_year == 0, 'pr%d'%i] = 0.0

In [64]:
filter_col = [col for col in list(te) if col.startswith('pr')]
rf_predict_fixed = te[filter_col].values

### Get NDCG score for this model

In [65]:
score = ndcg_score(te_labels.as_matrix(), rf_predict_fixed, k=5)
print score

0.927350768403


This shows that we can leverage the 1-to-1 coorelation information between certain values in the attribute and labels to improve the score drastically, though it might be biased.

## Part 2 :  NDF prediction

Inorder to get a more realistic model, in this section I will try to build models on the preprocessed data without the date_first_booking information. 

The model here is a two level classifier. The first level is a Binary classifier which will predict whether the country is NDF or not. In the second level, a multi-class classifier will classify the data among the valid countries. The first level classifier is chosen by comparing the performance of Linear Regression and a boosted DecisionTree. The second level classifier is RandomForest.

### Update labels to classify as either NDF or non-NDF destination

In [11]:
# Drop date_first_booking information and perform StandardScaler on the remanining data
ndf_data = data.drop(['dfb_year', 'dfb_month', 'dfb_day'], axis=1)
ndf_data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(ndf_data))

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(ndf_data, labels, random_state=20160202, test_size=0.33)

tr_labels_adjusted = tr_labels.copy()
# Mark all valid countries as 1 and NDF as 0
tr_labels_adjusted.loc[ tr_labels_adjusted[::] > 0 ] = 1

In [13]:
lr_predict_corrected_2 = [ 1 if v > 0.5 else 0 for v in lr_predict.flatten()]
te_labels_adjusted = te_labels.copy()
te_labels_adjusted.loc[ te_labels_adjusted[::] > 0 ] = 1
correct = 0
for l, t in zip(te_labels_adjusted, lr_predict_corrected_2):
    if l==t:
        correct += 1

print float(correct)/float(len(te_labels_adjusted))

0.595408793424


 Linear Regression doesn't classify the NDF labels very well

## Part 3 : Binary classification for US/Non-US prediction without NDF

In this part I am trying to come up with a model to predict the valid countries. Hence dropping all the rows with NDF as country_destination. In the data exploration part, it can be seen that almost 30% data has the country_destination as U.S.A. So in this model, I am using Linear Regression, Boosted Decision Tree and LinearSVC to classify between US and non-US and select the the best method. 

In [24]:
data = pd.read_csv('preprocessed_airbnb_train.csv')
data = data[ data.country_destination != 0 ]
labels = data.loc[:,'country_destination']
data = data.drop(['country_destination'], axis=1)

data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(data))

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(data, labels, random_state=20160202, test_size=0.33)

tr_labels_adjusted = tr_labels.copy()
# Mark all non-US countries as 0 and US is 1
tr_labels_adjusted.loc[ tr_labels_adjusted[::] > 1 ] = 0

### Binary classifier using AdaBoost Decision Tree

In [27]:
dt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         n_estimators=200)
dt.fit(tr_data, tr_labels_adjusted.values.ravel())
dt_predict = dt.predict(te_data)

In [28]:
dt_predict_corrected_2 = [ 1 if v > 0.5 else 0 for v in dt_predict.flatten()]
te_labels_adjusted = te_labels.copy()
te_labels_adjusted.loc[ te_labels_adjusted[::] > 1 ] = 0
correct = 0
for l, t in zip(te_labels_adjusted, dt_predict_corrected_2):
    if l==t:
        correct += 1

print float(correct)/float(len(te_labels_adjusted))

0.7030334015


It can be seen that all the three models gave an accuracy of 70%. This is because, after removing the NDF values, 70% of the data belong to US. A blind prediction of US to all the instances would give the same accuracy score. The data is highly skewed and none of the models gave an accuracy better than the baseline of 70%.

### Trying to change the proportion of US and Non-US countries

In [32]:
labels.value_counts()

1     62376
2     10094
3      5023
4      2835
5      2324
6      2249
7      1428
8      1061
9       762
10      539
11      217
Name: country_destination, dtype: int64

It can be seen from the above distribution of the labels that there is a disproportionate number of countries labelled as US. So I tried to balance the dataset by sampling only 25,000 records which had the labels as US. 

In [36]:
data = pd.read_csv('preprocessed_airbnb_train.csv')
data = data[ data.country_destination != 0 ]
data_us = data[ data.country_destination == 1 ]
data_us = data_us.sample(n=25000)
data = data[ data.country_destination != 1 ]
data = data.append( data_us )

labels = data.loc[:,'country_destination']
data = data.drop(['country_destination'], axis=1)

data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(data))

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(data, labels, random_state=20160202, test_size=0.33)

tr_labels_adjusted = tr_labels.copy()
# Mark all non-US countries as 0 and US is 1
tr_labels_adjusted.loc[ tr_labels_adjusted[::] > 1 ] = 0

In [37]:
lr = LinearRegression()

lr.fit(tr_data, tr_labels_adjusted.values.ravel())
lr_predict = lr.predict(te_data)

In [38]:
lr_predict_corrected_2 = [ 1 if v > 0.5 else 0 for v in lr_predict.flatten()]
te_labels_adjusted = te_labels.copy()
te_labels_adjusted.loc[ te_labels_adjusted[::] > 1 ] = 0
correct = 0
for l, t in zip(te_labels_adjusted, lr_predict_corrected_2):
    if l==t:
        correct += 1

print float(correct)/float(len(te_labels_adjusted))

0.525638010114


Unfortunately this modelling also did not improve the baseline accuracy. 

## Part 4 : 3 Class Prediction

Since the majority of the data is NDF or US, it would be interesting to see the performance of a model that predicts whether it is US, NDF or other. So, the labels are modified such that all the countries other than 'NDF' and 'US' is changed to 'other'

In [43]:
data = pd.read_csv('preprocessed_airbnb_train.csv')
data = data.drop(['dfb_year', 'dfb_month', 'dfb_day'], axis=1)
labels = data.loc[:,'country_destination']
labels.loc[ labels[::] > 1 ] = 2
data = data.drop(['country_destination'], axis=1)

data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(data))

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(data, labels, random_state=20160202, test_size=0.33)


In [45]:
rf = RandomForestClassifier(n_estimators=600,criterion='gini', min_samples_leaf=50, random_state=20160202)
rf = rf.fit(tr_data, tr_labels)
rf_predict = rf.predict_proba( te_data )

In [46]:
score = ndcg_score(te_labels.as_matrix(), rf_predict, k=3)
print score

0.845515158034


In [47]:
rf.score(te_data, te_labels_012)

0.6252076264569344

The nDCG score is 0.845 which is pretty good for this model.