# XGBoost

* Performs xgboost on training data. 
* Iterates over parameters with cross validation
* Currently ignoring date parameters due to large number of factors. Waiting for preprocessing steps to improve. 
* Warning: takes a long time to cross validate

In [1]:
# Libraries
library(xgboost)
library(dplyr)
library(Matrix)
library(data.table)
library(Ckmeans.1d.dp)
library(e1071)
library(caret)

# Set Seed
set.seed(1066)

# Name of Run
NAME <- "_eg_1"


Attaching package: 'dplyr'

The following object is masked from 'package:xgboost':

    slice

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, last

Loading required package: lattice
Loading required package: ggplot2


** Currently remove date features because of large number of factors **  

In [2]:
# Read data
# remove id and date_first_booking as they are not relevant
# CURRENTLY REMOVES DATE PARAMETERS AS WELL
dat_raw <- readRDS("../Data/users_PP.RDS") %>%
    na.omit()

dat <- dat_raw %>%
    select(-c(id,dataset,age_cln,age_cln2)) %>%
    data.table(keep.rownames = F)

In [3]:
# One-hot encoding  
# https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html
sparse_dat <- sparse.model.matrix(country_destination ~ . -1, data = dat)

# Find the training set
sparse_tr <- sparse_dat[dat_raw$dataset == "train",]
tr <- dat[dat_raw$dataset == "train",]

In [4]:
xgb <- xgboost(data = sparse_tr, 
               label = as.numeric(tr$country_destination) - 1, 
               eta = 0.1,
               max_depth = 9, 
               nround=25, 
               subsample = 0.5,
               colsample_bytree = 0.5,
               eval_metric = "merror",
               objective = "multi:softprob",
               num_class = 12,
               nthread = 3
)

[0]	train-merror:0.464063
[1]	train-merror:0.459373
[2]	train-merror:0.454256
[3]	train-merror:0.451848
[4]	train-merror:0.449214
[5]	train-merror:0.448111
[6]	train-merror:0.447270
[7]	train-merror:0.446568
[8]	train-merror:0.446041
[9]	train-merror:0.445189
[10]	train-merror:0.444449
[11]	train-merror:0.443721
[12]	train-merror:0.443521
[13]	train-merror:0.442567
[14]	train-merror:0.442179
[15]	train-merror:0.440661
[16]	train-merror:0.439921
[17]	train-merror:0.439056
[18]	train-merror:0.438855
[19]	train-merror:0.437864
[20]	train-merror:0.437476
[21]	train-merror:0.436711
[22]	train-merror:0.435971
[23]	train-merror:0.435770
[24]	train-merror:0.435168


In [5]:
saveRDS(xgb, paste0("./Models/xgb_model", NAME, ".RDS"))

# Predictions
We use the "predictions" function to evaluate our model on both the training set and set set. We see from the below that the probabilities lead to NDF and US always being predicted. The accuracy at this point is also quite low. 

In [6]:
#dataset <- dat_raw$dataset
#target <- dat$country_destination
#save(xgb, sparse_dat, dataset, target, file = "test.RData")

In [7]:
source("Predictions.R")
pred <- predictions(xgb, sparse_dat, dat_raw$dataset, dat$country_destination)

pred$pred_tr %>% table()
pred$acc_tr

pred$pred_ts %>% table()
pred$acc_ts

.
  NDF other    US 
48703     1 31033 

.
  NDF    US 
16064 10301 

## Submission
https://www.kaggle.com/indradenbakker/airbnb-recruiting-new-user-bookings/rscript-0-86547/discussion  
As per the example script above this submission file currently just takes the top 5 predictions in order as its submission file.
Submission page: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/submissions/attach

In [8]:
# Generate predictions on competition test set. 
# compare prediction to results
source("Generate_submission.R")
sparse_test <- sparse_dat[dat_raw$dataset == "test_external",]
id <- as.character(dat_raw[dat_raw$dataset == "test_external", "id"])

str(sparse_test)
final <- submission(xgb, sparse_test, id, paste0("xgb", NAME))

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:712680] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:166] 0 27767 27767 27949 28585 32308 39767 45598 49050 51079 ...
  ..@ Dim     : int [1:2] 27767 165
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:27767] "106103" "106104" "106105" "106106" ...
  .. ..$ : chr [1:165] "X" "age_bucket0-4" "age_bucket100+" "age_bucket15-19" ...
  ..@ x       : num [1:712680] 213452 213457 213458 213463 213464 ...
  ..@ factors : list()


In [9]:
name <- paste0("xgb_", NAME)
save(xgb, sparse_test, id, name, file = "test.RData")
str(final)

List of 2
 $ df  :'data.frame':	27767 obs. of  6 variables:
  ..$ V1: Factor w/ 2 levels "AU","IT": 2 2 1 1 2 1 1 2 1 2 ...
  .. ..- attr(*, "names")= chr [1:27767] "V1" "V2" "V3" "V4" ...
  ..$ V2: Factor w/ 5 levels "AU","CA","FR",..: 1 1 4 4 1 4 4 1 4 1 ...
  .. ..- attr(*, "names")= chr [1:27767] "V1" "V2" "V3" "V4" ...
  ..$ V3: Factor w/ 7 levels "AU","CA","ES",..: 4 6 6 6 4 6 6 4 6 4 ...
  .. ..- attr(*, "names")= chr [1:27767] "V1" "V2" "V3" "V4" ...
  ..$ V4: Factor w/ 8 levels "AU","CA","ES",..: 6 2 2 4 2 2 2 6 4 6 ...
  .. ..- attr(*, "names")= chr [1:27767] "V1" "V2" "V3" "V4" ...
  ..$ V5: Factor w/ 9 levels "AU","CA","ES",..: 2 4 4 2 5 4 4 2 2 2 ...
  .. ..- attr(*, "names")= chr [1:27767] "V1" "V2" "V3" "V4" ...
  ..$ id: chr [1:27767] "5uwns89zht" "szx28ujmhf" "guenkfjcbq" "fyomoivygn" ...
 $ file:'data.frame':	138835 obs. of  2 variables:
  ..$ id     : Factor w/ 27767 levels "0031awlkjq","0057snrdpu",..: 4446 4446 4446 4446 4446 22340 22340 22340 22340 22340 ...
  ..$

In [10]:
head(final$file)

Unnamed: 0,id,country
1,5uwns89zht,IT
2,5uwns89zht,AU
3,5uwns89zht,FR
4,5uwns89zht,NL
5,5uwns89zht,CA
6,szx28ujmhf,IT
