# Preprocessing

In this script we preprocess the training user data and split into internal training and testing sets so it is in a form which is processable by machine learning models.

In [1]:
# Libraries
library(ggplot2)
library(caret)
library(dplyr)
library(readr)

# Set seed
set.seed(1066)

Loading required package: lattice

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [2]:
# Read in the data
# Assumes the data files from the competition are in "../Data/"
#"../Data/train_users_2.csv"
train_users_2 <- read.csv("../Data/users_FE.csv")
test_users <- read.csv("../Data/test_users.csv")
sessions <- read_csv("../Data/sessions.csv")
str(train_users_2)

'data.frame':	275547 obs. of  35 variables:
 $ X                      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ id                     : Factor w/ 275547 levels "00023iyk9l","0005ytdols",..: 129856 61844 34121 88356 63040 190164 167259 2769 77033 52565 ...
 $ age_cln                : int  NA 38 56 42 41 NA 46 47 50 46 ...
 $ age_cln2               : int  NA 38 56 42 41 NA 46 47 50 46 ...
 $ age_bucket             : Factor w/ 20 levels "0-4","100+","15-19",..: NA 7 12 8 8 NA 9 9 11 9 ...
 $ dac_year               : int  2010 2011 2010 2011 2010 2010 2010 2010 2010 2010 ...
 $ dac_month              : int  6 5 9 12 9 1 1 1 1 1 ...
 $ dac_day                : int  28 25 28 5 14 1 2 3 4 4 ...
 $ dac_week               : int  26 21 39 49 37 0 0 1 1 1 ...
 $ dac_yearweek           : int  201026 201121 201039 201149 201037 201000 201000 201001 201001 201001 ...
 $ dac_yearmonth          : int  201006 201105 201009 201112 201009 201001 201001 201001 201001 201001 ...
 $ dac_yearmonthday       : int  

## Test / Train merge / index
Split internal training data into a training and testing set so we don't need to evaluate with kaggle leaderboards every time. 

Merge the 3 sets into one so feature engineering is performed the same on each. 

In [14]:
ii <- train_users_2$dataset == "train"


# Index internal training and testing sets
tr_index <- createDataPartition(y = train_users_2$country_destination, p = .75, list = FALSE)
train_users_2$set <- "test_internal"
train_users_2$set[tr_index] <- "train"

# Merge training and testing set
test_users$country_destination <- "NDF"
test_users$set <- "test_external"
dat <- rbind(train_users_2, test_users)

dat$set <- as.factor(dat$set)
table(dat$set)


test_external test_internal         train 
        62096         53358        160093 

## Feature Engineering

In [15]:
# Vector of labels the training data for which we have session data (~1/3)
sessions_label <- dat$id %in% sessions$user_id
sum(sessions_label)

# Remove the date_first_booked variable as it is 100% missing on test data (for obvious reasons)
dat <- select(dat, -date_first_booking)

## Save Output

In [16]:
saveRDS(dat, "../Data/users.RDS")
write_csv(dat, "../Data/users.csv")