# Preprocessing

In this script we preprocess the training user data and split into internal training and testing sets so it is in a form which is processable by machine learning models.

In [6]:
# Libraries
library(ggplot2)
library(caret)
library(dplyr)
library(readr)

# Set seed
set.seed(1066)

In [7]:
# Read in the data
# Assumes the data files from the competition are in "../Data/"
train_users_2 <- read.csv("../Data/train_users_2.csv")
test_users <- read.csv("../Data/test_users.csv")
sessions <- read_csv("../Data/sessions.csv")
str(train_users_2)

'data.frame':	213451 obs. of  16 variables:
 $ id                     : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
 $ date_account_created   : Factor w/ 1634 levels "2010-01-01","2010-01-02",..: 171 502 263 696 249 1 2 3 4 4 ...
 $ timestamp_first_active : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
 $ date_first_booking     : Factor w/ 1977 levels "","2010-01-02",..: 1 1 194 960 35 2 4 10 190 3 ...
 $ gender                 : Factor w/ 4 levels "-unknown-","FEMALE",..: 1 3 2 2 1 1 2 2 2 1 ...
 $ age                    : num  NA 38 56 42 41 NA 46 47 50 46 ...
 $ signup_method          : Factor w/ 3 levels "basic","facebook",..: 2 2 1 2 1 1 1 1 1 1 ...
 $ signup_flow            : int  0 0 3 0 0 0 0 0 0 0 ...
 $ language               : Factor w/ 25 levels "ca","cs","da",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ affiliate_channel      : Factor w/ 8 levels "api","content",..: 3 8 3 3 3 4 4 3 4 4 ...
 $ affiliate_p

## Test / Train merge / index
Split internal training data into a training and testing set so we don't need to evaluate with kaggle leaderboards every time. 

Merge the 3 sets into one so feature engineering is performed the same on each. 

In [8]:
# Index internal training and testing sets
tr_index <- createDataPartition(y = train_users_2$country_destination, p = .75, list = FALSE)
train_users_2$dataset <- "test_internal"
train_users_2$dataset[tr_index] <- "train"

# Merge training and testing set
test_users$country_destination <- "NDF"
test_users$dataset <- "test_external"
dat <- rbind(train_users_2, test_users)

dat$dataset <- as.factor(dat$dataset)
table(dat$dataset)


test_external test_internal         train 
        62096         53358        160093 

## Feature Engineering

In [9]:
# Vector of labels the training data for which we have session data (~1/3)
sessions_label <- dat$id %in% sessions$user_id
sum(sessions_label)

# Remove the date_first_booked variable as it is 100% missing on test data (for obvious reasons)
dat <- select(dat, -date_first_booking)

## Save Output

In [10]:
saveRDS(dat, "../Data/users.RDS")
write_csv(dat, "../Data/users.csv")