# Final Approach
This notebook contains the code submitted as final in the Genpact Machine Learning Hackathon. The software used is:
* R version 3.5.1
    * Caret package 6.0-80
    * Metrics package 0.1.4

Libraries:

In [2]:
library(caret)
library(Metrics)

### Meal based approach
In this approach we fit a model for each meal ID. We work under the assumption that the same type of meal follow the same beahviour.

Data loading:

In [3]:
train <- read.csv('Data/train.csv')
trainCenter <- read.csv('Data/fulfilment_center_info.csv')
trainMeal <- read.csv('Data/meal_info.csv')
test <- read.csv('Data/test_QoiMO9B.csv')
submission <- read.csv('Data/sample_submission_hSlSoT6.csv')

Data engineering (Delete ID and factorize some caterogical features):

In [4]:
train$id <- NULL # Delete id
train$emailer_for_promotion <- factor(train$emailer_for_promotion) # Factorize
train$homepage_featured <- factor(train$homepage_featured) # Factorize

test$emailer_for_promotion <- factor(test$emailer_for_promotion) # Factorize
test$homepage_featured <- factor(test$homepage_featured) # Factorize

trainCenter$city_code <- factor(trainCenter$city_code) # Factorize
trainCenter$region_code <- factor(trainCenter$region_code) # Factorize

Join datasets by center_id and factorize:

In [5]:
trainDEF <- merge(train, trainCenter, by = 'center_id')
trainDEF$center_id <- factor(trainDEF$center_id) # Factorize

testDEF <- merge(test, trainCenter, by = 'center_id')
testDEF$center_id <- factor(testDEF$center_id) # Factorize

Reorder columns:

In [6]:
trainDEF <- trainDEF[, c(1:7, 9:ncol(trainDEF), 8)]
testDEF <- testDEF[, c(2, 1, 3:ncol(testDEF))]

Now we'll split the train set based on the meal ID, then for each meal ID we'll split the subset into train and test (the last 10 weeks):

In [7]:
trainBYmeal <- list()
for(meal in unique(trainDEF$meal_id)) {
  
  auxDF <- trainDEF[trainDEF$meal_id == meal, -3] # Delete meal_id column
  
  trainBYmeal[[as.character(meal)]][['train']] <- auxDF[auxDF$week %in% 1:135, ]
  trainBYmeal[[as.character(meal)]][['test']] <- auxDF[auxDF$week %in% 136:145, ]
  
}

The loss function RMSLE is not included in caret so we need to define it manually:

In [8]:
RMSLECaretFunc <- function(data, lev = NULL, model = NULL) {
  
  rmsleCaret <- Metrics::rmsle(data$obs, data$pred)
  c(RMSLEcaret = -rmsleCaret)
  
}

We will calculate and save the standard prediction, augmented by 1.10 and reduced by 0.9:

In [9]:
submission_1 <- submission
submission_1 <- cbind.data.frame(submission_1,
                                 num_ordersReduced = rep(0, nrow(submission_1)),
                                 num_ordersAugmented = rep(0, nrow(submission_1)))

We also need to create a csv saving the RSMLE achieved for each meal ID:

In [10]:
metrics <- data.frame(meal = rep(0, length(trainBYmeal)),
                      RMSLENormal = rep(0, length(trainBYmeal)),
                      RMSLEreduced = rep(0, length(trainBYmeal)),
                      RMSLEAugmented = rep(0, length(trainBYmeal)))

Now we will train the models (ranger implementation of Random Forests), the process is the following:
* Iterate over each meal ID
    * Find optimal hyperparameters by validation and calculate RMSLE on test
    * Once the optimal hyperparameters are known fit the model on all the train data (for that ID)
    * Make the predictions

During the process we update the RMSLE table.

In [None]:
set.seed(15122018)
ctrl <- trainControl(method = 'cv',
                     number = 3,
                     summaryFunction = RMSLECaretFunc)

for(i in 1:length(trainBYmeal)) {
  
  elemMeal <- trainBYmeal[[i]]
  
  # Find optimal paramaters 
  model <- suppressMessages(train(num_orders ~ .,
                                  data = elemMeal[['train']],
                                  method = 'ranger',
                                  trControl = ctrl,
                                  tuneLength = 2,
                                  metric = 'RMSLEcaret'))
  
  optmParam <- model$bestTune
  
  # Predictions over the pseudo test set
  predsPseudo <- predict(model, elemMeal[['test']][, -11])
  
  # Update metrics CSV
  metrics[i, 1] <- names(trainBYmeal[i])
  metrics[i, 2] <- Metrics::rmsle(predsPseudo, elemMeal[['test']][, 11])
  metrics[i, 3] <- Metrics::rmsle(predsPseudo, elemMeal[['test']][, 11] * 0.9)
  metrics[i, 4] <- Metrics::rmsle(predsPseudo, elemMeal[['test']][, 11] * 1.1)
  
  # Fit final model
  modelFinal <- suppressMessages(train(num_orders ~ .,
                                       data = rbind.data.frame(elemMeal[['train']],
                                                               elemMeal[['test']]),
                                       method = 'ranger',
                                       trControl = trainControl(method = 'none',
                                                                summaryFunction = RMSLECaretFunc),
                                       tuneGrid = optmParam,
                                       metric = 'RMSLEcaret'))
  
  # Final predictions over the real test set
  testAUX <- testDEF[testDEF$meal_id == as.numeric(names(trainBYmeal[i])),
                     -4]
  predsFinal <- predict(modelFinal, testAUX[, -1])
  testAUX$preds <- predsFinal
  
  for (j in 1:nrow(testAUX)) {
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_orders'] <-
      testAUX[j, 'preds']
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_ordersReduced'] <-
      testAUX[j, 'preds'] * 0.9
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_ordersAugmented'] <-
      testAUX[j, 'preds'] * 1.1
    
  } 
  
}

Save the predictions and its RMSE values as csv files:

In [None]:
write.csv(submission_1, file = 'submission_1.csv', row.names = FALSE)
write.csv(metrics, file = 'metrics.csv', row.names = FALSE)

Once generated these files we can make submissions based on:
* Original preds: with no changes at all
* Reduced preds: Multiplied by 0.9
* Augmented preds: Multiplied by 1.10
* Combined preds: Select the prediction for each meal ID based on the RMSLE values

In [None]:
metrics <- read.csv('metrics.csv')
submission_1 <- read.csv('submission_1.csv')

# Original preds
preds_1_sub <- submission_1[, c(1, 2)]
write.csv(preds_1_sub, 'preds_1_sub.csv', row.names = FALSE) 

# Reduced preds
preds_2_sub <- submission_1[, c(1, 3)]
colnames(preds_2_sub)[2] <- 'num_orders'
write.csv(preds_2_sub, 'preds_2_sub.csv', row.names = FALSE)

# Augmented preds
preds_3_sub <- submission_1[, c(1, 4)]
colnames(preds_3_sub)[2] <- 'num_orders'
write.csv(preds_3_sub, 'preds_3_sub.csv', row.names = FALSE)

# Combined preds (test dataframe needed)
preds_4_sub <- submission_1[, c(1, 2)]
preds_4_sub[, 2] <- rep(0, nrow(preds_4_sub))

metricsMins <- data.frame(meal = metrics[, 1],
                          whichMin = apply(metrics[, -1], 1, which.min))

for(i in preds_4_sub$id) {
  
  mealFORid <- test[test$id == i, 'meal_id']
  bestPred <- metricsMins[metricsMins$meal == mealFORid, 2]
  
  preds_4_sub[preds_4_sub$id == i, 'num_orders'] <-
    submission_1[submission_1$id == i, bestPred + 1]
  
}

write.csv(preds_4_sub, 'preds_4_sub.csv', row.names = FALSE) # LB: 58.3519

### Center based approach
In this approach we fit a model for each center ID. We work under the assumption that each center has its own behaviour patterns based on the location, country, people nearby...

(The process its similar to the previous one and there are a lot of of code repetition which is not a good practice. My justification is that I didn't have enough time to write nice code and, to maint the beauty of the original script, I decided not to change anything).

Data loading:

In [None]:
train <- read.csv('Data/train.csv')
trainCenter <- read.csv('Data/fulfilment_center_info.csv')
trainMeal <- read.csv('Data/meal_info.csv')
test <- read.csv('Data/test_QoiMO9B.csv')
submission <- read.csv('Data/sample_submission_hSlSoT6.csv')

Data engineering (Delete ID and factorize some caterogical features):

In [None]:
train$id <- NULL # Delete id
train$emailer_for_promotion <- factor(train$emailer_for_promotion) # Factorize
train$homepage_featured <- factor(train$homepage_featured) # Factorize

test$emailer_for_promotion <- factor(test$emailer_for_promotion) # Factorize
test$homepage_featured <- factor(test$homepage_featured) # Factorize

Join datasets by meal_id and factorize:

In [None]:
trainDEF <- merge(train, trainMeal, by = 'meal_id')
trainDEF$meal_id <- factor(trainDEF$meal_id) # Factorize

testDEF <- merge(test, trainMeal, by = 'meal_id')
testDEF$meal_id <- factor(testDEF$meal_id) # Factorize

Reorder columns:

In [None]:
trainDEF <- trainDEF[, c(1:7, 9:ncol(trainDEF), 8)]
testDEF <- testDEF[, c(2, 1, 3:ncol(testDEF))]

Now we'll split the train set based on the center ID, then for each meal ID we'll split the subset into train and test (the last 10 weeks):

In [None]:
trainBYcenter <- list()
for(center in unique(trainDEF$center_id)) {
  
  auxDF <- trainDEF[trainDEF$center_id == center, -3] # Delete center_id column
  
  trainBYcenter[[as.character(center)]][['train']] <- auxDF[auxDF$week %in% 1:135, ]
  trainBYcenter[[as.character(center)]][['test']] <- auxDF[auxDF$week %in% 136:145, ]
  
}

We will calculate and save the standard prediction, augmented by 1.10 and reduced by 0.9:

In [None]:
submission_1 <- submission
submission_1 <- cbind.data.frame(submission_1,
                                 num_ordersReduced = rep(0, nrow(submission_1)),
                                 num_ordersAugmented = rep(0, nrow(submission_1)))

We also need to create a csv saving the RSMLE achieved for a each center ID:

In [None]:
metrics <- data.frame(center = rep(0, length(trainBYcenter)),
                      RMSLENormal = rep(0, length(trainBYcenter)),
                      RMSLEreduced = rep(0, length(trainBYcenter)),
                      RMSLEAugmented = rep(0, length(trainBYcenter)))

Now we will train the models (ranger implementation of Random Forests), the process is the following:
* Iterate over each center ID
    * Find optimal hyperparameters by validation and calculate RMSLE on test
    * Once the optimal hyperparameters are known fit the model on all the train data (for that ID)
    * Make the predictions

During the process we update the RMSLE table.

In [None]:
set.seed(15122018)
ctrl <- trainControl(method = 'cv',
                     number = 3,
                     summaryFunction = RMSLECaretFunc)

for(i in 1:length(trainBYcenter)) {
  
  elemCenter <- trainBYcenter[[i]]
  
  # Find optimal paramaters
  model <- train(num_orders ~ .,
                 data = elemCenter[['train']],
                 method = 'ranger',
                 trControl = ctrl,
                 tuneLength = 2,
                 metric = 'RMSLEcaret')
  
  optmParam <- model$bestTune
  
  # Predictions over the pseudo test set
  predsPseudo <- predict(model, elemCenter[['test']][, -11])
  
  # Update metrics CSV
  metrics[i, 1] <- names(trainBYcenter[i])
  metrics[i, 2] <- Metrics::rmsle(predsPseudo, elemCenter[['test']][, 'num_orders'])
  metrics[i, 3] <- Metrics::rmsle(predsPseudo, elemCenter[['test']][, 'num_orders'] * 0.9)
  metrics[i, 4] <- Metrics::rmsle(predsPseudo, elemCenter[['test']][, 'num_orders'] * 1.1)
  
  # Fit final model
  modelFinal <- train(num_orders ~ .,
                      data = rbind.data.frame(elemCenter[['train']],
                                              elemCenter[['test']]),
                      method = 'ranger',
                      trControl = trainControl(method = 'none',
                                               summaryFunction = RMSLECaretFunc),
                      tuneGrid = optmParam,
                      metric = 'RMSLEcaret')
  
  # Final predictions over the real test set
  testAUX <- testDEF[testDEF$center_id == as.numeric(names(trainBYcenter[i])),
                     -4]
  predsFinal <- predict(modelFinal, testAUX[, -1])
  testAUX$preds <- predsFinal
  
  for (j in 1:nrow(testAUX)) {
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_orders'] <-
      testAUX[j, 'preds']
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_ordersReduced'] <-
      testAUX[j, 'preds'] * 0.9
    
    submission_1[submission_1$id == testAUX[j, 'id'], 'num_ordersAugmented'] <-
      testAUX[j, 'preds'] * 1.1
    
  } 
  
}

Save the predictions and its RMSE values as csv files:

In [None]:
write.csv(submission_1, file = 'submission_1_2nd.csv', row.names = FALSE)
write.csv(metrics, file = 'metrics_2nd.csv', row.names = FALSE)

Once generated these files we can make submissions based on:
* Original preds: with no changes at all
* Reduced preds: Multiplied by 0.9
* Augmented preds: Multiplied by 1.10
* Combined preds: Select the prediction for each meal ID based on the RMSLE values

In [None]:
metrics <- read.csv('metrics_2nd.csv')
submission_1 <- read.csv('submission_1_2nd.csv')

# Original preds
preds_1_sub <- submission_1[, c(1, 2)]
write.csv(preds_1_sub, 'preds_1_sub_2nd.csv', row.names = FALSE) 

# Reduced preds
preds_2_sub <- submission_1[, c(1, 3)]
colnames(preds_2_sub)[2] <- 'num_orders'
write.csv(preds_2_sub, 'preds_2_sub_2nd.csv', row.names = FALSE) 

# Augmented preds
preds_3_sub <- submission_1[, c(1, 4)]
colnames(preds_3_sub)[2] <- 'num_orders'
write.csv(preds_3_sub, 'preds_3_sub_2nd.csv', row.names = FALSE) 

# Combined preds (test dataframe needed)
preds_4_sub <- submission_1[, c(1, 2)]
preds_4_sub[, 2] <- rep(0, nrow(preds_4_sub))

metricsMins <- data.frame(center = metrics[, 1],
                          whichMin = apply(metrics[, -1], 1, which.min))

for(i in preds_4_sub$id) {
  
  centerFORid <- test[test$id == i, 'center_id']
  bestPred <- metricsMins[metricsMins$center == centerFORid, 2]
  
  preds_4_sub[preds_4_sub$id == i, 'num_orders'] <-
    submission_1[submission_1$id == i, bestPred + 1]
  
}

write.csv(preds_4_sub, 'preds_4_sub_2nd.csv', row.names = FALSE) 

### Final Ensemble
The best results were obtained by the reduced predictions, that's why the final submission is a simple ensemble of these two predictions.

In [None]:
firstApproachPREDS <- read.csv('preds_2_sub.csv') # Reduced preds (meal approach)
secondApproachPREDS <- read.csv('preds_2_sub_2nd.csv') # Reduced preds (center approach)

ensemble <- data.frame(id = firstApproachPREDS$id,
                       num_orders = rowMeans(data.frame(firstApproachPREDS$num_orders,
                                                        secondApproachPREDS$num_orders)))

write.csv(ensemble, file = 'ensembleFS.csv', row.names = FALSE)