In [1]:
options(jupyter.plot_mimetypes = c("text/plain", "image/png"))

In [23]:
library(ggplot2); library(caret); library(plyr); library(dplyr)

In [4]:
set.seed(1234)

In [5]:
data <- read.csv("indego_df_model.csv")

In [6]:
data$start_hour <- factor(data$start_hour)
data$start_station <- factor(data$start_station)
levels(data$duration_type)[levels(data$duration_type)=="very short"] <- "very_short"
data$duration_type <- factor(data$duration_type, levels=c("very_short", "short", "medium", "long") )

In [33]:
data_small <- sample_n(data, 5000)

In [34]:
str(data_small)

'data.frame':	5000 obs. of  22 variables:
 $ X                  : int  66587 407901 539462 1874021 1028033 1013834 109059 653232 1204661 1634862 ...
 $ trip_id            : int  38711074 203530174 92073603 4054767 183055625 180646194 39220841 60823819 4691997 4496302 ...
 $ duration           : num  16 11 18 25 70 13 88 9 13 10 ...
 $ start_time         : Factor w/ 833217 levels "2015-04-23 07:44:00",..: 339604 731641 489692 16822 692355 687773 356053 414470 271393 185706 ...
 $ end_time           : Factor w/ 819051 levels "2015-04-23 07:45:00",..: 331697 718572 479571 16204 679636 675043 347897 405401 264787 180869 ...
 $ start_station      : Factor w/ 127 levels "3004","3005",..: 18 2 31 19 83 57 53 17 83 7 ...
 $ start_lat          : num  40 39.9 39.9 40 39.9 ...
 $ start_lon          : num  -75.2 -75.1 -75.2 -75.2 -75.2 ...
 $ end_station        : int  3026 3100 3020 3028 3046 3129 3058 3032 3010 3066 ...
 $ end_lat            : num  39.9 39.9 39.9 39.9 40 ...
 $ end_lon           

In [35]:
inTrain <- createDataPartition(y=data_small$duration_type, p=0.7, list=FALSE)
training <- data_small[inTrain,]
testing <- data_small[-inTrain,]

In [17]:
# formula <- duration_type ~ start_station + passholder_type + start_weekday + start_hour
formula <- duration_type ~ isBday + passholder_type + start_hour

# 1. RandomForest model

In [18]:
# Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
mtry <- sqrt(ncol(training))

In [19]:
model_rf <- train(formula, data=training, method="rf", prox=TRUE)
print(model_rf)

Random Forest 

1402 samples
   3 predictor
   4 classes: 'very_short', 'short', 'medium', 'long' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1402, 1402, 1402, 1402, 1402, 1402, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      
   2    0.6599207  0.004230985
  15    0.6409927  0.129502829
  29    0.6389667  0.127981506

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.


In [20]:
# ranger is another RF package with speed improvement
model_ranger <- train(formula, data=training, method="ranger", metric=metric, trControl=control)
print(model_ranger)

“package ‘ranger’ was built under R version 3.4.4”
Attaching package: ‘ranger’

The following object is masked from ‘package:randomForest’:

    importance



Random Forest 

1402 samples
   3 predictor
   4 classes: 'very_short', 'short', 'medium', 'long' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1261, 1262, 1262, 1263, 1261, 1262, ... 
Resampling results across tuning parameters:

  mtry  splitrule   Accuracy   Kappa       
   2    gini        0.6552688  0.0002113749
   2    extratrees  0.6557450  0.0009880549
  15    gini        0.6424468  0.1160161205
  15    extratrees  0.6393583  0.1115679930
  29    gini        0.6391152  0.1126795598
  29    extratrees  0.6362816  0.1089425659

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were mtry = 2 and splitrule = extratrees.


# 2. XGboost model

In [30]:
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 3, 
                        classProbs = TRUE,
                        allowParallel=T)

xgb.grid <- expand.grid(nrounds = 100,
                        eta = c(0.4,0.7),
                        max_depth = c(2,3,4),
                        gamma = 0,
                        colsample_bytree = .7,
                        min_child_weight = 1,
                        subsample = c(.8, 1)
                       )

In [36]:
model_xgb <-train(formula,
                 data=training,
                 method="xgbTree",
                 trControl=cv.ctrl,
                 tuneGrid=xgb.grid,
                 verbose=T,
                 metric="Kappa"
                 )

print(model_xgb)

eXtreme Gradient Boosting 

3502 samples
   3 predictor
   4 classes: 'very_short', 'short', 'medium', 'long' 

No pre-processing
Resampling: Cross-Validated (3 fold, repeated 5 times) 
Summary of sample sizes: 2334, 2335, 2335, 2336, 2334, 2334, ... 
Resampling results across tuning parameters:

  eta  max_depth  subsample  Accuracy   Kappa    
  0.4  2          0.8        0.6420907  0.1452312
  0.4  2          1.0        0.6430603  0.1499159
  0.4  3          0.8        0.6434029  0.1572003
  0.4  3          1.0        0.6434599  0.1538785
  0.4  4          0.8        0.6426612  0.1506443
  0.4  4          1.0        0.6431175  0.1533987
  0.7  2          0.8        0.6420329  0.1461228
  0.7  2          1.0        0.6419753  0.1472666
  0.7  3          0.8        0.6433464  0.1566673
  0.7  3          1.0        0.6435749  0.1506564
  0.7  4          0.8        0.6438604  0.1616273
  0.7  4          1.0        0.6427749  0.1554699

Tuning parameter 'nrounds' was held constant at a v