In [3]:
library(data.table)
library(repr)
library(lubridate)
library(rpart)
library(partykit)
library(ggplot2)
library(Metrics)
library(TSdist)
library(dtw)
library(TSrepr)
library(TunePareto)
library(caret)
library(writexl)
library(forecast)
library(tidyr)
library(randomForest)
library(rattle)
options(repr.plot.width=10, repr.plot.height=10)

## Model Selection and Test Part

In [82]:
dt <- fread("C:/Users/kaan9/OneDrive/Masaüstü/bulk_imbalance_son.csv")
total_vol <- data.table(dt[,c("date","hour","downRegulationZeroCoded",
                              "upRegulationZeroCoded","net","system_direction")])
colnames(total_vol) <- c("date","hour","yat_vol","yal_vol","net_imb","direction")

In [83]:
wt <- fread("C:/Users/kaan9/OneDrive/Masaüstü/weather_son.csv")
wt$loc <- paste("loc",as.character(wt$lat),as.character(wt$lon),sep="_")
wt <- data.table(pivot_wider(wt[,c(1,2,7,5,6)],names_from = c(loc,variable),values_from =value))
wt$day <- wday(wt$date)
wt$month <- month(wt$date)
total_vol <- wt[total_vol,on=.(date,hour)]
total_vol$t <- 1:nrow(total_vol)
total_vol[, direction:=ifelse(net_imb>50, "Positive" , ifelse(net_imb<(-50),"Negative","Neutral"))]
total_vol[, net_imb:=ifelse(net_imb<(-5000), (-5000), ifelse(net_imb>5000, 5000, net_imb))]

Data is read, trend, week day and month information added as features. Then, data is transformed to wide format. There are 42 addtional features from the weather data. Multivariate time series will be transformed to univariate time series with PCA and Random Forest.

In [84]:
pca <- princomp(total_vol[,-c("date","yat_vol","yal_vol","net_imb","direction","t")])
summary(pca)

Importance of components:
                            Comp.1       Comp.2       Comp.3      Comp.4
Standard deviation     704.3836339 122.42625879 101.36671251 93.81010051
Proportion of Variance   0.8474425   0.02560005   0.01755022  0.01503111
Cumulative Proportion    0.8474425   0.87304257   0.89059278  0.90562389
                            Comp.5      Comp.6      Comp.7       Comp.8
Standard deviation     88.26658029 84.20265311 82.61431247 74.974045479
Proportion of Variance  0.01330713  0.01210998  0.01165742  0.009600937
Cumulative Proportion   0.91893102  0.93104099  0.94269841  0.952299347
                             Comp.9    Comp.10      Comp.11      Comp.12
Standard deviation     70.751694181 68.4074839 61.486591144 56.258304100
Proportion of Variance  0.008549986  0.0079928  0.006457324  0.005405863
Cumulative Proportion   0.960849333  0.9688421  0.975299457  0.980705320
                            Comp.13      Comp.14      Comp.15    Comp.16
Standard deviation     52.018

In [6]:
total_vol[,pca1:=pca$scores[,1]]

First principal component is taken since it explans 84% of the variance.

In [7]:
date_fltr <- which(total_vol$date=="2020-12-31")[total_vol[which(total_vol$date=="2020-12-31")]$hour==23]
total_vol_train <- total_vol[1:date_fltr,]
total_vol_test <- total_vol[-(1:nrow(total_vol_train))]

First 2 years is used as training data and classification is performed on the last year as model testing. A simple random forest model without parameter tuning is built on the train set.

In [8]:
start <- Sys.time()
model_rf <- randomForest(net_imb ~.-date-yat_vol-yal_vol-direction-t,total_vol_train)
end <- Sys.time()
save(model_rf,file="C:/Users/kaan9/OneDrive/Masaüstü/rf_model_2years.Rdata")

In [22]:
model_rf


Call:
 randomForest(formula = net_imb ~ . - date - yat_vol - yal_vol -      direction - t, data = total_vol_train[, -paste0("pca", 1:15)]) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 15

          Mean of squared residuals: 318384.9
                    % Var explained: 60.52

In [21]:
load("C:/Users/kaan9/OneDrive/Masaüstü/rf_model_2years.Rdata")

In [10]:
fc <- predict(model_rf,total_vol_test)
total_vol_test$rf <- fc

Last year is predicted by random forest model, thus there is 2 univariate time series, 1st PCA and Random Forest predictions. Multivariate raw data, univariate PCA and random forest data will be tested only using k-Nearest Neighbors algorithm with euclidean distance for every hour.

In [115]:
control1 <- trainControl(method="cv",
                        number=10)
knn_raw <- list()
for(i in 12:23){
    knn_raw[[i]] <- train(direction~.,
                  data=total_vol[hour==i,-c("pca1","date","yat_vol","yal_vol","net_imb","t")],
                  method="knn",
                  metric="Accuracy",
                  trControl=control1,
                  tuneGrid = expand.grid(.k=c(10,20,30,40)))
}

In [116]:
knn_raw

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

[[7]]
NULL

[[8]]
NULL

[[9]]
NULL

[[10]]
NULL

[[11]]
NULL

[[12]]
k-Nearest Neighbors 

1107 samples
  45 predictor
   3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 996, 996, 998, 996, 997, 996, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
  10  0.5619286  0.1454441
  20  0.5664493  0.1444277
  30  0.5709610  0.1492826
  40  0.5664737  0.1332574

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 30.

[[13]]
k-Nearest Neighbors 

1107 samples
  45 predictor
   3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 995, 996, 996, 997, 996, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
  10  0.5547505  0.1507693
  20  0.5601073  

K-nn algorithm gives hourly accuracy between 0.50 and 0.65 while using multivariate raw data.

In [14]:
control1 <- trainControl(method="cv",
                        number=10)
knn_pca <- list()
for(i in 12:23){
    knn_pca[[i]] <- train(direction~pca1,
                  data=total_vol[hour==i,],
                  method="knn",
                  metric="Accuracy",
                  trControl=control1,
                  tuneGrid = expand.grid(.k=c(5,15,25,35,50)))
}

In [15]:
knn_pca

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

[[7]]
NULL

[[8]]
NULL

[[9]]
NULL

[[10]]
NULL

[[11]]
NULL

[[12]]
k-Nearest Neighbors 

1108 samples
   1 predictor
   3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 997, 997, 997, 998, 997, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.6067205  0.2507370
  15  0.5938920  0.1971945
  25  0.5749153  0.1497103
  35  0.5659395  0.1246118
  50  0.5839170  0.1431025

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

[[13]]
k-Nearest Neighbors 

1108 samples
   1 predictor
   3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 997, 997, 998, 995, 997, 997, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5956721  0

Classification made by the univariate 1st principal component, results are similar to the multivariate raw data. In some hours, PCA is better than raw and vice versa.

In [11]:
control1 <- trainControl(method="cv",
                        number=10)
knn_rf <- list()
for(i in 12:23){
    knn_rf[[i]] <- train(direction~rf,
                  data=total_vol_test[hour==i,],
                  method="knn",
                  metric="Accuracy",
                  trControl=control1,
                  tuneGrid = expand.grid(.k=c(5,15,25,35,50)))    
}


In [17]:
knn_rf

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

[[7]]
NULL

[[8]]
NULL

[[9]]
NULL

[[10]]
NULL

[[11]]
NULL

[[12]]
k-Nearest Neighbors 

377 samples
  1 predictor
  3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 339, 339, 339, 338, 340, 340, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa     
   5  0.6043112  0.20460584
  15  0.6023161  0.14349661
  25  0.5547799  0.04478103
  35  0.5890761  0.12762990
  50  0.5866506  0.13773197

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

[[13]]
k-Nearest Neighbors 

377 samples
  1 predictor
  3 classes: 'Negative', 'Neutral', 'Positive' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 339, 340, 339, 339, 339, 340, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.5648138  0

When the k-nn applied to the Random Forest univariate series, results are better than both PCA and raw data. Some hours has the accuracy of 0.7. 

In [124]:
total_vol_test <- total_vol_test[,c("date","hour","pca1","rf","direction")]
save(total_vol_test,file="C:/Users/kaan9/OneDrive/Masaüstü/test_data_son.Rdata")

In [None]:
load("C:/Users/kaan9/OneDrive/Masaüstü/test_data_son.Rdata")

To have higher accuracy, data is seperated for every hour with different size of time windows. For every 12 test hour, there are 3 time window sizes, 12,24,36. There are 2 univariate time series PCA anf RF. So, there are 72 time series data is available.

In [153]:
series <- list()
start_hours_before <- 0
window_sizes <- c(12,24,36)
cols <- c("rf","pca1")
for(c in cols){
    for(w in window_sizes){
        tmp <- data.table(total_vol_test)
        tmp[, paste0((start_hours_before):(w+start_hours_before-1), "_hours_before") := shift(tmp[[c]], (start_hours_before):(w+start_hours_before-1))]
        tmp <- tmp[complete.cases(tmp),]
        for(h in 12:23){
           hour <- paste("hour",as.character(h),sep="")
           window <- paste("window",as.character(w),sep="")
           st <- which(colnames(tmp)==paste(start_hours_before,"hours_before",sep="_"))
           e <- length(colnames(tmp))
           dir <- which(colnames(tmp)=="direction") 
           indices <- c(1,dir,st:e)
           series[[paste(hour,window,c,sep="_")]]  <- tmp[hour==h,..indices]
        }
    }
}

For every time series, data is represented with SAX and decision trees in addition to raw univariate series PCA and RF. So, after 3 representation for 72 time series data, there are 216 different time series data.

In [155]:
time_start <- Sys.time()
for(n in names(series)){
  series_long <- melt(series[[n]],id.vars = c("date","direction"))
  long_dt <- data.table()
  for(d in unique(series_long$date)){
    temp_dt <- series_long[date==d,]
    temp_tree <- rpart(value~variable,temp_dt,minbucket=1,minsplit=2)
    temp_pred <- predict(temp_tree)
    temp_dt$tree <- temp_pred
    temp_dt[,t:=1:.N]
    temp_sax <- repr_sax(temp_dt$value, q = 2, a = 4)
    dummy_time=c(1:(length(temp_sax)-1))*2
    dummy_time=c(dummy_time,nrow(temp_dt))  
    dt_sax=data.table(t=dummy_time,sax_rep_char=temp_sax)
    temp_dt <- merge(temp_dt,dt_sax,by="t",all.x=T)
    temp_dt[,sax_rep_char_num:=nafill(as.numeric(as.factor(sax_rep_char)),'nocb')] # from data.table  
    temp_dt[,sax_rep:=mean(value),by=list(sax_rep_char_num)]  
    long_dt <- rbind(long_dt,temp_dt)
  }
  series[[paste(n,"tree",sep="")]] <- dcast(long_dt,date+direction~variable,value.var="tree")
  series[[paste(n,"sax",sep="")]] <- dcast(long_dt,date+direction~variable,value.var="sax_rep")  
}
time_end <- Sys.time()

In [158]:
for(n in names(series)){
    series[[n]] <- series[[n]][date>="2021-01-02"]
}

In [None]:
save(series,file="C:/Users/kaan9/OneDrive/Masaüstü/series_test.RData")

In [16]:
load("C:/Users/kaan9/OneDrive/Masaüstü/series_test.RData")

To apply knn algorithm with different distance measures, distance matrices are computed for every time series data. Euclidean, dynamic time warping (DTW) and edit distance for real sequences (EDR) distance measures are used. With the 3 distance measures, there are 648 different distance matrices for the knn algorithm.

In [36]:
time_start1 <- Sys.time()
distances <- list()
for(n in names(series)){
 
    distances[[paste(n,"euc",sep="_")]] <- as.matrix(dist(series[[n]][,3:length(series[[n]])]))
    diag(distances[[paste(n,"euc",sep="_")]]) <- 1000000
    
    distances[[paste(n,"dtw",sep="_")]] <- as.matrix(dtwDist(series[[n]][,3:length(series[[n]])],window.type='sakoechiba',window.size=10))
    diag(distances[[paste(n,"dtw",sep="_")]]) <- 1000000
    
    distances[[paste(n,"edr",sep="_")]] <- as.matrix(TSDatabaseDistances(series[[n]][,3:length(series[[n]])],distance='erp',g=0.5))
    diag(distances[[paste(n,"edr",sep="_")]]) <- 1000000
    
}
save(distances,file="C:/Users/kaan9/OneDrive/Masaüstü/distances_test.RData")
time_end1 <- Sys.time()

In [37]:
load("C:/Users/kaan9/OneDrive/Masaüstü/distances_test.RData")

To find the best models, 10 fold cross-validation with 5 repetitions will be applied. Cross-validation test indices is computed before the knn algorithm. 

In [49]:
set.seed(131)
cv_indices <- list()
trainclasses <- list()
for(n in names(distances)){
    
    seri <- substr(n,start = 1,stop=gregexpr(pattern ='_',n)[[1]][3]-1)
    trainclasses[[n]] <- series[[seri]]$direction
    cv_indices[[n]] <- generateCVRuns(trainclasses[[n]], ntimes =5, nfold = 10, 
                                      leaveOneOut = FALSE, stratified = TRUE)
}


In [56]:
nn_classify_cv=function(dist_matrix,train_class,test_indices,k=1){
    
    test_distances_to_train=dist_matrix[test_indices,]
    test_distances_to_train=test_distances_to_train[,-test_indices]
    train_class=train_class[-test_indices]
    #print(str(test_distances_to_train))
    ordered_indices=apply(test_distances_to_train,1,order)
    if(k==1){
        nearest_class=as.numeric(train_class[as.numeric(ordered_indices[1,])])
        nearest_class=data.table(id=test_indices,nearest_class)
    } else {
        nearest_class=apply(ordered_indices[1:k,],2,function(x) {train_class[x]})
        nearest_class=data.table(id=test_indices,t(nearest_class))
    }
    
    long_nn_class=melt(nearest_class,'id')
    class_counts=long_nn_class[,.N,list(id,value)]
    class_counts[,predicted_prob:=N/k]
    wide_class_prob_predictions=dcast(class_counts,id~value,value.var='predicted_prob')
    wide_class_prob_predictions[is.na(wide_class_prob_predictions)]=0
    class_predictions=class_counts[,list(predicted=value[which.max(N)]),by=list(id)]
    return(list(prediction=class_predictions,prob_estimates=wide_class_prob_predictions))
    
}

In [58]:
time_starta <- Sys.time()
k_levels <- c(10,20,30,40)
results <- data.table(hour=character(0),model=character(0),k=numeric(0),mean_acc=numeric(0),sd_acc=numeric(0))
for(matr in names(distances)){
    hour <- substr(matr,start = 1,stop=gregexpr(pattern ='_',matr)[[1]][1]-1)
    for(k in k_levels){
        acc_temp <- numeric(0)
        for(rep in 1:5){
            this_rep <- cv_indices[[matr]][[rep]]
            for(fold in 1:10){
                test_indices <- this_rep[[fold]]
                preds <- nn_classify_cv(distances[[matr]],trainclasses[[matr]],
                                        test_indices,k)$prediction$predicted
                acc_temp <- c(acc_temp,
                              Metrics::accuracy(trainclasses[[matr]][test_indices],preds))
            }
        }
        results <- rbind(results,list(hour,matr,k,mean(acc_temp),sd(acc_temp)))
    }
}
save(results,file="C:/Users/kaan9/OneDrive/Masaüstü/results_test.RData")
time_enda <- Sys.time()

In [None]:
load("C:/Users/kaan9/OneDrive/Masaüstü/results_test.RData")

Knn algorithm is applied to all 648 distance matrices. So, for every test hour there are 2 univariate time series, 3 different time window size, 3 different representations and 3 different distance measures and 4 different k parameters. That means, for every test hour between 12 and 23, 216 different time series classification models are applied and accuracy of every model is calculated. Best models according to accuracy for every hour is shown in the table below.

In [4]:
tmp <- results[,max(mean_acc),by=list(hour)]
best <- results[tmp,on=.(hour=hour,mean_acc=V1)]
best

hour,model,k,mean_acc,sd_acc
<chr>,<chr>,<dbl>,<dbl>,<dbl>
hour12,hour12_window36_pca1sax_dtw,20,0.651394,0.04616396
hour13,hour13_window12_rf_euc,20,0.6583357,0.05978142
hour14,hour14_window12_pca1_dtw,20,0.7206686,0.03238162
hour15,hour15_window12_pca1_dtw,20,0.6936131,0.0461904
hour16,hour16_window24_pca1sax_edr,30,0.6935277,0.03293997
hour17,hour17_window12_rftree_euc,10,0.7166714,0.04893535
hour18,hour18_window12_rf_edr,20,0.7089758,0.04337602
hour19,hour19_window24_rf_dtw,30,0.7020341,0.02199065
hour20,hour20_window12_rfsax_euc,30,0.6788762,0.0507944
hour21,hour21_window36_rftree_edr,10,0.6460171,0.05104843


In [5]:
best_models <- character(0)
for(n in 1:nrow(best)){
    k <- paste("k",as.character(best$k[n]),sep="")
    best_models <- c(best_models,paste(best$model[n],k,sep="_"))
}

In [7]:
save(best_models,file="C:/Users/kaan9/OneDrive/Masaüstü/best_test.RData")

## Prediction and Results Part

For the prediction part, same data manipulation steps are followed.

In [4]:
dt <- fread("C:/Users/kaan9/OneDrive/Masaüstü/bulk_imbalance_son.csv")
total_vol <- data.table(dt[,c("date","hour","downRegulationZeroCoded",
                              "upRegulationZeroCoded","net","system_direction")])
colnames(total_vol) <- c("date","hour","yat_vol","yal_vol","net_imb","direction")

In [5]:
wt <- fread("C:/Users/kaan9/OneDrive/Masaüstü/weather_son.csv")
wt$loc <- paste("loc",as.character(wt$lat),as.character(wt$lon),sep="_")
wt <- data.table(pivot_wider(wt[,c(1,2,7,5,6)],names_from = c(loc,variable),values_from =value))
wt$day <- wday(wt$date)
wt$month <- month(wt$date)
total_vol <- wt[total_vol,on=.(date,hour)]
total_vol$t <- 1:nrow(total_vol)
total_vol[, direction:=ifelse(net_imb>50, "Positive" , ifelse(net_imb<(-50),"Negative","Neutral"))]
total_vol[, net_imb:=ifelse(net_imb<(-5000), (-5000), ifelse(net_imb>5000, 5000, net_imb))]

In [6]:
pca <- princomp(total_vol[,-c("date","yat_vol","yal_vol","net_imb","direction","t")])
summary(pca)

Importance of components:
                            Comp.1       Comp.2       Comp.3      Comp.4
Standard deviation     704.3836339 122.42625879 101.36671251 93.81010051
Proportion of Variance   0.8474425   0.02560005   0.01755022  0.01503111
Cumulative Proportion    0.8474425   0.87304257   0.89059278  0.90562389
                            Comp.5      Comp.6      Comp.7       Comp.8
Standard deviation     88.26658029 84.20265311 82.61431247 74.974045479
Proportion of Variance  0.01330713  0.01210998  0.01165742  0.009600937
Cumulative Proportion   0.91893102  0.93104099  0.94269841  0.952299347
                             Comp.9    Comp.10      Comp.11      Comp.12
Standard deviation     70.751694181 68.4074839 61.486591144 56.258304100
Proportion of Variance  0.008549986  0.0079928  0.006457324  0.005405863
Cumulative Proportion   0.960849333  0.9688421  0.975299457  0.980705320
                            Comp.13      Comp.14      Comp.15    Comp.16
Standard deviation     52.018

In [7]:
total_vol[,pca1:=pca$scores[,1]]

In [8]:
date_fltr <- which(total_vol$date=="2022-01-12")[total_vol[which(total_vol$date=="2022-01-12")]$hour==16]
total_vol_train <- total_vol[1:date_fltr,]

For the predictions, Random Forest model is trained on the all data available.

In [None]:
start <- Sys.time()
model_rf_all <- randomForest(net_imb ~.-date-yat_vol-yal_vol-direction-t-pca1,total_vol_train)
end <- Sys.time()
save(model_rf_all,file="C:/Users/kaan9/OneDrive/Masaüstü/rf_model_all.Rdata")

In [9]:
load("C:/Users/kaan9/OneDrive/Masaüstü/rf_model_all.Rdata")
model_rf_all


Call:
 randomForest(formula = net_imb ~ . - date - yat_vol - yal_vol -      direction - t - pca1, data = total_vol_train) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 15

          Mean of squared residuals: 378077.6
                    % Var explained: 62.9

In [None]:
control1 <- trainControl(method="cv",
                        number=10)
start <- Sys.time()
model_ranger <-  train(net_imb ~.-date-yat_vol-yal_vol-direction-t-pca1, 
                       data = total_vol_train,
                       method = 'ranger',
                       metric = 'RMSE',
                       trControl = control1,
                       tuneGrid = expand.grid(.mtry=c(15,25,35),.splitrule="variance",.min.node.size=c(10,30,50)))
end <- Sys.time()
save(model_ranger,file="C:/Users/kaan9/OneDrive/Masaüstü/ranger_model.Rdata")

In [10]:
load("C:/Users/kaan9/OneDrive/Masaüstü/ranger_model.Rdata")
model_ranger

Random Forest 

26585 samples
   51 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 23925, 23927, 23927, 23926, 23927, 23928, ... 
Resampling results across tuning parameters:

  mtry  min.node.size  RMSE      Rsquared   MAE     
   5    3              650.5268  0.6498134  473.3433
   5    5              653.8487  0.6459687  476.2564
   5    7              658.5292  0.6397895  479.8987
  10    3              632.6367  0.6636021  460.0678
  10    5              636.0772  0.6601107  462.6722
  10    7              639.5813  0.6558312  465.6089
  15    3              625.6155  0.6685003  455.0549
  15    5              628.2432  0.6654816  456.8414
  15    7              631.4221  0.6619813  459.5056

Tuning parameter 'splitrule' was held constant at a value of variance
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 15, splitrule = variance
 and min.node.size = 3.

In [None]:
control1 <- trainControl(method="cv",
                        number=10)
start <- Sys.time()
model_ranger_son <-  train(net_imb ~.-date-yat_vol-yal_vol-direction-t-pca1, 
                       data = total_vol_train,
                       num.trees=500,
                       method = 'ranger',
                       metric = 'RMSE',
                       trControl = control1,
                       tuneGrid = expand.grid(.mtry=15,.splitrule="variance",.min.node.size=3))
end <- Sys.time()
save(model_ranger_son,file="C:/Users/kaan9/OneDrive/Masaüstü/ranger_model_son.Rdata")

Ranger and Random Forest packages are trained and parameter tuning is made with cross-validation. Best RF model which will be used to transform data to univariate is shown below.

In [11]:
load("C:/Users/kaan9/OneDrive/Masaüstü/ranger_model_son.Rdata")
model_ranger_son

Random Forest 

26585 samples
   51 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 23926, 23926, 23926, 23926, 23928, 23926, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  624.4039  0.6715859  454.5149

Tuning parameter 'mtry' was held constant at a value of 15
Tuning
 parameter 'splitrule' was held constant at a value of variance

Tuning parameter 'min.node.size' was held constant at a value of 3

In [None]:
series <- list()
start_hours_before <- 0
window_sizes <- c(12,24,36)
cols <- c("rf","pca1")
for(c in cols){
    for(w in window_sizes){
        tmp <- data.table(total_vol)
        tmp[, paste0((start_hours_before):(w+start_hours_before-1), "_hours_before") := shift(tmp[[c]], (start_hours_before):(w+start_hours_before-1))]
        tmp <- tmp[complete.cases(tmp),]
        for(h in 12:23){
           hour <- paste("hour",as.character(h),sep="")
           window <- paste("window",as.character(w),sep="")
           st <- which(colnames(tmp)==paste(start_hours_before,"hours_before",sep="_"))
           e <- length(colnames(tmp))
           dir <- which(colnames(tmp)=="direction") 
           indices <- c(1,dir,st:e)
           series[[paste(hour,window,c,sep="_")]]  <- tmp[hour==h,..indices]
        }
    }
}

In [None]:
time_start <- Sys.time()
for(n in names(series)){
  series_long <- melt(series[[n]],id.vars = c("date","direction"))
  long_dt <- data.table()
  for(d in unique(series_long$date)){
    temp_dt <- series_long[date==d,]
    temp_tree <- rpart(value~variable,temp_dt,minbucket=1,minsplit=2)
    temp_pred <- predict(temp_tree)
    temp_dt$tree <- temp_pred
    temp_dt[,t:=1:.N]
    temp_sax <- repr_sax(temp_dt$value, q = 2, a = 4)
    dummy_time=c(1:(length(temp_sax)-1))*2
    dummy_time=c(dummy_time,nrow(temp_dt))  
    dt_sax=data.table(t=dummy_time,sax_rep_char=temp_sax)
    temp_dt <- merge(temp_dt,dt_sax,by="t",all.x=T)
    temp_dt[,sax_rep_char_num:=nafill(as.numeric(as.factor(sax_rep_char)),'nocb')] # from data.table  
    temp_dt[,sax_rep:=mean(value),by=list(sax_rep_char_num)]  
    long_dt <- rbind(long_dt,temp_dt)
  }
  series[[paste(n,"tree",sep="")]] <- dcast(long_dt,date+direction~variable,value.var="tree")
  series[[paste(n,"sax",sep="")]] <- dcast(long_dt,date+direction~variable,value.var="sax_rep")  
}
time_end <- Sys.time()

In [12]:
load("C:/Users/kaan9/OneDrive/Masaüstü/best_test.RData")
load("C:/Users/kaan9/OneDrive/Masaüstü/series_rapor.RData")

In [13]:
trainclasses <- list()
for(h in 12:23){
    trainclasses[[as.character(h)]] <- total_vol[hour==h,]$direction
}

Time series data is computed with the same steps. "predict_knn" function is shown below and it gives predictions between hour 12 and 23 with the best model information for the desired day. 

In [14]:
predict_knn <- function(bestModels,date){
    
    today <- date
    predictions <- character(0)
    for(n in best_models){
        seri <- substr(n,start = 1,stop=gregexpr(pattern ='_',n)[[1]][3]-1)
        hour <- as.numeric(substr(n,start = 5,stop=gregexpr(pattern ='_',n)[[1]][1]-1))
        dis <- substr(n,start = gregexpr(pattern ='_',n)[[1]][3]+1,stop=gregexpr(pattern ='_',n)[[1]][4]-1)
        k <- as.numeric(substr(n,start = gregexpr(pattern ='_',n)[[1]][4]+2,stop=nchar(n)))
        train_class <- trainclasses[[as.character(hour)]]   

        if(dis=="edr"){ 
           dist_matrix <- TSDatabaseDistances(X = series[[seri]][,3:length(series[[seri]])],
                                              Y = series[[seri]][date==today,3:length(series[[seri]])],        
                                              distance='erp',g=0.5) 
        }else if(dis=="dtw"){  
            dist_matrix <- dtwDist(mx=series[[seri]][,3:length(series[[seri]])],
                                   my=series[[seri]][date==today,3:length(series[[seri]])],
                                   window.type='sakoechiba',window.size=10)      
        }else if(dis=="euc"){    
            dist_matrix <- TSDatabaseDistances(X = series[[seri]][,3:length(series[[seri]])],
                                               Y = series[[seri]][date==today,3:length(series[[seri]])],        
                                               distance='euc')      
        }

        dist_matrix[length(dist_matrix)] <- 1000000
        ordered_indices <- order(dist_matrix)
        nearest_class <- train_class[ordered_indices[1:k]]
        tmp_table <- table(nearest_class)    
        pred <- names(which.max(tmp_table))
        predictions <- c(predictions,pred)
    }
    return(predictions)
    
}

In [15]:
predict_knn(best_models,"2022-01-22")

After the competition period ends, all period is predicted again.

In [17]:
total_vol[hour %in% c(12:23),][(date=="2022-01-22"),]$direction <- rep("Positive",12)

In [19]:
test_start <- "2022-01-09"
test_end <- "2022-01-22"
test_dates <- unique(total_vol[hour %in% c(12:23),][(date>=test_start) & (date<=test_end),date])
real <- total_vol[hour %in% c(12:23),][(date>=test_start) & (date<=test_end),direction]

In [20]:
preds <- character(0)
for(d in test_dates){
    preds <- c(preds,predict_knn(best_models,d))
}

In [21]:
res_dt <- total_vol[hour %in% c(12:23),][(date>=test_start) & (date<=test_end),c("date","hour","direction")]
res_dt[,preds:=preds]

In [22]:
accu <- numeric(0)
for(d in test_dates){
    real_temp <- res_dt[date==d,direction]
    pred_temp <- res_dt[date==d,preds]
    accu <- c(accu,Metrics::accuracy(actual = real_temp,predicted = pred_temp))
}

The final results are shown in the table below, daily accuracies of every day in the test period. Overall accuracy in the test period is 0.744.

In [54]:
res_son <- data.table("date"=test_dates,"Daily Accuracy"=accu)
res_son

date,Daily Accuracy
<date>,<dbl>
2022-01-09,0.75
2022-01-10,0.6666667
2022-01-11,0.9166667
2022-01-12,1.0
2022-01-13,0.75
2022-01-14,0.4166667
2022-01-15,0.6666667
2022-01-16,0.6666667
2022-01-17,0.3333333
2022-01-18,0.9166667


In [24]:
mean(accu)

Baseline accuracy for the test period is shown below.

In [40]:
baseline_accuracy_daily = function(dat, start_date, end_date){
    
    dat[, baseline1_prediction:=shift(direction, n=24, type="lag")]
    dat[, baseline2_prediction:=shift(direction, n=168, type="lag")]
    
    dat = dat[date>=start_date & date<=end_date & hour>=12]
    
    dat[,result_1:=ifelse(direction==baseline1_prediction, 1, 0)]
    dat[,result_2:=ifelse(direction==baseline2_prediction, 1, 0)]
    
    daily_summary = dat[,list(Baseline1_Daily_Accuracy=mean(result_1),
                              Baseline2_Daily_Accuracy=mean(result_2)), c("date")]
    
    return(daily_summary)
    
}

baseline_accuracy_daily(total_vol, as.Date("2022-01-09"), as.Date("2022-01-22"))

date,Baseline1_Daily_Accuracy,Baseline2_Daily_Accuracy
<date>,<dbl>,<dbl>
2022-01-09,0.75,0.5
2022-01-10,0.5,0.8333333
2022-01-11,0.6666667,0.5
2022-01-12,1.0,0.3333333
2022-01-13,0.8333333,0.8333333
2022-01-14,0.4166667,0.3333333
2022-01-15,0.25,0.9166667
2022-01-16,0.6666667,0.8333333
2022-01-17,0.1666667,0.5833333
2022-01-18,0.4166667,1.0


Comparasion between baselines and model predictions by daily accuracy are shown below in the table.

In [46]:
final_res <- baseline_accuracy_daily(total_vol, as.Date("2022-01-09"), as.Date("2022-01-22"))
final_res$Model_Daily_Accuracy <- res_son$`Daily Accuracy`
final_res$date <- as.character(final_res$date)
rbind(final_res,list("Mean",mean(final_res$Baseline1_Daily_Accuracy),
                     mean(final_res$Baseline2_Daily_Accuracy),mean(final_res$Model_Daily_Accuracy)))

date,Baseline1_Daily_Accuracy,Baseline2_Daily_Accuracy,Model_Daily_Accuracy
<chr>,<dbl>,<dbl>,<dbl>
2022-01-09,0.75,0.5,0.75
2022-01-10,0.5,0.8333333,0.6666667
2022-01-11,0.6666667,0.5,0.9166667
2022-01-12,1.0,0.3333333,1.0
2022-01-13,0.8333333,0.8333333,0.75
2022-01-14,0.4166667,0.3333333,0.4166667
2022-01-15,0.25,0.9166667,0.6666667
2022-01-16,0.6666667,0.8333333,0.6666667
2022-01-17,0.1666667,0.5833333,0.3333333
2022-01-18,0.4166667,1.0,0.9166667


In [72]:
baseline_accuracy_hourly = function(dat, start_date, end_date){
    
    dat[, baseline1_prediction:=shift(direction, n=24, type="lag")]
    dat[, baseline2_prediction:=shift(direction, n=168, type="lag")]
    
    dat = dat[date>=start_date & date<=end_date & hour>=12]
    
    dat[,result_1:=ifelse(direction==baseline1_prediction, 1, 0)]
    dat[,result_2:=ifelse(direction==baseline2_prediction, 1, 0)]
    
    daily_summary = dat[,list(Baseline1_Hourly_Accuracy=mean(result_1),
                              Baseline2_Hourly_Accuracy=mean(result_2)), c("hour")]
    
    return(daily_summary)
    
}

hourly_dt <- baseline_accuracy_hourly(total_vol, as.Date("2022-01-09"), as.Date("2022-01-22"))


Comparasion between baselines and model predictions by hourly accuracy are shown below in the table.

In [77]:
hourly_acc <- numeric(0)
for(h in 12:23){
    hourly_acc <- c(hourly_acc,Metrics::accuracy(actual = res_dt[hour==h,direction],predicted = res_dt[hour==h,preds]))
}
hourly_dt$Model_Hourly_Accuracy <- hourly_acc
hourly_dt

hour,Baseline1_Hourly_Accuracy,Baseline2_Hourly_Accuracy,Model_Hourly_Accuracy
<int>,<dbl>,<dbl>,<dbl>
12,0.5,0.3571429,0.2142857
13,0.5714286,0.5714286,0.7142857
14,0.5714286,0.7857143,0.7142857
15,0.8571429,0.7857143,0.7142857
16,0.6428571,0.4285714,0.6428571
17,0.8571429,0.8571429,0.9285714
18,1.0,0.8571429,1.0
19,0.5714286,0.6428571,0.7857143
20,0.5714286,0.6428571,0.7857143
21,0.5714286,0.7142857,0.7857143
