In [2]:
library(data.table)
library(lubridate)
library(rpart)
library(partykit)
library(ggplot2)
library(Metrics)
library(TSrepr)
library(TunePareto)
library(caret)
library(forecast)
library(tidyr)
library(rattle)
options(repr.plot.width=10, repr.plot.height=10)

In [3]:
accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  FBias=sum(error)/sum(actual)
  MAPE=sum(abs(error/actual+0.0000001))/n
  RMSE=sqrt(sum(error^2)/n)
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
  return(l)
}

## Summary

The aim of this assignment is forecasting hourly total solar power plant production values for multiple small solar power plants from the given data. The data consist of hourly production values and 175 features which are temperature, radiation flux, humidty, total cloud cover (4 different layers) information of 25 different locations. Scenaria is forecasting the day after (d+1) with the available yesterday values (d-1). Explainable Boosted Liner Regression (EBLR) is used to forecast production values. Before constructing the models, missing values are handled. Then, a base model which is simpe linear regression is constructed. Finally, regression model is improved with decision trees which are learning from the errors form the previous models. In conclusion, forecasts are made according to forecasting scenario and compared with a naive forecast approach which is applying last available data for future values.

## Approach

### Handling Missing Values

In [32]:
dt <- fread("C:/Users/kaan9/OneDrive/Masaüstü/production_with_weather_data.csv")

In [33]:
na_rows <- which(rowSums(is.na(dt))>0)
dt[na_rows,]

date,hour,production,DSWRF_surface_38_35,DSWRF_surface_38_35.25,DSWRF_surface_38_35.5,DSWRF_surface_38_35.75,DSWRF_surface_38_36,DSWRF_surface_38.25_35,DSWRF_surface_38.25_35.25,⋯,TMP_2.m.above.ground_38.75_35,TMP_2.m.above.ground_38.75_35.25,TMP_2.m.above.ground_38.75_35.5,TMP_2.m.above.ground_38.75_35.75,TMP_2.m.above.ground_38.75_36,TMP_2.m.above.ground_39_35,TMP_2.m.above.ground_39_35.25,TMP_2.m.above.ground_39_35.5,TMP_2.m.above.ground_39_35.75,TMP_2.m.above.ground_39_36
<date>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2021-06-09,19,7.9012,,,,,,,,⋯,,,,,,,,,,
2021-11-26,14,189.8403,,,,,,,,⋯,284.835,284.405,283.225,282.735,282.425,283.215,283.015,282.375,282.495,282.715
2021-11-26,16,35.4222,272.98,269.88,235.88,183.32,79.18,260.46,251.68,⋯,285.017,284.447,283.517,282.757,282.217,283.637,283.197,282.347,282.457,282.957
2021-12-01,7,0.260118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,,,,,,,,,,
2021-12-13,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,,,,,,,,,,
2021-12-22,18,0.0,,,,,,,,⋯,269.097,269.567,268.067,263.837,264.207,268.907,269.037,268.077,267.567,267.297
2021-12-25,16,46.84658,,,,,,,,⋯,275.997,275.377,275.497,271.187,269.877,274.217,274.307,273.747,273.017,273.247


After reading the data, there are only 7 rows with missing observations. The null values are filled with the values from 1 hour before since there are not many missing values to influence the model.

In [34]:
for(i in 4:length(dt)){
    col <- dt[,..i]
    col[which(is.na(col))] <- col[which(is.na(col))-1]
    dt[,colnames(dt)[i]:=col]
}

In [35]:
dt[na_rows,]

date,hour,production,DSWRF_surface_38_35,DSWRF_surface_38_35.25,DSWRF_surface_38_35.5,DSWRF_surface_38_35.75,DSWRF_surface_38_36,DSWRF_surface_38.25_35,DSWRF_surface_38.25_35.25,⋯,TMP_2.m.above.ground_38.75_35,TMP_2.m.above.ground_38.75_35.25,TMP_2.m.above.ground_38.75_35.5,TMP_2.m.above.ground_38.75_35.75,TMP_2.m.above.ground_38.75_36,TMP_2.m.above.ground_39_35,TMP_2.m.above.ground_39_35.25,TMP_2.m.above.ground_39_35.5,TMP_2.m.above.ground_39_35.75,TMP_2.m.above.ground_39_36
<date>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2021-06-09,19,7.9012,550.78,440.38,489.32,577.92,525.26,373.3,358.16,⋯,295.043,294.743,294.043,291.543,290.143,294.543,293.943,291.843,291.843,290.943
2021-11-26,14,189.8403,438.98,444.28,392.4,387.82,404.92,421.96,417.02,⋯,284.835,284.405,283.225,282.735,282.425,283.215,283.015,282.375,282.495,282.715
2021-11-26,16,35.4222,272.98,269.88,235.88,183.32,79.18,260.46,251.68,⋯,285.017,284.447,283.517,282.757,282.217,283.637,283.197,282.347,282.457,282.957
2021-12-01,7,0.260118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,276.78,276.52,275.38,273.56,273.62,276.91,277.12,275.11,274.43,274.48
2021-12-13,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,278.374,277.674,279.054,277.584,275.714,279.184,278.534,278.054,277.234,277.384
2021-12-22,18,0.0,193.0,194.44,184.04,183.86,178.42,181.98,177.16,⋯,269.097,269.567,268.067,263.837,264.207,268.907,269.037,268.077,267.567,267.297
2021-12-25,16,46.84658,412.68,417.42,412.18,408.1,401.86,395.74,391.78,⋯,275.997,275.377,275.497,271.187,269.877,274.217,274.307,273.747,273.017,273.247


In [36]:
dt[,w_day:=as.character(wday(date,label=T))]
dt[,mon:=as.character(month(date,label=T))]
dt[,trnd:=1:.N]
dt$production <- log(dt$production+1)
dt$lag_48 <- shift(dt$production,48)
dt <- dt[date>"2019-09-02"]

Trend, day of the week and month features added to the data. Since production is a non-negative variable, natural log transformation is applied to the output variable to prevent linear regreassion to predict negative values. 

In [37]:
train_dt <- dt[date<="2021-10-28",]
test_dt <- dt[date=="2021-10-30",]

Since the aim is predicting the other day's values (d+1), data until "2021-10-28" is used as training data. The last day before the forecast period "2021-10-30" is used as the test data.

### EBLR Model

To construct a base learner model, simple linear regression is used with the trend, day and month information as regressors.

In [38]:
lm_model <- lm(production~trnd+w_day+mon+as.factor(hour),train_dt)
summary(lm_model)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour), 
    data = train_dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5866 -0.3699  0.0397  0.3903  1.6995 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.909e-01  3.187e-02   9.129  < 2e-16 ***
trnd               5.393e-06  8.992e-07   5.998 2.04e-09 ***
w_dayCum          -1.131e-02  1.726e-02  -0.655  0.51228    
w_dayÇar           6.066e-03  1.722e-02   0.352  0.72468    
w_dayPaz           4.162e-02  1.726e-02   2.412  0.01587 *  
w_dayPer           9.350e-03  1.726e-02   0.542  0.58804    
w_dayPzt           1.090e-02  1.726e-02   0.631  0.52784    
w_daySal           2.809e-02  1.723e-02   1.631  0.10295    
monAra            -1.060e+00  2.378e-02 -44.562  < 2e-16 ***
monEki            -3.998e-01  2.157e-02 -18.534  < 2e-16 ***
monEyl            -1.912e-01  2.164e-02  -8.839  < 2e-16 ***
monHaz             1.706e-01  2.342e-02   7.285 3.34e-13 ***
monKa

In [39]:
test_pred1 <- predict(lm_model,test_dt)
accu(test_dt$production,test_pred1)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.06105456,inf,0.5319069,0.3046095,0.1433741,0.1433741


The first model has a R-squared value of 0.9285 which is good for a base learner. The base model has WMAPE value of 0.1433 on the test data. To improve the model, a decision tree will be construct on residuals to find the highest residual nodes.

In [40]:
f_count <- 1
residuals <- lm_model$residuals
tree_learner <- rpart(residuals~.-date-production,
                       data=train_dt,
                       control=rpart.control(cp=0,maxdepth=4))
best_node <- as.numeric(rownames(tree_learner$frame[which.max(abs(tree_learner[[1]]$yval)),]))
b <- path.rpart(tree_learner,best_node)
depth <- length(b[[1]])
for(i in 2:depth){
    if(startsWith(b[[1]][i],"mon") | startsWith(b[[1]][i],"w_day")){
        f_count <- f_count+1
        next
    }
    train_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    test_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    f_count <- f_count+1
}


 node number: 31 
   root
   lag_48>=2.238
   TCDC_entire.atmosphere_39_35.5< 84.35
   hour>=16.5
   mon=Ağu,Haz,Mar,May,Nis,Tem


The 31th node is has the highest residual value, so its split information added to the data's feature space. The code above is fitting a tree and adding the split information as variables to the data automatically. However, for splits on month or weekday information, feature generation is handled manually as seen below.

In [41]:
train_dt[,fltr4:=as.numeric(mon %in% c("Ağu","Haz","Mar","May","Nis","Tem"))]
test_dt[,fltr4:=as.numeric(mon %in% c("Ağu","Haz","Mar","May","Nis","Tem"))]
dt[,fltr4:=as.numeric(mon %in% c("Ağu","Haz","Mar","May","Nis","Tem"))]

In [42]:
lm_model2 <- lm(production~trnd+w_day+mon+as.factor(hour)+
                fltr1:fltr2:fltr3:fltr4
                ,train_dt)
summary(lm_model2)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    fltr1:fltr2:fltr3:fltr4, data = train_dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6298 -0.3134  0.0242  0.3427  2.2397 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1.991e-01  2.965e-02   6.715 1.94e-11 ***
trnd                     5.713e-06  8.353e-07   6.839 8.19e-12 ***
w_dayCum                -9.653e-03  1.603e-02  -0.602  0.54702    
w_dayÇar                 6.123e-03  1.600e-02   0.383  0.70188    
w_dayPaz                 4.046e-02  1.603e-02   2.524  0.01160 *  
w_dayPer                 6.560e-03  1.603e-02   0.409  0.68243    
w_dayPzt                 1.413e-02  1.603e-02   0.881  0.37814    
w_daySal                 2.559e-02  1.600e-02   1.599  0.10984    
monAra                  -9.225e-01  2.223e-02 -41.501  < 2e-16 ***
monEki                  -2.636e-01  2.019e-02 -13.052  < 2e-16 ***
monEyl                  -5.492e-02 

In [43]:
test_pred2 <- predict(lm_model2,test_dt)
accu(test_dt$production,test_pred2)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.06259739,inf,0.4194979,0.2744418,0.1291747,0.1291747


After adding these 4 filters to the data, new linear model's R-squared value is increased to 0.9383 and the new model has WMAPE value of 0.1291. To improve more, another decisions tree will be constructed to learn from the residuals.

In [44]:
residuals <- lm_model2$residuals
tree_learner <- rpart(residuals~.-date-production,
                       data=train_dt,
                       control=rpart.control(cp=0,maxdepth=4))
best_node <- as.numeric(rownames(tree_learner$frame[which.max(abs(tree_learner[[1]]$yval)),]))
b <- path.rpart(tree_learner,best_node)
depth <- length(b[[1]])
for(i in 2:depth){
    if(startsWith(b[[1]][i],"mon") | startsWith(b[[1]][i],"w_day")){
        f_count <- f_count+1
        next
    }
    train_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    test_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    f_count <- f_count+1
}


 node number: 31 
   root
   lag_48>=1.229
   TCDC_low.cloud.layer_38.5_35.5< 39.3
   hour< 7.5
   mon=Ağu,Haz,May,Nis,Tem


Again, the node with highest value is 31th node. Split information is added to data as new features.

In [45]:
train_dt[,fltr8:=as.numeric(mon %in% c("Ağu","Haz","May","Nis","Tem"))]
test_dt[,fltr8:=as.numeric(mon %in% c("Ağu","Haz","May","Nis","Tem"))]
dt[,fltr8:=as.numeric(mon %in% c("Ağu","Haz","May","Nis","Tem"))]

In [46]:
lm_model3 <- lm(production~trnd+w_day+mon+as.factor(hour)+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8
                ,train_dt)
summary(lm_model3)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    fltr1:fltr2:fltr3:fltr4 + fltr5:fltr6:fltr7:fltr8, data = train_dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6840 -0.2745  0.0144  0.2980  2.3572 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1.259e-01  2.743e-02   4.591 4.44e-06 ***
trnd                     5.582e-06  7.719e-07   7.231 4.96e-13 ***
w_dayCum                -8.985e-03  1.481e-02  -0.607  0.54414    
w_dayÇar                 3.156e-03  1.478e-02   0.213  0.83096    
w_dayPaz                 3.962e-02  1.481e-02   2.675  0.00748 ** 
w_dayPer                 3.828e-03  1.482e-02   0.258  0.79612    
w_dayPzt                 1.478e-02  1.482e-02   0.998  0.31850    
w_daySal                 2.364e-02  1.479e-02   1.598  0.10998    
monAra                  -7.921e-01  2.067e-02 -38.316  < 2e-16 ***
monEki                  -1.328e-01  1.880e-02  -7.063 1.68e-12 ***
monEyl   

In [47]:
test_pred3 <- predict(lm_model3,test_dt)
accu(test_dt$production,test_pred3)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.0625062,inf,0.3884094,0.2774711,0.1306005,0.1306005


After the filter variables, model has a R-squared value of 0.9473 but WMAPE value is slighlty increased. Since the R-squared is also increased, another decision tree could be constructed.

In [48]:
residuals <- lm_model3$residuals
tree_learner <- rpart(residuals~.-date-production,
                       data=train_dt,
                       control=rpart.control(cp=0,maxdepth=4))
best_node <- as.numeric(rownames(tree_learner$frame[which.max(abs(tree_learner[[1]]$yval)),]))
b <- path.rpart(tree_learner,best_node)
depth <- length(b[[1]])
for(i in 2:depth){
    if(startsWith(b[[1]][i],"mon") | startsWith(b[[1]][i],"w_day")){
        f_count <- f_count+1
        next
    }
    train_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    test_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    f_count <- f_count+1
}


 node number: 16 
   root
   TCDC_entire.atmosphere_38.5_35>=96.85
   lag_48>=0.02411
   RH_2.m.above.ground_38.5_35.5>=88.35
   mon=Ara,Kas,Oca,Şub


The 16th node has the highest residual, thus its split information added to the data as filter variables, again.

In [49]:
train_dt[,fltr12:=as.numeric(mon %in% c("Ara","Kas","Oca","Şub"))]
test_dt[,fltr12:=as.numeric(mon %in% c("Ara","Kas","Oca","Şub"))]
dt[,fltr12:=as.numeric(mon %in% c("Ara","Kas","Oca","Şub"))]

In [50]:
lm_model4 <- lm(production~trnd+w_day+mon+as.factor(hour)+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8+
                fltr9:fltr10:fltr11:fltr12
                ,train_dt)
summary(lm_model4)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    fltr1:fltr2:fltr3:fltr4 + fltr5:fltr6:fltr7:fltr8 + fltr9:fltr10:fltr11:fltr12, 
    data = train_dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81390 -0.25146  0.01316  0.27873  2.37327 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.213e-01  2.551e-02   4.754 2.00e-06 ***
trnd                        3.733e-06  7.187e-07   5.194 2.08e-07 ***
w_dayCum                   -1.333e-03  1.378e-02  -0.097 0.922931    
w_dayÇar                    4.690e-03  1.375e-02   0.341 0.733018    
w_dayPaz                    2.514e-02  1.378e-02   1.825 0.068090 .  
w_dayPer                    8.748e-04  1.378e-02   0.063 0.949381    
w_dayPzt                   -6.324e-03  1.378e-02  -0.459 0.646368    
w_daySal                    2.494e-02  1.375e-02   1.813 0.069784 .  
monAra                     -6.511e-01  1.940e-02 -33.562  < 2e-16 ***
mo

In [51]:
test_pred4 <- predict(lm_model4,test_dt)
accu(test_dt$production,test_pred4)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.05598848,inf,0.373917,0.24904,0.1172185,0.1172185


In [52]:
residuals <- lm_model4$residuals
tree_learner <- rpart(residuals~.-date-production,
                       data=train_dt,
                       control=rpart.control(cp=0,maxdepth=4))
best_node <- as.numeric(rownames(tree_learner$frame[which.max(abs(tree_learner[[1]]$yval)),]))
b <- path.rpart(tree_learner,best_node)
depth <- length(b[[1]])
for(i in 2:depth){
    if(startsWith(b[[1]][i],"mon") | startsWith(b[[1]][i],"w_day")){
        f_count <- f_count+1
        next
    }
    train_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    test_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    f_count <- f_count+1
}


 node number: 31 
   root
   lag_48>=1.725
   TCDC_entire.atmosphere_38.5_35.5< 85.95
   fltr3>=0.5
   mon=Eyl,Şub


In [53]:
train_dt[,fltr16:=as.numeric(mon %in% c("Eyl","Şub"))]
test_dt[,fltr16:=as.numeric(mon %in% c("Eyl","Şub"))]
dt[,fltr16:=as.numeric(mon %in% c("Eyl","Şub"))]

In [54]:
lm_model5 <- lm(production~trnd+w_day+mon+as.factor(hour)+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8+
                fltr9:fltr10:fltr11:fltr12+
                fltr13:fltr14:fltr15:fltr16
                ,train_dt)
summary(lm_model5)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    fltr1:fltr2:fltr3:fltr4 + fltr5:fltr6:fltr7:fltr8 + fltr9:fltr10:fltr11:fltr12 + 
    fltr13:fltr14:fltr15:fltr16, data = train_dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.82871 -0.23682  0.00891  0.26771  2.49183 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  1.230e-01  2.493e-02   4.937 8.02e-07 ***
trnd                         3.867e-06  7.023e-07   5.506 3.71e-08 ***
w_dayCum                    -2.525e-03  1.346e-02  -0.188   0.8512    
w_dayÇar                     4.184e-03  1.343e-02   0.311   0.7555    
w_dayPaz                     2.336e-02  1.346e-02   1.735   0.0827 .  
w_dayPer                    -2.046e-03  1.346e-02  -0.152   0.8792    
w_dayPzt                    -6.583e-03  1.347e-02  -0.489   0.6250    
w_daySal                     2.150e-02  1.344e-02   1.600   0.1096    
monAra                      -6.38

In [55]:
test_pred5 <- predict(lm_model5,test_dt)
accu(test_dt$production,test_pred5)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.05726881,inf,0.3409184,0.2415926,0.1137132,0.1137132


In [56]:
residuals <- lm_model5$residuals
tree_learner <- rpart(residuals~.-date-production-lag_48,
                       data=train_dt,
                       control=rpart.control(cp=0,maxdepth=4))
best_node <- as.numeric(rownames(tree_learner$frame[which.max(abs(tree_learner[[1]]$yval)),]))
b <- path.rpart(tree_learner,best_node)
depth <- length(b[[1]])
for(i in 2:depth){
    if(startsWith(b[[1]][i],"mon") | startsWith(b[[1]][i],"w_day")){
        f_count <- f_count+1
        next
    }
    train_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    test_dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    dt[,paste("fltr",as.character(f_count),sep=""):=as.numeric(eval(parse(text = b[[1]][i])))]
    f_count <- f_count+1
}


 node number: 19 
   root
   fltr13< 0.5
   TMP_2.m.above.ground_38.25_35.5>=279.2
   DSWRF_surface_38_35.5>=449
   TCDC_middle.cloud.layer_38.25_36>=1.6


In [57]:
lm_model6 <- lm(production~trnd+w_day+mon+as.factor(hour)+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8+
                fltr9:fltr10:fltr11:fltr12+
                fltr13:fltr14:fltr15:fltr16+
                fltr17:fltr18:fltr19:fltr20
                ,train_dt)
summary(lm_model6)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    fltr1:fltr2:fltr3:fltr4 + fltr5:fltr6:fltr7:fltr8 + fltr9:fltr10:fltr11:fltr12 + 
    fltr13:fltr14:fltr15:fltr16 + fltr17:fltr18:fltr19:fltr20, 
    data = train_dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.83028 -0.23521  0.00817  0.26688  2.50360 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  1.212e-01  2.487e-02   4.875 1.09e-06 ***
trnd                         3.840e-06  7.006e-07   5.481 4.27e-08 ***
w_dayCum                    -4.935e-04  1.343e-02  -0.037   0.9707    
w_dayÇar                     5.695e-03  1.340e-02   0.425   0.6709    
w_dayPaz                     2.488e-02  1.343e-02   1.852   0.0640 .  
w_dayPer                    -6.964e-05  1.343e-02  -0.005   0.9959    
w_dayPzt                    -4.526e-03  1.344e-02  -0.337   0.7363    
w_daySal                     2.405e-02  1.341e-02   1.794   0.0729 . 

In [58]:
test_pred6 <- predict(lm_model6,test_dt)
accu(test_dt$production,test_pred6)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,-0.05636799,inf,0.3385884,0.240189,0.1130525,0.1130525


The EBLR method is applied to the model for 5 iterations. While the base regression model has R-squared value of 0.9285 and has WMAPE error rate of 0.1433, the final model has 0.9567 R-squared value and WMAPE error rate of 0.1130 on the test data. This means that addition of new features which is from decision trees applied on the residuals, improved the model.

In [59]:
lm_model_son <- lm(production~trnd+w_day+mon+as.factor(hour)+lag_48+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8+
                fltr9:fltr10:fltr11:fltr12+
                fltr13:fltr14:fltr15:fltr16+
                fltr17:fltr18:fltr19:fltr20   
                ,train_dt)
summary(lm_model_son)


Call:
lm(formula = production ~ trnd + w_day + mon + as.factor(hour) + 
    lag_48 + fltr1:fltr2:fltr3:fltr4 + fltr5:fltr6:fltr7:fltr8 + 
    fltr9:fltr10:fltr11:fltr12 + fltr13:fltr14:fltr15:fltr16 + 
    fltr17:fltr18:fltr19:fltr20, data = train_dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61983 -0.09915  0.00101  0.13012  2.62905 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  3.112e-02  1.897e-02   1.641 0.100880    
trnd                         9.115e-07  5.346e-07   1.705 0.088175 .  
w_dayCum                    -2.565e-03  1.024e-02  -0.251 0.802124    
w_dayÇar                     3.003e-03  1.022e-02   0.294 0.768815    
w_dayPaz                     4.271e-02  1.024e-02   4.172 3.04e-05 ***
w_dayPer                    -3.711e-03  1.024e-02  -0.362 0.717026    
w_dayPzt                     5.217e-03  1.024e-02   0.509 0.610484    
w_daySal                     7.682e-03  1.022e-02   0.752 0.

In [60]:
test_pred_son <- predict(lm_model_son,test_dt)
accu(test_dt$production,test_pred_son)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
24,2.124579,2.475094,1.164981,0.04628246,inf,0.2309529,0.1507577,0.07095887,0.07095887


To make a final improvement, lagged value of production added to the final model. 48th lag improved the R-squared to 0.9748 and WMAPE error rate on test to 0.0709.

## Forecast and Conclusion

To forecast, training data is updated every day after 10th of october, then a new model constructed with the new training data and forecasts are made for the 2 days after.

In [66]:
train_end <- nrow(dt[date<="2021-10-30"])
test_dates <- unique(dt$date[dt$date<="2021-12-25" & dt$date>="2021-11-01"])
preds <- numeric(0)
for(d in test_dates ){
    train_tmp <- dt[1:train_end,]
    model <- lm(production~trnd+w_day+mon+as.factor(hour)+lag_48+
                fltr1:fltr2:fltr3:fltr4+
                fltr5:fltr6:fltr7:fltr8+
                fltr9:fltr10:fltr11:fltr12+
                fltr13:fltr14:fltr15:fltr16+
                fltr17:fltr18:fltr19:fltr20   
                ,train_tmp)
    p <- predict(model,newdata=dt[date==d,])
    preds <- c(preds,p)
    train_end <- train_end + 24
}

In [76]:
base_sc <- exp(dt[date %in% test_dates,lag_48])
accu(real,base_sc)
preds_trns <- exp(preds)
real <- exp(dt[date %in% test_dates,production])-1
accu(real,preds_trns)

n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1296,45.46808,79.29,1.743861,-0.01978126,inf,51.70043,23.52425,0.5173795,0.5173795


n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1296,45.46808,79.29,1.743861,0.07300363,inf,41.83656,19.50246,0.4289263,0.4289263


Predictions are transformed to real production values by taking their exponentials. The predictions has WMAPE value of 0.4289 over the test period near 2 months. Naive forecast has the WMAPE error rate of 0.5173 which means that EBLR is more than 20% percent better than the base line forecast. Learning from the previous error is significantly improves linear regression model.