# Training Data Cleaning & Exploration - Linear Regression

**Overview of Implementation**
1. <a href="#section1">Data Cleaning</a>
2. <a href="#section2">Linear Regression</a>
    1. <a href="#section2a">Model 0</a>: baseline, no year
    2. <a href="#section2b">Model 1</a>: trainset.imputation, no time data
    3. <a href="#section2c">Model 2</a>: trainset.imputation, no time data, turbine_status & high VIF removed (motor_torque, generator_temperature
    4. <a href="#section2d">Model 2P</a>: Model 2 + high p-value factors removed (turbine_status, shaft_temperature, gearbox_temperature, blade_length, windmill_height)
    5. <a href="#section2e">Model 3</a>: Model 2P + manual removal of inputs that can only be observed

## <a id='section1'>1. Data Cleaning</a>
Import data & explore statistics

In [1]:
library(data.table)
library(ggplot2)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


In [2]:
# Import train data using data.table fread function
data.imputation <- fread("../data/train_data_lr.csv", stringsAsFactors = T) 

cat("\nNumber of NA values for imputed: ", sum(is.na(data.imputation)))

summary(data.imputation)
colnames(data.imputation)


Number of NA values for imputed:  0

 blade_breadth         year          month             mday      
 Min.   :0.2001   Min.   :2018   Min.   : 1.000   Min.   : 1.00  
 1st Qu.:0.3481   1st Qu.:2019   1st Qu.: 3.000   1st Qu.: 8.00  
 Median :0.3996   Median :2019   Median : 6.000   Median :15.00  
 Mean   :0.3979   Mean   :2019   Mean   : 6.222   Mean   :15.55  
 3rd Qu.:0.4504   3rd Qu.:2019   3rd Qu.: 9.000   3rd Qu.:23.00  
 Max.   :0.5000   Max.   :2019   Max.   :12.000   Max.   :31.00  
                                                                 
      wday            hour         wind_speed      atmospheric_temperature
 Min.   :0.000   Min.   : 0.00   Min.   :-104.60   Min.   :-166.759       
 1st Qu.:1.000   1st Qu.: 6.00   1st Qu.:  33.48   1st Qu.:   5.841       
 Median :3.000   Median :12.00   Median :  93.59   Median :  15.567       
 Mean   :2.998   Mean   :11.58   Mean   :  83.46   Mean   :   0.594       
 3rd Qu.:5.000   3rd Qu.:18.00   3rd Qu.:  95.60   3rd Qu.:  24.053       
 Max.   :6.000   Max. 

In [3]:
data.imputation$year = as.factor(data.imputation$year)
data.imputation$month = as.factor(data.imputation$month)
data.imputation$mday = as.factor(data.imputation$mday)
data.imputation$wday = as.factor(data.imputation$wday)

In [4]:
#sample split into train and test set
library(caTools)
set.seed(2021)
train <- sample.split(Y=data.imputation$windmill_generated_power, SplitRatio=0.7)

trainset.imputation <- subset(data.imputation, train==T)
testset.imputation <- subset(data.imputation, train==F)
paste("number of rows of imputed trainset: ", nrow(trainset.imputation))
paste("number of rows of imputed testset: ",nrow(testset.imputation))

In [5]:
head(trainset.imputation)

blade_breadth,year,month,mday,wday,hour,wind_speed,atmospheric_temperature,shaft_temperature,blades_angle,...,area_temperature,windmill_body_temperature,wind_direction,resistance,rotor_torque,blade_length,windmill_height,windmill_generated_power,turbine_status,cloud_level
0.3140648,2019,8,4,0,14,94.82002,-99.0,41.723019,-0.9034229,...,26.89787,64.71157,239.83639,2730.311,42.08467,2.217542,24.28169,6.766521,BA,Medium
0.4473414,2019,4,17,3,18,16.02625,-99.0,44.072819,-0.1968448,...,33.84939,43.00875,528.00399,1222.931,11.80511,2.917922,33.59351,5.089173,BD,Low
0.354881,2019,7,8,1,21,48.73783,12.71681,43.217778,-99.0,...,30.55316,-99.0,339.86603,1177.637,18.38487,2.93881,29.94482,8.536889,BA,Low
0.301911,2019,5,24,5,12,91.99617,39.43655,41.873082,69.4844587,...,23.15148,41.19579,248.81426,1662.733,23.0571,2.939582,24.55546,3.90696,BB,Medium
0.4448594,2019,7,5,5,10,11.59651,10.16427,-6.382525,1.2848402,...,22.95195,40.14483,279.15467,1188.048,12.03866,2.902797,20.15427,3.566115,BCB,Medium
0.4288537,2019,7,23,2,7,11.11376,12.68041,42.226497,-99.0,...,29.41502,41.26187,16.22034,1180.577,12.98519,2.586635,60.07083,3.75782,AC,Medium


### We now have a trainset with data imputation to handle NAs, suitable for use in Linear Regression.

## <a id='section2'>2. Linear Regression</a>

In [6]:
library(car)

Loading required package: carData
"package 'carData' was built under R version 3.6.3"

### <a id="section2a">LR Model 0 (baseline)</a>
**trainset.imputation with selected time data (no year)**

In [7]:
# Develop model on trainset.imputation, including selected time data
m0 <- lm(windmill_generated_power ~ . - year, data = trainset.imputation)
summary(m0)


Call:
lm(formula = windmill_generated_power ~ . - year, data = trainset.imputation)

Residuals:
   Min     1Q Median     3Q    Max 
-7.532 -1.168 -0.127  1.030 13.777 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                3.104e+00  2.724e-01  11.395  < 2e-16 ***
blade_breadth              7.627e-01  2.531e-01   3.014 0.002587 ** 
month2                    -2.193e-01  6.851e-02  -3.201 0.001373 ** 
month3                    -2.206e+00  7.477e-02 -29.499  < 2e-16 ***
month4                    -3.221e+00  8.481e-02 -37.977  < 2e-16 ***
month5                    -2.117e+00  7.617e-02 -27.790  < 2e-16 ***
month6                    -1.366e+00  8.008e-02 -17.060  < 2e-16 ***
month7                    -1.604e+00  7.570e-02 -21.189  < 2e-16 ***
month8                    -1.965e+00  7.279e-02 -26.994  < 2e-16 ***
month9                    -2.261e+00  7.595e-02 -29.773  < 2e-16 ***
month10                   -1.796e+00  1.724e-01 -10.417  

In [8]:
# Residuals = Error = Actual mpg - Model Predicted mpg
RMSE.m0.train.imputation <- sqrt(mean(residuals(m0)^2))  # RMSE on trainset based on m5 model.
print(RMSE.m0.train.imputation)
summary(abs(residuals(m0)))  # Check Min Abs Error and Max Abs Error.

[1] 1.757599


    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00004  0.52029  1.11093  1.35246  1.87716 13.77667 

In [9]:
# Apply model from trainset to predict on testset.
predict.m0.test.imputation <- predict(m0, newdata = testset.imputation)
testset.imputation.error <- testset.imputation$windmill_generated_power - predict.m0.test.imputation

# Testset Errors
RMSE.m0.test.imputation <- sqrt(mean(testset.imputation.error^2))
print(RMSE.m0.test.imputation)
summary(abs(testset.imputation.error))

[1] 1.745235


    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000323 0.496531 1.073644 1.342705 1.905631 9.947762 

In [10]:
# Check for multicollinearity
vif(m0)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
blade_breadth,1.071393,1,1.035081
month,3.90507,11,1.063879
mday,1.817169,30,1.010004
wday,1.291296,6,1.021532
hour,1.12938,1,1.062723
wind_speed,1.300991,1,1.14061
atmospheric_temperature,1.02599,1,1.012912
shaft_temperature,1.053324,1,1.026316
blades_angle,1.10665,1,1.051974
gearbox_temperature,1.025612,1,1.012725


### <a id="section2b">LR Model 1</a>
**trainset.imputation, no time data**

In [11]:
# Develop model on trainset.imputation, excluding time data
m1 <- lm(windmill_generated_power ~ . - year - mday - wday - month - hour, data = trainset.imputation)
summary(m1)


Call:
lm(formula = windmill_generated_power ~ . - year - mday - wday - 
    month - hour, data = trainset.imputation)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8555 -1.3697 -0.2158  1.1343 16.0174 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.421e-01  2.572e-01  -0.941 0.346683    
blade_breadth              6.588e-01  2.895e-01   2.275 0.022902 *  
wind_speed                -3.798e-03  3.288e-04 -11.551  < 2e-16 ***
atmospheric_temperature   -2.119e-03  3.919e-04  -5.406 6.56e-08 ***
shaft_temperature         -1.599e-03  6.501e-04  -2.459 0.013947 *  
blades_angle              -1.489e-04  3.755e-04  -0.397 0.691680    
gearbox_temperature        1.322e-03  4.005e-04   3.301 0.000965 ***
engine_temperature         3.046e-02  3.169e-03   9.614  < 2e-16 ***
motor_torque               3.310e-03  6.231e-05  53.130  < 2e-16 ***
generator_temperature     -9.614e-02  2.768e-03 -34.728  < 2e-16 ***
atmospheric_p

In [12]:
# Residuals = Error = Actual mpg - Model Predicted mpg
RMSE.m1.train.imputation <- sqrt(mean(residuals(m1)^2))  # RMSE on trainset based on m5 model.
print(RMSE.m1.train.imputation)
summary(abs(residuals(m1)))  # Check Min Abs Error and Max Abs Error.

[1] 2.017218


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000119  0.587901  1.270727  1.551903  2.185962 16.017448 

In [13]:
# Apply model from trainset to predict on testset.
predict.m1.test.imputation <- predict(m1, newdata = testset.imputation)
testset.imputation.error <- testset.imputation$windmill_generated_power - predict.m1.test.imputation

# Testset Errors
RMSE.m1.test.imputation <- sqrt(mean(testset.imputation.error^2))
print(RMSE.m1.test.imputation)
summary(abs(testset.imputation.error))

[1] 1.993386


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000025  0.589467  1.258778  1.540312  2.201629 11.495299 

In [14]:
# Check for multicollinearity
vif(m1)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
blade_breadth,1.068155,1,1.033516
wind_speed,1.290777,1,1.136124
atmospheric_temperature,1.019871,1,1.009886
shaft_temperature,1.048959,1,1.024187
blades_angle,1.098709,1,1.048193
gearbox_temperature,1.022946,1,1.011408
engine_temperature,1.276638,1,1.129884
motor_torque,9.019779,1,3.003295
generator_temperature,10.225088,1,3.197669
atmospheric_pressure,1.068788,1,1.033822


### <a id="section2c">LR Model 2</a>
**trainset.imputation, no time data, turbine_status & high VIF removed (motor_torque, generator_temperature)**

In [15]:
# Develop model on trainset.imputation, excluding time data and factors with high VIF
# high VIF: motor_torque, generator_temperature
m2 <- lm(windmill_generated_power ~ . 
         - year 
         - mday 
         - wday 
         - month 
         - hour 
         - motor_torque 
         - generator_temperature, data = trainset.imputation)
summary(m2)


Call:
lm(formula = windmill_generated_power ~ . - year - mday - wday - 
    month - hour - motor_torque - generator_temperature, data = trainset.imputation)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6686 -1.6016  0.0151  1.4458 15.1802 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -5.156e+00  2.699e-01 -19.107  < 2e-16 ***
blade_breadth              1.540e+00  3.228e-01   4.772 1.84e-06 ***
wind_speed                -2.587e-03  3.425e-04  -7.553 4.51e-14 ***
atmospheric_temperature   -2.419e-03  4.362e-04  -5.545 2.99e-08 ***
shaft_temperature         -1.560e-03  7.259e-04  -2.149  0.03167 *  
blades_angle              -5.142e-03  4.055e-04 -12.683  < 2e-16 ***
gearbox_temperature        1.144e-03  4.471e-04   2.558  0.01054 *  
engine_temperature         4.897e-02  3.489e-03  14.037  < 2e-16 ***
atmospheric_pressure       5.283e-07  1.053e-07   5.015 5.37e-07 ***
area_temperature           8.449e-02  2.793

In [16]:
# Residuals = Error = Actual mpg - Model Predicted mpg
RMSE.m2.train.imputation <- sqrt(mean(residuals(m2)^2))  # RMSE on trainset based on m5 model.
print(RMSE.m2.train.imputation)
summary(abs(residuals(m2)))  # Check Min Abs Error and Max Abs Error.

[1] 2.252434


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000348  0.725964  1.522702  1.780580  2.498034 15.180239 

In [17]:
# Apply model from trainset to predict on testset.
predict.m2.test.imputation <- predict(m2, newdata = testset.imputation)
testset.imputation.error <- testset.imputation$windmill_generated_power - predict.m2.test.imputation

# Testset Errors
RMSE.m2.test.imputation <- sqrt(mean(testset.imputation.error^2))
print(RMSE.m2.test.imputation)
summary(abs(testset.imputation.error))

[1] 2.225206


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000328  0.715462  1.491257  1.762371  2.497867 10.858205 

In [18]:
# Check for multicollinearity
vif(m2)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
blade_breadth,1.065066,1,1.03202
wind_speed,1.123981,1,1.06018
atmospheric_temperature,1.013622,1,1.006788
shaft_temperature,1.048947,1,1.024181
blades_angle,1.027386,1,1.013601
gearbox_temperature,1.022579,1,1.011226
engine_temperature,1.241284,1,1.114129
atmospheric_pressure,1.034468,1,1.017088
area_temperature,1.2654,1,1.1249
windmill_body_temperature,1.012411,1,1.006186


### <a id="section2d">LR Model 2P</a>
**Model 2 + high p-value factors removed (turbine_status, shaft_temperature, gearbox_temperature, blade_length, windmill_height)**

In [19]:
# Develop model on m2, excluding high p-value factors
# high p-value: shaft_temperature, gearbox_temperature, blade_length, windmill_height
m2p <- lm(windmill_generated_power ~ . 
         - year 
         - mday 
         - wday 
         - month 
         - hour 
         - motor_torque 
         - generator_temperature
         - turbine_status
         - shaft_temperature
         - gearbox_temperature
         - blade_length
         - windmill_height, data = trainset.imputation)
summary(m2p)


Call:
lm(formula = windmill_generated_power ~ . - year - mday - wday - 
    month - hour - motor_torque - generator_temperature - turbine_status - 
    shaft_temperature - gearbox_temperature - blade_length - 
    windmill_height, data = trainset.imputation)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6073 -1.6000  0.0046  1.4512 15.1822 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -5.257e+00  2.545e-01 -20.653  < 2e-16 ***
blade_breadth              1.562e+00  3.227e-01   4.841 1.31e-06 ***
wind_speed                -2.580e-03  3.426e-04  -7.530 5.37e-14 ***
atmospheric_temperature   -2.406e-03  4.362e-04  -5.517 3.50e-08 ***
blades_angle              -5.171e-03  4.053e-04 -12.759  < 2e-16 ***
engine_temperature         4.853e-02  3.398e-03  14.280  < 2e-16 ***
atmospheric_pressure       5.292e-07  1.053e-07   5.024 5.13e-07 ***
area_temperature           8.447e-02  2.793e-03  30.246  < 2e-16 ***
windmill_b

In [20]:
# Residuals = Error = Actual mpg - Model Predicted mpg
RMSE.m2p.train.imputation <- sqrt(mean(residuals(m2p)^2))  # RMSE on trainset based on m5 model.
print(RMSE.m2p.train.imputation)
summary(abs(residuals(m2p)))  # Check Min Abs Error and Max Abs Error.

[1] 2.254904


    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00004  0.73087  1.52327  1.78287  2.50643 15.18221 

In [21]:
# Apply model from trainset to predict on testset.
predict.m2p.test.imputation <- predict(m2p, newdata = testset.imputation)
testset.imputation.error <- testset.imputation$windmill_generated_power - predict.m2p.test.imputation

# Testset Errors
RMSE.m2p.test.imputation <- sqrt(mean(testset.imputation.error^2))
print(RMSE.m2p.test.imputation)
summary(abs(testset.imputation.error))

[1] 2.224104


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000763  0.719582  1.490907  1.760613  2.496258 10.782528 

In [22]:
# Check for multicollinearity
vif(m2p)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
blade_breadth,1.063516,1,1.031269
wind_speed,1.123131,1,1.059779
atmospheric_temperature,1.012322,1,1.006142
blades_angle,1.025697,1,1.012767
engine_temperature,1.176767,1,1.084789
atmospheric_pressure,1.033578,1,1.01665
area_temperature,1.264031,1,1.124291
windmill_body_temperature,1.0108,1,1.005385
wind_direction,1.061588,1,1.030334
resistance,1.172322,1,1.082738


### <a id="section2e">LR Model 3</a>
**Model 2P + manual removal of inputs that can only be observed**

In [23]:
# Develop model on m2p, excluding post-installation observations
# observations: windmill_body_temperature, rotor_torque, engine_temperature
m3 <- lm(windmill_generated_power ~ . 
         - year 
         - mday 
         - wday 
         - month 
         - hour 
         - motor_torque 
         - generator_temperature
         - turbine_status
         - shaft_temperature
         - gearbox_temperature
         - blade_length
         - windmill_height
         - windmill_body_temperature
         - rotor_torque
         - engine_temperature, data = trainset.imputation)
summary(m3)


Call:
lm(formula = windmill_generated_power ~ . - year - mday - wday - 
    month - hour - motor_torque - generator_temperature - turbine_status - 
    shaft_temperature - gearbox_temperature - blade_length - 
    windmill_height - windmill_body_temperature - rotor_torque - 
    engine_temperature, data = trainset.imputation)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9495 -1.5770  0.0284  1.4638 14.0921 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -3.938e+00  2.346e-01 -16.790  < 2e-16 ***
blade_breadth            1.705e+00  3.270e-01   5.215 1.86e-07 ***
wind_speed              -1.779e-03  3.443e-04  -5.167 2.41e-07 ***
atmospheric_temperature -2.102e-03  4.418e-04  -4.758 1.97e-06 ***
blades_angle            -5.558e-03  4.104e-04 -13.542  < 2e-16 ***
atmospheric_pressure     4.271e-07  1.066e-07   4.006 6.22e-05 ***
area_temperature         9.837e-02  2.716e-03  36.219  < 2e-16 ***
wind_direction           5

In [24]:
# Residuals = Error = Actual mpg - Model Predicted mpg
RMSE.m3.train.imputation <- sqrt(mean(residuals(m3)^2))  # RMSE on trainset based on m5 model.
print(RMSE.m3.train.imputation)
summary(abs(residuals(m3)))  # Check Min Abs Error and Max Abs Error.

[1] 2.286162


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000183  0.735454  1.522536  1.801100  2.522995 14.092124 

In [25]:
# Apply model from trainset to predict on testset.
predict.m3.test.imputation <- predict(m3, newdata = testset.imputation)
testset.imputation.error <- testset.imputation$windmill_generated_power - predict.m3.test.imputation

# Testset Errors
RMSE.m3.test.imputation <- sqrt(mean(testset.imputation.error^2))
print(RMSE.m3.test.imputation)
summary(abs(testset.imputation.error))

[1] 2.257213


     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000343  0.727997  1.500098  1.783388  2.496319 11.004963 

In [26]:
# Check for multicollinearity
vif(m3)

Unnamed: 0,GVIF,Df,GVIF^(1/(2*Df))
blade_breadth,1.062695,1,1.030871
wind_speed,1.104045,1,1.050736
atmospheric_temperature,1.010539,1,1.005256
blades_angle,1.023219,1,1.011543
atmospheric_pressure,1.030131,1,1.014954
area_temperature,1.163218,1,1.078526
wind_direction,1.054502,1,1.026889
resistance,1.145312,1,1.070193
cloud_level,1.076488,2,1.018597
