## Predict Power Generation of Solar Panels
### GMiS CAHSI Data Analytics Hackathon 2020 - Team 14: Mask-araid
----
__Question__: What will be the generated power voltage from a solar panel at a given time in the future given the weather conditions?

__Software Used__: Jupyter Lab

__Programming Language__: Python 

### Research
First, let's do some research about solar panels. According to [1876 Energy](https://www.1876energy.com/5-factors-that-affect-solar-panel-efficiency/) and [Trace Software](https://www.trace-software.com/blog/which-are-the-factors-that-affect-solar-panels-efficiency/), the highest contributing factors to solar panels are temperature, energy conversion efficiency (power), shade, solar radiation, and location (longitude and latitude). Additionally, solar panels work more efficiently in cold temperatures, allowing the panel to produce more voltage and more electricity. Rain and snow have no effect on solar panels however cloudy days and humidity can slow down production.

### Step I: EDA
First, we will perform some EDA so that we can get a feel for the data.

In [1]:
import pandas as pd

In [2]:
data_set = pd.read_csv("cahsi_data_2020/D1.csv")

In [3]:
data_set.head(100)

Unnamed: 0,weather_datetime,solar_datetime,solarRadiation,uvHigh,winddirAvg,humidityHigh,humidityLow,humidityAvg,qcStatus,tempHigh,...,windchillAvg,heatindexHigh,heatindexLow,heatindexAvg,pressureMax,pressureMin,pressureTrend,precipRate,precipTotal,DC
0,2020-02-07 14:29:00,2020-02-07 14:29:1,627.70,7.0,195,24,24,24,-1,65,...,65,65,65,65,30.06,30.05,0.60,0.0,0.0,42.036
1,2020-02-07 14:34:00,2020-02-07 14:34:1,617.31,7.0,129,24,23,23,-1,68,...,67,68,66,67,30.06,30.05,-0.15,0.0,0.0,42.126
2,2020-02-07 14:39:00,2020-02-07 14:39:1,608.13,6.0,108,24,23,23,-1,68,...,67,68,67,67,30.06,30.05,0.00,0.0,0.0,42.264
3,2020-02-07 14:44:00,2020-02-07 14:44:1,582.57,6.0,87,25,24,24,-1,67,...,66,67,66,66,30.06,30.05,-0.15,0.0,0.0,42.204
4,2020-02-07 14:49:00,2020-02-07 14:49:1,571.67,6.0,38,24,24,24,-1,66,...,66,66,66,66,30.05,30.04,-0.15,0.0,0.0,42.360
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2020-02-07 22:24:00,2020-02-07 22:24:1,0.00,0.0,255,41,40,40,1,51,...,51,51,51,51,30.15,30.14,0.15,0.0,0.0,0.186
96,2020-02-07 22:29:00,2020-02-07 22:29:1,0.00,0.0,3,43,41,42,1,51,...,51,51,50,51,30.15,30.14,0.15,0.0,0.0,0.192
97,2020-02-07 22:34:00,2020-02-07 22:34:1,0.00,0.0,299,42,40,41,1,51,...,50,51,50,50,30.15,30.14,0.00,0.0,0.0,0.192
98,2020-02-07 22:39:00,2020-02-07 22:39:1,0.00,0.0,233,42,41,41,1,51,...,51,51,51,51,30.15,30.15,0.00,0.0,0.0,0.192


In [4]:
data_set.tail()

Unnamed: 0,weather_datetime,solar_datetime,solarRadiation,uvHigh,winddirAvg,humidityHigh,humidityLow,humidityAvg,qcStatus,tempHigh,...,windchillAvg,heatindexHigh,heatindexLow,heatindexAvg,pressureMax,pressureMin,pressureTrend,precipRate,precipTotal,DC
7955,2020-03-30 21:29:00,2020-03-30 21:29:1,0.0,0.0,153,25,25,25,1,62,...,62,62,62,62,30.25,30.24,0.0,0.0,0.0,0.03
7956,2020-03-30 21:34:00,2020-03-30 21:34:1,0.0,0.0,160,25,25,25,1,62,...,62,62,62,62,30.25,30.24,0.0,0.0,0.0,0.024
7957,2020-03-30 21:39:00,2020-03-30 21:39:1,0.0,0.0,188,25,25,25,1,62,...,62,62,62,62,30.25,30.24,0.0,0.0,0.0,0.03
7958,2020-03-30 21:44:00,2020-03-30 21:44:1,0.0,0.0,153,25,25,25,1,62,...,62,62,62,62,30.25,30.24,-0.15,0.0,0.0,0.024
7959,2020-03-30 21:49:00,2020-03-30 21:49:1,0.0,0.0,107,25,25,25,1,62,...,62,62,62,62,30.25,30.25,0.0,0.0,0.0,0.024


Observation: Notice that as it becomes later in the day, the solar radiation, uv, and temperature decreases. The DC voltage also decreases.

In [5]:
# what other columns are there?
data_set.columns

Index(['weather_datetime', 'solar_datetime', 'solarRadiation', 'uvHigh',
       'winddirAvg', 'humidityHigh', 'humidityLow', 'humidityAvg', 'qcStatus',
       'tempHigh', 'tempLow', 'tempAvg', 'windspeedHigh', 'windgustLow',
       'windspeedAvg', 'dewptHigh', 'dewptLow', 'dewptAvg', 'windchillHigh',
       'windchillAvg', 'heatindexHigh', 'heatindexLow', 'heatindexAvg',
       'pressureMax', 'pressureMin', 'pressureTrend', 'precipRate',
       'precipTotal', 'DC'],
      dtype='object')

In [6]:
# what's the size of our data?
data_set.shape

(7960, 29)

In [7]:
# how distributed is the data?
data_set.describe()

Unnamed: 0,solarRadiation,uvHigh,winddirAvg,humidityHigh,humidityLow,humidityAvg,qcStatus,tempHigh,tempLow,tempAvg,...,windchillAvg,heatindexHigh,heatindexLow,heatindexAvg,pressureMax,pressureMin,pressureTrend,precipRate,precipTotal,DC
count,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,...,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0,7960.0
mean,180.382851,1.844472,182.790075,45.861683,44.912437,45.077261,0.89397,53.745729,53.420603,53.554397,...,53.41897,53.581533,53.236935,53.380653,30.200309,30.193024,0.000974,0.000469,0.012936,19.539283
std,264.082275,2.84604,78.432376,21.862087,21.940977,21.924786,0.311143,11.622671,11.565509,11.589908,...,11.720226,11.353402,11.265849,11.305647,0.137469,0.137532,0.095205,0.00505,0.053583,19.753129
min,0.0,0.0,0.0,11.0,10.0,10.0,-1.0,27.0,27.0,27.0,...,26.0,27.0,27.0,27.0,29.85,29.83,-0.6,0.0,0.0,0.0
25%,0.0,0.0,139.0,28.0,27.0,27.0,1.0,45.0,45.0,45.0,...,45.0,45.0,45.0,45.0,30.1,30.1,0.0,0.0,0.0,0.03
50%,0.0,0.0,197.0,43.0,42.0,42.0,1.0,54.0,54.0,54.0,...,54.0,54.0,54.0,54.0,30.18,30.18,0.0,0.0,0.0,8.379
75%,333.12,3.0,215.0,60.0,59.0,60.0,1.0,62.0,62.0,62.0,...,62.0,62.0,62.0,62.0,30.25,30.25,0.0,0.0,0.0,39.8235
max,986.88,10.0,359.0,98.0,98.0,98.0,1.0,83.0,83.0,83.0,...,83.0,80.0,80.0,80.0,30.61,30.6,0.6,0.13,0.37,43.71


In [8]:
# Use pd.DataFrame.corr function to see what correlations can be identified between DC and other features.
data_set.corr(method="spearman")

Unnamed: 0,solarRadiation,uvHigh,winddirAvg,humidityHigh,humidityLow,humidityAvg,qcStatus,tempHigh,tempLow,tempAvg,...,windchillAvg,heatindexHigh,heatindexLow,heatindexAvg,pressureMax,pressureMin,pressureTrend,precipRate,precipTotal,DC
solarRadiation,1.0,0.937831,-0.342999,-0.399324,-0.408921,-0.405694,-0.051987,0.4525,0.44402,0.448101,...,0.446924,0.452235,0.442935,0.447495,0.046178,0.045571,-0.073512,0.001359,0.035785,0.814947
uvHigh,0.937831,1.0,-0.316591,-0.407405,-0.418103,-0.414418,-0.048438,0.454691,0.445485,0.449924,...,0.448278,0.454565,0.444371,0.449403,0.062698,0.06251,-0.096479,-0.043045,0.00651,0.700912
winddirAvg,-0.342999,-0.316591,1.0,0.319881,0.321128,0.320633,-0.058951,-0.362586,-0.361985,-0.362065,...,-0.361569,-0.362126,-0.361636,-0.361757,0.196054,0.195846,0.011043,-0.018541,0.00361,-0.259023
humidityHigh,-0.399324,-0.407405,0.319881,1.0,0.99886,0.999456,-0.132945,-0.765681,-0.763568,-0.764586,...,-0.7593,-0.765219,-0.763032,-0.764093,0.169556,0.169504,0.027204,0.197191,0.362307,-0.102328
humidityLow,-0.408921,-0.418103,0.321128,0.99886,1.0,0.999736,-0.132788,-0.767797,-0.765245,-0.766474,...,-0.761256,-0.767374,-0.764697,-0.765993,0.167956,0.167909,0.029168,0.196765,0.360056,-0.109135
humidityAvg,-0.405694,-0.414418,0.320633,0.999456,0.999736,1.0,-0.132995,-0.766938,-0.764562,-0.765708,...,-0.760461,-0.766498,-0.764012,-0.765218,0.168278,0.168241,0.02843,0.196955,0.36102,-0.106906
qcStatus,-0.051987,-0.048438,-0.058951,-0.132945,-0.132788,-0.132995,1.0,0.050799,0.052421,0.051585,...,0.058814,0.050671,0.052195,0.051358,-0.130895,-0.131655,-0.013845,0.023923,0.051293,-0.128036
tempHigh,0.4525,0.454691,-0.362586,-0.765681,-0.767797,-0.766938,0.050799,1.0,0.99903,0.999402,...,0.998271,0.999708,0.998698,0.999141,-0.451769,-0.452192,-0.045211,-0.126698,-0.170193,0.176902
tempLow,0.44402,0.445485,-0.361985,-0.763568,-0.765245,-0.764562,0.052421,0.99903,1.0,0.999544,...,0.998397,0.998736,0.999677,0.999288,-0.455311,-0.455749,-0.043467,-0.125614,-0.17011,0.170358
tempAvg,0.448101,0.449924,-0.362065,-0.764586,-0.766474,-0.765708,0.051585,0.999402,0.999544,1.0,...,0.998819,0.999124,0.999221,0.999748,-0.453805,-0.45422,-0.04426,-0.126174,-0.170251,0.173594


__Observation__: In relation to DC, it appears there is a strong correlation with: 
* ```solarRadiation``` - 0.8
* ```uvHigh``` - 0.7

and loose correlation with:

* ```tempHigh```
* ```tempLow```
* ```tempAvg```
* ```windchillAvg```
* ```heatindexHigh```
* ```heatindexLow```
* ```heatindexAvg```
* ```precipTotal```

Does this reflect any information gathered from our research?

### Step II: Feature Selection

We will split the data into features and labels and convert them into arrays to be used for our model.

In [9]:
import numpy as np

In [10]:
# we want to perdict DC
labels = np.array(data_set['DC'])

In [21]:
# Remove the labels and unimportant features from the features list.

col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal',
 'DC']

features= data_set.drop(col, axis = 1)
feature_list = list(features.columns)
features = np.array(features)

### Step III: Build and Train Model
Split the data into train and test sets.

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
# Note here that the test size is so low because I want to overfit the model since we have a separate test set.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1)

In [24]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (7164, 5)
Training Labels Shape: (7164,)
Testing Features Shape: (796, 5)
Testing Labels Shape: (796,)


In [25]:
# the features we will be using to predict DC
feature_list

['solarRadiation', 'uvHigh', 'tempHigh', 'tempLow', 'tempAvg']

#### Step III.i: Hyper Parameters Tuning
Hyper Parameters Tuning is good for figuring out what parameters will work the best for building the model. It's much better than guessing. Although it isn't perfect, it gives us some clues on what to try.

In [26]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

----
```Gradient Boost```

----

In [56]:
gradient_boost_model = GradientBoostingRegressor()
gradient_params = {'learning_rate': sp_randFloat(),
                'subsample'    : sp_randFloat(),
                'n_estimators' : sp_randInt(200, 2000),
                'max_depth'    : sp_randInt(10, 110)
             }

In [58]:
random_gradient = RandomizedSearchCV(estimator= gradient_boost_model, param_distributions = gradient_params, cv = 3, verbose=2, n_iter = 100, n_jobs=-1)
random_gradient.fit(train_features, train_labels)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 28.2min finished


RandomizedSearchCV(cv=3, estimator=GradientBoostingRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aedd0>,
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aecd0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2ae050>,
                                        'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aee10>},
                   verbose=2)

In [59]:
# Results from Random Search
print(" Results from Random Search " )
print("\n The best estimator across ALL searched params:\n", random_gradient.best_estimator_)
print("\n The best score across ALL searched params:\n", random_gradient.best_score_)
print("\n The best parameters across ALL searched params:\n", random_gradient.best_params_)
print(random_gradient.score(test_features , test_labels))

 Results from Random Search 

 The best estimator across ALL searched params:
 GradientBoostingRegressor(learning_rate=0.01794706377831745, max_depth=32,
                          n_estimators=785, subsample=0.2873167459093807)

 The best score across ALL searched params:
 0.9655473309872132

 The best parameters across ALL searched params:
 {'learning_rate': 0.01794706377831745, 'max_depth': 32, 'n_estimators': 785, 'subsample': 0.2873167459093807}
0.9688789434674687


### Step III.ii: Random Forest Model

In [61]:
# Instantiate model with 1500 decision trees
rf = RandomForestRegressor(n_estimators = 785, 
                           criterion="mse", 
                           max_depth = 32, 
                           min_samples_split = 2)

In [62]:
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(max_depth=32, n_estimators=785)

#### Step III.iii: Accuracy - R2 Score
Let's see what the accuracy our model is using the training set provided.

In [73]:
y_pred = rf.predict(test_features)

In [75]:
from sklearn.metrics import r2_score

r2_score(test_labels, y_pred)

0.9709460517127418

__Comment__: Our model has a accuracy of 97%! That's not bad at all.

### Sept IV: Predictions Using Test Dataset
Now we will test our model using the test set. Remember that whatever we did to the training set must also be done to the testing set!

In [63]:
test_set = pd.read_csv("cahsi_data_2020/D2.csv")


col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal']

testset_features = test_set.drop(col, axis = 1)
testset_features = np.array(testset_features)

In [64]:
# Use the forest's predict method on the test data
predictions = rf.predict(testset_features)

In [65]:
predictions

array([1.20963236, 1.20963236, 0.70364704, ..., 6.45444127, 6.45444127,
       6.45444127])

### Step V: Dump predictions into text file for later use.

In [67]:
print('Predictions:\n', predictions) 
file = open("answer.txt", "w") 

for num in predictions:

    content = str(num)
    file.write(content)
    file.write("\n")

file.close()

Predictions:
 [1.20963236 1.20963236 0.70364704 ... 6.45444127 6.45444127 6.45444127]
