Coursera: Introduction to machine learning for project management

In this project we will build a model to predict the electrical energy output of a Combined Cycle Power Plant, which uses a combination of gas turbines, steam turbines, and heat recovery steam generators to generate power. We have a set of 9568 hourly average ambient environmental readings from sensors at the power plant which we will use in our model.

The columns in the data consist of hourly average ambient variables: 
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)


1.For the problem described in the Project Topic section above, determine what type of machine learning approach is needed and select an appropriate output metric to evaluate performance in accomplishing the task.

-> For this problem we will use linear regression, given that we need to predict a numerical value. And we will use MAPE as output metric given that it is easy to interpret.


In [2]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()

2. Determine which possible features we may want to use in the model, and identify the different algorithms we might consider.

->The output power of gas turbine mainly depends on the atmospheric parameters such as relative humidity, atmospheric pressure and atmospheric temperature. The output power of a steam turbine has direct correlation with exhaust vacuum. (https://www.sciencedirect.com/science/article/pii/S0306261904000601)

-> So for this case we will consider all variables.

-> We can use Linear Regression as a parametric Algorithm and we can user decision tree regression as a non- parametric algorithm

3. Split your data to create a test set to evaluate the performance of your final model. Then, using your training set, determine a validation strategy for comparing different models - a fixed validation set or cross-validation.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('CCPP_data.csv')
X=data.iloc[:,:4] #first 4 rows
Y=data.iloc[:,-2:-1] # last row PE
type(X)
data.head(3)


Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


In [87]:
X.head(3)



Unnamed: 0,AT,V,AP,RH
0,14.96,41.76,1024.07,73.17
1,25.18,62.96,1020.04,59.08
2,5.11,39.4,1012.16,92.14


In [89]:
Y.head(3)

Unnamed: 0,PE
0,463.26
1,444.37
2,488.56


In [79]:
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3) #data Split
#test 30% of the data

In [81]:
regressor.fit(x_train,y_train)
regressor.coef_
#Model Coefficients
#y_pred = regressor.predict(x_test)


array([[-1.98200564, -0.23102433,  0.05284114, -0.16153373]])

In [119]:
from sklearn.model_selection import KFold, cross_val_score

k_folds = KFold(n_splits = 5)

scores = cross_val_score(regressor, X, Y, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [0.93053597 0.92681472 0.93389127 0.92680208 0.92464499]
Average CV Score:  0.9285378066739307
Number of CV Scores used in Average:  5


4. Use your validation approach to compare at least two different models.
   
Now we will test the regression Tree to see if it performs better or not

In [165]:
from sklearn.tree import DecisionTreeRegressor
Treeregressor = DecisionTreeRegressor(random_state=0)
Treeregressor.fit(x_train, y_train)
print(Treeregressor.get_depth())


27


In [163]:
tree_scores = cross_val_score(Treeregressor, X, Y, cv = k_folds)

print("Cross Validation Scores: ", tree_scores)
print("Average CV Score: ", tree_scores.mean())
print("Number of CV Scores used in Average: ", len(tree_scores))

Cross Validation Scores:  [0.92232798 0.92398902 0.93539541 0.93383827 0.92758009]
Average CV Score:  0.9286261544521682
Number of CV Scores used in Average:  5


5. To conclude we can see that both models perform almost exactly the Same, winning by 0.01% more accuracy, now we evaluate the performance of our final model using the output metric you defined earlier. 

In [141]:
predictions = Treeregressor.predict(x_test)
predictions

array([460.14, 432.83, 486.91, ..., 456.54, 482.3 , 439.61])

In [169]:
y_test.head(3)

Unnamed: 0,PE
8291,457.98
5717,434.94
6710,479.16


In [145]:
Treeregressor.score(x_test,y_test)

0.9207734263571175

In [149]:
extracted_MSEs = Treeregressor.tree_.impurity 

#impurity calculation for each node for all levels of the tree (27)
for idx, MSE in enumerate(Treeregressor.tree_.impurity):
    print("Node {} has MSE {}".format(idx,MSE))

Node 0 has MSE 288.5651287169312
Node 1 has MSE 87.33070096030133
Node 2 has MSE 42.58895413079881
Node 3 has MSE 33.12731455534231
Node 4 has MSE 27.160101551242406
Node 5 has MSE 30.274553849827498
Node 6 has MSE 28.606759113114094
Node 7 has MSE 7.286188020429108
Node 8 has MSE 8.583622222242411
Node 9 has MSE 2.5000044843181968e-05
Node 10 has MSE 0.0
Node 11 has MSE 5.820766091346741e-11
Node 12 has MSE -2.9103830456733704e-11
Node 13 has MSE 5.722882439411478
Node 14 has MSE 4.225467729615048
Node 15 has MSE 9.986768749979092
Node 16 has MSE 1.6128999999782536
Node 17 has MSE 0.0
Node 18 has MSE 0.0
Node 19 has MSE 0.39062499997089617
Node 20 has MSE 0.0
Node 21 has MSE 0.0
Node 22 has MSE 2.3302409722236916
Node 23 has MSE 2.158837500057416
Node 24 has MSE 1.677606222248869
Node 25 has MSE 1.2930066326225642
Node 26 has MSE 0.9311984374653548
Node 27 has MSE 0.0
Node 28 has MSE 0.46471428574295714
Node 29 has MSE 0.1856959999713581
Node 30 has MSE 0.0
Node 31 has MSE 0.131300000