Coursera: Introduction to machine learning for project management

In this project we will build a model to predict the electrical energy output of a Combined Cycle Power Plant, which uses a combination of gas turbines, steam turbines, and heat recovery steam generators to generate power. We have a set of 9568 hourly average ambient environmental readings from sensors at the power plant which we will use in our model.

The columns in the data consist of hourly average ambient variables: 
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)


1.For the problem described in the Project Topic section above, determine what type of machine learning approach is needed and select an appropriate output metric to evaluate performance in accomplishing the task.

-> For this problem we will use linear regression, given that we need to predict a numerical value. And we will use MAPE as output metric given that it is easy to interpret.


In [1]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()

2. Determine which possible features we may want to use in the model, and identify the different algorithms we might consider.

->The output power of gas turbine mainly depends on the atmospheric parameters such as relative humidity, atmospheric pressure and atmospheric temperature. The output power of a steam turbine has direct correlation with exhaust vacuum. (https://www.sciencedirect.com/science/article/pii/S0306261904000601)

-> So for this case we will consider all variables.

-> We can use Linear Regression as a parametric Algorithm and we can user decision tree regression as a non- parametric algorithm

3. Split your data to create a test set to evaluate the performance of your final model. Then, using your training set, determine a validation strategy for comparing different models - a fixed validation set or cross-validation.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('CCPP_data.csv')
X=data.iloc[:,:4] #first 4 rows
Y=data.iloc[:,-2:-1] # last row PE
type(X)
data.head(3)


Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56


In [3]:
X.head(3)



Unnamed: 0,AT,V,AP,RH
0,14.96,41.76,1024.07,73.17
1,25.18,62.96,1020.04,59.08
2,5.11,39.4,1012.16,92.14


In [4]:
Y.head(3)

Unnamed: 0,RH
0,73.17
1,59.08
2,92.14


In [5]:
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3) #data Split
#test 30% of the data

In [6]:
regressor.fit(x_train,y_train)
regressor.coef_
#Model Coefficients
#y_pred = regressor.predict(x_test)


array([[8.30382978e-16, 3.26863259e-16, 5.64628776e-16, 1.00000000e+00]])

In [7]:
from sklearn.model_selection import KFold, cross_val_score

k_folds = KFold(n_splits = 5)

scores = cross_val_score(regressor, X, Y, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. 1. 1.]
Average CV Score:  1.0
Number of CV Scores used in Average:  5


4. Use your validation approach to compare at least two different models.
   
Now we will test the regression Tree to see if it performs better or not

In [8]:
from sklearn.tree import DecisionTreeRegressor
Treeregressor = DecisionTreeRegressor(random_state=0)
Treeregressor.fit(x_train, y_train)
print(Treeregressor.get_depth())


16


In [9]:
tree_scores = cross_val_score(Treeregressor, X, Y, cv = k_folds)

print("Cross Validation Scores: ", tree_scores)
print("Average CV Score: ", tree_scores.mean())
print("Number of CV Scores used in Average: ", len(tree_scores))

Cross Validation Scores:  [0.99999766 0.99999664 0.99999806 0.99999713 0.9999938 ]
Average CV Score:  0.9999966585159201
Number of CV Scores used in Average:  5


5. To conclude we can see that both models perform almost exactly the Same, winning by 0.01% more accuracy, now we evaluate the performance of our final model using the output metric you defined earlier. 

In [10]:
predictions = Treeregressor.predict(x_test)
predictions

array([84.79, 84.29, 53.69, ..., 99.28, 77.45, 52.95])

In [11]:
y_test.head(3)

Unnamed: 0,RH
2558,84.79
1856,84.29
2760,53.69


In [12]:
Treeregressor.score(x_test,y_test)

0.9999966335325362

In [13]:
extracted_MSEs = Treeregressor.tree_.impurity 

#impurity calculation for each node for all levels of the tree (27)
for idx, MSE in enumerate(Treeregressor.tree_.impurity):
    print("Node {} has MSE {}".format(idx,MSE))

Node 0 has MSE 214.99210990675692
Node 1 has MSE 84.01170591445634
Node 2 has MSE 38.26158814811197
Node 3 has MSE 17.39483796836953
Node 4 has MSE 10.39921252419822
Node 5 has MSE 6.796657856400088
Node 6 has MSE 3.7837538461538998
Node 7 has MSE 1.2629839999998467
Node 8 has MSE 0.10562499999991815
Node 9 has MSE 0.0
Node 10 has MSE -1.1368683772161603e-13
Node 11 has MSE 0.10148888888863894
Node 12 has MSE 0.0
Node 13 has MSE 0.034225000000105865
Node 14 has MSE 0.0
Node 15 has MSE 1.1368683772161603e-13
Node 16 has MSE 0.2450187500003267
Node 17 has MSE 0.09175555555543724
Node 18 has MSE 0.0
Node 19 has MSE 0.015624999999772626
Node 20 has MSE 0.0
Node 21 has MSE 0.0
Node 22 has MSE 0.04745600000012473
Node 23 has MSE 0.006400000000098771
Node 24 has MSE 0.0
Node 25 has MSE 1.1368683772161603e-13
Node 26 has MSE 0.015022222222341952
Node 27 has MSE 0.004900000000475302
Node 28 has MSE 0.0
Node 29 has MSE 0.0
Node 30 has MSE -2.2737367544323206e-13
Node 31 has MSE 0.612291666667488