# Machine Learning Foundation for Product Managers

## Course Assignment

#### Objective: To build a machine learning model to predict the electricity output of a Combined Cycle Power Plant

## Exploratory Data Analysis

In [1]:
import pandas as pd

data = pd.read_csv('CCPP_data.csv')
print("Data shape: {}".format(data.shape))
print("Column names: {}".format(data.columns))
data.head(10)

Data shape: (9568, 5)
Column names: Index(['AT', 'V', 'AP', 'RH', 'PE'], dtype='object')


Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9
5,26.27,59.44,1012.23,58.77,443.67
6,15.89,43.96,1014.02,75.24,467.35
7,9.48,44.71,1019.12,66.43,478.42
8,14.64,45.0,1021.78,41.25,475.98
9,11.74,43.56,1015.14,70.72,477.5


In [2]:
data.describe()

Unnamed: 0,AT,V,AP,RH,PE
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


The dataset contains 9,568 rows of records where each record is the hourly average ambient environmental readings from sensors at the power plant. There are 5 columns in the dataset, labelled as 'AT', 'V', 'AP', 'RH' & 'PE'.

**AT**: Average Temperature, in the range of 1.81°C (min) to 37.11°C (max)
**V**: Exhaust Vacuum , in the range of (min) 25.36 cm Hg to (max) 81.56 cm Hg
**AP**: Ambient Pressure, in the range of (min) 992.89 milibar to (max) 1033.30 milibar
**RH**: Relative Humidity, in the range of (min) 25.56% to (max) 100.16%
**PE**: Net hourly electrical energy output, in the range of (min) 420.26 MW to (max) 495.76 MW

PE is the target variable we are trying to predict for the machine learning model

We make sure that there are no missing data in the data set

In [3]:
print(data.isnull().any())

AT    False
V     False
AP    False
RH    False
PE    False
dtype: bool


We now prepare the data for model training and validation. The steps are:
1. Assign column 'PE' as the Target Variable
2. Drop the column 'PE' from the training dataset
3. Split the training dataset into TRAIN set and TEST set at a 80:20 ratio

In [4]:
from sklearn.model_selection import train_test_split

# set column 'PE' as the target variable
Y = data['PE'].values

# drop column 'PE' from the training set 
X = data.drop('PE', axis = 1).values

# Split the data set into TRAIN set and TEST set at a 80:20 ratio
X_train, X_test, Y_train, Y_test = train_test_split (X, 
                                                     Y, 
                                                     test_size = 0.20, 
                                                     random_state=42)

## Modeling Approach

### Use of Linear Regression model

A conventional Linear Regression model is used as the first model. All the 4 columns are used as input parameters to predict the target variable 'PE'. 

**R2** and **Mean Square Error** will be used to validate the Linear Regression model on the Test set.

In [5]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression()
LR_model.fit(X_train, Y_train)
print("Linear Regression Model coefficients: {}".format(LR_model.coef_))

Linear Regression Model coefficients: [-1.98589969 -0.23209358  0.06219991 -0.15811779]


**R2 score** is also known as the Coefficient of determination. It is used evaluate the performance of a linear regression model. It is used to check how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model.

**Mean Square Error** is an estimator measures the average of error squares (the average squared difference between the estimated values and true value). It is always non – negative and values close to zero are better.

In [6]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

#validate model on test set
predictions = LR_model.predict(X_test)

print ("R2 Score: {0:.4f}".format(r2_score(Y_test, predictions)))
print ("Mean Square Error: {0:.4f}".format(mean_squared_error(Y_test, 
                                                              predictions)))

R2 Score: 0.9301
Mean Square Error: 20.2737


### Use of Gradient Boosting Regression model

A Gradient Boosting Regression model is used as the second model for comparison. All the 4 columns are used as input parameters to predict the target variable 'PE'. 

**R2** and **Mean Square Error** will be used to validate the Gradient Boosting Regression model on the Test set.

In [7]:
from sklearn.ensemble import GradientBoostingRegressor

GB_model = GradientBoostingRegressor(random_state=21, n_estimators=400)
GB_model.fit(X_train, Y_train)

GradientBoostingRegressor(n_estimators=400, random_state=21)

In [8]:
#validate model on test set
predictions = GB_model.predict(X_test)

print ("R2 Score: {0:.4f}".format(r2_score(Y_test, predictions)))
print ("Mean Square Error: {0:.4f}".format(mean_squared_error(Y_test, predictions)))

R2 Score: 0.9606
Mean Square Error: 11.4333
