# Intro

Regression model for Combined Cycle Power Plant (CCPP). In this notebook, various regression models such as polynomial, SVR, Decision Tree Regression, and Random Forest Regression will be built in order to perform prediction of the net hourly electrical energy output (EP) of the plant. The dataset is from UCI ML dataset repository <a href="https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant">here</a>.

# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_csv("./dataset/CCPP.csv")

In [3]:
dataset.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AT      9568 non-null   float64
 1   V       9568 non-null   float64
 2   AP      9568 non-null   float64
 3   RH      9568 non-null   float64
 4   PE      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


In [5]:
dataset.describe()

Unnamed: 0,AT,V,AP,RH,PE
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


In [6]:
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [7]:
x

array([[  14.96,   41.76, 1024.07,   73.17],
       [  25.18,   62.96, 1020.04,   59.08],
       [   5.11,   39.4 , 1012.16,   92.14],
       ...,
       [  31.32,   74.33, 1012.92,   36.48],
       [  24.48,   69.45, 1013.86,   62.39],
       [  21.6 ,   62.52, 1017.23,   67.87]])

In [8]:
y

array([463.26, 444.37, 488.56, ..., 429.57, 435.74, 453.28])

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
# split the dataset
# 80% for training and 20% for testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

# Model

In [12]:
from sklearn.ensemble import RandomForestRegressor

In [13]:
regressor = RandomForestRegressor(n_estimators=10, random_state=0)

In [14]:
regressor.fit(x_train, y_train)

RandomForestRegressor(n_estimators=10, random_state=0)

## Prediction

In [15]:
y_pred = regressor.predict(x_test)b

In [16]:
y_pred

array([434.049, 458.785, 463.02 , ..., 469.479, 439.566, 460.385])

# Evaluating the model $(R^2)$

## Normal $R^2$

In [17]:
from sklearn.metrics import r2_score

In [18]:
r2 = r2_score(y_test, y_pred)
r2

0.9615908334363876

## Adjusted $R^2$

In [19]:
n = len(x_test)
p = x_test.shape[1]

1-(1-r2)*(n-1)/(n-p-1)

0.9615103532550076