<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Case-Study:" data-toc-modified-id="Case-Study:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Case Study:</a></span><ul class="toc-item"><li><span><a href="#Data-Set-Reference:" data-toc-modified-id="Data-Set-Reference:-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Set Reference:</a></span></li></ul></li></ul></div>

# Case Study:

**Data**:
A Combined Cycle Power Plant (CCPP) is composed of Gas Turbines (GT), Steam Turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. The ambient variables, like temperature, pressure, and humidity have effect the GT performance. Exhaust vacuum is collected from and has effect on the ST performance. 

Data collected from a Combined Cycle Power Plant over 6 years (2006-2011) is given. The power plant was set to work with full load during the data collection phase. Following input variables (hourly averages) are collected:
1. Ambient Temperature (AT), 
2. Ambient Pressure (AP), 
3. Relative Humidity (RH), and 
4. Exhaust Vacuum (EV) 

The aim is to predict the net hourly electrical energy output (EP) of the plant. The data is given in Regression-4.csv file.

**Hypothesis**:  
Our underlying hypothesis is that the input variables are _linearly_ related with the output variable.  

**Objective**:  
The objective of this case study is to identify the input variables' relationship with the output variable. Specifically, conduct a regression analysis, and estimate the best coefficients that capture the underlying relationship.
_Note: It is not necessary that all the input variables are related to the output variable._

In [1]:
# Reading & describing the data 
import numpy as np
import pandas as pd
df = pd.read_csv('data/Regression-5.csv', delimiter = ',')
df.describe()

Unnamed: 0,AT,EV,AP,RH,EP
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


In [2]:
# Generate Train - Test splits
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1].values
y = df.iloc[:, -1].values 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

In [3]:
# Scaling the Train - Test splits
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(np.c_[X_train,y_train])

A_train = scaler.transform(np.c_[X_train,y_train])
X_train = A_train[:,:-1]
y_train = A_train[:,-1]

A_test = scaler.transform(np.c_[X_test,y_test])
X_test = A_test[:,:-1]
y_test = A_test[:,-1]

In [4]:
# Regression Analysis: Mean Squared Error Metric
from sklearn.metrics import mean_squared_error

## OLS
from sklearn.linear_model import LinearRegression
reg1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
y_pred1 = reg1.predict(X_test)
print('The MSE using OLS is:', mean_squared_error(y_test, y_pred1))


## Ridge
from sklearn.linear_model import RidgeCV
reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train)
y_pred2 = reg2.predict(X_test)
print('The MSE using Ridge is:', mean_squared_error(y_test, y_pred2))


## Lasso
from sklearn.linear_model import LassoCV
reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train)
y_pred3 = reg3.predict(X_test)
print('The MSE using Lasso is:', mean_squared_error(y_test, y_pred3))

The MSE using OLS is: 0.07091976100937038
The MSE using Ridge is: 0.07091950720057445
The MSE using Lasso is: 0.07092372057034313


In [5]:
## Details of the best estimates
print('The best penalty coefficient is:', reg2.alpha_)
print('The best coefficient estimates are:', reg2.coef_.tolist())

The best penalty coefficient is: 0.1
The best coefficient estimates are: [-0.8681247972122161, -0.1718378232119017, 0.019518121580732375, -0.13712536395312042]


Summary of the Case Study-2:
- There were 9568 samples/observations in total.
- Lasso does not performs better than OLS and/or Ridge for this data.
- Thus, all input variables seems to be relevant.

## Data Set Reference:
6. Regression-5: _Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615._