# HOW TO USE OPTICHILL

### IMPORTING THE NECESSARY MODULES TO RUN THE CODE

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from sklearn import metrics
from sklearn.ensemble import GradientBoostingRegressor
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import seaborn as sns

# code to add to import from optichill folder
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from optichill import bas_filter
from optichill import GBM_model

### FILTERING OUT THE DATA

* Use `bas_filter.train_single_plant` to allow the data to import the data, filter out features that are redundent and alarms to provide a training and testing dataset that can be used.

Spitting the data from Plant 1 to training and testing data:

In [2]:
train_data = [
    'Plt1 m 2016-11.csv', 'Plt1 m 2016-12.csv', 'Plt1 m 2017-01.csv', 'Plt1 m 2017-02.csv',
    'Plt1 m 2017-03.csv'
]
test_data = [ 
    'Plt1 m 2017-04.csv', 'Plt1 m 2017-05.csv', 'Plt1 m 2017-06.csv', 'Plt1 m 2017-07.csv', 
    'Plt1 m 2017-08.csv', 'Plt1 m 2017-09.csv', 'Plt1 m 2017-10.csv', 'Plt1 m 2017-11.csv', 
    'Plt1 m 2017-12.csv', 'Plt1 m 2018-01.csv', 'Plt1 m 2018-02.csv', 'Plt1 m 2018-03.csv', 
    'Plt1 m 2018-04.csv' 
]

points_list = '../../Plt1/Plt1 Points List.xlsx'


Two dataframes (training and testing) are obtained using the `train_single_plant` function:

In [3]:
df_train, df_test = bas_filter.train_single_plt(
    '../../Plt1', train_data, test_data, points_list, 
    include_alarms = True, dim_remove = []
) 

Filtering Training Set
['../../Plt1\\Plt1 m 2016-11.csv']
['../../Plt1\\Plt1 m 2016-12.csv']
['../../Plt1\\Plt1 m 2017-01.csv']
['../../Plt1\\Plt1 m 2017-02.csv']
['../../Plt1\\Plt1 m 2017-03.csv']
Descriptors in the points list that are not in the datasets.
CommunicationFailure_COV
CH3COM1F
CH3Ready
CH4COM1F
CH4Ready
CH4SURGE
CH5COM1F
CH5Ready
Original data contains 40396 points and 413 dimensions.
Filtered data contains 35940 points and 193 dimensions.
Filtering Test Set
['../../Plt1\\Plt1 m 2017-04.csv']
['../../Plt1\\Plt1 m 2017-05.csv']
['../../Plt1\\Plt1 m 2017-06.csv']
['../../Plt1\\Plt1 m 2017-07.csv']
['../../Plt1\\Plt1 m 2017-08.csv']
['../../Plt1\\Plt1 m 2017-09.csv']
['../../Plt1\\Plt1 m 2017-10.csv']
['../../Plt1\\Plt1 m 2017-11.csv']
['../../Plt1\\Plt1 m 2017-12.csv']
['../../Plt1\\Plt1 m 2018-01.csv']
['../../Plt1\\Plt1 m 2018-02.csv']
['../../Plt1\\Plt1 m 2018-03.csv']
['../../Plt1\\Plt1 m 2018-04.csv']
Descriptors in the points list that are not in the datasets.
Commun

Split the data into a datasest with kW/Ton and all the other features. This is similar to splitting the data into "x and y"
axes

In [4]:
ytrain = df_train['kW/Ton']
ytest = df_test['kW/Ton']
xtrain = df_train.drop(['kW/Ton'], axis=1)
xtest = df_test.drop(['kW/Ton'], axis=1)

### USING GBM (GRADIENT BOOSTING MACHINES) FOR DETERMINING FEATURE IMPORTANCE AND PREDICTING EFFICIENCY 

* Train the model by inputting the "x and y" datasets and using the `GBM_model.train_model` function. The R<sup>2</sup> gets printed below:

In [6]:
GBM_model.train_model(xtrain, ytrain, xtest, ytest)

-1.8853841244359621

In [7]:
GBM_model.predict_model()

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.01, loss='ls', max_depth=6, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=500, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

Save the features importance list into a .csv file:

In [8]:
GBM_model.feature_importance_list('Plt1.csv', xtest)

The feature importance list was created as Plt1.csv
