# Modeling d0eop_microkappa
Using the feautures of the stage d0eop predict the microkappa (target)(of the stage D1 - output stage EOP)

-------
**INFO MODEL**
- **stage**: d0eop
- **target**: blancura entrada d1

## Root folder and read env variables

In [1]:
import os
# fix root path to save outputs
actual_path = os.path.abspath(os.getcwd())
list_root_path = actual_path.split('\\')[:-1]
root_path = '\\'.join(list_root_path)
os.chdir(root_path)
print('root path: ', root_path)

root path:  D:\github-mi-repo\Optimization-Industrial-Process-Advanced


In [2]:
import os
from dotenv import load_dotenv, find_dotenv # package used in jupyter notebook to read the variables in file .env

""" get env variable from .env """
load_dotenv(find_dotenv())

""" Read env variables and save it as python variable """
PROJECT_GCP = os.environ.get("PROJECT_GCP", "")

## RUN TRAINING

In [3]:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import gcsfs
import pickle

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

# transform
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer

# models
from sklearn.linear_model import LinearRegression # lr
from sklearn.linear_model import Ridge # ridge
from sklearn.linear_model import Lasso # lasso
from sklearn.tree import DecisionTreeRegressor # tree
from sklearn.ensemble import GradientBoostingRegressor #gb
from sklearn.ensemble import RandomForestRegressor #rf
#from xgboost import XGBRegressor # xgb
from  sklearn.neural_network import MLPRegressor # mlp

In [4]:
# ### desarrollo

# PROJECT_ID = PROJECT_GCP
# ! gcloud config set project $PROJECT_ID

### 1. Read data

In [5]:
path_data = 'artifacts/data/data.pkl'
data = pd.read_pickle(path_data)
data.head()

Unnamed: 0_level_0,230AIT446.PNT,240AIC022.MEAS,240AIC126.MEAS,240AIC224.MEAS,240AIC286.MEAS,240AIC324.MEAS,240AIC433.MEAS,240AIT063A.PNT,240AIT063B.PNT,240AIT225A.PNT,...,S240ALDP022,S240ALDP031,S240ALDP032,S276PER002,S2MAQUINAT07,S76ALE017,SSTRIPPING015,calc_prod_d0,calc_prod_d1,calc_prod_p
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-01 00:05:00,11.55504,2.983948,11.346645,4.413519,4.352375,10.441675,4.292521,5.86932,62.37495,1.837519,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3240.8635,3313.6215,3259.3745
2021-01-01 00:10:00,11.55232,3.015669,11.353215,4.413179,4.347186,10.43217,4.289684,5.86932,62.37495,1.81402,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3260.7475,3301.692,3208.6785
2021-01-01 00:15:00,11.549955,3.018903,11.355525,4.408321,4.355828,10.410115,4.284427,5.86932,62.37495,1.81402,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3265.5765,3284.133,3210.779
2021-01-01 00:20:00,11.547145,3.001164,11.326725,4.408659,4.361292,10.379145,4.285478,5.83575,62.37495,1.81402,...,91.49,1.7,11.3,11.77,1.5712,173.6,964.0,3253.775,3271.926,3221.7745
2021-01-01 00:25:00,11.54316,3.017393,11.336345,4.408596,4.356374,10.387205,4.304148,5.802179,62.37495,1.81402,...,91.49,1.6,11.2,11.77,1.5712,173.6,964.0,3236.979,3267.305,3227.6935


### 2. Define target
All the stages could have multiple targets. But with the models supported by gurobi only accept models that predict one target

In [6]:
list_target = ['240AIT225B.PNT'] # blancura_d1

### 3. Define features
In the exploratory data analysis you should select the features that will be use to predict the target. In this example, this features are defined manually in a list

In stage DOEOP, there are 4 VC - features controlables
- dioxido d0
- oxigeno eop
- peroxido eop
- soda eop

See the list of features controlables is a subset of list features controlables

In [7]:
##### OBS: TO SIMPLIFY THIS EXAMPLE THE FEATURES USED IN MODEL BLANCURA_D0EOP ARE THE SAME FEATURES USED IN MODEL MICROKAPPA_D0EOP
list_features = [
    "240AIT063A.PNT", #kappa_d0
    "240AIT063B.PNT", #brillo_d0
    "calc_prod_d0", #calc_prod_d0
    "240FY050.RO02", #especifico_dioxido_d0 - VC
    "240FY11PB.RO01", #especifico_peroxido_eop - VC
    "240FY118B.RO01", #especifico_oxigeno_eop - VC
    "240FY107A.RO01", #especifico_soda_eop - VC
    "SSTRIPPING015", #dqo_evaporadores
    "S276PER002", #concentracion_clo2_d0
    "240AIC022.MEAS", #ph_a
]

In [8]:
# list_features_controlables = [
#     "240FY050.RO02", #especifico_dioxido_d0 - VC,
#     "240FY118B.RO01", #especifico_oxigeno_eop - VC
#     "240FY11PB.RO01", #especifico_peroxido_eop - VC
#     "240FY107A.RO01", #especifico_soda_eop - VC
# ]

### 4. Read master tags data for this stage. Sort features used to train according this order

In [9]:
### read master table - list tags
stage = 'd0eop'
path_maestro_tags_d0eop = f'config/config_ml_models_development/MaestroTags-{stage}-general.xlsx'
maestro_tags = pd.read_excel(path_maestro_tags_d0eop)
maestro_tags

Unnamed: 0,TAG,TAG_DESCRIPTION,DESCRIPCION,ETAPA,CLASIFICACION,USE_PREVIOUS_MODEL,USE_NEXT_MODEL
0,240FI020A.PNT,prod_total,Producción Total,A,NC,,
1,calc_prod_d0,calc_prod_d0,Producción entrada D0 (prod entrada A dezplazada),D0,NC,,
2,240FI020B.PNT,prod_eop,Prod entrada EOP,EOP,NC,,
3,240LIT010.PNT,nivel_tac_cafe,Nivel torre TAC Café,A,NC,,
4,240AIC022.MEAS,ph_a,pH entrada etapa Acida,A,NC,,
5,240TIC023.MEAS,temperatura_a,Temperatura etapa Acida,A,NC,,
6,240FY024A.RO01,especifico_acido_a,Específico Acido sulfúrico,A,NC,,
7,240AIT063B.PNT,brillo_d0,Brillo salida etapa A (entrada D0),D0,NC,,
8,240AIT063A.PNT,kappa_d0,Kappa salida etapa A (entrada D0),D0,NC,,
9,S276PER002,concentracion_clo2_d0,Concentración ClO2,D0,NC,,


In [10]:
### sort list of features according the order in master table

list_features = [tag for tag in maestro_tags['TAG'].tolist() if tag in list_features]
#list_features_controlables = [tag for tag in maestro_tags['TAG'].tolist() if tag in list_features_controlables]

### 5. Split train test

In [11]:
# RANDOM split train test
X_train, X_test, y_train, y_test = train_test_split(data[list_features], 
                                                    data[list_target], 
                                                    test_size = 0.2, 
                                                    random_state=42
                                                   )

In [12]:
# # TIME SERIES SPLIT
# X_train, X_test, y_train, y_test = train_test_split(data[list_features], 
#                                                     data[list_target], 
#                                                     test_size = 0.2, 
#                                                     shuffle = False
#                                                    )

In [13]:
### save data TRAIN - TEST
name_model = 'd0eop_blancura'


# ---
# save X_train
path_X_train = f'artifacts/data_training/{name_model}/X_train.pkl'
with open(path_X_train, "wb") as output:
    pickle.dump(X_train, output)
    output.close()

# save y_train
path_y_train = f'artifacts/data_training/{name_model}/y_train.pkl'
with open(path_y_train, "wb") as output:
    pickle.dump(y_train, output)
    output.close()


# ---
# save X_test
path_X_test = f'artifacts/data_training/{name_model}/X_test.pkl'
with open(path_X_test, "wb") as output:
    pickle.dump(X_test, output)
    output.close()

# save y_test
path_y_test = f'artifacts/data_training/{name_model}/y_test.pkl'
with open(path_y_test, "wb") as output:
    pickle.dump(y_test, output)
    output.close()

### 6. Train lr and evaluate R2

In [14]:
# # train lr
# lr = LinearRegression()
# lr.fit(X_train, y_train)

In [15]:
# train lr
lr_model = LinearRegression()

lr = Pipeline([
    ('scaler', StandardScaler() ), # minmax scaler its not supported by gurobi
    ('poly_feature_2', PolynomialFeatures(2)),
    ('lr',  lr_model)
])

lr.fit(X_train, y_train)

In [16]:
# r2 score
r2_lr_train = lr.score(X_train, y_train)
r2_lr_test = lr.score(X_test, y_test)

print('r2_train: ', r2_lr_train)
print('r2_test: ', r2_lr_test)

r2_train:  0.7706366240333764
r2_test:  0.7684082750001481


### 7. Train Gradient Boosting and evaluate R2

In [17]:
# train
param_n_estimators = 20
gb_simple_model = GradientBoostingRegressor(random_state = 42,
                                     n_estimators = param_n_estimators,
                                      min_samples_split = 0.2,
                                    min_samples_leaf = 0.1,
                                    #max_depth = 2)
                                     )

gb_simple = Pipeline([
    ('poly_feature_2', PolynomialFeatures(2)),
    ('scaler', StandardScaler() ), # minmax scaler its not supported by gurobi
    ('gb_simple',  gb_simple_model)
])

gb_simple.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [18]:
# r2 score
r2_gb_simple_train = gb_simple.score(X_train, y_train)
r2_gb_simple_test = gb_simple.score(X_test, y_test)

print('r2_train: ', r2_gb_simple_train)
print('r2_test: ', r2_gb_simple_test)

r2_train:  0.7028494076256613
r2_test:  0.7003605276910014


### 8. Linear Regression custom transformations in columns
The objetive of this model is achieve higher score but also test if gurobi can accept custom transformation in its optimization models

In [19]:
# test transformation v1
transformer_log_v1 = FunctionTransformer(np.log1p)
transformer_log_v1.transform(X_train['calc_prod_d0'])

datetime
2022-05-19 06:10:00    8.207356
2022-08-31 15:55:00    8.135041
2022-08-18 22:20:00    8.094981
2021-02-15 22:30:00    8.112742
2021-02-23 20:55:00    8.013620
                         ...   
2022-06-18 03:20:00    8.159363
2022-08-03 00:45:00    8.134254
2022-05-19 06:40:00    8.201076
2022-09-21 05:45:00    8.201623
2022-08-12 02:15:00    8.156949
Name: calc_prod_d0, Length: 114560, dtype: float64

In [20]:
# test transformation only one column v2
transformer_log_v2 = ColumnTransformer(transformers=
                             [
                                 ('log production', FunctionTransformer(np.log1p), ['calc_prod_d0'])
                             ], #remainder='passthrough'
    )


transformer_log_v2 # NO FUNCIONA

In [21]:
# test transformation only one column v2
transformer_log_v3 = make_column_transformer(
    (FunctionTransformer(np.log1p), ["calc_prod_d0"]),
    remainder='passthrough' # Leave other columns unchanged
)

transformer_log_v3

In [22]:
#### create pipeline training lr

# create transformation
transformer_log = transformer_log_v3 = make_column_transformer(
    (FunctionTransformer(np.log1p), ["calc_prod_d0"]),
    remainder='passthrough' # Leave other columns unchanged
)


# create mode lr
lr_model = LinearRegression()


# create pipeline
lr_log_prod = Pipeline([
    ('log production', transformer_log),
    ('scaler', StandardScaler() ), # minmax scaler its not supported by gurobi
    ('poly_feature_2', PolynomialFeatures(2)),
    ('lr',  lr_model)
])

lr_log_prod.fit(X_train, y_train)

In [23]:
# r2 score
r2_lr_log_prod_train = lr_log_prod.score(X_train, y_train)
r2_lr_log_prod_test = lr_log_prod.score(X_test, y_test)

print('r2_train: ', r2_lr_log_prod_train)
print('r2_test: ', r2_lr_log_prod_test)

r2_train:  0.7700459374494136
r2_test:  0.7677098007932315


In [24]:
coefficients = lr_log_prod.named_steps['lr'].coef_
coefficients.shape

(1, 66)

### 9. Linear Regression Adding certain transformations in the data
- Dioxido al cuadrado
- Dioxido por blancura de entrada
- e elevado a menos (a) * dioxido, donde (a) es un parámetro obenido por experto

In [25]:
# define function euler negative clo2
def euler_dioxido(values):
    factor_a = 1/10
    output = np.power(np.e, -factor_a * values)
    return output

In [26]:
# # examples euler dioxido
# transformer_euler = FunctionTransformer(euler_dioxido)
# transformer_euler.transform(X_train['240FY11PB.RO01'])

In [27]:
#### create pipeline training lr

# create transformation
transformer_euler = transformer_log_v3 = make_column_transformer(
    (FunctionTransformer(euler_dioxido), ['240FY11PB.RO01']),
    remainder='passthrough' # Leave other columns unchanged
)


# create mode lr
lr_model = LinearRegression()


# create pipeline
lr_euler_dioxido = Pipeline([
    ('euler dioxido', transformer_euler),
    ('scaler', StandardScaler() ), # minmax scaler its not supported by gurobi
    ('poly_feature_2', PolynomialFeatures(2)),
    ('lr',  lr_model)
])

lr_euler_dioxido.fit(X_train, y_train)

In [28]:
# r2 score
r2_lr_euler_dioxido_train = lr_euler_dioxido.score(X_train, y_train)
r2_lr_euler_dioxido_test = lr_euler_dioxido.score(X_test, y_test)

print('r2_train: ', r2_lr_euler_dioxido_train)
print('r2_test: ', r2_lr_euler_dioxido_test)

r2_train:  0.7703282062919419
r2_test:  0.7681920622396127


## IMPORTANT: GURORI DOESN'T ACCEPT CUSTOM TRANSFORMATIONS IN COLUMNS.
Por ejemplo esta transformación, aunque se defina la función que crea la transformación, gurobi no lo acepta. Retorna el error:

NoModel: Can't do model for functiontransformer: No implementation found

## SAVE OUTPUTS TRAINING

Al terminar el entrenamiento, los siguientes outputs deben de ser generados:

----
#### Artefacto Analitico:
- **modelo entrenado** y guardado como pkl

----
#### Listado de features:
- **listado de features** (listado de todas las features que ve el modelo)

- **listado de features variables controlables** (listado de todas las features que ve el modelo y que son variables controlables y por lo tanto
variables de decisión en un modelo de optimización)

- **listado de target** (lista con el target del modelo)


----
#### Example Input:
- **X_train.head(1)**: se necesita saber el orden de las features utilizadas y los nombres de las columnas. Ambos se deben de respetar. Con el listado de features se debe de poder deducir, pero de todas formas se guarda un ejemplo de la instancia de entrenamiento X

In [29]:
name_model = 'd0eop_blancura'

### 1. Save artifact model
Versioning of models are not development in this example

In [30]:
artifact_model_to_save = lr_log_prod

# save model
path_model = f'artifacts/models/{name_model}/model.pkl'
with open(path_model, "wb") as output:
    pickle.dump(artifact_model_to_save, output)
    output.close()

In [31]:
# print model saved to gurobi
artifact_model_to_save

### 2. Save list of features
Save table master tag only with the tags used to train the model. 

OBS IMPORTANT: remember that the list of features was sorted according the master table so this order was used to train. Also this table has the differentation between no-controlable, controlable and targer variables

In [32]:
# generate a list of features + target
list_features_target = list_features + list_target

# filter master tag with only the features+target used to train the ml models
maestro_tags = maestro_tags[maestro_tags['TAG'].isin(list_features_target)]
maestro_tags = maestro_tags.reset_index().drop(columns = 'index')

In [33]:
# save master in config folder that will used to create the optimization engine
path_list_features_target_to_optimization = f'config/optimization_engine/ml_models/MaestroTags-{name_model}-general.xlsx'
maestro_tags.to_excel(path_list_features_target_to_optimization, index = False)

In [40]:
maestro_tags

Unnamed: 0,TAG,TAG_DESCRIPTION,DESCRIPCION,ETAPA,CLASIFICACION,USE_PREVIOUS_MODEL,USE_NEXT_MODEL
0,calc_prod_d0,calc_prod_d0,Producción entrada D0 (prod entrada A dezplazada),D0,NC,,
1,240AIC022.MEAS,ph_a,pH entrada etapa Acida,A,NC,,
2,240AIT063B.PNT,brillo_d0,Brillo salida etapa A (entrada D0),D0,NC,,
3,240AIT063A.PNT,kappa_d0,Kappa salida etapa A (entrada D0),D0,NC,,
4,S276PER002,concentracion_clo2_d0,Concentración ClO2,D0,NC,,
5,SSTRIPPING015,dqo_evaporadores,DQO Evaporadores,D0,NC,,
6,240FY050.RO02,especifico_dioxido_d0,Específico ClO2,D0,C,,
7,240FY11PB.RO01,especifico_peroxido_eop,Esp. Peróxido,EOP,C,,
8,240FY118B.RO01,especifico_oxigeno_eop,Esp. Oxígeno,EOP,C,,
9,240FY107A.RO01,especifico_soda_eop,Esp. Soda EOP,EOP,C,,


### 3. Save example input

In [34]:
# example input
example_input = X_train.head(1)
example_input

Unnamed: 0_level_0,calc_prod_d0,240AIC022.MEAS,240AIT063B.PNT,240AIT063A.PNT,S276PER002,SSTRIPPING015,240FY050.RO02,240FY11PB.RO01,240FY118B.RO01,240FY107A.RO01
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-05-19 06:10:00,3666.8315,3.018025,61.48198,6.910001,11.28,489.0,6.656102,5.811579,0.999272,9.314641


In [35]:
# save example input

path_example_input_ml_model = f'config/optimization_engine/ml_models/ExampleInputsModels-{name_model}.xlsx'
example_input.to_excel(path_example_input_ml_model)

### 4. Save configuration piecewise model
Save the tag, threshold that divide the data to use differents models. Also save the name of the model used. This output is usefull when a piecewise model is trained (segmented models acording a threshold in one or more features)

**This output need to be always generated to connect to the optimization engine. Only it is usefull when a piecewise model was trained, but in order to conserve the same structure across all models, this file needs to be generated always**

In [36]:
# values by default to save when no segmentation is trained
no_apply_segmentation_string = 'no_apply'
no_apply_segmentation_number = 0

In [37]:
# generate a dictionary with all model info
dict_model_info = {}

# generate a dictionary with info specific segmentation. Conserve the lists structure (list to multiple threshold into multiple models)
info_piecewise = {"tag": no_apply_segmentation_string,
                 "threshold": [no_apply_segmentation_number], 
                 "names_pkl_models": ["model"] # names of pkl of models trained. Altough no segmentation trained, this name is important to save correctly
                }

# append into global dict
dict_model_info["Segmentation_Production"] = info_piecewise

dict_model_info

{'Segmentation_Production': {'tag': 'no_apply',
  'threshold': [0],
  'names_pkl_models': ['model']}}

In [38]:
import json
path_json_info_model = f'config/optimization_engine/ml_models/InfoModel-{name_model}.json'
with open(path_json_info_model, 'w') as archivo:
    json.dump(dict_model_info, archivo)