# Modeling d0eop_microkappa
Using the feautures of the stage d0eop predict the microkappa (target)(of the stage D1 - output stage EOP)

-------
**INFO MODEL**
- **stage**: d0eop
- **target**: microkappa

## Root folder and read env variables

In [1]:
import os
# fix root path to save outputs
actual_path = os.path.abspath(os.getcwd())
list_root_path = actual_path.split('\\')[:-1]
root_path = '\\'.join(list_root_path)
os.chdir(root_path)
print('root path: ', root_path)

root path:  D:\github-mi-repo\Optimization-Industrial-Process


In [2]:
import os
from dotenv import load_dotenv, find_dotenv # package used in jupyter notebook to read the variables in file .env

""" get env variable from .env """
load_dotenv(find_dotenv())

""" Read env variables and save it as python variable """
PROJECT_GCP = os.environ.get("PROJECT_GCP", "")

## RUN

In [3]:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import gcsfs
import pickle

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

# transform
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures

# models
from sklearn.linear_model import LinearRegression # lr
from sklearn.linear_model import Ridge # ridge
from sklearn.linear_model import Lasso # lasso
from sklearn.tree import DecisionTreeRegressor # tree
from sklearn.ensemble import GradientBoostingRegressor #gb
from sklearn.ensemble import RandomForestRegressor #rf
from xgboost import XGBRegressor # xgb
from  sklearn.neural_network import MLPRegressor # mlp

In [4]:
### desarrollo

PROJECT_ID = PROJECT_GCP
! gcloud config set project $PROJECT_ID

Updated property [core/project].


### 1. Read data

In [5]:
path_data = 'artifacts/data/data.pkl'
data = pd.read_pickle(path_data)
data.head()

Unnamed: 0_level_0,230AIT446.PNT,240AIC022.MEAS,240AIC126.MEAS,240AIC224.MEAS,240AIC286.MEAS,240AIC324.MEAS,240AIC433.MEAS,240AIT063A.PNT,240AIT063B.PNT,240AIT225A.PNT,...,S240ALDP022,S240ALDP031,S240ALDP032,S276PER002,S2MAQUINAT07,S76ALE017,SSTRIPPING015,calc_prod_d0,calc_prod_d1,calc_prod_p
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-01 00:05:00,11.55504,2.983948,11.346645,4.413519,4.352375,10.441675,4.292521,5.86932,62.37495,1.837519,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3240.8635,3313.6215,3259.3745
2021-01-01 00:10:00,11.55232,3.015669,11.353215,4.413179,4.347186,10.43217,4.289684,5.86932,62.37495,1.81402,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3260.7475,3301.692,3208.6785
2021-01-01 00:15:00,11.549955,3.018903,11.355525,4.408321,4.355828,10.410115,4.284427,5.86932,62.37495,1.81402,...,91.49,1.8,11.4,11.77,1.5712,173.6,964.0,3265.5765,3284.133,3210.779
2021-01-01 00:20:00,11.547145,3.001164,11.326725,4.408659,4.361292,10.379145,4.285478,5.83575,62.37495,1.81402,...,91.49,1.7,11.3,11.77,1.5712,173.6,964.0,3253.775,3271.926,3221.7745
2021-01-01 00:25:00,11.54316,3.017393,11.336345,4.408596,4.356374,10.387205,4.304148,5.802179,62.37495,1.81402,...,91.49,1.6,11.2,11.77,1.5712,173.6,964.0,3236.979,3267.305,3227.6935


### 2. Read master tags data for this stage

In [6]:
stage = 'd0eop'
path_maestro_tags_d0eop = f'config/config_models/MaestroTags-{stage}-general.xlsx'
maestro_tags = pd.read_excel(path_maestro_tags_d0eop)
maestro_tags

Unnamed: 0,TAG,TAG_DESCRIPTION,DESCRIPCION,ETAPA,CLASIFICACION
0,240FI020A.PNT,prod_total,Producción Total,A,NC
1,calc_prod_d0,calc_prod_d0,Producción entrada D0 (prod entrada A dezplazada),D0,NC
2,240FI020B.PNT,prod_eop,Prod entrada EOP,EOP,NC
3,240LIT010.PNT,nivel_tac_cafe,Nivel torre TAC Café,A,NC
4,240AIC022.MEAS,ph_a,pH entrada etapa Acida,A,C
5,240TIC023.MEAS,temperatura_a,Temperatura etapa Acida,A,C
6,240FY024A.RO01,especifico_acido_a,Específico Acido sulfúrico,A,NC
7,240AIT063B.PNT,brillo_d0,Brillo salida etapa A (entrada D0),D0,NC
8,240AIT063A.PNT,kappa_d0,Kappa salida etapa A (entrada D0),D0,NC
9,S276PER002,concentracion_clo2_d0,Concentración ClO2,D0,NC


### 3. Define target
All the stages could have multiple targets. But with the models supported by gurobi only accept models that predict one target

In [7]:
list_target = ['240AIT225A.PNT'] # microkappa_d1

### 4. Define features
In the exploratory data analysis you should select the features that will be use to predict the target. In this example, this features are defined manually in a list

In stage DOEOP, there are 4 VC - features controlables
- dioxido d0
- oxigeno eop
- peroxido eop
- soda eop

In [8]:
list_features = [
    "240AIT063A.PNT", #kappa_d0
    "240AIT063B.PNT", #brillo_d0
    "calc_prod_d0", #calc_prod_d0
    "240FY050.RO02", #especifico_dioxido_d0 - VC
    "SSTRIPPING015", #dqo_evaporadores
    "S276PER002", #concentracion_clo2_d0
    "240AIC022.MEAS", #ph_a
    "240FY118B.RO01", #especifico_oxigeno_eop - VC
    "240FY11PB.RO01", #especifico_peroxido_eop - VC
    "240FY107A.RO01", #especifico_soda_eop - VC
]

In [9]:
list_features_controlables = [
    "240FY050.RO02", #especifico_dioxido_d0 - VC,
    "240FY118B.RO01", #especifico_oxigeno_eop - VC
    "240FY11PB.RO01", #especifico_peroxido_eop - VC
    "240FY107A.RO01", #especifico_soda_eop - VC
]

### 5. Split train test

In [10]:
X_train, X_test, y_train, y_test = train_test_split(data[list_features], 
                                                    data[list_target], 
                                                    test_size = 0.2, 
                                                    random_state=42
                                                   )

In [11]:
# save
# path_X_train = f'artifacts/data/modeling_{stage}/X_train.pkl'
# with open(path_X_train, "wb") as output:
#     pickle.dump(X_train, output)
#     output.close()

### 6. Train lr and evaluate R2

In [12]:
# train lr
lr = LinearRegression()
lr.fit(X_train, y_train)

In [13]:
# train lr
lr_model = LinearRegression()

lr = Pipeline([
    ('poly_feature_2', PolynomialFeatures(2)),
    #('scaler', MinMaxScaler() ), 
    ('lr',  lr_model)
])

lr.fit(X_train, y_train)

In [14]:
# r2 score
r2_lr_train = lr.score(X_train, y_train)
r2_lr_test = lr.score(X_test, y_test)

print('r2_train: ', r2_lr_train)
print('r2_test: ', r2_lr_test)

r2_train:  0.6418639812663169
r2_test:  0.6441467041612028


### 7. Save artifact model

In [15]:
# example input
X_train.head(1)

Unnamed: 0_level_0,240AIT063A.PNT,240AIT063B.PNT,calc_prod_d0,240FY050.RO02,SSTRIPPING015,S276PER002,240AIC022.MEAS,240FY118B.RO01,240FY11PB.RO01,240FY107A.RO01
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-12-28 22:55:00,6.621296,59.40062,3486.1635,7.026081,460.0,11.29,2.918974,1.098129,6.418634,8.966175


In [16]:
# save model
path_model_lr = f'artifacts/models/d0eop_microkappa/lr.pkl'
with open(path_model_lr, "wb") as output:
    pickle.dump(lr, output)
    output.close()

### 8. Output training
Al terminar el entrenamiento, los siguientes outputs deben de ser generados:

----
#### Artefacto Analitico:
- **modelo entrenado** y guardado como pkl

----
#### Example Input:
- **X_train.head(1)**: se necesita saber el orden de las features utilizadas y los nombres de las columnas. Ambos se deben de respetar

----
#### Listado de features:
- **listado de features** (listado de todas las features que ve el modelo)

- **listado de features variables controlables** (listado de todas las features que ve el modelo y que son variables controlables y por lo tanto
variables de decisión en un modelo de optimización)

- **listado de target** (lista con el target del modelo)