# Model Embedding Example and Prediction Verification

This script demonstrates how to train a single model class, embed the model, and solve the optimization problem. We fix a sample from our generated data and solve the optimization problem with all elements of $\mathbf{x}$ equal to our data. In general, we might have some elements of $\mathbf{x}$ that are fixed, called our "conceptual variables," and the remaining indices are our decision variables. By fixing all elements of $\mathbf{x}$, we can verify that the model prediction matches the original sklearn model.

## Load the relevant packages

In [9]:
import opticl

In [11]:
import pandas as pd
import numpy as np
import math
from sklearn.utils.extmath import cartesian
import time
import sys
import os
import time

from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error
from sklearn.cluster import KMeans

In [19]:
from pyomo import environ
from pyomo.environ import *

## Initialize data
We will work with a basic dataset from `sklearn`.

In [12]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=200, n_features = 20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=1)
X_train = pd.DataFrame(X_train).add_prefix('col')
X_test = pd.DataFrame(X_test).add_prefix('col')

## Train the chosen model type

In [13]:
# alg = 'rf' 
# alg_run = 'rf_shallow'
alg = alg_run = 'mlp'

The user can optionally select a manual parameter grid for the cross-validation procedure. We implement a default parameter grid; see **run_MLmodels.py** for details on the tuned parameters. If you wish to use the default, leave ```parameter_grid = None``` (or do not specify any grid).

In [14]:
parameter_grid = None
# parameter_grid = {'hidden_layer_sizes': [(5,),(10,)]}

In [16]:
from sklearn.neural_network import MLPRegressor, MLPClassifier

s = 0
version = 'test'
outcome = 'temp'

m, perf = opticl.run_model(X_train, y_train, X_test, y_test, alg_run, task = 'continuous', 
                       seed = s, cv_folds = 5, 
                       # The user can manually specify the parameter grid for cross-validation if desired
                       parameter_grid = parameter_grid,
                       save = False)

------------- Initialize grid  ----------------
------------- Running model  ----------------
Algorithm = mlp, metric = None
------------- Model evaluation  ----------------
-------------------training evaluation-----------------------
Train MSE: 8.017082756749335e-07
Train R2: 0.9999999999802882
-------------------testing evaluation-----------------------
Test MSE: 60.24952015596481
Test R2: 0.9987048913993389
------------- Save results  ----------------


After training the model, we will save the trained model in the format needed for embedding the constraints. See **ConstraintLearning.py** for the specific format that is extracted per method. We also save the performance of the model to use in the automated model selection pipeline (if desired).

We also create the save directory if it does not exist.



In [17]:
if not os.path.exists('results/%s/' % alg):
    os.makedirs('results/%s/' % alg)
    
constraintL = opticl.ConstraintLearning(X_train, y_train, m, alg)
constraint_add = constraintL.constraint_extrapolation('r')
constraint_add.to_csv('results/%s/%s_%s_model.csv' % (alg, version, outcome), index = False)

perf.to_csv('results/%s/%s_%s_performance.csv' % (alg, version, outcome), index= False)

### Check: what should the result be for our sample observation, if all x are fixed?

#### Choose sample to test
This will be the observation ("patient") that we feed into the optimization model.

In [22]:
sample_id = 1
sample = X_train.loc[sample_id:sample_id,:].reset_index(drop = True)

Calculate model prediction directly in sklearn.

In [23]:
m.predict(sample)

array([290.34368228])

## Optimization formulation
We will embed the model trained above. The model could also be selected using the model selection pipeline, which we demonstrate in the WFP example script.

If manually specifying the model, as we are here, the key elements of the ``model_master`` dataframe are:
- model_type: algorithm name.
- outcome: name of outcome of interest; this is relevant in the case of multiple learned outcomes.
- save_path: file name of the extracted model.
- objective: the weight of the objective if it should be included as an additive term in the objective. A weight of 0 omits it from the objective entirely.
- lb/ub: the lower (or upper) bound that we wish to apply to the learned outcome. If there is no bound, it should be set to ``None``.

In this case, we set the outcome to be our only objective term, which will allow us to verify that the predictions are consistent between the embedded model and the sklearn prediction function.

In [18]:
model_master = pd.DataFrame(columns = ['model_type','outcome','save_path','lb','ub','objective'])

model_master.loc[0,'model_type'] = alg
model_master.loc[0,'save_path'] = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)
model_master.loc[0,'outcome'] = outcome
model_master.loc[0,'objective'] = 1
model_master.loc[0,'ub'] = None
model_master.loc[0,'lb'] = None

In [35]:
model = ConcreteModel()

## We will create our x decision variables, and fix them all to our sample's values for model verification.
N = X_train.columns
#model.x = model.addVars(N, vtype=GRB.CONTINUOUS, name='x', lb = -math.inf)
model.x = Var(N, domain=Reals)

def fix_value(model, index):
    return model.x[index] == sample.loc[0,index]

model.Constraint1 = Constraint(N, rule=fix_value)

## Specify any non-learned objective components - none here for feasibility
model.OBJ = Objective(expr=0, sense=minimize)

In [36]:
final_model = opticl.optimization_MIP(model, model.x, model_master, X_train, tr = False)

Embedding objective function for temp


In [37]:
final_model.pprint()

2 Set Declarations
    Constraint1_index : Size=1, Index=None, Ordered=False
        Key  : Dimen : Domain : Size : Members
        None :     1 :    Any :   20 : {'col0', 'col1', 'col10', 'col11', 'col12', 'col13', 'col14', 'col15', 'col16', 'col17', 'col18', 'col19', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9'}
    x_index : Size=1, Index=None, Ordered=False
        Key  : Dimen : Domain : Size : Members
        None :     1 :    Any :   20 : {'col0', 'col1', 'col10', 'col11', 'col12', 'col13', 'col14', 'col15', 'col16', 'col17', 'col18', 'col19', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9'}

6 Var Declarations
    l : Size=0, Index=Any
        Key : Lower : Value : Upper : Fixed : Stale : Domain
    v : Size=10, Index=Any
        Key            : Lower : Value : Upper : Fixed : Stale : Domain
        ('temp', 0, 0) :     0 :  None :  None : False :  True : NonNegativeReals
        ('temp', 0, 1) :     0 :  None :  None : False :  True : NonNegat

In [None]:
opt = SolverFactory('gurobi')
results = opt.solve(final_model) 

In [34]:
value(final_model.y['temp'])

231.03123109014757

### Check for equality between sklearn and embedded model

In [11]:
math.isclose(model.ObjVal,m.predict(sample)[0],abs_tol=1e-5)

True