# Model Embedding Example and Prediction Verification

This script demonstrates how to train a single model class, embed the model, and solve the optimization problem. We fix a sample from our generated data and solve the optimization problem with all elements of $\mathbf{x}$ equal to our data. In general, we might have some elements of $\mathbf{x}$ that are fixed, called our "conceptual variables," and the remaining indices are our decision variables. By fixing all elements of $\mathbf{x}$, we can verify that the model prediction matches the original sklearn model.

## Load the relevant packages

In [1]:
import pandas as pd
import numpy as np
import math
from sklearn.utils.extmath import cartesian
import time
import sys
import os
import time

from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error
from sklearn.cluster import KMeans

In [2]:
import opticl
from pyomo import environ
from pyomo.environ import *

## Initialize data
We will work with a basic dataset from `sklearn`.

In [4]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=200, n_features = 20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=1)
X_train = pd.DataFrame(X_train).add_prefix('col')
X_test = pd.DataFrame(X_test).add_prefix('col')

## Train the chosen model type

In [5]:
# alg = 'rf' 
# alg_run = 'rf_shallow'
alg = alg_run = 'mlp'

The user can optionally select a manual parameter grid for the cross-validation procedure. We implement a default parameter grid; see **run_MLmodels.py** for details on the tuned parameters. If you wish to use the default, leave ```parameter_grid = None``` (or do not specify any grid).

In [6]:
parameter_grid = None
# parameter_grid = {'hidden_layer_sizes': [(5,),(10,)]}

In [7]:
from sklearn.neural_network import MLPRegressor, MLPClassifier

s = 0
version = 'test'
outcome = 'temp'

m, perf = opticl.run_model(X_train, y_train, X_test, y_test, alg_run, task = 'continuous', 
                       seed = s, cv_folds = 5, 
                       # The user can manually specify the parameter grid for cross-validation if desired
                       parameter_grid = parameter_grid,
                       save = False)

------------- Initialize grid  ----------------
------------- Running model  ----------------
Algorithm = mlp, metric = None
------------- Model evaluation  ----------------
-------------------training evaluation-----------------------
Train MSE: 8.017082756749335e-07
Train R2: 0.9999999999802882
-------------------testing evaluation-----------------------
Test MSE: 60.24952015596481
Test R2: 0.9987048913993389
------------- Save results  ----------------


After training the model, we will save the trained model in the format needed for embedding the constraints. See **ConstraintLearning.py** for the specific format that is extracted per method. We also save the performance of the model to use in the automated model selection pipeline (if desired).

We also create the save directory if it does not exist.



In [8]:
if not os.path.exists('results/%s/' % alg):
    os.makedirs('results/%s/' % alg)
    
constraintL = opticl.ConstraintLearning(X_train, y_train, m, alg)
constraint_add = constraintL.constraint_extrapolation('r')
constraint_add.to_csv('results/%s/%s_%s_model.csv' % (alg, version, outcome), index = False)

perf.to_csv('results/%s/%s_%s_performance.csv' % (alg, version, outcome), index= False)

### Check: what should the result be for our sample observation, if all x are fixed?

#### Choose sample to test
This will be the observation ("patient") that we feed into the optimization model.

In [9]:
sample_id = 1
sample = X_train.loc[sample_id:sample_id,:].reset_index(drop = True)

Calculate model prediction directly in sklearn.

In [10]:
m.predict(sample)

array([290.34368228])

## Optimization formulation
We will embed the model trained above. The model could also be selected using the model selection pipeline, which we demonstrate in the WFP example script.

If manually specifying the model, as we are here, the key elements of the ``model_master`` dataframe are:
- model_type: algorithm name.
- outcome: name of outcome of interest; this is relevant in the case of multiple learned outcomes.
- save_path: file name of the extracted model.
- objective: the weight of the objective if it should be included as an additive term in the objective. A weight of 0 omits it from the objective entirely.
- lb/ub: the lower (or upper) bound that we wish to apply to the learned outcome. If there is no bound, it should be set to ``None``.

In this case, we set the outcome to be our only objective term, which will allow us to verify that the predictions are consistent between the embedded model and the sklearn prediction function.

In [11]:
model_master = pd.DataFrame(columns = ['model_type','outcome','save_path','lb','ub','objective'])

model_master.loc[0,'model_type'] = alg
model_master.loc[0,'save_path'] = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)
model_master.loc[0,'outcome'] = outcome
model_master.loc[0,'objective'] = 1
model_master.loc[0,'ub'] = None
model_master.loc[0,'lb'] = None

#### Pyomo

In [13]:
model_pyo = ConcreteModel()

## We will create our x decision variables, and fix them all to our sample's values for model verification.
N = X_train.columns
model_pyo.x = Var(N, domain=Reals)

def fix_value(model_pyo, index):
    return model_pyo.x[index] == sample.loc[0,index]

model_pyo.Constraint1 = Constraint(N, rule=fix_value)

## Specify any non-learned objective components - none here 
model_pyo.OBJ = Objective(expr=0, sense=minimize)

In [14]:
final_model_pyo = opticl.optimization_MIP(model_pyo, model_pyo.x, model_master, X_train, tr = False)
# final_model_pyo.pprint()
opt = SolverFactory('gurobi')
results = opt.solve(final_model_pyo) 

Embedding objective function for temp


#### Gurobipy

In [None]:
## Load files for gurobipy implementation (for comparison)
sys.path.append(os.path.abspath('../../opticl'))
import gurobipy as grb
import embed_mip_gurobi as em

In [12]:
model_grb = grb.Model()

## We will create our x decision variables, and fix them all to our sample's values for model verification.
N = X_train.columns
x = model_grb.addVars(N, vtype=grb.GRB.CONTINUOUS, name='x', lb = -math.inf)
model_grb.addConstrs(x[i] == sample.loc[0,i] for i in N)
model_grb.update()
final_model_grb = em.optimization_MIP(model_grb, x, model_master, X_train, tr = False)
final_model_grb.write('test.lp')
final_model_grb.optimize()

Academic license - for non-commercial use only - expires 2022-08-14
Using license file /Users/hollywiberg/gurobi.lic
Gurobi Optimizer version 9.1.2 build v9.1.2rc0 (mac64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 51 rows, 41 columns and 481 nonzeros
Model fingerprint: 0xc1cf5a2a
Variable types: 31 continuous, 10 integer (10 binary)
Coefficient statistics:
  Matrix range     [1e-03, 1e+05]
  Objective range  [1e+00, 1e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [7e-02, 1e+05]
Presolve removed 51 rows and 41 columns
Presolve time: 0.00s
Presolve: All rows and columns removed

Explored 0 nodes (0 simplex iterations) in 0.01 seconds
Thread count was 1 (of 8 available processors)

Solution count 1: 290.344 

Optimal solution found (tolerance 1.00e-04)
Best objective 2.903436822752e+02, best bound 2.903436822752e+02, gap 0.0000%


### Check for equality between sklearn and embedded models

In [15]:
print("True outcome: %.3f" % m.predict(sample)[0])
print("Gurobipy output: %.3f" % model_grb.ObjVal)
print("Pyomo output: %.3f" % final_model_pyo.OBJ())

True outcome: 290.344
Gurobipy output: 290.344
Pyomo output: 290.344
