# Model Prediction Verification

This script demonstrates how to train a single model class, embed the model, and solve the optimization problem for *regression* problems (i.e., continuous outcome prediction). We fix a sample from our generated data and solve the optimization problem with all elements of $\mathbf{x}$ equal to our data. In general, we might have some elements of $\mathbf{x}$ that are fixed, called our "conceptual variables," and the remaining indices are our decision variables. By fixing all elements of $\mathbf{x}$, we can verify that the model prediction matches the original sklearn model.

## Load the relevant packages

In [1]:
import pandas as pd
import numpy as np
import math
from sklearn.utils.extmath import cartesian
import time
import sys
import os
import time

from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error
from sklearn.cluster import KMeans

In [2]:
import opticl
from pyomo import environ
from pyomo.environ import *

## Initialize data
We will work with a basic dataset from `sklearn`.

In [3]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=200, n_features = 20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=1)
X_train = pd.DataFrame(X_train).add_prefix('col')
X_test = pd.DataFrame(X_test).add_prefix('col')
X_train

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
0,1.584662,0.018317,-0.218921,0.554458,0.532759,-1.507076,-0.004103,1.706662,0.064656,1.135484,1.301202,-0.376086,1.203994,-1.226023,-0.776604,-0.123396,0.958069,0.260800,-1.072332,0.712571
1,0.127315,0.664080,-0.588780,0.317858,-0.426667,-0.072892,0.068032,-0.628463,0.858286,-0.070950,0.307331,0.186212,1.004093,0.339488,1.394081,-1.311324,0.995704,0.873006,0.757328,0.292931
2,-0.097976,0.578025,-0.776809,-1.559095,-0.670203,0.465758,1.130605,0.510315,-0.730213,2.378240,-0.792741,-0.226675,1.131265,-0.253619,1.393016,-0.284635,-2.739142,1.050003,-0.306036,-0.457663
3,-0.667930,-1.083960,-0.633375,0.955877,0.261402,-1.520746,-1.217973,-0.057682,0.852833,0.004939,-0.788276,0.976803,0.579087,1.919169,-0.536705,-2.160269,-1.993137,0.844518,-0.079740,0.477895
4,0.680070,-0.517094,-0.174703,2.190700,-1.896361,-0.287308,0.901487,-0.248635,0.213534,-0.319802,-0.646917,0.986335,0.248799,0.043669,0.495211,-0.997027,2.528326,-0.296641,1.331457,-0.226314
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.614726,1.495885,2.293718,-0.214654,1.021248,0.353567,-0.477124,1.037039,-1.019520,-0.348984,0.524750,-0.830011,0.599213,0.672620,0.606404,0.675454,-0.035990,-1.470237,1.005687,2.428877
146,-2.506441,-0.135977,0.956122,-0.237942,1.155288,-1.942589,1.122328,-0.106794,1.192686,-2.114164,0.438166,-0.705841,0.282676,1.451429,0.621083,-0.797270,-0.997020,-0.826097,-2.037201,-0.618037
147,-1.004765,-0.520928,-0.068925,-0.498428,-0.304744,1.707986,-0.486471,-0.491599,-0.215651,0.754817,-0.162497,0.752761,1.254887,-0.454928,-0.091675,-0.859160,0.088282,0.241109,0.007260,1.986539
148,-0.840332,-0.880045,-0.762087,0.475760,0.373943,-1.642020,-0.638848,1.955754,-0.009185,1.271048,-0.879264,1.053166,-0.022013,0.473292,-0.288755,-0.653662,0.855587,0.338300,1.228505,-0.675083


## Train the chosen model type

In [4]:
# alg = 'rf' 
alg = 'gbm'
task_type = 'continuous'

The user can optionally select a manual parameter grid for the cross-validation procedure. We implement a default parameter grid; see **run_MLmodels.py** for details on the tuned parameters. If you wish to use the default, leave ```parameter_grid = None``` (or do not specify any grid).

In [5]:
parameter_grid = None
# parameter_grid = {'hidden_layer_sizes': [(5,),(10,)]}

In [6]:
s = 1
version = 'test'
outcome = 'temp'

model_save = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)

alg_run = alg if alg != 'rf' else 'rf_shallow'
m, perf = opticl.run_model(X_train, y_train, X_test, y_test, alg_run, outcome, task = task_type, 
                       seed = s, cv_folds = 5, 
                       # The user can manually specify the parameter grid for cross-validation if desired
                       parameter_grid = parameter_grid,
                       save_path = model_save,
                       save = False)

------------- Initialize grid  ----------------
------------- Running model  ----------------
Algorithm = gbm, metric = None
saving... results/gbm_temp_trained.pkl
------------- Model evaluation  ----------------
-------------------training evaluation-----------------------
Train MSE: 4314.00082576947
Train R2: 0.8939305706172604
-------------------testing evaluation-----------------------
Test MSE: 17814.940522763252
Test R2: 0.6170544988313675


After training the model, we will save the trained model in the format needed for embedding the constraints. See **constraint_learning.py** for the specific format that is extracted per method. We also save the performance of the model to use in the automated model selection pipeline (if desired).

We also create the save directory if it does not exist.



In [7]:
if not os.path.exists('results/%s/' % alg):
    os.makedirs('results/%s/' % alg)
    
constraintL = opticl.ConstraintLearning(X_train, y_train, m, alg)
constraint_add = constraintL.constraint_extrapolation(task_type)
constraint_add.to_csv(model_save, index = False)

perf.to_csv('results/%s/%s_%s_performance.csv' % (alg, version, outcome), index= False)

In [8]:
constraint_add

Unnamed: 0,Tree_id,ID,col0,col1,col2,col3,col4,col5,col6,col7,...,col14,col15,col16,col17,col18,col19,threshold,prediction,initial_prediction,learning_rate
0,0,1,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,0.362847,-147.677247,3.181564,0.2
0,0,1,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,0.040611,-147.677247,3.181564,0.2
0,0,2,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,0.362847,26.265507,3.181564,0.2
0,0,2,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,-0.040612,26.265507,3.181564,0.2
0,0,3,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,-0.362848,-36.183599,3.181564,0.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,19,2,0.0,0,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,-0.653505,29.789185,3.181564,0.2
0,19,3,0.0,-1,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,-1.771010,117.421701,3.181564,0.2
0,19,3,0.0,1,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,1.913692,117.421701,3.181564,0.2
0,19,4,0.0,-1,0.0,0,0.0,0.0,0.0,0.0,...,0,0.0,0.0,0,0.0,0.0,-1.771010,71.336522,3.181564,0.2


### Check: what should the result be for our sample observation, if all x are fixed?

#### Choose sample to test
This will be the observation ("patient") that we feed into the optimization model.

In [9]:
sample_id = 1
sample = X_train.loc[sample_id:sample_id,:].reset_index(drop = True)

Calculate model prediction directly in sklearn.

In [10]:
m.predict(sample)

array([182.75931286])

## Optimization formulation
We will embed the model trained above. The model could also be selected using the model selection pipeline, which we demonstrate in the WFP example script.

If manually specifying the model, as we are here, the key elements of the ``model_master`` dataframe are:
- model_type: algorithm name.
- outcome: name of outcome of interest; this is relevant in the case of multiple learned outcomes.
- save_path: file name of the extracted model.
- objective: the weight of the objective if it should be included as an additive term in the objective. A weight of 0 omits it from the objective entirely.
- lb/ub: the lower (or upper) bound that we wish to apply to the learned outcome. If there is no bound, it should be set to ``None``.

In this case, we set the outcome to be our only objective term, which will allow us to verify that the predictions are consistent between the embedded model and the sklearn prediction function.

In [11]:
model_master = pd.DataFrame(columns = ['model_type','outcome','save_path','lb','ub','objective'])

model_master.loc[0,'model_type'] = alg
model_master.loc[0,'save_path'] = 'results/%s/%s_%s_model.csv' % (alg, version, outcome)
model_master.loc[0,'outcome'] = outcome
model_master.loc[0,'objective'] = 1
model_master.loc[0,'ub'] = None
model_master.loc[0,'lb'] = None
model_master.loc[0,'task'] = task_type
model_master['SCM_counterfactuals'] = None
model_master['features'] = [[col for col in X_train.columns]]

#### Solve with Pyomo

In [12]:
model_pyo = ConcreteModel()

## We will create our x decision variables, and fix them all to our sample's values for model verification.
N = X_train.columns
model_pyo.x = Var(N, domain=Reals)

def fix_value(model_pyo, index):
    return model_pyo.x[index] == sample.loc[0,index]

model_pyo.Constraint1 = Constraint(N, rule=fix_value)

## Specify any non-learned objective components - none here 
model_pyo.OBJ = Objective(expr=0, sense=minimize)

In [13]:
final_model_pyo = opticl.optimization_MIP(model_pyo, model_pyo.x, model_master, X_train, tr = False)
# final_model_pyo.pprint()
opt = SolverFactory('gurobi')
results = opt.solve(final_model_pyo) 

Embedding objective function for temp


### Check for equality between sklearn and embedded models

In [14]:
print("True outcome: %.3f" % m.predict(sample)[0])
print("Pyomo output: %.3f" % final_model_pyo.OBJ())

True outcome: 182.759
Pyomo output: 182.759
