# Building symbolic metamodels

A symbolic metamodel takes as an input a machine learning model, and outputs a symbolic equation describing its response surface as illustrated in the Figure below. This notebook provides the steps needed for building a symbolic metamodel for an XGBoost model fitted to the "UCI absenteeism at work dataset: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

<img src="images/FigB1.png" width="640"/>

We first import the necessary modules for symbolic metamodeling

In [1]:
from pysymbolic.algorithms.symbolic_metamodeling import *

Next, we import the necessary libraries fordata processing, splitting the data into training and testing samples, and XGBoost model fitting 

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split


data          = pd.read_csv("data/absenteeism.csv", delimiter=';')

feature_names = ['Transportation expense', 'Distance from Residence to Work',
                 'Service time', 'Age', 'Work load Average/day ', 'Hit target',
                 'Disciplinary failure', 'Education', 'Son', 'Social drinker',
                 'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index']

scaler        = MinMaxScaler(feature_range=(0, 1))
X             = scaler.fit_transform(data[feature_names])
Y             = ((data['Absenteeism time in hours'] > 4) * 1) 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

model         = XGBClassifier()

model.fit(X_train, Y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=24, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

Let us examine the AUC-ROC performance of the fitted XGBoost model on the test data

In [3]:
roc_auc_score(Y_test, model.predict_proba(X_test)[:, 1])

0.700890272148234

How much better does XGBoost perform compared to a standard logistic regression?

In [4]:
model_L = LogisticRegression()

model_L.fit(X_train, Y_train)

roc_auc_score(Y_test, model_L.predict_proba(X_test)[:, 1])

0.6709250144759699

Now let us create a symbolic metamodel instance. This takes the fitted model **model** and the training features **X_train** as follows

In [5]:
metamodel = symbolic_metamodel(model, X_train)

metamodel.fit(num_iter=10, batch_size=X_train.shape[0], learning_rate=.01)

---- Tuning the basis functions ----


  0%|          | 0/15 [00:00<?, ?it/s]

----  Optimizing the metamodel  ----


  0%|          | 0/10 [00:00<?, ?it/s]

Now let us see how this metamodel performs using the **evaluate** method...

In [6]:
Y_metamodel = metamodel.evaluate(X_test)

roc_auc_score(Y_test, Y_metamodel)

0.7468514765489288

It performs close to the original XGBoost model! Now let us see the exact symbolic equation of the model in terms of Meijer-G functions

By invoking the **.fit()** method in the **symbolic_metamodel** class, we essentially transform the XGBoost model to a space of interpretable symbolic equations as shown in the Figure below, without much loss in predictive accuracy. 

<img src="images/FigB2.png" width="320"/>

Now we show how to extract the exact and approximate equation $g(x)$ from the metamodel class...

In [7]:
metamodel.exact_expression

1/(exp(-8.37077369171791*re(X0**4.88056967687008*hyper((1.0, 1.0), (4.30032465503742,), 2.05906167705875*X0*exp_polar(I*pi))) - 2.91481664479597e-20*re(X1**2.284041992711*hyper((1.0, 1.0), (2.38184499180744,), 2.99776479501188e-5*X1*exp_polar(I*pi))) + 1.3226567831806e-7*re(X10**2.21453176598295*hyper((1.0, 1.0), (2.29755416346768,), 0.015052326675425*X10*exp_polar(I*pi))) - 0.186301809769693*re(X11**2.39032547969092*hyper((1.0, 1.0), (2.43105549058114,), 0.776666756204049*X11*exp_polar(I*pi))) - 1.84461761074991e-17*re(X12**2.33746723859666*hyper((1.0, 1.0), (2.45038642754615,), 0.000214240699597301*X12*exp_polar(I*pi))) + 1.76587270831449e-7*re(X13**2.4559735429552*hyper((1.0, 1.0), (2.57692280582931,), 0.0345870204145743*X13*exp_polar(I*pi))) + 9.0784143183138e-18*re(X14**2.3287289826686*hyper((1.0, 1.0), (2.43848081408145,), 0.000217324022912984*X14*exp_polar(I*pi))) - 5.63260580259647e-14*re(X2**2.38807927992325*hyper((1.0, 1.0), (2.50603582049886,), 0.00110190752161027*X2*exp_pol

Because this equation involves Hypergeometric functions, we might prefer to work with the polynomial approximation...

In [8]:
metamodel.approx_expression

1/(2.22950777571167*exp(-0.235503679692941*X0**3*X1**3 + 0.252193322947565*X0**3*X10**3 - 0.333256000886837*X0**3*X11**3 + 0.912406413355178*X0**3*X12**3 + 0.0934610860237479*X0**3*X13**3 + 0.25669417593652*X0**3*X14**3 - 0.197944339187641*X0**3*X2**3 - 0.287313775801068*X0**3*X3**3 + 0.355009022752357*X0**3*X4**3 - 0.570540288538447*X0**3*X5**3 - 0.671922114423197*X0**3*X6**3 + 0.261718813935586*X0**3*X7**3 - 0.828798555389674*X0**3*X8**3 + 0.348030643322906*X0**3*X9**3 - 14.5191539080898*X0**3 + 0.0736107782771074*X0**2*X1**2 - 0.0856300457299928*X0**2*X10**2 + 0.138886226924399*X0**2*X11**2 - 0.340934919826468*X0**2*X12**2 - 0.0410897662513246*X0**2*X13**2 - 0.100470670960164*X0**2*X14**2 + 0.0782029649582984*X0**2*X2**2 + 0.129186226988679*X0**2*X3**2 - 0.147641373354917*X0**2*X4**2 + 0.204871110464076*X0**2*X5**2 + 0.355947767662687*X0**2*X6**2 - 0.117996408666321*X0**2*X7**2 + 0.366287253537644*X0**2*X8**2 - 0.187996404830256*X0**2*X9**2 + 13.6052230325643*X0**2 + 2.7099614672001