# Building symbolic metamodels

A symbolic metamodel takes as an input a machine learning model, and outputs a symbolic equation describing its response surface as illustrated in the Figure below. This notebook provides the steps needed for building a symbolic metamodel for an XGBoost model fitted to the "UCI absenteeism at work dataset: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

<img src="images/FigB1.png" width="640"/>

We first import the necessary modules for symbolic metamodeling

In [1]:
from pysymbolic.algorithms.symbolic_metamodeling import *

Next, we import the necessary libraries fordata processing, splitting the data into training and testing samples, and XGBoost model fitting 

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split


data          = pd.read_csv("data/absenteeism.csv", delimiter=';')

feature_names = ['Transportation expense', 'Distance from Residence to Work',
                 'Service time', 'Age', 'Work load Average/day ', 'Hit target',
                 'Disciplinary failure', 'Education', 'Son', 'Social drinker',
                 'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index']

scaler        = MinMaxScaler(feature_range=(0, 1))
X             = scaler.fit_transform(data[feature_names])
Y             = ((data['Absenteeism time in hours'] > 4) * 1) 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

model         = XGBClassifier()

model.fit(X_train, Y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=16, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

Let us examine the AUC-ROC performance of the fitted XGBoost model on the test data

In [3]:
roc_auc_score(Y_test, model.predict_proba(X_test)[:, 1])

0.700890272148234

How much better does XGBoost perform compared to a standard logistic regression?

In [4]:
model_L = LogisticRegression()

model_L.fit(X_train, Y_train)

roc_auc_score(Y_test, model_L.predict_proba(X_test)[:, 1])

0.6709250144759699

Now let us create a symbolic metamodel instance. This takes the fitted model **model** and the training features **X_train** as follows

In [5]:
X_train.shape

(495, 15)

In [6]:
X_train.shape[0]

495

In [7]:
metamodel = symbolic_metamodel(model, X_train, 'regression')

metamodel.fit(num_iter=10, batch_size=X_train.shape[0], learning_rate=.01)

---- Tuning the basis functions ----


  0%|          | 0/15 [00:00<?, ?it/s]

----  Optimizing the metamodel  ----


  0%|          | 0/10 [00:00<?, ?it/s]

Now let us see how this metamodel performs using the **evaluate** method...

In [8]:
Y_metamodel = metamodel.evaluate(X_test)

roc_auc_score(Y_test, Y_metamodel)

0.6922408801389693

It performs close to the original XGBoost model! Now let us see the exact symbolic equation of the model in terms of Meijer-G functions

By invoking the **.fit()** method in the **symbolic_metamodel** class, we essentially transform the XGBoost model to a space of interpretable symbolic equations as shown in the Figure below, without much loss in predictive accuracy. 

<img src="images/FigB2.png" width="320"/>

Now we show how to extract the exact and approximate equation $g(x)$ from the metamodel class...

In [9]:
metamodel.exact_expression

1/(exp(0.245800578798176*re(X0**1.30856567492641*hyper((1.0, 1.0), (1.22555170654358,), 1.38907279877692*X0*exp_polar(I*pi))) - 0.382699148949428*re(X1**0.529163036035237*hyper((1.0, 1.0), (0.856074182807829,), 1.57734272324556*X1*exp_polar(I*pi))) + 0.228277444647308*re(X10**2.03305606307999*hyper((1.0, 1.0), (2.05093948636499,), 0.783319767733504*X10*exp_polar(I*pi))) + 3.64657903528539*re(X11**0.862032977081594*hyper((1.0, 1.0), (1.15556155348886,), 2.66746107739585*X11*exp_polar(I*pi))) - 1.15624459855416*re(X12**0.365401485447434*hyper((1.0, 1.0), (0.800280269202145,), 2.61240002054092*X12*exp_polar(I*pi))) + 0.226943960966134*re(X13**0.660197564559027*hyper((1.0, 1.0), (1.0188109089646,), 1.43410992705438*X13*exp_polar(I*pi))) - 1.35240378687709*re(X14**0.268530405361707*hyper((1.0, 1.0), (0.85730171418828,), 2.1373582725263*X14*exp_polar(I*pi))) - 0.473694990650615*re(X2**0.734035974987053*hyper((1.0, 1.0), (0.77767111879157,), 1.91846247568242*X2*exp_polar(I*pi))) - 0.515385617

Because this equation involves Hypergeometric functions, we might prefer to work with the polynomial approximation...

In [10]:
metamodel.approx_expression

1/(0.371813217591705*exp(-0.0318210928949518*X0**3*X1**3 + 0.0175700241430365*X0**3*X10**3 + 0.0134779059163627*X0**3*X11**3 + 0.120184748028559*X0**3*X12**3 + 0.0512547510986315*X0**3*X13**3 + 0.0315071953857997*X0**3*X14**3 - 0.0468587725536519*X0**3*X2**3 - 0.0138324955577671*X0**3*X3**3 + 0.0515908165663782*X0**3*X4**3 - 0.00255179484023309*X0**3*X5**3 - 0.0907101254562517*X0**3*X6**3 + 0.0358805715971848*X0**3*X7**3 - 0.0112474361664426*X0**3*X8**3 + 0.0170554692503384*X0**3*X9**3 + 0.00052690378920895*X0**3 + 0.0131973502087724*X0**2*X1**2 - 0.00791189684425721*X0**2*X10**2 - 0.00593878574153117*X0**2*X11**2 - 0.0533769577577954*X0**2*X12**2 - 0.0227705170624254*X0**2*X13**2 - 0.0140177844388415*X0**2*X14**2 + 0.0209302071750483*X0**2*X2**2 + 0.00619055056852978*X0**2*X3**2 - 0.0226503485716472*X0**2*X4**2 + 0.00113872189592208*X0**2*X5**2 + 0.0410417052928694*X0**2*X6**2 - 0.0161284050611851*X0**2*X7**2 + 0.00491400793300847*X0**2*X8**2 - 0.00728376025821415*X0**2*X9**2 - 0.0304

In [11]:
str(metamodel.approx_expression)

'1/(0.371813217591705*exp(-0.0318210928949518*X0**3*X1**3 + 0.0175700241430365*X0**3*X10**3 + 0.0134779059163627*X0**3*X11**3 + 0.120184748028559*X0**3*X12**3 + 0.0512547510986315*X0**3*X13**3 + 0.0315071953857997*X0**3*X14**3 - 0.0468587725536519*X0**3*X2**3 - 0.0138324955577671*X0**3*X3**3 + 0.0515908165663782*X0**3*X4**3 - 0.00255179484023309*X0**3*X5**3 - 0.0907101254562517*X0**3*X6**3 + 0.0358805715971848*X0**3*X7**3 - 0.0112474361664426*X0**3*X8**3 + 0.0170554692503384*X0**3*X9**3 + 0.00052690378920895*X0**3 + 0.0131973502087724*X0**2*X1**2 - 0.00791189684425721*X0**2*X10**2 - 0.00593878574153117*X0**2*X11**2 - 0.0533769577577954*X0**2*X12**2 - 0.0227705170624254*X0**2*X13**2 - 0.0140177844388415*X0**2*X14**2 + 0.0209302071750483*X0**2*X2**2 + 0.00619055056852978*X0**2*X3**2 - 0.0226503485716472*X0**2*X4**2 + 0.00113872189592208*X0**2*X5**2 + 0.0410417052928694*X0**2*X6**2 - 0.0161284050611851*X0**2*X7**2 + 0.00491400793300847*X0**2*X8**2 - 0.00728376025821415*X0**2*X9**2 - 0.030