In [1]:
# ML_in_Finance-Interaction
# Author: Matthew Dixon
# Version: 1.0 (08.09.2019)
# License: MIT
# Email: matthew.dixon@iit.edu
# Notes: tested on Mac OS X with Python 3.6.9 and the following packages:
# numpy=1.18.1, keras=2.3.1, tensorflow=2.0.0, statsmodels=0.10.1, scikit-learn=0.22.1
# Citation: Please cite the following reference if this notebook is used for research purposes:
# Dixon M.F., I. Halperin and P. Bilokon, Machine Learning in Finance: From Theory to Practice, Springer Graduate textbook Series, 2020. 

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasRegressor
import statsmodels.api as sm
import sklearn

# Overview
The purpose of this notebook is to illustrate a neural network interpretability method which is compatible with linear regression, including an interaction term. 

In linear regression, provided the independent variables are scaled, one can view the regression coefficients as a measure of importance of the variables and their interaction effect. Equivalently, the dependent variable can be differentiated w.r.t. the inputs to give the coefficient, with the interaction obtained from the cross-term in the Hessian. 

Similarly, the derivatives of the network w.r.t. the inputs are a non-linear generalization of interpretability in a linear regression model with interaction effects. Moreover, we should expect the neural network gradients to approximate the regression model coefficients when the data is generated by a linear regression model with interaction terms. 

Various simple experimental tests, corresponding to Section 4 of Chpt 5, are performed to illustrate the properties of network interpretability.

## Simple Data Generation Process (DGP)

Generate data from a regression model with an interaction term 

$Y=X_1+X_2 + X_1X_2+\epsilon~, ~~X_1, X_2 \sim N(0,1)~, ~~\epsilon \sim N(0,\sigma_n^2)$

In [3]:
M = 5000 # Number of samples
np.random.seed(7) # Set NumPy's random seed for reproducibility
X = np.zeros(shape=(M, 2))
sigma_n = 0.01
X[:int(M/2), 0] = np.random.randn(int(M/2))
X[:int(M/2), 1] = np.random.randn(int(M/2))
# Use antithetic sampling to reduce sample bias in the mean
X[int(M/2):, 0] = -X[:int(M/2), 0]
X[int(M/2):, 1] = -X[:int(M/2), 1]

eps = np.random.randn(M)
Y = X[:, 0] + X[:, 1] + X[:, 0]*X[:, 1] + sigma_n*eps.flatten()

## Use ordinary least squares to fit a linear model to the data

For a baseline, let us compare the neural network with OLS regression. 

We fit statsmodels' OLS model to the data

In [4]:
ols_results = sm.OLS(Y, sm.add_constant(X)).fit()

For each input, get the predicted $Y$ value according to the model

In [5]:
y_ols = ols_results.predict(sm.add_constant(X))

View characteristics of the resulting model. You should observe that the intercept is close to zero and the other coefficients are close to one.

In [6]:
ols_results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.669
Model:,OLS,Adj. R-squared:,0.669
Method:,Least Squares,F-statistic:,5052.0
Date:,"Mon, 18 May 2020",Prob (F-statistic):,0.0
Time:,16:09:50,Log-Likelihood:,-7103.5
No. Observations:,5000,AIC:,14210.0
Df Residuals:,4997,BIC:,14230.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0243,0.014,1.713,0.087,-0.004,0.052
x1,0.9999,0.014,70.236,0.000,0.972,1.028
x2,1.0000,0.014,70.164,0.000,0.972,1.028

0,1,2,3
Omnibus:,846.889,Durbin-Watson:,2.024
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16614.377
Skew:,0.168,Prob(JB):,0.0
Kurtosis:,11.924,Cond. No.,1.02


## Compare with a feedforward NN with no hidden layers

Recall the feedforward network with no hidden layers or activation function is a linear regression model

Create a build function for the linear perceptron, which transforms the inputs directly to a single output

In [7]:
def linear_NN0_model(l1_reg=0.0):    
    model = Sequential()
    model.add(Dense(1, input_dim=2, kernel_initializer='normal')) 
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae', 'mse'])
    return model

An early stopping callback to terminate training once the weights appear to have converged to an optimum

In [8]:
es = EarlyStopping(monitor='loss', mode='min', verbose=1, patience=10)

Passing the build function for our model and training parameters to the `KerasRegressor` constructor to create a Scikit-learn-compatible regression model. This allows you to take advantage of the library's built-in tools and estimator methods, and to incorporate it into Scikit-learn pipelines. 

In [9]:
lm = KerasRegressor(build_fn=linear_NN0_model, epochs=40, batch_size=10, verbose=1, callbacks=[es])

Train the model

In [10]:
lm.fit(X, Y)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 00023: early stopping


<keras.callbacks.callbacks.History at 0x14066a978>

### Check that the weights are close to one
The weights should be close to unity. The bias term is the second entry and should be close to zero.

In [11]:
W = lm.model.layers[0].get_weights()[0]
b = lm.model.layers[0].get_weights()[1]
print("Weights: "+ str(W))
print("Bias: " + str(b))

Weights: [[0.99630827]
 [0.99237144]]
Bias: [0.02239072]


## Compare with a feedforward NN with one hidden layer (unactivated)

This time we create a neural network with a hidden layer with 10 units.

In [12]:
n = 10 # number of hidden units

In [13]:
def linear_NN1_model(l1_reg=0.0):    
    model = Sequential()
    # Note the first argument passed to the Dense layer constructor
    model.add(Dense(n, input_dim=2, kernel_initializer='normal')) 
    model.add(Dense(1, kernel_initializer='normal', activation='linear'))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae', 'mse'])
    return model

In [14]:
lm = KerasRegressor(build_fn=linear_NN1_model, epochs=50, batch_size=10, verbose=1, callbacks=[es])

Train the model

In [15]:
lm.fit(X, Y)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 00012: early stopping


<keras.callbacks.callbacks.History at 0x1a47f43eb8>

Extract the trained weights from the model

In [16]:
W1 = lm.model.get_weights()[0]
b1 = lm.model.get_weights()[1]
W2 = lm.model.get_weights()[2]
b2 = lm.model.get_weights()[3]
print(W1, W2)

[[-0.38235897  0.30883455 -0.27476394  0.37077746  0.33718014 -0.30436334
  -0.2637127  -0.30291387 -0.29929903  0.28886756]
 [-0.26061237  0.2982374  -0.3480544   0.25751233  0.20222877 -0.3529237
  -0.32209945 -0.2524027  -0.3108291   0.31747958]] [[-0.28780663]
 [ 0.30990782]
 [-0.32593182]
 [ 0.30765072]
 [ 0.3182374 ]
 [-0.3202046 ]
 [-0.2919182 ]
 [-0.36224043]
 [-0.33498996]
 [ 0.34460506]]


### Check that the coefficients are close to one and the intercept is close to zero

In [17]:
beta_0 = np.dot(np.transpose(W2), b1) + b2
beta_1 = np.dot(np.transpose(W2), W1[0])
beta_2 = np.dot(np.transpose(W2), W1[1])

In [18]:
print(beta_0, beta_1, beta_2)

[0.03100312] [1.0006595] [0.9364493]


## Compare with a feedforward NN with one hidden layer ($tanh$ activated)

Finally, we create another model with a 10 unit hidden layer, this time with a $tanh$ activation function.

In [19]:
# number of hidden neurons
n = 10

In [20]:
def linear_NN1_model_act(l1_reg=0.0):    
    model = Sequential()
    # Note the activation parameter passed to the layer constructor
    model.add(Dense(n, input_dim=2, kernel_initializer='normal', activation='tanh'))
    model.add(Dense(1, kernel_initializer='normal')) 
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae', 'mse'])
    return model

In [21]:
lm = KerasRegressor(build_fn=linear_NN1_model_act, epochs=100, batch_size=10, verbose=1, callbacks=[es])

Train the model

In [22]:
lm.fit(X, Y)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100


Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.callbacks.History at 0x1a4851dda0>

### Compute the sensitivities

In [23]:
# Assumes that the activation function is tanh
def sensitivities(lm, X):
    
    W1 = lm.model.get_weights()[0]
    b1 = lm.model.get_weights()[1]
    W2 = lm.model.get_weights()[2]
    b2 = lm.model.get_weights()[3]
    
    M = np.shape(X)[0]
    p = np.shape(X)[1]

    beta = np.array([0]*M*(p+1), dtype='float32').reshape(M, p+1)
    beta_interact = np.array([0]*M*p*p, dtype='float32').reshape(M, p, p)
    
    beta[:, 0] = (np.dot(np.transpose(W2), np.tanh(b1)) + b2)[0] # intercept \beta_0= F_{W,b}(0)
    for i in range(M):
 
        Z1 = np.tanh(np.dot(np.transpose(W1), np.transpose(X[i,])) + b1)
        
        D = np.diag(1 - Z1**2) 
        D_prime = np.diag(-2 * Z1 * (1 - Z1**2))   # Needed for interaction term     
          
        for j in range(p):  
            beta[i, j+1] = np.dot(np.transpose(W2), np.dot(D, W1[j]))
            # Interaction term
            for k in range(p):
                beta_interact[i, j, k ] = np.dot(np.transpose(W2), np.dot(np.diag(W1[j]), np.dot(D_prime, W1[k])))  
    
    return beta, beta_interact

In [24]:
beta, beta_inter = sensitivities(lm, X)

### Check that the intercept is close to one and the coefficients are close to one

In [25]:
print(np.mean(beta, axis=0))

[-0.04780496  1.0240353   1.0067265 ]


In [26]:
print(np.std(beta, axis=0))

[1.7955754e-06 9.7666699e-01 9.8022658e-01]


In [27]:
print(np.mean(beta_inter, axis=0)) # off-diagonals are interaction terms

[[-0.01265506  0.9733205 ]
 [ 0.9733205  -0.02740895]]
