# Generalising Influence Functions with Autograd

This notebook is an implementation of the data splitting estimator for the Shannon Entropy in 'Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations' by Kandasamy et al. 2015 https://arxiv.org/abs/1411.4342

The process is the same as theirs except, instead of implementing the analytical form derived in the paper, we use autograd.

#### Acknowledgements
I'd like to thank Nic Ford, Sina Akbari, and Jalal Etesami for their patience in helping me work through this topic.

### Form

This work is concerned with functions of the form $$T(p) = \phi \left( \int \nu(p) d\nu \right)$$

where $T(p)$ is the target functional we wish to estimate, and $p$ is a density.

Notice already that Shannon entropy $-\int p \log p$ can be expressed in the same form. There are a couple of ways to do this, but one way is with $\phi(p) = p$ (i.e. the identity function) and $\nu(p) = p \log p$. 

### Pathwise Derivative

The data we collect enables us to estimate $T(p)$ but not the true population quantity $T(q)$. i.e. we have access to $P$ but not to $Q$. We assume that $T(p)$ is 'close enough' and lies on a path to $T(q)$. This allows us to define the pathwise or 'Gateaux' derivative as:

$$
T'(H; P) = \left. \frac{\partial T(P+tH)}{\partial t} \right \vert_{t=0}
$$

### Influence Function

Assuming that $T$ is Gateaux differentiable at $P$ then a function $\psi:\mathcal{X} \rightarrow \mathbb{R}$ which satisfies $T'(Q-P;P) = \int \psi(x; P)dQ(x)$ is known as the influence function:

$$ \psi(x, P) = T'(\delta_x - P, P) =\left. \frac{\partial T((1-t)P+t\delta_x)}{\partial t} \right \vert_{t=0}$$

### Von Mises

Following a generalization of the Taylor expansion to functionals, the true target quantity $T(Q)$ which we wish to estimate can be expressed as:

$$
T(Q) = T(P) + T'(Q-P;P) + R_2 = T(P) + \int \psi(x;P)dQ(x) + R_2
$$

In words, the true quantity is equal to the estimatable quantity plus the integral of the influence function and some higher order error term(s).

Following a little substitution, the expression can be written as:

$$
T(q) = T(p) + \phi ' \left( \int \nu(p)\right) \int (q-p)\nu ' (p) + R_2
$$

expanding the second term...

$$
T(q) = T(p) + \phi ' \left( \int \nu(p)\right) \left(  \int q\nu ' (p) - \int p \nu ' (p)  \right)+ R_2
$$


### Estimating $T(q)$

As we do not have access to $Q$ we can approximate it using samples from our dataset. This is where our data splitting will come in handy.

The rest of the process is described in line with the code below.


###  Make some imports and define some constants

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.autograd.functional as func
import torch.autograd as grad
from sklearn.neighbors import KernelDensity
torch.pi = torch.tensor(torch.acos(torch.zeros(1)).item() * 2)

# Set up an example for finding the Shannon Entropy of a Gaussian
n = 10000
true_mu = 0
true_sigma = 1
dx = 0.01  # as we are estimating densities with sums we will multiply by dx
runs = 500

def nu(a, dx):
    return - a * torch.log(a/dx) 

def phi(b):
    return b

def entropy(p, dx):
    return phi((nu(p, dx)).sum())

In [2]:
# calculate true entropy
GT_psi = 0.5 * torch.log(2 * torch.pi * torch.exp(torch.tensor([1])) * true_sigma**2)
    
updated_psis = []
naive_psis = []
for i in range(runs):
    if i % 10 == 0:
        print('===== Run {} ======'.format(i))
    x = ((torch.randn(n) + true_mu) * true_sigma).reshape(-1,1)

    # data splits
    x1 = x[:len(x)//2]
    x2 = x[len(x)//2:]

    # estimate density using first half of data
    kde_ds1 = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(x1)

    # estimate density using second half of data
    kde_ds2 = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(x2)

    # define domain x
    r_ds1 = np.arange(-10, 10, dx).reshape(-1,1)
    # get density of domain
    p_r_ds1 = np.exp(kde_ds1.score_samples(r_ds1)) * dx
    p_r_ds1 = torch.tensor(p_r_ds1)

    # define domain x
    r_ds2 = np.arange(-10, 10, dx).reshape(-1,1)
    # get density of domain
    p_r_ds2 = np.exp(kde_ds1.score_samples(r_ds2)) * dx
    p_r_ds2 = torch.tensor(p_r_ds2)

    # calculate estimated entropy
    est_ds1 = entropy(p_r_ds1, dx)
    est_ds2 = entropy(p_r_ds2, dx)
    est = (est_ds1 + est_ds2) / 2
    naive_psis.append(est)
    
    # A
    int_nu_p_ds1 = nu(p_r_ds1, dx).sum()  # 1.
    int_nu_p_ds1.requires_grad_(True)   # 2.
    phi_int_nu_p_ds1 = phi(int_nu_p_ds1)  # 3.
    phi_int_nu_p_ds1.backward(torch.ones(phi_int_nu_p_ds1.shape))  # 4.
    A_ds1 = int_nu_p_ds1.grad.data  # 5.

    p_xi_ds2 = torch.tensor(np.exp(kde_ds1.score_samples(x2))) * dx  # 1.
    p_xi_ds2.requires_grad_(True)  # 2. 
    nu_p_xi_ds2 = nu(p_xi_ds2, dx)  # 3.
    nu_p_xi_ds2.backward(torch.ones(nu_p_xi_ds2.shape))  # 4.
    nu_pr_p_xi_ds2 = p_xi_ds2.grad.data  # 5.
    B_ds1 = nu_pr_p_xi_ds2.mean()  # 6.

    p_r_ds1.requires_grad_(True)  # 1.
    nu_p_ds1 = nu(p_r_ds1, dx)  # 2.
    nu_p_ds1.backward(torch.ones(nu_p_ds1.shape))  # 3.
    nu_pr_p_ds1 = p_r_ds1.grad.data  # 4. 
    C_ds1 = (p_r_ds1 * nu_pr_p_ds1).sum()  # 5.

    psi_ds1 = A_ds1 * (B_ds1 - C_ds1)

    # PART 2

    # A
    int_nu_p_ds2 = nu(p_r_ds2, dx).sum()
    int_nu_p_ds2.requires_grad_(True)
    phi_int_nu_p_ds2 = phi(int_nu_p_ds2)
    phi_int_nu_p_ds2.backward(torch.ones(phi_int_nu_p_ds2.shape))
    A_ds2 = int_nu_p_ds2.grad.data

    # B -> 1/n * sum ( nu_pr(p(x_i))) using ds 2

    p_xi_ds1 = torch.tensor(np.exp(kde_ds2.score_samples(x1))) * dx
    p_xi_ds1.requires_grad_(True)
    nu_p_xi_ds1 = nu(p_xi_ds1, dx)
    nu_p_xi_ds1.backward(torch.ones(nu_p_xi_ds1.shape))
    nu_pr_p_xi_ds1 = p_xi_ds1.grad.data
    B_ds2 = nu_pr_p_xi_ds1.mean()

    # C
    p_r_ds2.requires_grad_(True)
    nu_p_ds2 = nu(p_r_ds2, dx)
    nu_p_ds2.backward(torch.ones(nu_p_ds2.shape))
    nu_pr_p_ds2 = p_r_ds2.grad.data
    C_ds2 = (p_r_ds2 * nu_pr_p_ds2).sum()

    psi_ds2 = A_ds2 * (B_ds2 - C_ds2)

    psi = (psi_ds1 + psi_ds2) / 2

    updated_est = est + psi
    updated_psis.append(updated_est)



In [3]:
naive_psis = np.asarray(naive_psis)
updated_psis = torch.FloatTensor(updated_psis).detach().numpy()
GT_psi = GT_psi.numpy()[0]
print('True psi: ', GT_psi)
print('naive psi: ', naive_psis.mean(), ' relative bias:',
      (naive_psis.mean() - GT_psi)/GT_psi * 100, '%')
print('updated TMLE psi: ', updated_psis.mean(), ' relative bias:',
      (updated_psis.mean() - GT_psi)/GT_psi * 100, '%')
print('Reduction in bias:', np.abs(naive_psis.mean() - GT_psi)/GT_psi * 100 - 
     np.abs(updated_psis.mean() - GT_psi)/GT_psi * 100, '%')

True psi:  1.4189385
naive psi:  1.4375673514037324  relative bias: 1.312871107701906 %
updated TMLE psi:  1.4203782  relative bias: 0.10146250715479255 %
Reduction in bias: 1.2114086005471134 %


In [23]:
# This takes the reduction in relative bias for each simulation first, then takes an average
# (Owing to the nonlinearity of the ||x|| function, this gives different results which are
# worth considering.)
print('naive psi var:', naive_psis.var())
print('updated psi var:', updated_psis.var())
errors_naive = (naive_psis - GT_psi)/GT_psi *100
errors_updated = (updated_psis - GT_psi)/GT_psi *100
diff_errors = np.abs(errors_naive) - np.abs(errors_updated)
print('Average of reductions:', diff_errors.mean(), '%')

naive psi var: 9.849999785692589e-05
updated psi var: 5.5809996e-05
Average of reductions: 0.9019158318410758 %
