Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Conditional Equality of Distribution Test #225

Open
rflperry opened this issue Oct 25, 2021 · 0 comments
Open

Kernel Conditional Equality of Distribution Test #225

rflperry opened this issue Oct 25, 2021 · 0 comments
Labels
enhancement New feature or request ndd Issues for NeuroData Design

Comments

@rflperry
Copy link
Member

rflperry commented Oct 25, 2021

Popular 2-sample tests often measure the difference in means or the conditional difference in means.

Tests such as MMD test for differences in distribution, generalizing past only the first moment (mean).

The relatively recent paper here presents a Kernel Conditional Discrepancy (KCD) Test to test for the difference between two conditional distributions Y_1 | X_1 vs Y_2 | X_2. Like MMD, it is generalizing the conditional test past purely the first moment. It also provides interpretable ways to analyze where this discrepancy comes from.

Author's code is as follows, relying on the falkon package for efficient computation of kernels and kernel regression. I have correct(?) code reliant on sklearn, not falkon, with replicated results from the paper.

import numpy as np
from falkon import kernels, models, FalkonOptions, gsc_losses
import torch


#  The following function computes the kernel matrix using the Gaussian kernel between vectors x and y, with bandwidth
#  sigma. If y=None, the Gram matrix is computed based on x. sigma can be a vector of bandwidths, one for each
#  dimension.
def kernel_matrix(x, y=None, sigma=None):
    if y is None:
        y = x
    p = x.shape[1]
    if isinstance(sigma, float):
        sigma = sigma * np.ones(p)
    each = [0] * p
    for i in range(p):
        xi = x[:, i]
        yi = y[:, i]
        each[i] = np.exp(-0.5 * (np.reshape(xi ** 2, [-1, 1]) + np.reshape(yi ** 2, [1, -1])
                                 - 2 * np.matmul(np.reshape(xi, [-1, 1]), np.reshape(yi, [1, -1]))) / (sigma[i] ** 2))
    return np.prod(each, axis=0)


#  The following function divides x and y into control and treatment groups, x_c, x_t, y_c and y_t respectively,
#  based on the treatment assignment vector z. It also computes the matrix W for the control and treatment group,
#  which is the inverse of sum of the Gram matrix and the identity matrix times the regularisation parameter.
def data_prep(x, y, z, sigma_x, reg):
    x_c = x[np.array(1 - z, dtype=bool), :]
    x_t = x[np.array(z, dtype=bool), :]
    y_c = y[np.array(1 - z, dtype=bool), :]
    y_t = y[np.array(z, dtype=bool), :]
    k_c = kernel_matrix(x_c, sigma=sigma_x)
    k_t = kernel_matrix(x_t, sigma=sigma_x)
    w_c = np.linalg.inv(k_c + reg * np.identity(int(np.sum(1 - z))))
    w_t = np.linalg.inv(k_t + reg * np.identity(int(np.sum(z))))
    return x_c, x_t, y_c, y_t, w_c, w_t


#  This function computes the witness function between control and treatment groups at arg_x and arg_y.
def witness(arg_x, arg_y, x_c, x_t, y_c, y_t, w_c, w_t, sigma_x, sigma_y):
    l_c = kernel_matrix(y_c, arg_y, sigma_y)
    l_t = kernel_matrix(y_t, arg_y, sigma_y)
    k_c = kernel_matrix(x_c, arg_x, sigma_x)
    k_t = kernel_matrix(x_t, arg_x, sigma_x)
    return (k_t.T @ w_t @ l_t - k_c.T @ w_c @ l_c).T


#  This function carries out generalised kernel ridge regression for U-statistic regression, at arg_x.
def var_regression(arg_x, x, y, sigma_x, reg):
    arg_x = np.tile(arg_x, 2)
    n = x.shape[0]
    p = x.shape[1]
    x_pairs = np.zeros((n ** 2, 2 * p))
    x_pairs[:, range(p)] = np.repeat(x, n, axis=0)
    x_pairs[:, range(p, 2 * p)] = np.tile(x, (n, 1))
    h = np.reshape(0.5 * (np.power(y, 2) + np.power(y, 2).T - 2 * np.matmul(y, y.T)), -1)
    kernel = kernels.GaussianKernel(sigma=torch.tensor(sigma_x))
    options = FalkonOptions(use_cpu=True, keops_active='no')
    model = models.Falkon(kernel=kernel, penalty=reg, M=n, options=options)
    model.fit(torch.from_numpy(x_pairs), torch.from_numpy(h))
    var = model.predict(torch.from_numpy(arg_x)).numpy()
    return var


#  This function computes the test statistic, t
def t(x, x_c, x_t, y_c, y_t, w_c, w_t, sigma_x, sigma_y, size):
    k_c_tilde = kernel_matrix(x, x_c, sigma=sigma_x)
    k_t_tilde = kernel_matrix(x, x_t, sigma=sigma_x)
    l_c = kernel_matrix(y_c, sigma=sigma_y)
    l_t = kernel_matrix(y_t, sigma=sigma_y)
    l_ct = kernel_matrix(y_c, y_t, sigma=sigma_y)
    first = np.trace(k_c_tilde @ w_c @ l_c @ w_c @ k_c_tilde.T)
    second = np.trace(k_c_tilde @ w_c @ l_ct @ w_t @ k_t_tilde.T)
    third = np.trace(k_t_tilde @ w_t @ l_t @ w_t @ k_t_tilde.T)
    return (first - 2 * second + third) / size


#  This function conducts a hypothesis test of equality between the control and treatment conditional distributions,
#  using no_perm number of permutations.
def test(x, y, z, x_c, x_t, y_c, y_t, w_c, w_t, sigma_x, sigma_y, size, reg, no_perm, propensity=None):
    t_hat = t(x, x_c, x_t, y_c, y_t, w_c, w_t, sigma_x, sigma_y, size)
    if propensity is not None:
        e_hat = propensity
    else:
        kernel = kernels.GaussianKernel(sigma=torch.tensor(sigma_x))
        options = FalkonOptions(use_cpu=True, keops_active='no')
        loss = gsc_losses.LogisticLoss(kernel)
        logistic_model = models.LogisticFalkon(kernel=kernel, M=size, loss=loss, options=options,
                                               penalty_list=[1e-2, 1e-4, 1e-6, 1e-6, 1e-8],
                                               iter_list=[3, 3, 3, 8, 8])
        if len(x.shape) == 1:
            x = np.reshape(x, [-1, 1])
        zz = np.ones(size)
        zz[z < 0.5] = -1
        logistic_model.fit(torch.from_numpy(x), torch.from_numpy(zz))
        predictions = np.reshape(logistic_model.predict(torch.from_numpy(x)).numpy(), -1)
        e_hat = np.exp(predictions) / (1 + np.exp(predictions))
    t_k = np.zeros(no_perm)
    for k in range(no_perm):
        z_tilde = np.random.binomial(1, e_hat)
        x_tilde_c, x_tilde_t, y_tilde_c, y_tilde_t, w_tilde_c, w_tilde_t = data_prep(x, y, z_tilde, sigma_x, reg)
        t_k[k] = t(x, x_tilde_c, x_tilde_t, y_tilde_c, y_tilde_t, w_tilde_c, w_tilde_t, sigma_x, sigma_y, size)
    p = (1 + np.sum(t_k > t_hat)) / (1 + no_perm)
    return p

@rflperry rflperry added the enhancement New feature or request label Oct 25, 2021
@sampan501 sampan501 added the ndd Issues for NeuroData Design label Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ndd Issues for NeuroData Design
Projects
None yet
Development

No branches or pull requests

2 participants