# Example use case of the Generative Conditional Independence Test

This notebook provides a simple use case of the Generative Conditional Independence Test for testing independence of two variables given we know about other variables that may be related to our quantities of interest. Please find more details the paper *"Conditional Independence Testing using Generative Adversarial Networks"* by *Alexis Bellot and Mihaela van der Schaar*. 

## What is the Generative Conditional Independence Test?

Conditional independence tests are concerned with the question of whether two variables $X$ and $Y$ behave independently of each other, after accounting for the effect of confounders $Z$. Such questions can be written as a hypothesis testing problem: $\mathcal H_0: X\indep Y|Z $ versus the general alternative of no independence. A conditional independence test, given a significance level, determines whether to reject or not the null hypothesis $\mathcal H_0$.

A number of studies have shown such tests to fail when a large number of variables $Z$ confound the relationship between $X$ and $Y$. This work describes a test that is empirically more robust and whose performance guarantees do not depend on the number of variables involved.

Our test is based on a modification of Generative Adversarial Networks (GANs) that simulates from a distribution under the assumption of conditional independence, while maintaining good power in high dimensional data. In our procedure, after training, the first step involves simulating from our network to generate data sets consistent with $\mathcal H_0$. We then define a test statistic to capture the $X-Y$ dependency in each sample and compute an empirical distribution which approximates the behaviour of the statistic under $\mathcal H_0$ and can be directly compared to the statistic observed on the real data to make a decision. 

Let us first generate some data from the GCIT.data_utils module in GCIT as follows:

In [4]:
from utils import *

x, y, z = generate_samples_random(size=500, sType='CI', dx=1, dy=1, dz=100,fixed_function='nonlinear', dist_z='gaussian')

Here we sample gaussian random variables transformed through non-linear functions, such that $Y$ depends on both $X$ and $Z$.

### Now, can we say whether there is an independent relationship between $X$ and $Y$ that is not due to $Z$?

To answer this question, the GCIT takes into account the associations of $Z$ and $Y$ separately to mimic a setting where the null hypothesis holds, i.e. $X$ is irrelevant for inferring $Y$, and then compares this synthetic setting to the actual observations. To arrive at a *p-value* we simply call GCIT with arguments the arrays of data variables as follows:

In [6]:
from GCIT import *

alpha = 0.05
pval = GCIT(x, y, z, verbose=False)

# This is a two-sided test. For a level 0.05 test, set alpha = 0.05/2
if pval < alpha/2: 
    print('p-value is',pval,'- There is enough evidence in the data to reject the null hypoyhesis.')
else:
    print('p-value is',pval,'- There is not enough evidence in the data to reject the null hypothesis.')

p-value is 0.711 - There is not enough evidence in the data to reject the null hypothesis.
