# Causal discovery with `TIGRAMITE`

TIGRAMITE is a time series analysis python module. It allows to reconstruct graphical models (conditional independence graphs) from discrete or continuously-valued time series based on the PCMCI method and create high-quality plots of the results.

PCMCI is described here: J. Runge et al. (2018): Detecting Causal Associations in Large Nonlinear Time Series Datasets. https://arxiv.org/abs/1702.07007v2

This tutorial explains the missing values and masking and gives walk-through examples. See the following paper for theoretical background:
Runge, Jakob. 2018. “Causal Network Reconstruction from Time Series: From Theoretical Assumptions to Practical Estimation.” Chaos: An Interdisciplinary Journal of Nonlinear Science 28 (7): 075310.

In [None]:
# Imports
import numpy
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline     
## use `%matplotlib notebook` for interactive figures
# plt.style.use('ggplot')
import sklearn

import tigramite
from tigramite import data_processing as pp
from tigramite import plotting as tp
from tigramite.pcmci import PCMCI
from tigramite.independence_tests import ParCorr, GPDC, CMIknn, CMIsymb
from tigramite.models import LinearMediation, Prediction


In [None]:
var_names = [r'$X^0$', r'$X^1$', r'$X^2$', r'$X^3$']

## Missing values and masking

### Missing values

Tigramite consistently handles missing values. For example, missing values denoted as ``999.`` in the data can be flagged with ``ParCorr.set_dataframe(data, missing_flag=999.)``. Then all time slices of samples where missing values occur in any variable are dismissed while consistently handling time lags. To avoid biases also subsequent samples for all lags up to ``2*tau_max`` are dismissed. Missing values and masking will be descriped in more detail in a future paper.

In [None]:
numpy.random.seed(1)
data = numpy.random.randn(100, 3)
for t in range(1, 100):
    data[t, 0] += 0.7*data[t-1, 0] 
    data[t, 1] += 0.6*data[t-1, 1] + 0.6*data[t-1,0]
    data[t, 2] += 0.5*data[t-1, 2] + 0.6*data[t-1,1]
# Randomly mark 10% of values as missing values in variable 2
data[numpy.random.permutation(100)[:10], 2] = 999.
tp.plot_timeseries(data, missing_flag=999., var_names=var_names)
dataframe = pp.DataFrame(data, missing_flag=999.)
pcmci_parcorr = PCMCI(dataframe=dataframe, cond_ind_test=ParCorr(verbosity=3), 
                      var_names=var_names, verbosity=4)
results = pcmci_parcorr.run_pcmci(tau_max=2, pc_alpha=0.2)
pcmci_parcorr.print_significant_links(
        p_matrix = results['p_matrix'], 
        val_matrix = results['val_matrix'],
        alpha_level = 0.01)

### Masking

Different from missing values, masking can be used to include or exclude samples depending on the situation: For example, in climate research we frequently are interested to detect the drivers of a target variable *only in winter months*. Thus, in all independent tests $X \perp Y | Z$ carried out during a PCMCI analysis, we require samples of $Y$ to be from the winter only, while lagged samples of $X$ or $Z$ can also come from the previous summer. This can be achieved with ``mask_type='y'`` in initializing ``ParCorr``' and marking all winter month data in $Y$ in ``mask``. If we want *all* samples, also in $X$ and $Z$ to be restricted to winter months, we need to mark them in ``mask`` as well and set  ``mask_type='yxz'``. Correspondingly, also  ``mask_type='z'`` or any combination is possible. Missing values and masking will be descriped in more detail in a future paper.

In the following example, we generate data with a different underlying causality for winter and summer months. In particular, assume a causal effect is of opposite sign in both seasons.

In [None]:
# Masking demo: We consider time series where the first half is generated by a different
# causal process than the second half. 
numpy.random.seed(42)
T = 1000
data = numpy.random.randn(T, 2)
data_mask = numpy.zeros(data.shape)
for t in range(1, T):
#     print t % 365
    if (t % 365) < 3*30 or (t % 365) > 8*30: 
        # Winter half year
        data[t, 0] +=  0.4*data[t-1, 0]
        data[t, 1] +=  0.3*data[t-1, 1] + 0.9*data[t-1, 0]
    else:
        # Summer half year
        data_mask[[t, t-1]] = True
        data[t, 0] +=  0.4*data[t-1, 0]
        data[t, 1] +=  0.3*data[t-1, 1] - 0.9*data[t-1, 0]

T, N = data.shape
# print data_mask[:100, 0]
dataframe = pp.DataFrame(data, mask=data_mask)
tp.plot_timeseries(data, figsize=(8,3),  var_names=var_names, use_mask=True, mask=data_mask, 
                             grey_masked_samples='data')


In [None]:
# Setup analysis
def run_and_plot(cond_ind_test, fig_ax):
    pcmci = PCMCI(dataframe=dataframe, cond_ind_test=cond_ind_test, var_names=var_names)
    results = pcmci.run_pcmci(tau_max=2,pc_alpha=0.2, )
    link_matrix = pcmci.return_significant_parents(pq_matrix=results['p_matrix'],
            val_matrix=results['val_matrix'], alpha_level=0.01)['link_matrix']
    tp.plot_graph(fig_ax = fig_ax,  val_matrix=results['val_matrix'],
                  link_matrix=link_matrix, var_names=var_names,
    )

In [None]:
# Causal graph of whole year yields no link because effects average out
fig  = plt.figure(figsize=(3,2)); ax=fig.add_subplot(111)
run_and_plot(ParCorr(mask_type=None), (fig, ax))

# # Causal graph of winter half only gives positive link
fig  = plt.figure(figsize=(3,2)); ax=fig.add_subplot(111)
run_and_plot(ParCorr(mask_type='y'), (fig, ax))

# Causal graph of summer half only gives negative link
fig  = plt.figure(figsize=(3,2)); ax=fig.add_subplot(111)
dataframe.mask = (dataframe.mask == False)
run_and_plot(ParCorr(mask_type='y'),  (fig, ax))


Note, however, that the failure to detect the link on the whole sample occurs only for partial correlatiol because the positive and negative dependencies cancel out. Using CMIknn recovers the link (but gets a false positive for this realization):

In [None]:
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=CMIknn(mask_type=None), var_names=var_names)
results = pcmci.run_pcmci(tau_max=2,pc_alpha=0.2)
link_matrix = pcmci.return_significant_parents(pq_matrix=results['p_matrix'],
        val_matrix=results['val_matrix'], alpha_level=0.01)['link_matrix']
fig  = plt.figure(figsize=(3,2)); ax=fig.add_subplot(111)
tp.plot_graph(fig_ax = (fig, ax),  val_matrix=results['val_matrix'],
              link_matrix=link_matrix, var_names=var_names)