# Statsmodels Methods Overview

In [None]:
# This cells setups the environment when executed in Google Colab.
try:
    import google.colab
    !curl -s https://raw.githubusercontent.com/ibs-lab/cedalion/dev/scripts/colab_setup.py -o colab_setup.py
    # Select branch with --branch "branch name" (default is "dev")
    %run colab_setup.py
except ImportError:
    pass

In [None]:
# this cell won't be rendered in the sphinx documentation to keep the page short

import matplotlib.pyplot as p
import numpy as np
import pandas as pd
import xarray as xr

import cedalion
import cedalion.datasets
import cedalion.io
import cedalion.models.glm as glm
import cedalion.nirs
import cedalion.plots as plots
import cedalion.sigproc.frequency
from cedalion import units

xr.set_options(display_expand_data=False);

rec = cedalion.datasets.get_fingertapping()

# rename trials
rec.stim.cd.rename_events(
    {
        "1.0": "control",
        "2.0": "Tapping/Left",
        "3.0": "Tapping/Right",
        "15.0": "sentinel",
    }
)
rec.stim = rec.stim[rec.stim.trial_type != "sentinel"]

# differential pathlenght factors
dpf = xr.DataArray(
    [6, 6],
    dims="wavelength",
    coords={"wavelength": rec["amp"].wavelength},
)

# calculate optical density and concentrations
rec["od"] = cedalion.nirs.int2od(rec["amp"])
rec["conc"] = cedalion.nirs.od2conc(rec["od"], rec.geo3d, dpf, spectrum="prahl")

# Bandpass filter remove cardiac component and slow drifts.
# Here we use a highpass to remove drift. Another possible option would be to
# use drift regressors in the design matrix.
fmin = 0.02 * units.Hz
fmax = 0 * units.Hz

rec["conc_filtered"] = cedalion.sigproc.frequency.freq_filter(rec["conc"], fmin, fmax)


ts = rec["conc_filtered"]

# split time series into two based on channel distance
ts_long, ts_short = cedalion.nirs.split_long_short_channels(
    rec["conc_filtered"], rec.geo3d, distance_threshold=1.5 * units.cm
)

dms = (
    glm.design_matrix.hrf_regressors(
        ts_long, rec.stim, glm.Gamma(tau=0 * units.s, sigma=3 * units.s)
    )
    & glm.design_matrix.closest_short_channel_regressor(ts_long, ts_short, rec.geo3d)
)
dms.channel_wise[0] = dms.channel_wise[0].pint.dequantify()
dms.channel_wise[0] /= dms.channel_wise[0].max("time")

dm = dms.common



Cedalion uses statsmodels for its GLM fitting functionality, and this notebook gives an overview of some common statsmodels methods. The glm.fit function returns an xr.DataArray of statsmodels RegressionResults objects with dimensions (channel, chromo). Any RegressionResults method can be called on this DataArray using the .sm accessor. A full list of available methods and attribute can be found in the [statsmodels documentation](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html).

In this notebook, we'll assume that we have already loaded our data and set up the GLM. See the other GLM notebooks for details on setup.

We'll start by fitting the GLM and displaying the resulting object.

In [None]:
results = glm.fit(ts_long, dms)
display(results)

The results object is a DataArray of statsmodels [RegressionResults](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html): one for each channel/chromophore. In order to call a method on our results object, we just use the accessor `.sm`, followed by the RegressionResults method. Cedalion handles calling the method on each individual RegressionResults object in the results DataArray, and returns the outputs in a new DataArray with the appropriate dimensions. This allows us to get information on all channels simply and concisely.


### Beta Coefficients (params)

First, we'll retreive the coefficients of the GLM fit.

In [None]:
results.sm.params

### Standard Error

The method [bse](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.bse.html#statsmodels.regression.linear_model.RegressionResults.bse) returns the standard errors of the parameter estimates

In [None]:
results.sm.bse

### Confidence Intervals (conf_int)

The method [conf_int](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.conf_int.html#statsmodels.regression.linear_model.RegressionResults.conf_int) calculates the confidence interval of the fitted parameters. We can specify the alpha level for the confidence interval (default 5%).

In the output, the index conf_int marks the low (conf_int=0) and high (conf_int=1) endpoints of the confidence interval.

In [None]:
results.sm.conf_int(alpha=0.05)

### Covariance and Variance

The method [cov_params](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.cov_params.html) computes the covariance matrix. Note that we can recover the variances from the diagonal elements of the matrix.

In [None]:
results.sm.cov_params()

Here we visualize the covariance matrix for a single regressor:

In [None]:
p.imshow(results.sm.cov_params()[0,0,:,:]);

The convenience function `sm.regressor_variances` computes the variances of the regressors, i.e. the diagonal elements of the covariance matrices.

In [None]:
# returns diagonal elements of the cov matrices
results.sm.regressor_variances()

### Statistical Tests - T-values, P-values

Statsmodels also has sophisticated functionality for performing statistical tests on regression results.

The method [t-values](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.tvalues.html) simply returns the t-statistic for each coefficient.


In [None]:
results.sm.tvalues

The method [t-test](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.t_test.html#statsmodels.regression.linear_model.RegressionResults.t_test) allows for general linear hypothesis tests.

We can specify contrasts either by passing an r-matrix or through strings. See the [patsy documentation](https://patsy.readthedocs.io/en/latest/API-reference.html#patsy.DesignInfo.linear_constraint) for details on specifying linear contrasts using strings.

The method returns an array of statsmodel [ContrastResult objects](https://www.statsmodels.org/dev/dev/generated/statsmodels.stats.contrast.ContrastResults.html#statsmodels.stats.contrast.ContrastResults).

In [None]:
# Specifying hypotheses for t-test as string
hypotheses = "HRF Tapping/Left = HRF control, HRF Tapping/Right = HRF control"

contrast_results = results.sm.t_test(hypotheses)
display(contrast_results)


We can use the `.sm` accessor on the resulting `DataArray` of `ContrastResult` objects, just like we did before with the `RegressionResult` arrays.

The convenience functions `sm.tvalues()` and `sm.pvalues()` return the t- and p-values of the contrast, respectively.

The `sm.map` method works analogously to the map function in python, applying a given function to each cell of the DataArray.

Below, we extract the t-values of the contrast using first the map method and then the convenience function.

In [None]:
# Extracting t-values from the contrast results
display(contrast_results.sm.map(lambda i : i.tvalue, name="hypothesis"))

display(contrast_results.sm.t_values())

In [None]:
# Extracting p-values from the contrast results
display(contrast_results.sm.p_values())

### Plotting Uncertainty Bands

In this section, we explore a technique, still in development, for visualizing uncertainty in a GLM with many regressors. This method visualizes the uncertainty in a GLM fit by drawing multiple samples of the beta coefficients from their estimated covariance (via multivariate normal sampling). It then uses these sampled betas to generate predicted time courses, and plots the mean prediction with a shaded band representing ±3 standard deviations across samples, thus capturing the variability due to model uncertainty. The band is quite narrow because uncertainty is low in this toy example.

FIXME: Band even smaller than before?

In [None]:
# Sample betas

betas = results.sm.params
cov = results.sm.cov_params()

sampled_betas = xr.zeros_like(betas).expand_dims({"sample" : 100}, axis=-1).copy()
for i_ch in range(sampled_betas.shape[0]):
    for i_cr in range(sampled_betas.shape[1]):
        sampled_betas[i_ch, i_cr, :, :] = np.random.multivariate_normal(
            betas[i_ch, i_cr, :],
            cov[i_ch, i_cr, :, :],
            size=100,
        ).T

display(sampled_betas)

In [None]:
# Predicting the time series using the sampled betas
pred = glm.predict(ts_long, sampled_betas, dms)
display(pred)

In [None]:
# Plot a band between mean-3*std and mean+3*std
# We select a 20 second window for better visualization
pred_mean = pred.mean("sample")
pred_std = pred.std("sample")

mm = pred_mean.loc[slice(60,80), "S5D5", "HbO"]
ss = pred_std.loc[slice(60,80), "S5D5", "HbO"]

p.plot(mm.time, mm, c="r")
p.fill_between(mm.time, mm-3*ss, mm+3*ss, fc="y", alpha=.8)

p.xlabel("time / s")
p.ylabel(r"$\Delta$  c / uM");