validate_against_scipy.Rmd

---
jupyter:
  jupytext:
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.11.5
---

$\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}$

## Validating the GLM against scipy

```{python}
import numpy as np
import numpy.linalg as npl
import matplotlib.pyplot as plt
# Print array values to 4 decimal places
np.set_printoptions(precision=4)
```

Make some random but predictable data with numpy.random:

```{python}
# Make random number generation predictable
np.random.seed(1966)
# Make a fake regressor and data.
n = 20
x = np.random.normal(10, 2, size=n)
y = np.random.normal(20, 1, size=n)
plt.plot(x, y, '+')
```

Do a simple linear regression with the GLM:

$$
\newcommand{\yvec}{\vec{y}}
\newcommand{\xvec}{\vec{x}}
\newcommand{\evec}{\vec{\varepsilon}}
\newcommand{Xmat}{\boldsymbol X}
\newcommand{\bvec}{\vec{\beta}}
\newcommand{\bhat}{\hat{\bvec}}
\newcommand{\yhat}{\hat{\yvec}}
\newcommand{\ehat}{\hat{\evec}}
\newcommand{\cvec}{\vec{c}}
\newcommand{\rank}{\textrm{rank}}

y_i = c + b x_i + e_i \implies \\

\yvec = \Xmat \bvec + \evec
$$

```{python}
X = np.ones((n, 2))
X[:, 1] = x
B = npl.pinv(X).dot(y)
B
E = y - X.dot(B)
```

Build the t statistic:

$$
\newcommand{\cvec}{\vec{c}}
\hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2 \\

t = \frac{\cvec^T \bhat}
{\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}
$$

```{python}
# Contrast vector selects slope parameter
c = np.array([0, 1])
df = n - npl.matrix_rank(X)
sigma_2 = np.sum(E ** 2) / df
c_b_cov = c.dot(npl.pinv(X.T.dot(X))).dot(c)
t = c.dot(B) / np.sqrt(sigma_2 * c_b_cov)
t
```

Test the t statistic against a t distribution with `df` degrees of freedom:

```{python}
import scipy.stats
t_dist = scipy.stats.t(df=df)
p_value = 1 - t_dist.cdf(t)
# One-tailed t-test (t is positive)
p_value
# Two-tailed p value is just 2 * one tailed value, because
# distribution is symmetric
2 * p_value
```

Now do the same test with `scipy.stats.linregress`:

```{python}
res = scipy.stats.linregress(x, y)
res.slope
res.intercept
# This is the same as the manual GLM fit
np.allclose(B, [res.intercept, res.slope])
# p value is always two-tailed
res.pvalue
np.allclose(p_value * 2, res.pvalue)
```

Now do the same thing with the two-sample t-test.

```{python}
X2 = np.zeros((n, 2))
X2[:10, 0] = 1
X2[10:, 1] = 1
X2
B2 = npl.pinv(X2).dot(y)
E2 = y - X2.dot(B2)
c2 = np.array([-1, 1])
df = n - npl.matrix_rank(X2)
sigma_2 = np.sum(E2 ** 2) / df
c_b_cov = c2.dot(npl.pinv(X2.T.dot(X2))).dot(c2)
t = c2.dot(B2) / np.sqrt(sigma_2 * c_b_cov)
t
t_dist = scipy.stats.t(df=df)
# One-tailed p value, for negative value
p_value_2 = t_dist.cdf(t)
p_value_2
# Two-tailed p value
p_value_2 * 2
```

The same thing using `scipy.stats.ttest_ind` for t test between two
independent samples:

```{python}
scipy.stats.ttest_ind(y[:10], y[10:])
```

<!-- vim:ft=rst -->
<!-- Course -->
<!-- BIC -->
<!-- Python distributions -->
<!-- Version control -->
<!-- Editors -->
<!-- Python and common libraries -->
<!-- IPython -->
<!-- Virtualenv and helpers -->
<!-- Pypi and packaging -->
<!-- Mac development -->
<!-- Windows development -->
<!-- Nipy and friends -->
<!-- FMRI datasets -->
<!-- Languages -->
<!-- Imaging software -->
<!-- Installation -->
<!-- Tutorials -->
<!-- MB tutorials -->
<!-- Ideas -->
<!-- Psych-214 -->
<!-- People -->
<!-- Licenses -->
<!-- Neuroimaging stuff -->
<!-- OpenFMRI projects -->
<!-- Unix -->
<!-- Substitutions -->