<a href="https://colab.research.google.com/github/lawgorithm/large_scale_inference/blob/main/Hotelling_T2_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import f

# Simulate observed data

In [2]:
def generate_normal_normal_samples(n_samples, mu, tau2):
    """
    Generates samples from a Normal-Normal hierarchical model.  V is simulated
    sampled from an Inverse Gamma distribution.

    Args:
        n_samples (int): The number of samples to generate.
        mu (float): The prior mean of theta.
        tau2 (float): The prior variance of theta.

    Returns:
        tuple: A tuple containing:
            - theta (array-like): The generated theta values.
            - y (array-like): The generated y values.
            - V (array-like): The generated V values (observation variances).
    """

    # Generate theta values from the prior distribution
    theta = np.random.normal(mu, np.sqrt(tau2), n_samples)

    # Simulate V values from an Inverse Gamma distribution (example prior)
    alpha = 5  # Shape parameter for the Inverse Gamma
    beta = 5   # Scale parameter for the Inverse Gamma
    V = 1.0 / np.random.gamma(alpha, 1.0 / beta, n_samples)

    # Generate y values from the conditional distribution (vectorized)
    y = np.random.normal(theta, np.sqrt(V))

    return theta, y, V

In [3]:
# Example usage:
n_samples = 100
mu = 0
tau2 = 2

theta, y, V = generate_normal_normal_samples(n_samples, mu, tau2)

np.quantile(V, [0.025, 0.975])

# # You can adjust the alpha and beta parameters to change the prior on V.
# # For example, to have more concentrated values around a mean of 1:
# alpha2 = 10
# beta2 = 10
# theta2, y2, V2 = generate_normal_normal_samples(n_samples, mu, tau2)
# print("\nGenerated V2 (first 10, different prior):\n", V2[:10])

df = pd.DataFrame(
    {
        "true_theta": theta,
        "y": y,
        "SE": V**0.5
    }
)

In [4]:
df

Unnamed: 0,true_theta,y,SE
0,0.610196,0.963685,1.519286
1,0.288472,0.648897,1.426743
2,1.012014,-0.285794,1.479295
3,2.132634,2.410277,1.019552
4,-0.481728,-1.315689,1.039709
...,...,...,...
95,-0.137834,-0.104785,0.711252
96,-0.718057,0.507717,0.789721
97,-0.806625,-1.272065,1.177175
98,0.797020,-1.672403,1.115580


# One-Sample Hotelling's T² Test

**Overview**

The purpose of the one-sample test is to perform inference on the mean of multivariate normal. For example, are the means all zero? Hotelling's T² test is simultaneous and provides a more holistic picture than testing each variable separately (which would increase the risk of Type I error due to multiple comparisons). It takes into account not just the differences in means for each variable, but also the covariance (i.e., the way in which the variables are correlated with each other).

**References**

* https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution

**Statistical Model**

$$
(x_1, x_2, \ldots, x_n) \stackrel{\text{iid}}{\sim} \operatorname{MVN}(\mu, \Sigma) $$

where
$$ \mu \in \mathbb{R}^p \quad $$ and
$$ \quad \Sigma \in \mathbb{R}^{p \times p} $$
with $\Sigma$ being a positive definite covariance matrix.

**Multivariate Testing:**

For the one-sample test, the null hypothesis is:

$$ H_0: \mu = \mu_0 $$

For example, suppose you ran 100 randomized experiments and your treatment effect estimates are all normally distributed. This test could be used to test whether all the true effects are actually zero.



**How It Works:**

The test statistic is calculated based on the squared distance between the observed sample mean vector and the hypothesized mean vector, scaled by the sample covariance matrix. This distance is called the Mahalanobis distance.

**Test Statistic:**

For the one-sample case, the statistic is given by:

$$ T^2 = n(\bar{\mathbf{x}} - \mathbf{\mu_0})^T S^{-1} (\bar{\mathbf{x}} - \mathbf{\mu_0})$$

where:

*   n is the sample size.
*   $\bar{\mathbf{x}}$ is the vector of sample means.
*   $\mathbf{\mu_0}$ is the hypothesized mean vector.
*   S is the sample covariance matrix (unbiased)

$$ S = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T$$

This statistic can be converted to an F-distributed statistic for hypothesis testing. The relevent statistic is:

$$ F = \frac{n - p}{(n - 1) p} * T^2 $$

Under the null, F has the following distribution:

$$ F \sim F_{p,n - p}$$

**Intuition**

To gather intuition for this test statistic, consider the easier case where $\Sigma$ is known. Recall that $\bar{\mathbf{x}} - \mathbf{\mu_0}$ is mean zero under the null.

Define $\mathbf{z} = (\mathbf{x} - \mathbf{\mu_0})$
Under the null, we have:

$$
\mathbf{z} \stackrel{\text{iid}}{\sim} \operatorname{MVN}(\mathbf{0}, \Sigma) $$

Furthermore, let $\Omega = (\Sigma^{-1})^{1/2})$ a matrix square root of the inverse of $\Sigma$. Then, define $\mathbf{v} = \Omega \mathbf{z}$. We then have:

$$
\mathbf{v} \stackrel{\text{iid}}{\sim} \operatorname{MVN}(\mathbf{0}, I) $$

The inner product of $\mathbf{v}$ with itself is therefore a sum of squared normals, which is $\chi^2$:

$$\mathbf{v}^T\mathbf{v} \sim \chi^2_{p} $$

So we can do inference under the null with these transformations, provided that we can find the matrix square root $\Sigma^{-1}$. Unpacking it all, we have:

$$(\Omega \mathbf{z})^T(\Omega \mathbf{z}) \sim \chi^2_{p} $$
$$\mathbf{z}^T \Omega^T \Omega \mathbf{z} \sim \chi^2_{p} $$
$$\mathbf{z}^T ((\Sigma^{-1})^{1/2})^T \Sigma^{-1})^{1/2} \mathbf{z} \sim \chi^2_{p} $$
$$\mathbf{z}^T \Sigma^{-1}\mathbf{z} \sim \chi^2_{p} $$
$$(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu_0}) \sim \chi^2_{p} $$

Note that, for a symmetric positive definite (SPD) matrix we have that it is always invertible and that the inverse is SPD. An SPD matrix always has a unique square root that is SPD.

**Assumptions:**


* Multivariate Normality: We assume that the observations are independently drawn from a multivariate normal distribution
* n > p
    * For a general covariance matrix, there are a lot of parameters, so we need some minimum number of data points to identify them. In particular, S needs to be full rank. It's not possible for n vectors in a p-dimensional space to span the entire space if n < p. They will always be linearly dependent.
    * Note: we don't have this problem in the special case where the covariance matrix is diagonal (independence between the entries). In that case, we only need non-zero variance estimates, so n > 1 is sufficient.


**Why Use It?**

When you have several interrelated dependent variables and you want to test for overall differences.

**Summary**

Hotelling's T² test extends the idea of comparing means from one dimension to multiple dimensions by incorporating the relationships between variables via their covariance. It's a powerful tool when you're dealing with multivariate data and want to make joint inferences about several means at once.

In [5]:
def hotelling_t2_from_stats(x_bar, S, mu0, n):
    """
    Performs the one-sample Hotelling's T² test given the sample mean vector,
    the unbiased sample covariance matrix, and the hypothesized mean vector.

    Parameters:
      x_bar : array_like
          Sample mean vector (p-dimensional).
      S : array_like
          Unbiased sample covariance matrix (p x p), computed as:
          S = (1/(n-1)) * sum[(x_i - x_bar)(x_i - x_bar)^T].
      mu0 : array_like
          Hypothesized mean vector (p-dimensional).
      n : int
          Sample size.

    Returns:
      T2 : float
          Hotelling's T² statistic.
      F_stat : float
          Corresponding F statistic.
      p_value : float
          p-value for the test.
    """
    x_bar = np.asarray(x_bar)
    mu0 = np.asarray(mu0)
    diff = x_bar - mu0
    # Compute T² statistic: T² = n * (x_bar - mu0)^T S^{-1} (x_bar - mu0)
    T2 = n * np.dot(np.dot(diff.T, np.linalg.inv(S)), diff)

    p = len(x_bar)  # dimensionality
    # Convert T² to an F statistic:
    # Under H0, F = [(n - p)/(p (n-1))] * T² ~ F_{p, n-p}
    F_stat = (n - p) / (p * (n - 1)) * T2
    p_value = 1 - f.cdf(F_stat, p, n - p)

    return T2, F_stat, p_value

In [10]:
# Example usage:
np.random.seed(42)
n = 19  # sample size
p = 28   # number of variables

# Generate some multivariate data (each row is an observation)
data = np.random.randn(n, p)

# Compute the sample mean and unbiased covariance matrix
x_bar = np.mean(data, axis=0)
S = np.cov(data, rowvar=False)

# Hypothesized mean vector (e.g., all zeros)
mu0 = np.zeros(p)

T2, F_stat, p_val = hotelling_t2_from_stats(x_bar, S, mu0, n)
print("Hotelling's T² statistic:", T2)
print("F statistic:", F_stat)
print("p-value:", p_val)

Hotelling's T² statistic: -1.3021815736284767e+18
F statistic: 2.3253242386222796e+16
p-value: nan


In [11]:
T2, F_stat, p_val = hotelling_t2_from_stats(
    x_bar=df['y'],
    S=np.diag(df['SE']**2),
    mu0=np.zeros(len(df['y'])),
    n=19)

print(T2, F_stat, p_val)

6580.5484418992255 -296.1246798854651 nan


In [12]:
# This is wrong because n < p, which doesn't make sense in this case
# The fact that S is diagonal should "save" me

When n < p, the standard Hotelling's T2 statistic (and its conversion to an F-statistic) isn't valid because its derivation assumes that S follows a Wishart distribution (which requires n > p). This is true even if you know that the off-diagonals of the covariance matrix are zero (in which case, the covariance matrix is still estimable, even though n < p).

In the special case where the covariance is known to be diagonal due to independence, each variable essentially yields a univariate t-statistic. One solution is to combine these independent t-statistics into a single test statistic and then use a simulation (or permutation) approach to compute a p-value. For example, define for each variable

$$ t_i = \frac{\sqrt{n} (\bar{x_i} - \mu_{0,i})}{s_i}$$

$s_i^2$ is the sample variance (with n-1 degrees of freedom). Then you can form:

$$T^2 = \sum_{i=1}^p t^2_i$$

Under the null hypothesis, each $t_i$ follows a t-distribution with n-1 degrees of freedom and they are independent. Unfortunately, the distribution of $T^2$ (the sum of p independent squared t-variables) does not have a simple closed form. However, you can approximate its null distribution by simulation. TLDR, you simply need to use a computing to get the sampling distribution of $T^2$, which you could then use to construct a p-value.

# Two-Sample Hotelling's T² Test


In [9]:
# TODO