## Meta analysis exercises for DAIR3



In [1]:
import numpy as np
import pandas as pd
import scipy.stats.distributions as dist

**Background:** The function _gen_study_dat_ below simulates data from multiple independent two-arm studies.  We are able to specify the number of studies (n_study), the population effect size (pes), and some parameters that control how the per-arm sample sizes are generated (arm_size_mean, arm_size_cv, and arm_size_cor).  Finally, var_cv controls how the variances of the data in each study are simulated.  The average data variance is 1, but different studies have different variances and the [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of these variances is given by var_cv.

In [2]:
def gen_study_dat(n_study, pes, arm_size_mean, arm_size_cv, arm_size_cor, var_cv):
    """
    Simulate data for meta-analysis.  Each study in the meta-analysis is a two arm-study/
    The population effect sizes are identical (the population is homogeneous).
    
    Parameters
    ----------
    n_study : number of studies
    pes : population effect size (can be scalar for homogeneous or vector for heterogeneous studies)
    arm_size_mean : the expected sample size of one study arm
    arm_size_cv : the coefficient of variation of study arm sizes
    arm_size_cor : the correlation between effect sizes of the two arms (on copula scale)
    var_cv : the coeffient of variation of the unexplained variance
    
    Notes
    -----
    The unexplained variance always has mean 1.
    """
    # Generate sample sizes for two arms in each study using a Gaussian copula
    z = np.random.normal(size=(n_study, 2))
    z[:, 1] = arm_size_cor*z[:, 0] + np.sqrt(1-arm_size_cor**2)*z[:, 1]
    u = dist.norm.cdf(z)
    v = (arm_size_mean * arm_size_cv)**2
    a = arm_size_mean**2 / v
    b = v / arm_size_mean
    N = dist.gamma(a, scale=b).ppf(u)
    N = np.ceil(N).astype(int)
    N1 = N[:, 0]
    N2 = N[:, 1]
    
    # Now generate variances, centered at 1
    v = var_cv**2
    sig = np.random.gamma(1/v, scale=v, size=n_study)
    
    f = (N1 + N2) / (N1 * N2)
    se = np.sqrt(sig**2 * f)
    md = pes + np.random.normal(size=n_study) * se
    
    return md, sig, N1, N2

In [3]:
def cochrane_Q(md, se):
    """
    Assess heterogeneity in a meta-analysis using Cochrane's approach.
    
    Parameters:
    -----------
    md : vector of point estimates (e.g. mean differences)
    se : vector of standard errors
    
    Returns:
    --------
    q : Cochrane's Q statistic
    pval : A p-value testing the null hypothesis of no heterogeneity
    I2 : The I2 statistic quantifying the extent of heterogeneity
    """
    w = 1 / se**2
    w /= w.sum()
    pe = np.dot(md, w) # pooled estimate
    q = np.sum((md - pe)**2 / se**2) # q-statistic
    pval = 1 - dist.chi2(len(md) - 1).cdf(q)
    I2 = 1 - (len(md) - 1) / q
    return q, pval, I2

In [5]:
md, sig, N1, N2 = gen_study_dat(20, 0.1, 30, 0.5, 0.7, 0.6)

**Exercise 1:**

The cell above generates data from 20 two-arm studies that are suitable for a meta-analysis.  The vectors _md_, _sig_, _N1_, and _N2_ contain, respectively, the estimated mean difference, estimated pooled standard deviation, arm 1 sample size, and arm 2 sample size.  All results pertain to a set of 30 studies estimating a common parameter of interest.

Answer the following sequence of questions, which could arise in a meta-analysis using these data.

a: Efficiently estimate the consensus effect size

b: Estimate the standard error of the consensus effect size from part a

c: Modify the study characteristics (the parameters in _gen_study_dat_) to identify a setting where we reject the null hypothesis of zero consensus effect around half of the time (review the documentation for the _gen_study_dat_ function above to understand what the parameters mean). 

d: In the scenario that you constructed in part c, around how many of the studies would have been considered to produce statistically significant evidence of an effect if considered in isolation?

e: Configure the parameters for _gen_study_dat_ as you like, then calculate a p-value for each study (considered in isolation), and use Fisher's method to produce an overall p-value.  Try to find a setting where the p-value for Fisher's method is less than 0.05 around half of the time.

f: Generate data using _gen_study_dat_ that has homogeneous effect sizes.  Calculate the Cochrane Q-statistic and confirm that the results are what you would expect.  Then generate data having heterogeneous effect sizes, and again check the results of the Cochrane Q-statistics. 

In [6]:
# part a
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
w = 1 / se**2
w /= w.sum()
np.dot(md, w)

0.1309356177925515

In [7]:
# part b
va = (1 / np.mean(1 / se**2)) / len(N1)
se = np.sqrt(va)
se

0.042845065007678325

In [8]:
# part c
md, sig, N1, N2 = gen_study_dat(20, 0.1, 20, 0.5, 0.7, 0.1)
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
w = 1 / se**2
w /= w.sum()
es = np.dot(md, w)
va = (1 / np.mean(1 / se**2)) / len(N1)
se = np.sqrt(va)
es/se

1.4592409926601186

In [9]:
# part d
md, sig, N1, N2 = gen_study_dat(20, 0.1, 20, 0.5, 0.7, 0.1)
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
(np.abs(md / se) > 2).sum()

0

In [10]:
# part e
md, sig, N1, N2 = gen_study_dat(20, 0.25, 20, 0.5, 0.7, 0.1)
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
w = 1 / se**2
w /= w.sum()
es = np.dot(md, w)
pval = 2*dist.norm.cdf(-np.abs(md / se))
fm = -2*np.log(pval).sum()
meta_p = 1 - dist.chi2(2*len(md)).cdf(fm)
meta_p

0.0014388131015138361

In [42]:
# part f
md, sig, N1, N2 = gen_study_dat(20, 0.25, 20, 0.5, 0.7, 0.1)
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
q, pval, I2 = cochrane_Q(md, se)

md, sig, N1, N2 = gen_study_dat(20, np.random.choice([0, 0.5], size=20), 20, 0.5, 0.7, 0.1)
f = (N1 + N2) / (N1 * N2)
se = sig * np.sqrt(f)
q, pval, I2 = cochrane_Q(md, se)
pval

0.005545465524741777