## Meta-analysis exercises for DAIR3



In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import scipy.stats.distributions as dist

**Background:** The function _gen_study_dat_ below simulates data from multiple independent two-arm studies.  You do not need to understand the internals of this function, but we will use it below to simulate data.  Though this function, we are able to specify the number of studies (n_study), the population effect size (pes), and some parameters that control how the per-arm sample sizes are generated (arm_size_mean, arm_size_cv, and arm_size_cor).  Finally, var_cv controls how the variances of the data in each study are simulated.  The average data variance is 1, but different studies have different variances and the [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of these variances is given by var_cv.

In [None]:
def gen_study_dat(n_study, pes, arm_size_mean, arm_size_cv, arm_size_cor, var_cv, clust_icc):
    """
    Simulate data for meta-analysis.  Each study in the meta-analysis is a two arm-study/
    The population effect sizes are identical (the population is homogeneous).
    
    Parameters
    ----------
    n_study : number of studies
    pes : population effect size (can be scalar for homogeneous or vector for heterogeneous studies)
    arm_size_mean : the expected sample size of one study arm
    arm_size_cv : the coefficient of variation of study arm sizes
    arm_size_cor : the correlation between effect sizes of the two arms (on copula scale)
    var_cv : the coeffient of variation of the unexplained variance
    
    Notes
    -----
    The unexplained variance always has mean 1.
    """
    # Generate sample sizes for two arms in each study using a Gaussian copula
    z = np.random.normal(size=(n_study, 2))
    z[:, 1] = arm_size_cor*z[:, 0] + np.sqrt(1-arm_size_cor**2)*z[:, 1]
    u = dist.norm.cdf(z)
    v = (arm_size_mean * arm_size_cv)**2
    a = arm_size_mean**2 / v
    b = v / arm_size_mean
    N = dist.gamma(a, scale=b).ppf(u)
    N = np.ceil(N).astype(int)
    N1 = N[:, 0]
    N2 = N[:, 1]
    
    # Now generate variances, centered at 1
    v = var_cv**2
    sig = np.random.gamma(1/v, scale=v, size=n_study)
    
    f = (N1 + N2) / (N1 * N2)
    se = np.sqrt(sig**2 * f)
    z = np.random.normal(size=n_study)
    if clust_icc == 0:
        clust = None
    else:
        clust = np.random.choice(range(5), n_study)
        for i in range(5):
            jj = np.flatnonzero(clust == i)
            if len(jj) > 0:
                z[jj] = np.sqrt(clust_icc)*np.random.normal() + np.sqrt(1 - clust_icc)*z[jj]
    md = pes + z*se
    
    return md, sig, N1, N2, clust

Below is an implementation of [Cochran's Q test](https://en.wikipedia.org/wiki/Cochran%27s_Q_test), a very common approach for assessing heterogeneity among study results and quantifying the extent of heterogeneity through the $I^2$ statistic.

In [None]:
def cochran_Q(md, se):
    """
    Assess heterogeneity in a meta-analysis using Cochran's Q approach.
    
    Parameters:
    -----------
    md : vector of point estimates (e.g. mean differences)
    se : vector of standard errors
    
    Returns:
    --------
    q : Cochran's Q statistic
    pval : A p-value testing the null hypothesis of no heterogeneity
    I2 : The I2 statistic quantifying the extent of heterogeneity
    """
    w = 1 / se**2
    w /= w.sum()
    pe = np.dot(md, w) # pooled estimate
    q = np.sum((md - pe)**2 / se**2) # q-statistic
    pval = 1 - dist.chi2(len(md) - 1).cdf(q)
    I2 = 1 - (len(md) - 1) / q
    return q, pval, I2

Below we simulate study data with a particular set of parameters, then we present some plots to illustrate the data.

In [None]:
md, sig, N1, N2, clust = gen_study_dat(20, 0.1, 30, 0.5, 0.7, 0.6, 0)

In [None]:
plt.plot(N1, N2, "o")
plt.grid(True)
plt.xlabel("Arm 1 sample size")
plt.ylabel("Arm 2 sample size")

In [None]:
plt.plot(md, sig, "o")
plt.grid(True)
plt.xlabel("Treatment effect")
plt.ylabel("Standard deviation")

**Assessment 1:**

The cell above generates data from 20 two-arm studies that are suitable for a meta-analysis.  The vectors _md_, _sig_, _N1_, and _N2_ contain, respectively, the estimated mean difference (treatment effect), estimated pooled standard deviation, arm 1 sample size, and arm 2 sample size.  All results pertain to a set of 20 studies estimating a common parameter of interest.

Answer the following sequence of questions, which could arise in a meta-analysis using these data.

a: Efficiently estimate the consensus effect size

b: Estimate the standard error of the consensus effect size from part a

c: Modify the study characteristics (the parameters in _gen_study_dat_) to identify a setting where we reject the null hypothesis of zero consensus effect around half of the time (review the documentation for the _gen_study_dat_ function above to understand what the parameters mean). 

d: In the scenario that you constructed in part c, around how many of the studies would have been considered to produce statistically significant evidence of an effect if considered in isolation?

e: Configure the parameters for _gen_study_dat_ as you like, then calculate a p-value for each study (considered in isolation), and use Fisher's method to produce an overall p-value.  Try to find a setting where the p-value for Fisher's method is less than 0.05 around half of the time.

In [None]:
# part 1a

In [None]:
# part 1b

In [None]:
# part 1c

In [None]:
# part 1d

In [None]:
# part 1e

**Assessment 2:**

a: Generate data using _gen_study_dat_ that have homogeneous effect sizes.  Calculate the Cochran Q-statistic and confirm that the results are what you would expect.  

b: Generate data using _gen_study_dat_ that have heterogeneous effect sizes.  Calculate the Cochran Q-statistics and briefly interpret your findings. 

c: Suppose we have a situation where of our 20 studies, around half were conducted in rural settings and the remainder were conducted in urban settings, with the treatment effect being much stronger in rural compared to urban settings.  Simulate a "rural/urban" variable, and then use _gen_study_dat_ to simulate data reflecting these heterogeneous treatment effects.  Then use generalized least squares to efficiently assess the treatment effect while accounting for both differing precisions of different studies, and heterogeneity due to the difference between rural and urban settings.

In [None]:
# part 2a

In [None]:
# part 2b 

In [None]:
# part 2c

**Assessment 3:**

Use _gen_study_dat_ to simulate studies that fall into clusters (e.g. a cluster may be a set of studies using similar protocols).  

a. Use an appropriate approach to estimate the consensus effect size.  

b. For comparison, estimate the consensus effect size ignoring the clustering.

In [None]:
# part 3a

In [None]:
# part 3b