# Demonstraton of cluster-level analysis using ENAT data and Stata
<ul>
    <li>Date: August 4th, 2024</li>
    <li>Author: Luke C. Mullany, PhD MS MHS</li>
</ul>

In this document, we are going to demonstrate simple approaches to cluster level analyses of randomized control trials, using Stata. The data we will use for the demonstration come from the Enhancing Nutrition and Antenatal Infection Treatment (ENAT) study, a cluster-randomized community-based effectiveness study conducted in 12 rural health centres in Amhara, Ethiopia (<a href="https://bmjpaedsopen.bmj.com/content/6/1/e001327">Lee AC et al, BMJ Paediatr Open. 2022 Jan;6(1):e001327. doi: 10.1136/bmjpo-2021-001327.)</a>


## Load the data
We load the dataset `baby_level_cluster_rct_data.dta`

In [None]:
set more off
clear all
use "../../source_data_for_git_notebooks/data/baby_level_cluster_rct_data.dta"

frame rename default main_data


The data we will use for the demonstration is a subset of the data available in the ENAT study. Specifically, the frame `main data` is baby level information, and contains three different types of variables:

<ul>
    <li>Identfiers</li>
    <ul>
        <li><code>study_id</code>: this is a unique identifer for the baby</li>
        <li><code>facility_trt</code>: this identifiers the cluster; in ENAT this is either<code> NON-ENP</code> or <code>ENP</code>.</li>
        <li><code>cluster_label</code>: this is an identifier for the cluster name (i.e. the name of the facility)</li>
    </ul>
    <li>Outcomes</li>
    <ul>
        <li><code>livebirth</code>: binary outcome indicating if baby born alive or not</li>
        <li><code>weight</code>: continuous outcome; weight of baby in grams</li>
        <li><code>preterm</code>: binary outcome indicating if baby born preterm or not</li>
        <li><code>gest_age</code>: continuous outcome gestational age at outcome</li>
    </ul>
    <li>Covariates</li>
    <ul>
        <li><code>mother_ht</code>: height of mother (cm)</li>
        <li><code>mother_age</code>: age of mother</li>
        <li><code>occupation</code>: occupation of mother/head of household</li>
        <li><code>bmi</code>: continuous measure of mother's bmi</li>
        <li><code>mother_parity</code>: parity of mother</li>
        <li><code>education</code>: education of the mother</li>
        <li><code>land_ownership</code>: land ownership status of the household</li>
        <li><code>sex</code>: sex of baby/fetus</li>
    </ul>
</ul>
    
### The first first five rows of the data appear below
            
    

In [None]:
%head

## Cluster-level Analysis

In the ENAT study, clusters (not individual participants) were allocated to the nutrition intervention (or NOT). In this type of design, analyses at the individual level that do not account for this clustered-randomization will under-estimate the variance in the effect size, leading to a false sense of confidence in estimate of the effect size. There are many ways to account for the cluster-level randomization, and this accounting can be done either in individual analyses (through for example, random effects models or generalized estimating equations) or at the cluster level, using aggregate analyses.  This notebook focuses on the latter approach, which is commonly suggested when the number of clusters is small. 

## Simple cluster level analysis of a continous variable, without adjustment
For this example, we will use the outcome `weight`.  Cluster level analysis is very simple in this instance. We simply gather the mean of the outcome variable in each cluster, and then do a t-test of the difference between the treatment groups. Below, we use the `collapse` command, and retain the cluster-level (i.e. facility) allocation. Note that we start by restricting to only those where `weight_met_time==1`

In [None]:
capture frame copy main_data cluster_level, replace
capture frame change cluster_level

collapse (mean) weight if weight_time_met==1, by(facility_trt cluster_label)

In [None]:
%head 12

Now that we have estimatd the cluster-specific mean weights, we can simply do a t-test of the six facilities assigned to `ENP` and the six assigned to `NON-ENP`

In [None]:
ttest weight, by(facility_trt) reverse

So we see that the mean weight of babies was 28.7 grams lower in the ENP cluster compared to the NON-ENP clusters, with a confidence interval extending from -101 grams to positive 44 grams. This is not that different than we might have obtained if we did a individual-level analysis, but accounted for the treatment group. Let's examine what we would obtain doing individual-level analysis, using GEE.

In [None]:
frame main_data: xtgee weight facility_trt if weight_time_met==1, i(cluster_label) 

Note that the analysis at the individual level also results in a similar estimate and standard error. Above, we used GEE, but we could have obtained a similar result via a simple random effects model
```stata
xtreg weight facility_trt if weight_time_met==1, mle i(cluster_label)
```

Note that if we did not account for the clustering at all, our estimate of the difference would still similar, but the standard error would be substantially smaller (i.e. under-estimated). 


In [None]:
frame change main_data
regress weight facility_trt if weight_time_met==1

## What if we need to adjust for individual-level covariates?
This is also fairly straightforward. Instead of aggregating the mean of the outcome at each cluster, we instead first run a regression model at the individual level, ignoring the treatment group or allocation. We then estimate the mean of the residuals of this model for each cluster and do a t-test on those cluster level means.

In [None]:
* 0. Refresh the frame where we do our collapsing with a copy of the main data
frame copy main_data cluster_level, replace
frame change cluster_level
keep if weight_time_met==1
* 1. First individual regression model with covariates that need adjustment
quietly regress weight bmi mother_age occupation land_ownership
* 2. Predict the residuals
quietly predict resid, residuals
* 3. get the mean of the residuals at the cluster level and do ttest
collapse (mean) residuals = resid, by(facility_trt cluster_label)
* 4. repeat same t-test from above
ttest residuals, by(facility_trt) reverse


To convince yourself that this is actually an estimate of the treatment effect adjusted by those covariates, conduct the same steps, but don't adjust for anything, and see that this approach will give the same answer as our original approach above (i.e in the unadjusted case)



In [None]:
* 0. Refresh the frame where we do our collapsing with a copy of the main data
frame copy main_data cluster_level, replace
frame change cluster_level
keep if weight_time_met==1
* 1. First individual regression model with covariates that need adjustment
quietly regress weight
* 2. Predict the residuals
quietly predict resid, residuals
* 3. get the mean of the residuals at the cluster level and do ttest
collapse (mean) mean_weight = weight mean_residual = resid, by(facility_trt cluster_label)
gen difference = mean_weight - mean_residual
list
* 4. repeat same t-test from above
ttest mean_residual, by(facility_trt) reverse

<h2 style="color:red">Why does this work (i.e. gives the exact answer as in above approach?</h2>
Notice that the mean residuals of each cluster is simply a constant removed from the mean weight in the cluster. Since adding a constant can't change the t-test result, we get the same answer as above for the non-adjusted treatment effect. What <strong>is</strong> this constant difference between the cluster-level residual means and the mean weight? The constant is the overall mean in the population, which is what the unadjusted individual level model is estimating!.  
<br></br>
Therefore, when the individual-level model is estimating the adjusted weight in the population, a t-test of the difference in the cluster-specific means of those residuals gives and estimate of the adjusted treatment effect.
<hr>

## Binary outcomes

What if our outcome is a binary variable, like preterm? Here, the approach is similar, but instead of examining the mean of the outcome, we examine the ratio of the observed to expected

In [None]:
* 0. Refresh the frame where we do our collapsing with a copy of the main data
frame copy main_data cluster_level, replace
frame change cluster_level
keep if preterm!=.

In [None]:
collapse (mean) preterm, by(facility_trt cluster_label)
list
ttest preterm, by(facility_trt) reverse

The estimate of relative risk here is just the ratio of the means of the cluster-level means

In [None]:
di r(mu_1)/r(mu_2)

Notice that there is a residual-based approach to this as well, which we can later leverage to do cluster-level adjusted analyses for dichotomous outcomes, as we did above for continuous

In [None]:
* 0. Refresh the frame where we do our collapsing with a copy of the main data
frame copy main_data cluster_level, replace
frame change cluster_level
keep if preterm!=.
* 1. estimate a logistic model of the outcome, without regard to cluster or treatment
qui glm preterm, family(bin) link(log)
* 2. Predict the observed and expected number of outcomes. 
* To do this, we predict the probability from the model, which of course is a constant over the
* entire dataset in this case, since we didn't adjust for anything
predict prob, xb
replace prob = exp(prob)
* The observed is just the sum of the observed preterm variable (because it is a zero-one outcome),
* while the expected is the mean of the probabilities
collapse (sum) expected = prob observed = preterm, by(facility_trt cluster_label)
list

Now, we take the ratio of the observed over expected, and do a t-test on those ratios, and we will see we get the exact same estimate. Note that is an estimate of the relative risk. Observed preterm risk is about 20% lower in the ENP arm, but there is little statistical evidence that the true rate of preterm differs between the groups


In [None]:
capture gen ratio = observed/expected
qui ttest ratio, by(facility_trt) reverse
di r(mu_1)/r(mu_2)

## What about confidence intervals for the relative risk based on cluster-level summaries?
Below is a small program that will return the estimate and the confidence interval. Note that is an ado file, and can be added temporarily or permanently to your path (see `help sysdir` or `help adopath` for more information). If you examine the code, you will see that it relies on values returned by Stata's built-in `ttest` command, and therefore can only be called immediately after `ttest`

```stata
capture program drop t_test_binary_ci

program define t_test_binary_ci
  syntax, [Level(real 95)]
  * get the estimate of relative risk, which is the ratio of the means (of cluster-level means) in group 1 and
  * the means (of cluster-level means) of group 2
  scalar est =  r(mu_1)/r(mu_2)
 
  * approx 95% CI:
  * first get the variance (V) and use the t distribution, with appropriate degress of freedom
  * to get a mulitiplicative risk factor
  scalar V = r(sd_1)^2/(r(N_1)*r(mu_1)^2) + r(sd_2)^2/(r(N_2)*r(mu_2)^2)
  local alpha = (100-`level')/100
  scalar err_factor = exp(invt(r(df_t), `alpha'/2)*sqrt(V))

  * get the lower bound as the estimate mutiplied by this factor, while the 
  * upper bound is the estimate divided by this factor
  scalar lower = est*err_factor
  scalar upper = est/err_factor

  di "{hline 40}"
  di "Est" _col(16) "`level' %CI"
  di "{hline 40}"

  di %4.3f est _col(14) "( " %4.3f lower " - " %4.3f upper " )"
  di "{hline 40}"

end
```

Here we add the location of this adofile to `adopath`


In [None]:
capture program drop t_test_binary_ci
capture adopath + "../ado_files/"

In [None]:
qui ttest ratio, by(facility_trt) reverse
t_test_binary_ci, level(99)

## Adjusting for covariates for binary outcomes. 
We follow a similar approach as before
1. estimate an adjusted model with only the covariates of interest
2. use the model to get an adjusted-expected number of outcomes per cluster
3. get the ratio of observed over expected for each cluster
4. do a t-test on these ratios, and obtain a CI using the above program

In [None]:
* 0. Refresh the frame where we do our collapsing with a copy of the main data

frame copy main_data cluster_level, replace
frame change cluster_level
quietly {

    * 1. First individual regression model with covariates that need adjustment
    quietly glm preterm bmi mother_age occupation land_ownership, family(binomial) link(log)
    * 2. Get preduction
    predict prob if e(sample), xb
    replace prob = exp(prob)
    * The observed is just the sum of the observed preterm variable (because it is a zero-one outcome),
    * while the expected is the mean of the probabilities
    collapse (sum) expected = prob observed = preterm, by(facility_trt cluster_label)
    **3. Create ratio and do t-tests
    gen ratio = observed/expected
    noi ttest ratio, by(facility_trt) reverse
    
}

t_test_binary_ci
