# Multi-level and Marginal Models - Linear and Logistic Regression - NHANES Dataset

This notebook applies multi-level and marginal models for linear and logistic regression to the NHANES datset. In contrast to ordinary linear and logistic regression, these models make the assumption that the variables are somehow correlated; that can be due to:

- Correlations within clusters; that happens when a clustered complex sample is done by dividing in neighborhoods, etc.
- Repeated measures along the time, typical in longitudinal studies that measure variable evolution during time.

Marginal and multi-level models differentiate from each other in these aspects:

- Multi-level models use random coefficients, i.e., variances between clusters (subject, neighborhood, etc.) are captured as random variables or effects; in addition to them, we also have the regular fixed effects or constant coefficients. All in all, the variance of the random effects is estimated. Thus, as a result, each cluster has its own model.
- Marginal models do not have random coefficients; instead, they have the regular linear/lohgistic regression parameters or coefficients. However, the variance between the variables is also computed to adjust the standard error of the coefficients. This allows for a more realistic inference (i.e., hypothesis testing or confidence interval computation). Additionally, that variance is computed overall, i.e., between-cluster variations are not computed, as it is done by multi-level models.

Overview of contents:
1. Data Analysis and Pre-Processing
    - 1.1 Cluster Ids
    - 1.2 Interclass Correlation (ICC)
    - 1.3 Conditional Interclass Correlation
2. Marginal Linear Models with Dependent Data

## 1. Data Analysis and Pre-Processing

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

In [2]:
# Read the data file
da = pd.read_csv("nhanes_2015_2016.csv")

In [5]:
# Drop unused columns, keep variables for our model
# Variables that account for clustering in the complex sample
# [SDMVSTRA](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVSTRA)
# [SDMVPSU](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVPSU)
vars = ["BPXSY1",
        "RIDAGEYR",
        "RIAGENDR",
        "RIDRETH1",
        "DMDEDUC2",
        "BMXBMI",
        "SMQ020",
        "SDMVSTRA",
        "SDMVPSU"]
# Drop rows with any missing values
da = da[vars].dropna()

### 1.1 Cluster Ids

Roughly speaking, in NHANES the data
are collected by selecting a limited number of counties in the US,
then selecting subregions of these counties, then selecting people
within these subregions.  Since counties are geographically
constrained, it is expected that people within a county are more
similar to each other than they are to people in other counties.

For privacy reasons, county ids are encoded as "masked variance units" (MVUs), which can be retrieved with `SDMVSTRA` and `SDMVPSU`.

In [7]:
da["group"] = 10*da.SDMVSTRA + da.SDMVPSU

### 1.2 Interclass Correlation (ICC)

Interclass Correlation (ICC) can be used to measure the dependence of the variables within clusters or groups. `ICC = 1` means perfect dependence. Note that ICC is similar but not the same as the Pearson's correlation; small values like `0.03` are not that small in ICC.

In [11]:
# ICC of blood pressure
# Small values like `0.03` are not that small in ICC.
model = sm.GEE.from_formula("BPXSY1 ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
result = model.fit()
print(result.cov_struct.summary())

The correlation between two observations in the same cluster is 0.030


In [12]:
# Recode smoking to a simple binary variable
da["smq"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})

In [13]:
# ICC of all variables
# Similar values to the blood pressure
# except for SDMVSTRA - 
for v in ["BPXSY1", "RIDAGEYR", "BMXBMI", "smq", "SDMVSTRA"]:
    model = sm.GEE.from_formula(v + " ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
    result = model.fit()
    print(v, result.cov_struct.summary())

BPXSY1 The correlation between two observations in the same cluster is 0.030
RIDAGEYR The correlation between two observations in the same cluster is 0.035
BMXBMI The correlation between two observations in the same cluster is 0.039
smq The correlation between two observations in the same cluster is 0.026
SDMVSTRA The correlation between two observations in the same cluster is 0.959


### 1.3 Conditional Interclass Correlation

In [21]:
# The ICC model above is computed without taking into account other variables
# Here we control the age variable
# We know that at older ages the blood pressure increases; that is an explained variation
# Thus, we update the model to compute more reliable ICCs
# and indeed the ICC value decreases
# Lesson: always introduce control variables in the model before computing the ICC
model = sm.GEE.from_formula("BPXSY1 ~ RIDAGEYR", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
result = model.fit()
print(result.cov_struct.summary())

The correlation between two observations in the same cluster is 0.019


In [22]:
# Create a labeled version of the gender variable
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

In [23]:
# Again, we refine the model and introduce other variables that might explain the outcome
# That way, the ICC is more accurate
# C(RIDRETH1): 5 levels of ethnicity; C() convertes integers to categorical levels
model = sm.GEE.from_formula("BPXSY1 ~ RIDAGEYR + RIAGENDRx + BMXBMI + C(RIDRETH1)",
           groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
result = model.fit()
print(result.cov_struct.summary())

The correlation between two observations in the same cluster is 0.013


## 2. Marginal Models with Dependent Data

### 2.1 Marginal Model for Linear Regression and Comparison with OLS

If we have dependent data and we use ordinary linear models (i.e., not multi-level or marginal models), the mean parameters or coefficients will be correct, but the standard errors won't be correct. Thus, the significances of the parameters will be wrong. Therefore, GEE should be used to fit dependent data with continuous outcomes instead of OLS.

In [39]:
# Fit a linear model with OLS
model1 = sm.OLS.from_formula("BPXSY1 ~ RIDAGEYR + RIAGENDRx + BMXBMI + C(RIDRETH1)",
           data=da)
result1 = model1.fit()

In [40]:
# Fit a marginal linear model using GEE to handle dependent data
model2 = sm.GEE.from_formula("BPXSY1 ~ RIDAGEYR + RIAGENDRx + BMXBMI + C(RIDRETH1)",
           groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
result2 = model2.fit()

In [41]:
# Create dataframe which contains model parameters and standard errors
# Linear regression: OLS vs GEE (assuming dependent data)
x = pd.DataFrame({"OLS_params": result1.params, "OLS_SE": result1.bse,
                  "GEE_params": result2.params, "GEE_SE": result2.bse})
x = x[["OLS_params", "OLS_SE", "GEE_params", "GEE_SE"]]
print(x)

                   OLS_params    OLS_SE  GEE_params    GEE_SE
Intercept           91.736583  1.339378   92.168530  1.384309
RIAGENDRx[T.Male]    3.671294  0.453763    3.650245  0.454498
C(RIDRETH1)[T.2]     0.855488  0.819486    0.159296  0.767025
C(RIDRETH1)[T.3]    -1.796132  0.671954   -2.233280  0.760228
C(RIDRETH1)[T.4]     3.813314  0.732355    3.105654  0.881580
C(RIDRETH1)[T.5]    -0.455347  0.808948   -0.439831  0.813675
RIDAGEYR             0.478699  0.012901    0.474101  0.018493
BMXBMI               0.278015  0.033285    0.280205  0.038553


### 2.2 Marginal Model for Logistic Regression and Comparison with GLM

Similarly as before, GEE should be used to fit dependent data with binary outcomes instead of GLM.

In [42]:
# Relabel the levels, convert rare categories to missing.
da["DMDEDUC2x"] = da.DMDEDUC2.replace({1: "lt9", 2: "x9_11", 3: "HS", 4: "SomeCollege",
                                       5: "College", 7: np.nan, 9: np.nan})

In [43]:
# Fit a basic GLM
model1 = sm.GLM.from_formula("smq ~ RIDAGEYR + RIAGENDRx + C(DMDEDUC2x)",
           family=sm.families.Binomial(), data=da)
result1 = model1.fit()
result1.summary()

0,1,2,3
Dep. Variable:,smq,No. Observations:,5093.0
Model:,GLM,Df Residuals:,5086.0
Model Family:,Binomial,Df Model:,6.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3201.2
Date:,"Wed, 30 Mar 2022",Deviance:,6402.4
Time:,17:11:23,Pearson chi2:,5100.0
No. Iterations:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.3060,0.114,-20.174,0.000,-2.530,-2.082
RIAGENDRx[T.Male],0.9096,0.060,15.118,0.000,0.792,1.028
C(DMDEDUC2x)[T.HS],0.9434,0.090,10.521,0.000,0.768,1.119
C(DMDEDUC2x)[T.SomeCollege],0.8322,0.084,9.865,0.000,0.667,0.998
C(DMDEDUC2x)[T.lt9],0.2662,0.109,2.438,0.015,0.052,0.480
C(DMDEDUC2x)[T.x9_11],1.0986,0.107,10.296,0.000,0.889,1.308
RIDAGEYR,0.0183,0.002,10.582,0.000,0.015,0.022


In [44]:
# Fit a marginal GLM using GEE
model2 = sm.GEE.from_formula("smq ~ RIDAGEYR + RIAGENDRx + C(DMDEDUC2x)",
           groups="group", family=sm.families.Binomial(),
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
result2 = model2.fit(start_params=result1.params)

In [46]:
# Create dataframe which contains model parameters and standard errors
# Logistic regression: GLM vs GEE (assuming dependent data)
x = pd.DataFrame({"GLM_params": result1.params, "GLM_SE": result1.bse,
                  "GEE_params": result2.params, "GEE_SE": result2.bse})
x = x[["GLM_params", "GLM_SE", "GEE_params", "GEE_SE"]]
print(x)

                             GLM_params    GLM_SE  GEE_params    GEE_SE
Intercept                     -2.305999  0.114308   -2.249820  0.140567
RIAGENDRx[T.Male]              0.909597  0.060167    0.908682  0.062342
C(DMDEDUC2x)[T.HS]             0.943364  0.089663    0.887965  0.095397
C(DMDEDUC2x)[T.SomeCollege]    0.832227  0.084361    0.771636  0.104449
C(DMDEDUC2x)[T.lt9]            0.266228  0.109183    0.321784  0.141327
C(DMDEDUC2x)[T.x9_11]          1.098561  0.106697    1.062149  0.138401
RIDAGEYR                       0.018257  0.001725    0.017416  0.001803
