# GEE analysis of growth trajectories of children

GEE is commonly used in longitudinal data analysis.  Here we
consider a dataset in which repeated measures of weight were made on
young children over several years in early childhood.
GEE allows us to use linear modeling techniques
similar to OLS, and still rigorously account for the repeated
measures aspect of the data.

The data we will use are obtained from this page:
http://www.bristol.ac.uk/cmm/learning/support/datasets

These are the packages we will be using:

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)


The data are in "fixed width" format, so we use some special
techniques for reading them:

In [2]:
colspecs = [(0, 4), (4, 7), (7, 12), (12, 16), (16, 17)]
df = pd.read_fwf("../data/growth/ASIAN.DAT", colspecs=colspecs, header=None)
df.columns = ["Id", "Age", "Weight", "BWeight", "Gender"]
df["Female"] = 1*(df.Gender == 2)
df = df.dropna()

Some of the analyses below will use logged data:

In [3]:
df["LogWeight"] = np.log(df.Weight) / np.log(2)
df["LogBWeight"] = np.log(df.BWeight) / np.log(2)

The first model that we consider treats weight as a linear function of age, and
ignores the repeated measures structure.  The point estimates from
this model are valid, but the standard errors are not.

In [4]:
model0 = sm.GLM.from_formula("Weight ~ Age + BWeight + Female", data=df)
rslt0 = model0.fit()
print(rslt0.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                 Weight   No. Observations:                 1572
Model:                            GLM   Df Residuals:                     1568
Model Family:                Gaussian   Df Model:                            3
Link Function:               identity   Scale:                      2.0045e+06
Method:                          IRLS   Log-Likelihood:                -13634.
Date:                Mon, 17 Feb 2020   Deviance:                   3.1431e+09
Time:                        13:48:28   Pearson chi2:                 3.14e+09
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2601.2871    231.095     11.256      0.0

Here is a GEE model with the same mean structure as in the cell
above, but using GEE gives us meaningful standard errors:

In [5]:
model1 = sm.GEE.from_formula("Weight ~ Age + BWeight + Female", groups="Id", data=df)
rslt1 = model1.fit()
print(rslt1.summary())

                               GEE Regression Results                              
Dep. Variable:                      Weight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Independence   Num. iterations:                     3
Date:                     Mon, 17 Feb 2020   Scale:                     2004506.825
Covariance type:                    robust   Time:                         13:48:29
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2601.2871    268.766      9.679      0.000    2074.515    3128.059
Age    

Now we fit the same model as a log/log regression.  Specifically,
the relationship between weight in childhood at a given age and
birth weight is modeled as a log/log relationship.  This means that when
comparing two children of the same sex whose birth weights differed
by a given percentage, say $x$, then their childhood weights at a
given age differ on average by a corresponding percentage $b\cdot
x$, where $b$ is the coefficient of LogBWeight in the model.  Typically we
anticipate that $0 \le b \le 1$ in this type of regression.  If $b
\approx 1$ then, say, two kids whose weights at birth differ by
20% will continue to have weights differing by 20% as they age.
If $b < 1$, then the 20% difference at birth will attenuate as the
kids age.

In [6]:
model2 = sm.GEE.from_formula("LogWeight ~ Age + LogBWeight + Female", groups="Id", data=df)
rslt2 = model2.fit()
print(rslt2.summary())

                               GEE Regression Results                              
Dep. Variable:                   LogWeight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Mon, 17 Feb 2020   Scale:                           0.094
Covariance type:                    robust   Time:                         13:48:29
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.0936      0.501     18.151      0.000       8.112      10.076
Age    

It isn't very likely that weight varies either linearly or
exponentially with age.  We can use splines to capture a much
broader range of relationships.

In [7]:
model3 = sm.GEE.from_formula("LogWeight ~ bs(Age, 4) + LogBWeight + Female", groups="Id", data=df)
rslt3 = model3.fit()
print(rslt3.summary())

                               GEE Regression Results                              
Dep. Variable:                   LogWeight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Mon, 17 Feb 2020   Scale:                           0.024
Covariance type:                    robust   Time:                         13:48:29
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         7.9489      0.384     20.675      0.000       7.195       8.70

It is quite possible that the relationships between birth weight and
childhood weight differ between girls and boys.  An interaction
captures this possibility.

In [8]:
model4 = sm.GEE.from_formula("LogWeight ~ bs(Age, 4) + LogBWeight*Female", groups="Id", data=df)
rslt4 = model4.fit()
print(rslt4.summary())

                               GEE Regression Results                              
Dep. Variable:                   LogWeight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Mon, 17 Feb 2020   Scale:                           0.024
Covariance type:                    robust   Time:                         13:48:29
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             8.0705      0.621     12.991      0.000       6.85

Although GEE does not require us to specify an accurate covariance
structure, we will have more power if we do so.  We will also learn
something about the strength of the within-subject dependence that
we would not learn when using the independence model.

In [9]:
model5 = sm.GEE.from_formula("LogWeight ~ bs(Age, 4) + LogBWeight + Female", groups="Id",
                             cov_struct=sm.cov_struct.Exchangeable(), data=df)
rslt5 = model5.fit()
print(rslt5.summary())
print(rslt5.cov_struct.summary())

                               GEE Regression Results                              
Dep. Variable:                   LogWeight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Exchangeable   Num. iterations:                     6
Date:                     Mon, 17 Feb 2020   Scale:                           0.024
Covariance type:                    robust   Time:                         13:48:30
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         7.7638      0.361     21.482      0.000       7.055       8.47

In general, it is better to use the default "robust" approach for
covariance estimation.  This allows the covariance model to be
mis-specified, while still yielding valid parameter estimates and
standard errors.  If you are very confident that your working
covariance model is correct, you can specify the "naive" approach to
covariance estimation, as below.  In this case, the standard errors will be
meaningful only if the working correlation model is correct.

In [10]:
model6 = sm.GEE.from_formula("LogWeight ~ bs(Age, 4) + LogBWeight + Female", groups="Id",
                             cov_struct=sm.cov_struct.Exchangeable(), data=df)
rslt6 = model6.fit(cov_type="naive")
print(rslt6.summary())

                               GEE Regression Results                              
Dep. Variable:                   LogWeight   No. Observations:                 1572
Model:                                 GEE   No. clusters:                      568
Method:                        Generalized   Min. cluster size:                   1
                      Estimating Equations   Max. cluster size:                   5
Family:                           Gaussian   Mean cluster size:                 2.8
Dependence structure:         Exchangeable   Num. iterations:                     6
Date:                     Mon, 17 Feb 2020   Scale:                           0.024
Covariance type:                     naive   Time:                         13:48:31
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         7.7638      0.249     31.119      0.000       7.275       8.25