## Predictors of birth weight

In this notebook we use data on births in the United States to identify and understand predictors of birth weight.  Low birth weight is associated with several immediate risks for the newborn including infection and mortality, as well as longer-term risks for developmental delays and chronic health conditions in adulthood. Low birth weight is usually defined as the birth weight being less than 2500 grams, but this threshold does not play a role in the analyses below.  Instead, we will use quantitative birth weight as an outcome, and consider predictors of birth weight on this quantitative scale.

The data used here are a complete record (a census) of all live births in the United States in specific years (for example, in 1971 there are around 1.8 million births). The raw data and documentation are available from the National Center for Health Statistics (NCHS) [here](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm), or directly from [this page](https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality).  You should use the [prep.py](https://github.com/kshedden/case_studies/birthweight/prep.py) script to download the data and generate the CSV files that are used in the analyses below.

This notebook will present the data, and fit some initial models to explain how birth weight varies with respect to several factors.  A proper analysis of the data should be driven by a scientific aim or hypothesis.  This notebook only provides the scaffolding for such an analysis.  It is up to you to build on this scaffold to produce a coherent analytic narrative.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from pathlib import Path
from scipy.stats import kendalltau

Set the path below to point to the location of the data files (which are prepared by the prep.py script).

In [None]:
pa = "/home/kshedden/data/Teaching/birthweight/births"
pa = Path(pa)

We will start by working with the 1971 data.

In [None]:
da = pd.read_csv(pa / "1971.csv.gz")

In [None]:
da.shape

The documentation on the NCHS page explains what the variables mean.  Most are sself-explanatory.  "interval" is the duration in months to the previous birth, and is coded as "888"  for firstborn children.

In [None]:
da.head()

In [None]:
plt.hist(da["interval"])

A few variables have missing values.  Rather than impute them, we will do a "complete case analysis", dropping all observations with missing values on any variable.  However before we proceed to the complete case analysis, we will check to see whether there are differences between the cases with and without missing values.

The table below shows how many observations are missing for each variable.

In [None]:
da.isnull().sum(0)

In [None]:
da0 = da.copy()
da = da.dropna().copy()

Below is a cross-tab showing how often each pair of variables is missing simultaneously.  For example, this shows us that dadrace and dadage are usually missing together.

In [None]:
C = da0.isnull()
C = np.asarray(C).astype(np.int64)
S = np.dot(C.T, C)
cn = da.columns
ii = np.flatnonzero(S.sum(0) > 0)
S = S[ii, :][:, ii]
cn = cn[ii]
S = pd.DataFrame(S, columns=cn, index=cn)
S

Complementing the contingency table above, we can inspect the correlation matrix between the indicators of different variable pairs being missing for each case.  This makes it easy to see that the missingness status of dadrace and dadage are highly correlated, and the misingness status of birthorder and interval are also correlated.  For all other pairs of variables, missingness seems to be nearly uncorrelated.

In [None]:
C = C[:, ii]
np.corrcoef(C.T)

Below we can see that the birthweight distributions are similar regardless of whether birthorder is missing.  When dadage is missing, birthweight tends to be slightly lower than when dadage is observed.  Also, cases where dadage is missing are more likely to be first-born children (birthorder = 1), and when dadage is missing, momage tends to be somewhat lower than when dadage is observed.

These analyses show that the data here are not "missing completely at random".  But for now, we won't get into the various forms of missing data in any more detail.

In [None]:
for missing_vname in ["birthorder", "dadage"]:
    dmiss = da0.loc[da0[missing_vname].isnull(), :]
    dobs = da0.loc[da0[missing_vname].notnull(), :]
    for vname in ["birthweight", "birthorder", "momage"]:
        if missing_vname != vname:
            if vname == "birthorder":
                plt.boxplot([np.sqrt(dmiss[vname].dropna()), np.sqrt(dobs[vname].dropna())])
            else:
                plt.boxplot([dmiss[vname].dropna(), dobs[vname].dropna()])
            plt.ylabel(vname)
            if vname == "birthweight":
                plt.gca().set_ylim(2000, 5000)
            plt.gca().set_xticklabels([f"Missing {missing_vname}", f"Not missing {missing_vname}"])
            plt.show()

In [None]:
da["dadrace"].isnull().sum()

In [None]:
print(da.shape)
da.head()

Below is a quantile plot of the birth weight values.  This marginal distribution is not of primary interest since our focus will be on the conditional distribution of birth weight relative to potential risk factors.  Not surprisingly, the marginal distribution of birth weights is right-skewed.

In [None]:
plt.grid(True)
plt.xticks(np.linspace(0, 1, 11))
plt.yticks(np.linspace(0, 10000, 11))
u = np.linspace(0, 1, da.shape[0])
plt.plot(u, np.sort(da["birthweight"]))
plt.xlabel("Probability point")
plt.ylabel("Birth weight")

Maternal and paternal age are both plausible predictors of birth weight.  These two variables are correlated, as shown below.

In [None]:
hb = plt.hexbin(da["momage"], da["dadage"], gridsize=30, vmax=4)
ar = hb.get_array()
ar = np.log10(1 + ar)
hb.set_array(ar)
plt.xlabel("Mother age")
plt.ylabel("Father age")
plt.colorbar()
plt.show()

Plurality is the number of births resulting from a single pregnancy.  While twin births are fairly common, only a tiny fraction of multiple birth involve three or more babies.

In [None]:
da["plurality"].value_counts()

Birthorder is the order of a given birth among all births to the same mother.  Birthorder=1 is the first-born child to a mother.

In [None]:
plt.grid(True)
plt.plot(da["birthorder"].value_counts(), "-o")
plt.xlabel("Birth order")
plt.ylabel("Frequency")

Some of the birth order values are implausibly high, so we clip this variable to have a maximum value of 10.

In [None]:
da["birthorder"] = da["birthorder"].clip(1, 10)

We will be interested in the extent to which births in the same county are more similar than births in different counties.  This could be an indication of unobserved county-level heterogeneity (i.e. county-level characteristics that predict birth weight but that are not included as predictors in our regression models).

In [None]:
da["location"] = [f"{state}_{county}" for (state,county) in zip(da["state"], da["county"])]

The distribution of group (cluster) sizes is shown below.  The median group size is well over 100.

In [None]:
gs = da.groupby("location").size()
gss = np.sort(gs.values)
pp = np.linspace(0, 1, gss.size)
plt.grid(True)
plt.plot(pp, np.log10(gss))
plt.ylabel("Log10 group size")
plt.xlabel("Probability point")

Our main tool will be generalized linear models (GLM), using the framework known as "generalized estimating equations" (GEE) that allows us to fit GLMs to clustered data.

When fitting regression models, especially more complicated forms that involve nonlinear and nonadditive components, it is often useful to center any quantitative predictors.

In [None]:
momage_mean = da["momage"].mean()
da["momage_cen"] = da["momage"] - momage_mean
dadage_mean = da["dadage"].mean()
da["dadage_cen"] = da["dadage"] - dadage_mean

We begin with a basic model that incorporates some plausible predictors of birth weight that are available in the NCHS data. This model includes main effects for all variables, and models maternal age (momage) quadratically for reasons that will be explored below.  This initial model is a "gamma" GLM, which means that the conditional variance is modeled as being proportional to the square of the conditional mean.  Despite the name, this does not require the response variable to follow a Gamma distribution.

In [None]:
fml = "birthweight ~ sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder"
m0 = sm.GLM.from_formula(fml, data=da, family=sm.families.Gamma(link=sm.families.links.log()))
r0 = m0.fit(scale="X2")
r0.summary()

The birth weights are likely to be "clustered" (non-independent) by county.  The GEE approach to fitting a GLM accounts for this possibility.

In [None]:
m1 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Exchangeable())
r1 = m1.fit()
r1.summary()

The model fit above (r1) treats the birth weight observations as being "exchangeable" within counties.  This means that any two births in the same county are correlated to the same extent.  An important aspect of using GEE is that the dependence structure is a "working dependence structure", and can be mis-specified without invalidating the estimated mean structure.  The value of the "intra-class correlation" (ICC) providing the average pairwsise correlation between births in the same county is given below.

In [None]:
r1.cov_struct.summary()

Although the ICC is small, when some of the clusters are large (as is the case here), it can be consequential.

As noted above, the Gamma GLM/GEE analysis takes the conditional variance to be proportional to the conditional mean.  Below we assess whether this is consistent with the data.

In [None]:
dx = pd.DataFrame({"fit": r1.fittedvalues, "resid": r1.resid_pearson})
dx["aresid"] = np.abs(dx["resid"])
dx = dx.sort_values(by="fit")
dx["fitgroup"] = pd.qcut(dx["fit"], 20)
aa = dx.groupby("fitgroup").agg({"fit": np.median, 
      "aresid": [np.median, lambda x: np.quantile(x, 0.1), lambda x: np.quantile(x, 0.9)]})
aa.columns = ["fit", "q50", "q10", "q90"]

In [None]:
ax = plt.axes([0.1, 0.1, 0.75, 0.9])
ax.grid(True)
ax.set_ylim(0, 0.4)
ax.plot(aa["fit"], aa["q10"], label="q10")
ax.plot(aa["fit"], aa["q50"], label="q50")
ax.plot(aa["fit"], aa["q90"], label="q90")
ax.set_xlabel("Predicted birth weight")
ax.set_ylabel("Absolute Pearson residual quantiles")
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="center right")
leg.draw_frame(False)

Below we calculate a measure of concordance, showing that the model only explains a rather small fraction of the birth weight variation.  This is common when working with data on human biological traits. Nevertheless, many of the associations are strongly statistically significant, and the effect sizes are large enough to have important implications. 

In [None]:
kendalltau(r1.fittedvalues, da["birthweight"])

We modeled the roles of parental ages quadratically.  For maternal age, this reveals the inverted U-shaped relationship below.  This type of plot shows the conditional mean for the dependent variable (birth weight) versus an independent variable of interest (parental age), holding all other covariates fixed.  Since there are no interactions, the specific value at which we fix the other variables is unimportant (the choice only translates the graph vertically).  But when interactions are present (i.e. the model is not additive), more work is needed to see how parental age and birth weight are related for each possible setting of the other covariates.

In [None]:
dp = da.head(100).copy()

def set_parent_ages(dp, momage, dadage):
    dp["momage"] = momage
    dp["momage_cen"] = dp["momage"] - momage_mean
    dp["dadage"] = dadage
    dp["dadage_cen"] = dadage - dadage_mean
    return dp

dp["sex"] = "female"
dp["plurality"] = 1
dp["birthorder"] = 1

In [None]:
age = np.linspace(15, 50, 100)
dp = set_parent_ages(dp, age, 25)
ym = r1.predict(dp)
plt.grid(True)
plt.plot(age, ym)
plt.xlabel("Maternal age")
plt.ylabel("Birth weight")

The association between paternal age and birthwight also has an inverted U shape, but note that the amplitude (and significance level) are much weaker compared to maternal age.

In [None]:
dp = set_parent_ages(dp, 25, age)
yp = r1.predict(dp)
plt.grid(True)
plt.plot(age, yp)
plt.xlabel("Paternal age")
plt.ylabel("Birth weight")

A more insightful comparison of the roles of maternal and paternal age is obtained by plotting the two curves together.

In [None]:
plt.grid(True)
plt.plot(age, ym, label="Mom age")
plt.plot(age, yp, label="Dad age")
ha, lb = plt.gca().get_legend_handles_labels()
plt.figlegend(ha, lb, loc="center right")
plt.xlabel("Age")
plt.ylabel("Birth weight")