# Predictors of birth weight

In this notebook we use data on births in the United States to identify and understand predictors of birth weight.  Low birth weight is associated with several immediate risks for the newborn including infection and mortality, as well as longer-term risks for developmental delays and chronic health conditions in adulthood. Low birth weight is usually defined as the birth weight being less than 2500 grams, but this threshold does not play a role in the analyses below.  Instead, we will use quantitative birth weight as an outcome, and consider predictors of birth weight on this quantitative scale.

The data used here are a complete record (a census) of all live births in the United States in specific years.  The sample sizes are large, for example in 1971 there were around 1.8 million births. The raw data and documentation are available from the National Center for Health Statistics (NCHS) [here](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm), or directly from [this page](https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality).  You should use the [prep.py](https://github.com/kshedden/case_studies/birthweight/prep.py) script to download the data and generate the CSV files that are used in the analyses below.

This notebook will present the data, and fit some initial models to explain how birth weight varies with respect to several factors.  A proper analysis of the data should be driven by a scientific aim or hypothesis.  This notebook only provides the scaffolding for such an analysis.  It is up to you to build on this scaffold to produce a coherent analytic narrative.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from pathlib import Path
from scipy.stats import kendalltau
import patsy

Set the path below to point to the location of the data files (which are prepared by the prep.py script).

In [None]:
pa = "/home/kshedden/data/Teaching/birthweight/births"
pa = Path(pa)

We will start by working with the 1971 data.

In [None]:
da = pd.read_csv(pa / "1971.csv.gz")

This is the number of observations (rows) and variables (columns) in the data file.

In [None]:
da.shape

The [documentation](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm) on the NCHS page explains what the variables mean.  Most of the variables are self-explanatory.  "interval" is the duration in months to the previous birth, and is coded as "777"  for firstborn children.

In [None]:
da.head()

## Missing values

A few variables have missing values.  Rather than impute them, we will do a "complete case analysis", dropping all observations with missing values on any variable.  However before we proceed to the complete case analysis, we will check to see whether there are systematic differences between the cases with and without missing values.

The table below shows how many observations are missing for each variable.

In [None]:
da.isnull().sum(0)

Since we will be doing a complete case analysis, we will drop all the observations with missing values on any variable next, but we also save a copy of the original data to use for further assessment of the missing data patterns.

In [None]:
da0 = da.copy()
da = da.dropna().copy()

Below is a cross-tab showing how often each pair of variables is missing simultaneously.  For example, this shows us that dadrace and dadage are usually missing together.

In [None]:
C = da0.isnull()
C = np.asarray(C).astype(np.int64)
S = np.dot(C.T, C)
cn = da.columns
ii = np.flatnonzero(S.sum(0) > 0)
S = S[ii, :][:, ii]
cn = cn[ii]
S = pd.DataFrame(S, columns=cn, index=cn)
S

Complementing the contingency table above, we can inspect the correlation matrix between the indicators of different variable pairs being missing for each case.  This makes it easy to see that the missingness status of dadrace and dadage are highly correlated, and the misingness status of birthorder and interval are also correlated.  For all other pairs of variables, missingness seems to be nearly uncorrelated.

In [None]:
C = C[:, ii]
np.corrcoef(C.T)

Birthweight is the key outcome variable in our analysis, so we will pay special attention to it when considering the missing data patterns.

Below we can see that the birthweight distributions are similar regardless of whether birthorder is missing.  When dadage is missing, birthweight tends to be slightly lower than when dadage is observed.  Also, cases where dadage is missing are more likely to be first-born children (birthorder = 1), and when dadage is missing, momage tends to be somewhat lower than when dadage is observed.

These analyses show that the data here are not "missing completely at random".  But for now, we won't get into the various forms of missing data in any more detail.

In [None]:
for missing_vname in ["birthorder", "dadage"]:
    dmiss = da0.loc[da0[missing_vname].isnull(), :]
    dobs = da0.loc[da0[missing_vname].notnull(), :]
    for vname in ["birthweight", "birthorder", "momage"]:
        if missing_vname != vname:
            if vname == "birthorder":
                plt.boxplot([np.sqrt(dmiss[vname].dropna()), np.sqrt(dobs[vname].dropna())])
            else:
                plt.boxplot([dmiss[vname].dropna(), dobs[vname].dropna()])
            plt.ylabel(vname)
            if vname == "birthweight":
                plt.gca().set_ylim(2000, 5000)
            plt.gca().set_xticklabels([f"Missing {missing_vname}", f"Not missing {missing_vname}"])
            plt.show()

In [None]:
print(da.shape)
da.head()

## Marginal distributions of the outcome and covariates

Below is a quantile plot of the birth weight values.  This marginal distribution is not of primary interest since our focus will be on the conditional distribution of birth weight relative to potential risk factors.  Not surprisingly, the marginal distribution of birth weights is right-skewed.

In [None]:
plt.grid(True)
plt.xticks(np.linspace(0, 1, 11))
plt.yticks(np.linspace(0, 10000, 11))
u = np.linspace(0, 1, da.shape[0])
plt.plot(u, np.sort(da["birthweight"]))
plt.xlabel("Probability point")
plt.ylabel("Birth weight")

Maternal and paternal age are both plausible predictors of birth weight.  These two variables are correlated, as shown below.

In [None]:
hb = plt.hexbin(da["momage"], da["dadage"], gridsize=30, vmax=4)
ar = hb.get_array()
ar = np.log10(1 + ar)
hb.set_array(ar)
plt.xlabel("Mother age")
plt.ylabel("Father age")
plt.colorbar()
plt.show()

Plurality is the number of births resulting from a single pregnancy.  While twin births are fairly common, only a tiny fraction of multiple birth involve three or more babies.

In [None]:
da["plurality"].value_counts()

Birthorder is the order of a given birth among all births to the same mother.  Birthorder=1 is the first-born child to a mother.

In [None]:
plt.grid(True)
plt.plot(da["birthorder"].value_counts(), "-o")
plt.xlabel("Birth order")
plt.ylabel("Frequency")

Some of the birth order values are implausibly high, so we clip this variable to have a maximum value of 10.

In [None]:
da["birthorder"] = da["birthorder"].clip(1, 10)

## Geographic clustering

We will be interested in the extent to which births in the same county are more similar than births in different counties.  This could be an indication of unobserved county-level heterogeneity (i.e. county-level characteristics that predict birth weight but that are not included as predictors in our regression models).  To facilitate this analysis, we create a categorical variable for each state/county combination.

In [None]:
da["location"] = [f"{state}_{county}" for (state,county) in zip(da["state"], da["county"])]

The distribution of group (cluster) sizes is shown below.  The median group size is approximately 100.

In [None]:
gs = da.groupby("location").size()
gss = np.sort(gs.values)
pp = np.linspace(0, 1, gss.size)
plt.grid(True)
plt.plot(pp, np.log10(gss))
plt.ylabel("Log10 group size")
plt.xlabel("Probability point")

## Regression analysis for a single year

Our main tool will be generalized linear models (GLM), using the framework known as "generalized estimating equations" (GEE) that allows us to fit GLMs to clustered data.

When fitting regression models, especially more complicated forms that involve nonlinear and nonadditive components, it is often useful to center any quantitative predictors.

In [None]:
momage_mean = da["momage"].mean()
da["momage_cen"] = da["momage"] - momage_mean
dadage_mean = da["dadage"].mean()
da["dadage_cen"] = da["dadage"] - dadage_mean

We begin with a basic model that incorporates some plausible predictors of birth weight that are available in the NCHS data. This model includes main effects for all variables, and models maternal age (momage) quadratically for reasons that will be explored below.  This initial model is a "gamma" GLM, which means that the conditional variance is modeled as being proportional to the square of the conditional mean.  Despite the name, this does not require the response variable to follow a Gamma distribution.

In [None]:
fml = "birthweight ~ sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder"
m0 = sm.GLM.from_formula(fml, data=da, family=sm.families.Gamma(link=sm.families.links.log()))
r0 = m0.fit(scale="X2")
r0.summary()

The birth weights are likely to be "clustered" (non-independent) by county.  The GEE approach to fitting a GLM accounts for this possibility.

In [None]:
m1 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r1 = m1.fit()
r1.summary()

A defining property of GLM's is that the residuals are orthogonal to the covariates.  This is how we know that all information in the covariates has been incorporated into the model (subject to the way that the model was specified).

In [None]:
np.dot(m1.exog.T, r1.resid_working)

The model fit above (r1) treats the birth weight observations as being independent within counties.  This independence is actually a "working independence" and does not need to be true.  The GEE approach is more precise when the "working correlation model" is correctly specified, but the results remain valid even if this is not the case.  It is common to fit working models with independent working correlation structures even when we strongly suspect that the data are not independent.

If we want to get more insight into the within cluster dependence structure, we can refit the model with an "exchangeable" within-county correlation model.  This allows any two births in the same county to be correlated to the same extent.  This provides us with an "intra-class correlation" (ICC), estimating the average pairwsise correlation between births in the same county, as shown below.

In [None]:
m2 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Exchangeable())
r2 = m2.fit()
r2.cov_struct.summary()

Although the ICC is small, when some of the clusters are large (as is the case here), it can be consequential.

### Heteroscedasticity and assessing the mean/variance relationship

As noted above, the Gamma GLM/GEE analysis takes the conditional variance to be proportional to the conditional mean.  Below we assess whether this is consistent with the data.

In [None]:
dx = pd.DataFrame({"fit": r1.fittedvalues, "resid": r1.resid_pearson})
dx["aresid"] = np.abs(dx["resid"])
dx = dx.sort_values(by="fit")
dx["fitgroup"] = pd.qcut(dx["fit"], 20)
aa = dx.groupby("fitgroup").agg({"fit": np.median, 
      "aresid": [np.median, lambda x: np.quantile(x, 0.1), lambda x: np.quantile(x, 0.9)]})
aa.columns = ["fit", "q50", "q10", "q90"]

In [None]:
ax = plt.axes([0.1, 0.1, 0.75, 0.9])
ax.grid(True)
ax.set_ylim(0, 0.4)
ax.plot(aa["fit"], aa["q10"], label="q10")
ax.plot(aa["fit"], aa["q50"], label="q50")
ax.plot(aa["fit"], aa["q90"], label="q90")
ax.set_xlabel("Predicted birth weight")
ax.set_ylabel("Absolute Pearson residual quantiles")
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="center right")
leg.draw_frame(False)

### Predictive concordance

Below we calculate a measure of concordance, showing that the model only explains a rather small fraction of the birth weight variation.  This is common when working with data on human biological traits. Nevertheless, many of the associations are strongly statistically significant, and the effect sizes are large enough to have important implications. 

In [None]:
kendalltau(r1.fittedvalues, da["birthweight"])

### Interpreting the fitted model

We modeled the roles of parental ages quadratically.  For maternal age, this reveals the inverted U-shaped relationship below.  This type of plot shows the conditional mean for the dependent variable (birth weight) versus an independent variable of interest (parental age), holding all other covariates fixed.  Since there are no interactions, the specific value at which we fix the other variables is unimportant (the choice only translates the graph vertically).  But when interactions are present (i.e. the model is not additive), more work is needed to see how parental age and birth weight are related for each possible setting of the other covariates.

In [None]:
dp = da.head(100).copy()

def set_parent_ages(dp, momage, dadage):
    dp["momage"] = momage
    dp["momage_cen"] = dp["momage"] - momage_mean
    dp["dadage"] = dadage
    dp["dadage_cen"] = dadage - dadage_mean
    return dp

dp["sex"] = "female"
dp["plurality"] = 1
dp["birthorder"] = 1

In [None]:
age = np.linspace(15, 50, 100)
dp = set_parent_ages(dp, age, 25)
ym = r1.predict(dp)
plt.grid(True)
plt.plot(age, ym)
plt.xlabel("Maternal age")
plt.ylabel("Expected birth weight")

The association between paternal age and birthwight also has an inverted U shape, but note that the amplitude (and significance level) are much weaker compared to maternal age.

In [None]:
dp = set_parent_ages(dp, 25, age)
yp = r1.predict(dp)
plt.grid(True)
plt.plot(age, yp)
plt.xlabel("Paternal age")
plt.ylabel("Expected birth weight")

A more insightful comparison of the roles of maternal and paternal age is obtained by plotting the two curves together.

In [None]:
plt.grid(True)
plt.plot(age, ym, label="Mom age")
plt.plot(age, yp, label="Dad age")
ha, lb = plt.gca().get_legend_handles_labels()
plt.figlegend(ha, lb, loc="center right")
plt.xlabel("Age")
plt.ylabel("Expected birth weight")

Below, in a single plot, we show the mean birthweights for female and male babies, for single and multiple births, and for birth orders from 1 to 4.

In [None]:
dp = set_parent_ages(dp, 25, 25)
sex = np.concatenate((np.repeat("female", 25), np.repeat("male", 25)))
sex = np.concatenate((sex, sex))
dp["sex"] = sex
plurality = np.concatenate((np.ones(50), 2*np.ones(50)))
dp["plurality"] = plurality
birthorder = np.linspace(1, 5, 25)
dp["birthorder"] = np.concatenate((birthorder, birthorder, birthorder, birthorder))
yp = r1.predict(dp)

In [None]:
plt.grid(True)
plt.plot(birthorder, yp[0:25], label="Female single")
plt.plot(birthorder, yp[25:50], label="Male single")
plt.plot(birthorder, yp[50:75], label="Female twin")
plt.plot(birthorder, yp[75:], label="Male twin")
ha, lb = plt.gca().get_legend_handles_labels()
plt.figlegend(ha, lb, loc="center right")
plt.xlabel("Birthorder")
plt.ylabel("Expected birth weight")
plt.gca().set_xticks([1, 2, 3, 4, 5]);

### Contrasts

To obtain a more precise result targetting a specific research question, we can devise a contrast that most perfectly reflects the scientific research aim.  For example, we may want to focus on the difference between the mother being 35 versus 30 years of age, for a first born child.  Since most births are single births, we can set plurality=1.  One way to proceed would be to take the conditional mean birth weight to a 35 year old mother minus the conditional mean birth weight to a 30 year old mother, setting the father's age to the mean age of all fathers.  This might be interpreted as the purely "biological" effect of maternal age.  However, as seen above, mother's age and father's age are correlated.  For many purposes, a more natural approach would be to hold the father's age fixed at the average father age for the given mother age.  To do that, we can fit an auxiliary model for father age in relation to mother age. 

In [None]:
m3 = sm.OLS.from_formula("dadage ~ bs(momage, 5)", data=da)
r3 = m3.fit()

As usual, we will do a bit of verification before using this model.

In [None]:
dp = da.iloc[0:100, :].copy()
momage = np.linspace(15, 40, 100)
dp = set_parent_ages(dp, momage, 25)
yp = r3.predict(dp)

plt.grid(True)
plt.plot(dp["momage"], yp)
plt.xlabel("Mother age")
plt.ylabel("Expected father age")

Below are the average father ages corresponding to mother ages of 30 and 35.

In [None]:
dq = da.iloc[0:2, :].reset_index().copy()
dq["momage"] = [30, 35]
yp = r3.predict(dq)
yp

Below are the estimated average birth weights corresponding to these parent ages.

In [None]:
dp = da.iloc[0:2, :].reset_index().copy()
dp = set_parent_ages(dp, [35, 30], yp.values)
dp["plurality"] = 1
dp["birthorder"] = 1
dp["sex"] = ["female", "female"]
yp = r2.predict(dp)
yp

The contrast is the difference between these fitted birth weights:

In [None]:
np.dot(yp, [1, -1])

The precision with which the parameters are estimated is extremely high, so there is little statistical uncertainty here.  The greater risks are systematic errors due to model miss-specification, or confounding. 

In [None]:
dm = patsy.dmatrix(m1.data.design_info, dp, return_type="dataframe")
cm = np.dot(np.dot(dm, r1.cov_params()), dm.T)
d = np.array([1, -1])
c = np.sqrt(np.dot(d, np.dot(cm, d)))
e = np.dot(d, yp)
print("Expected difference (age 35 mother minus age 30 mother): ", e)
print("Standard error of the expected difference: ", c)

## Heterogeneity and non-additivity

The models considered above are additive in the sense that the the mean structure, on the log scale (due to the GLM link function) is an additive function of main effects for each of the covariates.  This implies that the change in the log expected birth weight as one factor changes is constant in terms of the other covariates.  Below we fit models that do not have this additivity, and use score testing to assess how well they fit.

Score tests are used to compare nested models, so we add non-additivity (through the inclusion of interactions) one at a time, building a sequence of successfully more complex models.

In [None]:
fml = "birthweight ~ sex * (sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder)"
m4 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r4 = m4.fit()

In [None]:
fml = "birthweight ~ (sex + momage) * (sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder)"
m5 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r5 = m5.fit()

In [None]:
fml = "birthweight ~ (sex + momage + plurality) * (sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder)"
m6 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r6 = m6.fit()

In [None]:
fml = "birthweight ~ (sex + momage + dadage + plurality) * (sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder)"
m7 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r7 = m7.fit()

In [None]:
fml = "birthweight ~ (sex + momage + dadage + plurality + birthorder) * (sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder)"
m8 = sm.GEE.from_formula(fml, groups=da["location"], data=da, 
                         family=sm.families.Gamma(link=sm.families.links.log()),
                         cov_struct=sm.cov_struct.Independence())
r8 = m8.fit()

Below we use score tests to compare each model to its parent.

In [None]:
t41 = m4.compare_score_test(r1)
t54 = m5.compare_score_test(r4)
t65 = m6.compare_score_test(r5)
t76 = m7.compare_score_test(r6)
t87 = m8.compare_score_test(r7)

The results of the score tests are below, and suggest that the most complex model (with the least additive structure) fits the data best.

In [None]:
for t in [t41, t54, t65, t76, t87]:
    print(t)

We will now briefly attempt to understand what non-additivity is expressed by this model.  We will focus here on the roles of maternal age and birth order.  The plot below suggests that for younger mothers, a first born child is larger than a second born child, but for older mothers, this relationship is reversed.

In [None]:
dp = da.iloc[0:100, :].copy()
momage1 = np.linspace(15, 40, 50)
momage = np.concatenate((momage1, momage1))
dp = set_parent_ages(dp, momage, 25)
dp["plurality"] = 1
dp["birthorder"] = 1
dp["sex"] = "female"
dp["birthorder"] = np.concatenate((np.repeat(1, 50), np.repeat(2, 50)))

In [None]:
yp = r8.predict(dp)

plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
plt.plot(momage1, yp[0:50], "-", color="orange", label="1")
plt.plot(momage1, yp[50:], "-", color="purple", label="2")
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="center right", title="Birthorder")
leg.draw_frame(False)
plt.xlabel("Maternal age")
plt.ylabel("Expected birth weight")

Another form of heterogeneity is heterogeneity over time, which might also be referred to as "non-stationarity".  To assess this, we will move forward 20 years to 1991 and compare the fitted models for 1971 and 1991.

In [None]:
db = pd.read_csv(pa / "1991.csv.gz")
db = db.dropna()
db["location"] = [f"{state}_{county}" for (state,county) in zip(db["state"], db["county"])]
db["momage_cen"] = db["momage"] - momage_mean
db["dadage_cen"] = db["dadage"] - dadage_mean
db

Before looking at birthweight, we consider a few key covariates.  Differences in the covariate distribution between two cohorts is known as "distribution shift".  Below we see that both mothers and fathers are older in 1991 compared to 1971.

In [None]:
print(np.median(da["momage"]))
print(np.median(db["momage"]))

In [None]:
print(np.median(da["dadage"]))
print(np.median(db["dadage"]))

Also, the mean birthorder is lower in 1991, reflecting slightly smaller family sizes.

In [None]:
print(np.mean(da["birthorder"]))
print(np.mean(db["birthorder"]))

The rate of multiple births is slighty greater in 1991, potentially reflecting assisted reproductive technologies.

In [None]:
print(np.mean(da["plurality"] > 1))
print(np.mean(db["plurality"] > 1))

We will use a quantile-quantile (QQ) plot to compare the marginal distributions of birthweights in 1971 and 1991. The result suggests a "location shift", in which at each quantile, the 1991 birthweights are greater by about 60 grams.

In [None]:
pp = np.linspace(0.1, 0.9, 9)
q71 = np.quantile(da["birthweight"], pp)
q91 = np.quantile(db["birthweight"], pp)

plt.grid(True)
plt.plot(q71, q91)
plt.xlabel("1971")
plt.ylabel("1991")
plt.axline((3000, 3000), slope=1, color="grey")
(q91 - q71).mean()

Now we turn to comparing the two models for birthweight.  For simplicity we will compare the models including only main effects.

In [None]:
fml = "birthweight ~ sex + momage + I(momage_cen**2) + dadage + I(dadage_cen**2) + plurality + birthorder"
m9 = sm.GEE.from_formula(fml, data=db, groups=db["location"], family=sm.families.Gamma(link=sm.families.links.log()))
r9 = m9.fit(scale="X2")
r9.summary()

In [None]:
r1.summary()

The coefficients are quite similar.

In [None]:
pd.DataFrame({"Coef1971": r1.params, "Coef1991": r9.params})