# Demographic predictors of birth counts in US counties from 2011-2020

In this notebook we will explore predictors of natality, defined here as birth count per county/year, for a subset of US counties between 2011 and 2020.  

In [None]:
import pandas as pd
import numpy as np
from prep import births, demog, pop, na, age_groups, rucc, adi
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.nonparametric.smoothers_lowess import lowess

View some of the raw data:

In [None]:
births.head()

In [None]:
pop.head()

In [None]:
adi.head()

In [None]:
rucc.head()

In [None]:
demog.head()

Create a dataframe for modeling.  Merge the birth data with population and RUCC data.

In [None]:
da = pd.merge(births, pop, on="FIPS", how="left")
da = pd.merge(da, rucc, on="FIPS", how="left")
da = pd.merge(da, adi, left_on="FIPS", right_on="FIPS5", how="left")

Population will be used as an offset below, so we log it here.

In [None]:
da["logPop"] = np.log(da["Population"])
da["logADINatRank"] = np.log(1 + da["ADI_NATRANK"])
da = da.dropna()

It's not always essential to center quantitative variables, but it makes it easier to interpret their effects, especially when interactions are used.

In [None]:
print(da["year"].mean())
da["yearc"] = da["year"] - da["year"].mean()
da["logPopc"] = da["logPop"] - da["logPop"].mean()
da["RUCC_2013c"] = da["RUCC_2013"] - da["RUCC_2013"].mean()
da["logADINatRank"] = da["logADINatRank"] - da["logADINatRank"].mean()
da["logADINatRankZ"] = da["logADINatRank"] / da["logADINatRank"].std()

One point that we will focus on below is the extent to which urbanicity (RUCC) and/or adversity (ADI) predict natality (after controlling for population size).  RUCC and ADI are related, but this does not mean that we cannot attempt to disentangle their roles in the regression.  The plot below giv

In [None]:
px = da[["RUCC_2013c", "logADINatRankZ"]]
lm = lowess(px["logADINatRankZ"], px["RUCC_2013c"])
plt.plot(px["RUCC_2013c"], px["logADINatRankZ"], "o")
plt.xlabel("RUCC")
plt.ylabel("ADI (Z-score)")
plt.plot(lm[:, 0], lm[:, 1])
plt.grid(True)

## Scaling by population size and offsets

It is natural to expect a 1-1 scaling between total population size and the number of births.  That is, all else equal we would expect two counties that differ by a factor of two in population size to differ by a factor of two in natality.  To assess this, we can make a scatterplot of the birth count versus the population size in log-space.  Under the expected 1-1 scaling, the slope of this line should be close to 1.

In [None]:
da = da.sort_values(["FIPS", "year"])
da.head()

In [None]:
da["logBirths"] = np.log(da["Births"])
sns.scatterplot(da, x="logPop", y="logBirths", alpha=0.3)
plt.grid(True)
b = np.cov(da["logPop"], da["logBirths"])[0, 1] / np.var(da["logPop"])
b

## Assessing the variance structure and mean/variance relationships

Since we have 10 years of data for each county, we can treat these as replicates to estimate the mean and variance within each county (over the 10 years covered by the dataset).  This is one way for us to assess the mean/variance relationship.

In [None]:
mv = births.groupby("FIPS")["Births"].agg([np.mean, np.var])
lmv = np.log(mv)
mv

Regress log variance on log mean

In [None]:
mr = sm.OLS.from_formula("var ~ mean", lmv).fit()
print(mr.summary())

Plot the log variance against the log mean.  If variance = phi * mean, then log(variance) = log(phi) + log(mean), i.e. the slope is 1 and the intercept is log(phi).  If variance = phi * mean^a then log(variance) = log(phi) + a * log(mean).

In [None]:
plt.clf()
plt.grid(True)
plt.plot(lmv["mean"], lmv["var"], "o", alpha=0.2, rasterized=True)
plt.axline((8, 8), slope=1, color="grey")
plt.axline((lmv["mean"].mean(0), lmv["var"].mean(0)), slope=1, color="purple")
plt.axline((8, mr.params[0]+8*mr.params[1]), slope=mr.params[1], color="orange")
plt.xlabel("Log mean", size=16)
plt.ylabel("Log variance", size=16)

## Urbanicity and time trends as predictors of natality

Below we fit a GLM, which is not appropriate since we have repeated measures on counties.  Due to the repeated measures, the uncertainty assessments (standard errors, p-values, confidence intervals, score tests) will be invalid, but the point estimates of the coefficients are still meaningful.

In [None]:
fml = "Births ~ logPop + RUCC_2013 + logADINatRankZ"
m0 = sm.GLM.from_formula(fml, family=sm.families.Poisson(), data=da)
r0 = m0.fit() # Poisson
r0x = m0.fit(scale="X2") # Quasi-Poisson
r0x.summary()

Using GEE accounts for the correlated data.  After accounting for clustering by county, ADI remains significant but RUCC looses significance.

In [None]:
m1 = sm.GEE.from_formula(fml, groups="FIPS", family=sm.families.Poisson(), data=da)
r1 = m1.fit() # Poisson and quasi-Poisson are the same for GEE
r1x = m1.fit(scale="X2")
r1x.summary()

Use log population as an offset instead of a covariate

In [None]:
m2 = sm.GEE.from_formula("Births ~ RUCC_2013 + logADINatRankZ", groups="FIPS", offset="logPop",
                         family=sm.families.Poisson(), data=da)
r2 = m2.fit(scale="X2")
r2.summary()

Below we construct a diagnostic plot for the variance structure that does not require there to be replicates  (in general there will be no replicates, and even here it is unclear whether we can treat the 10 years of data within each county as replicates).  If the variance structure is correctly specified, then the absolute Pearson residuals should have constant conditional mean with respect to the fitted values.

In [None]:
plt.clf()
plt.grid(True)
lfv = np.log(r2.fittedvalues).values
apr = np.abs(r2.resid_pearson)
ii = np.argsort(lfv)
lfv = lfv[ii]
apr = apr[ii]
ff = sm.nonparametric.lowess(apr, lfv)
plt.plot(lfv, apr, "o", alpha=0.2, rasterized=True)
plt.plot(ff[:, 0], ff[:, 1], "-", color="orange")
plt.title("Poisson mean/variance model")
plt.xlabel("Log predicted mean", size=16)
plt.ylabel("Absolute Pearson residual", size=16)

The Poisson variance model did not fit well based on the diagnostic plot above, so we next consider a Gamma family to better match the mean/variance relationship.

In [None]:
m3 = sm.GEE.from_formula("Births ~ RUCC_2013", groups="FIPS", offset="logPop",
                         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r3 = m3.fit(scale="X2")
r3.summary()

Diagnostic plot for mean/variance relationship with gamma model.

In [None]:
plt.clf()
plt.grid(True)
lfv = np.log(r3.fittedvalues).values
apr = np.abs(r3.resid_pearson)
ii = np.argsort(lfv)
lfv = lfv[ii]
apr = apr[ii]
ff = sm.nonparametric.lowess(apr, lfv)
plt.plot(lfv, apr, "o", alpha=0.2, rasterized=True)
plt.plot(ff[:, 0], ff[:, 1], "-", color="orange")
plt.title("Gamma mean/variance model")
plt.xlabel("Log predicted mean", size=16)
plt.ylabel("Absolute Pearson residual", size=16)

Now we proceed to fit and interpret some regression models.  Here we use exchangeable correlation structure in the GEE.  Since RUCC and ADI are constant within groups, the parameter estimates and standard errors are the same as with the independence model.  The first model only considers the roles of urbanicity (RUCC) and deprivation (ADI).

In [None]:
m4 = sm.GEE.from_formula("Births ~ RUCC_2013 + logADINatRankZ", groups="FIPS", offset="logPop",
                         cov_struct=sm.cov_struct.Exchangeable(),
                         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r4 = m4.fit(scale="X2")
r4.summary()

Now we consider the role of urbanicity as well as the potential for a linear time trend.

In [None]:
m5 = sm.GEE.from_formula("Births ~ RUCC_2013 + logADINatRankZ + year", groups="FIPS", offset="logPop",
                         cov_struct=sm.cov_struct.Exchangeable(),
                         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r5 = m5.fit(scale="X2")
r5.summary()

Now we consider the possibility that the linear time trend is different based on the level of deprivation.

In [None]:
m6 = sm.GEE.from_formula("Births ~ (logADINatRankZ + RUCC_2013c) * yearc", groups="FIPS", offset="logPop",
                         cov_struct=sm.cov_struct.Exchangeable(),
                         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r6 = m6.fit(scale="X2")
r6.summary()

Score tests comparing pairs of nested models:

In [None]:
print(r5.model.compare_score_test(r4))
print(r6.model.compare_score_test(r5))

## Principal Components Regression

We begin by double centering the demographic data.

In [None]:
demogx = np.asarray(demog)
demogx = np.log(1 + demogx)
demogx -= demogx.mean()
demogx -= demogx.mean(0)
demogx -= demogx.mean(1)[:, None]
demog

Get factors (principal components) from the demographic data.

In [None]:
u, s, vt = np.linalg.svd(demogx, 0)
v = vt.T

Convert the coefficients back to the original coordinates

In [None]:
def convert_coef(c, npc):
    return np.dot(v[:, 0:npc], c/s[0:npc])

Put the demographic factors into a dataframe

In [None]:
m = {("pc%02d" % k) : u[:, k] for k in range(100)}
m["FIPS"] = demog.index
demog_f = pd.DataFrame(m)

Merge demographic information into the births data

In [None]:
da = pd.merge(da, demog_f, on="FIPS", how="left")

Include this number of factors in subsequent models

In [None]:
npc = 10

A GLM, not appropriate since we have repeated measures on counties

In [None]:
fml = "Births ~ (logPopc + RUCC_2013c + logADINatRankZ) * yearc + " + " + ".join(["pc%02d" % j for j in range(npc)])
m7 = sm.GLM.from_formula(fml, family=sm.families.Poisson(), data=da)
r7 = m7.fit(scale="X2")
r7.summary()

GEE accounts for the correlated data

In [None]:
m8 = sm.GEE.from_formula(fml, groups="FIPS",
         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r8 = m8.fit(scale="X2")
r8.summary()

Use log population as an offset instead of a covariate

In [None]:
fml = "Births ~ " + " + ".join(["pc%02d" % j for j in range(npc)])
m9 = sm.GEE.from_formula(fml, groups="FIPS", offset="logPop",
         family=sm.families.Gamma(link=sm.families.links.log()), data=da)
r9 = m9.fit(scale="X2")
r9.summary()

Restructure the coefficients so that the age bands are in the columns.

In [None]:
def restructure(c):
    ii = pd.MultiIndex.from_tuples(na)
    c = pd.Series(c, index=ii)
    c = c.unstack()
    return c

This function fits a Gamma GLM to the data using 'npc' principal components as explanatory variables (using GEE to account for non-independence), then converts the coefficients back to the original variables.

In [None]:
def fitmodel(npc):
    # A GEE using log population as an offset
    fml = "Births ~ 1" if npc == 0 else "Births ~ RUCC_2013c*yearc + " + " + ".join(["pc%02d" % j for j in range(npc)])
    m = sm.GEE.from_formula(fml, groups="FIPS", family=sm.families.Gamma(link=sm.families.links.log()),
                            offset=da["logPop"], data=da)
    r = m.fit(scale="X2")

    # Convert the coefficients back to the original coordinates
    c = convert_coef(r.params[4:], npc)

    # Restructure the coefficients so that the age bands are
    # in the columns.
    c = restructure(c)

    return c, m, r

Plot styling information

In [None]:
colors = {"A": "purple", "B": "orange", "N": "lime", "W": "red"}
lt = {"F": "-", "M": ":"}
sym = {"H": "s", "N": "o"}
ages = range(0, 20)

Fit models with these numbers of PCs.

In [None]:
pcs = [0, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

In [None]:
models = []
for npc in pcs:

    c, m, r = fitmodel(npc)
    models.append((m, r))

    plt.clf()
    plt.figure(figsize=(9, 7))
    ax = plt.axes([0.14, 0.18, 0.7, 0.75])
    ax.grid(True)
    for i in range(c.shape[0]):
        a = c.index[i]
        la = "/".join(a)
        ax.plot(ages, c.iloc[i, :], lt[a[2]] + sym[a[1]], color=colors[a[0]],
                label=la)

    # Setup the horizontal axis labels
    ax.set_xticks(ages)
    ax.set_xticklabels(age_groups)
    for x in plt.gca().get_xticklabels():
        x.set_rotation(-90)

    ha, lb = plt.gca().get_legend_handles_labels()
    leg = plt.figlegend(ha, lb, loc="center right")
    leg.draw_frame(False)

    plt.xlabel("Age group", size=17)
    plt.ylabel("Coefficient", size=17)
    plt.title("%d factors" % npc)
    plt.show()

Use score tests to get a sense of the number of PC factors to include; also consider the PVEs calculated above.

In [None]:
for k in range(10):
    st = models[k+1][0].compare_score_test(models[k][1])
    print("%d versus %d: p=%f" % (pcs[k+1], pcs[k], st["p-value"]))

In [None]:
# https://www.jstor.org/stable/45118439

def dv(a, b, scale):
    return ((b-a)**2).sum()
    vpa = 2 * scale * a
    vpb = 2 * scale * b
    u = np.log(vpb + np.sqrt(1 + vpb**2)) - np.log(vpa + np.sqrt(1 + vpa**2))
    u += vpb * np.sqrt(1 + vpb**2) - vpa * np.sqrt(1 + vpa**2)
    return (u**2).sum()

for k in range(10):
    res = models[k][1]
    scale = res.scale
    n = len(res.model.endog)
    icept = res.params[0] + da["logPop"]
    numer = dv(res.model.endog, res.fittedvalues, scale)
    denom = dv(res.model.endog, np.exp(icept), scale)
    #assert(np.all(denom >= numer))
    r2 = 1 - numer / denom
    #print(denom)
    print("%15.1f %15.1f %15.1f" % (numer, denom, r2))
    #print(res.params[0])
  
#print(res.summary())
#res = models[5][1]
#icept = res.params[0] + da["logPop"]
#plt.plot(res.model.endog, res.fittedvalues, "o", alpha=0.2)
#plt.plot(res.model.endog, np.exp(icept), "o", alpha=0.2)
#plt.grid(True)