# Analyzing demographic variation among US counties using biplots

The data considered here are a single year of population counts for US counties.  The population within each county is partitioned into 2 x 2 x 4 x 19 = 304 demographic cells (sex ⨯ Hispanic ethnicity status ⨯ race ⨯ age).  See the [prep.py](https://github.com/kshedden/case_studies/blob/main/natality/prep.py) script for more information.

In [None]:
import numpy as np
import pandas as pd
from prep import demog, births
import matplotlib.pyplot as plt

This is what the demographic data looks like in its initial form:

In [None]:
demog.head()

Make sure that we use only the counties for which we have natality data.

In [None]:
fips = np.asarray(births["FIPS"].unique())
demogx = demog.reindex(fips)
demogx.head()

Get the national population counts in each race ⨯ ethnicity ⨯ sex cell, aggregating over age groups.  This is a simplified summary, it is not used in any of the subsequent analyses.

In [None]:
demogy = demogx.copy()
demogy.columns = pd.MultiIndex.from_tuples([tuple(x.split("_")) for x in demogy.columns])
demogy = demogy.unstack()
demogy = demogy.reset_index()
demogy.columns = ["race", "ethnicity", "sex", "age", "FIPS", "pop"]
demogy.groupby(["race", "ethnicity", "sex"])["pop"].agg(np.sum)

The total number of people included in this dataset is given below.

In [None]:
demogy["pop"].sum()

Convert the demographic data to an array, also save the county totals for use below.

In [None]:
demogz = np.asarray(demogx)
totpop = demogz.sum(1)

Log the data to stabilize variance and symmetrize the data.

In [None]:
demogz = np.log(1 + demogz)
totpopx = np.log(1 + totpop)

Double center the data, save the mean parameters so that we can standardize passive variables below.

In [None]:
gm = demogz.mean()
demogz -= gm
totpopx -= gm
colmn = demogz.mean(0)
demogz -= colmn
totpopx -= totpopx.mean()
rowmn = demogz.mean(1)
demogz -= rowmn[:, None]

Factor the data matrix

In [None]:
u,s,vt = np.linalg.svd(demogz, 0)
v = vt.T

To understand how many dimensions are contributing variation, we can consider the singular values.  A plot of the raw singular values is not that informative:

In [None]:
plt.clf()
plt.grid(True)
plt.plot(s)
plt.ylabel("Singular value")
plt.xlabel("Position")
plt.show()

Now we can consider some simple models for the singular values, including an exponential model $\lambda_i = a\exp(-bi)$ or a powerlaw model $\lambda_i = a/i^b$.  These models can be assessed by plotting the singular values in semi-log space or in log space, as shown below.  These plots suggest a "multiphasic" relationship which is not strictly speaking either exponential or powerlaw.  One interpretation is that there are 10-12 large singular values followed by an exponentially decreasing pattern of "tail singular values".

In [None]:
# Plot in semi-log space
plt.clf()
plt.grid(True)
ii = np.arange(1, len(s) + 1)
plt.plot(ii[0:-1], np.log(s[0:-1]), "-o", alpha=0.4)
plt.xlabel("Position")
plt.ylabel("Log singular value")
plt.title("Assess fit of exponential model")
plt.show()

# Plot in log space
plt.clf()
plt.grid(True)
ii = np.arange(1, len(s) + 1)
plt.plot(np.log(ii[0:-1]), np.log(s[0:-1]), "-o", alpha=0.4)
plt.xlabel("Log position")
plt.ylabel("Log singular value")
plt.title("Assess fit of powerlaw model")
plt.show()

For biplots, the singular values are partitioned between the left
and right singular vectors. alpha = 1 gives a distance
interpretation for rows (counties), alpha = 0 gives a distance
interpretation for columns (demographic categories), alpha = 0.5
does not have a strict distance interpretation.

In [None]:
alpha = 0.5
uu = np.dot(u, np.diag(s**alpha))
vv = np.dot(v, np.diag(s**(1-alpha)))

Specify some parameters for plotting.

In [None]:
colors = {"A": "purple", "B": "orange", "N": "lime", "W": "red"}
lt = {"F": "-", "M": ":"}
sym = {"H": "s", "N": "o"}
ages = range(0, 19)

In [None]:
def generate_biplot(uu, vv, sex, c, fips, j0=0, j1=1, highlight={}):
    """
    Produce a biplot of components 'j1' versus 'j0' (zero-based positions)
    based on the row scores in 'uu' and the column scores in vv.  The column 
    labels are in 'c' and the plot is given the title 'title'.  The dictionary
    'highlight' contains key/value pairs mapping FIPS codes to letters that
    are plotted to indicate the locations of specific counties.
    """

    # Map FIPS codes to row positions in the data
    fipsm = {v:i for i,v in enumerate(fips)}

    plt.clf()
    plt.figure(figsize=(10, 8))
    ax = plt.axes([0.1, 0.1, 0.76, 0.8])
    ax.grid(True)

    # Plot the counties as grey points
    plt.plot(uu[:, j0], uu[:, j1], 'o', color="grey", alpha=0.3)

    # Plot letters corresponding to the seletected counties.
    for k,v in highlight.items():
        jj = fipsm[k]   
        plt.text(uu[jj, j0], uu[jj, j1], v, color="blue", size=20)
    
    # Plot the demographic categories as colored points, joined
    # by lines connecting the age groups in order.
    for race in ["A", "B", "N", "W"]:
        for eth in ["H", "N"]:
            la = "%s_%s_%s" % (race, eth, sex)
            ii = [i for (i,x) in enumerate(c) if x.startswith(la)]
            sym = "-o" if eth == "H" else "-s"
            ax.plot(vv[ii, j0], vv[ii, j1], sym, color=colors[race], label=la, ms=5, mfc="none")
            ax.text(vv[ii[-1], j0], vv[ii[-1], j1], eth, ha="left", va="top", color=colors[race])

    # Plot the total population as a passive variable
    px = np.linalg.solve(np.diag(s), np.dot(uu.T, totpopx))
    pt = px[[j0, j1]]
    pt /= np.linalg.norm(pt)
    pt *= 1.5
    ax.annotate("Pop", xy=(0, 0), xytext=(pt[0], pt[1]), 
                arrowprops=dict(facecolor='black', arrowstyle="<-"))
            
    ax.set_xlabel("Component %d" % (j0+1), size=18)
    ax.set_ylabel("Component %d" % (j1+1), size=18)

    ha, lb = ax.get_legend_handles_labels()
    leg = plt.figlegend(ha, lb, loc="center right")
    leg.draw_frame(False)
    ax.set_title("Female" if sex == "F" else "Male")

    plt.show()

To reduce overplotting, produce separate biplots for females and for
males.

In [None]:
def make_biplots(j0, j1, highlight={}):
    c = demog.columns.to_list()
    for sex in ["F", "M"]:
        cx = [x.split("_") for x in c]
        ii = [i for (i,x) in enumerate(cx) if x[2] == sex]
        ii = np.asarray(ii, dtype=int)
        generate_biplot(uu, vv[ii, :], sex, [c[i] for i in ii], fips, j0=j0, j1=j1, highlight=highlight)

Annotate these counties in the biplots

In [None]:
highlight = {"26163": "W", # Wayne County MI
             "06085": "S", # Santa Clara CA
             "25005": "B", # Bristol MA
             "17031": "C", # Cook IL
             "46103": "P", # Pennington SD
             "06037": "L", # Los Angeles, CA
            }

In [None]:
make_biplots(0, 1, highlight=highlight)

In [None]:
make_biplots(2, 3, highlight=highlight)