# Understanding variation in lifespans of notable people using the BHHT data

The BHHT (Brief History of Human Time) project provides a dataset about "notable people" based mainly on wikipedia biography articles.

The analyses below focus on lifespans of the people in the BHHT dataset, aiming to understand how lifespans vary based on factors including era of birth, the geographic region where the person lived, and the person's sex.

This analysis uses survival analysis methods, allowing us to use information from still-living people.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pathlib import Path
from scipy.stats.distributions import chi2

Change the path below as needed to point to the directory containing the data file.

In [None]:
pa = Path("/home/kshedden/data/Teaching/bhht")

Load the dataset.  Use the latin-1 encoding since there is some non-UTF data in the file.  Add "nrows=####" when developing to reduce the run time, but always use the complete data to get final results.

In [None]:
df = pd.read_csv(pa / Path("cross-verified-database.csv.gz"), encoding="latin-1", nrows=500000)

Create a lifespan variable (years of life).  It will be missing for people who are currently living.

In [None]:
df["lifespan"] = df["death"] - df["birth"]

Exclude people born before 1500, there is too little data to gain a meaningful understanding of lifespans prior to this year.  We also exclude people born after 1970 as these people are not yet at risk for most causes of death that are associated with aging.

In [None]:
dx = df.query("1500 <= birth <= 1970")

Retain only variables to be analyzed below.

In [None]:
dx = dx[["birth", "lifespan", "gender", "un_region"]]
dx.head()

There are a small number of people with missing or "Other" gender but it is too small of a sample to draw conclusions.

In [None]:
print(dx.gender.value_counts())
dx = dx.loc[dx["gender"].isin(["Female", "Male"]), :]

Censor lifespans at the last year when anyone died in the dataset.

In [None]:
censor_year = df["death"].max()
print("Censoring at %d" % censor_year)
dx["clifespan"] = dx["lifespan"].fillna(censor_year - dx["birth"])
dx["died"] = 1 - 1*dx["lifespan"].isnull()
dx.head()

Now we can drop all rows with missing data

In [None]:
dx = dx.drop("lifespan", axis=1)
dx = dx.dropna()
dx.head()

Create a categorical variable indicating the century in which a person was born.

In [None]:
dx["era"] = np.floor((dx["birth"] - 1500) / 100)
dx.head()

Another factor of interest is the region where the person lived, which has five levels coded as follows:

In [None]:
dx["un_region"].value_counts()

Two other factors of interest are gender and era (of birth).  Marginal distributions for these variables are as below:

In [None]:
dx["gender"].value_counts()

In [None]:
dx["era"].value_counts()

## Marginal survival functions

Below we plot the marginal survival functions for people born in each century.  These survival functions are estimated using the product limit (Kaplan-Meier) method.  Note that the curve for 1900 suggests that 10% of notable people live to be 100.  This is unlikely to be true, and could be due to improvements in health within the 20th century -- people born later in the 20th century are (by construction) more likely to be censored and also presumably have longer lifespans.  This results in dependent censoring, which biases the survival function estimates.

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("era"):
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    la = "%.0f" % (1500 + k*100)
    plt.plot(sf.surv_times, sf.surv_prob, label=la)
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("Proportion alive", size=15)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

We can also estimate the marginal survival function for each gender and for each region.

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("gender"):
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    plt.plot(sf.surv_times, sf.surv_prob, label=k)
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("Proportion alive", size=15)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("un_region"):
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    plt.plot(sf.surv_times, sf.surv_prob, label=k)
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("Proportion alive", size=15)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

The results above may be heavily influenced by confounding.  The people from Oceania tended to live more recently, and people who lived more recently tend to have longer lifespans.  Conversely, many of the notable people from the 1500's-1700's are from Europe, and lifespans tended to be shorter in these historical eras. Looking at one factor at a time, it is not clear whether the "driver" of lifespan variation is geography (where a person lived) or time (when a person lived).

## Marginal hazard functions

A very important concept in survival analysis is the hazard function.  In this case, since the times are discrete (ages are in whole years), we can estimate the hazard easily as the ratio of the number of events (deaths) to the number at risk, for each age.  These are plotted below, stratified by era.

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("era"):
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    la = "%.0f" % (1500 + k*100)
    plt.plot(sf.surv_times, sf.n_events/sf.n_risk, label=la)
plt.xlabel("Age", size=15)
plt.ylabel("Hazard", size=15)
plt.xlim(0, 90)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

Below we estimate and plot the (marginal) hazard function for each gender.

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("gender"):
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    plt.plot(sf.surv_times, sf.n_events/sf.n_risk, label=k)
plt.xlabel("Age", size=15)
plt.ylabel("Hazard", size=15)
plt.xlim(0, 90)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

## Proportional hazard modeling

Create a translated birth year setting year 1500 as year zero.  This makes it easier to interpret the proportional hazard regression models so that effects are relative to 1500 as a reference year.

In [None]:
dx["birth1500"] = dx["birth"] - 1500

Fit a proportional hazards regression model

In [None]:
fml = "clifespan ~ birth1500 + gender + un_region"
m0 = sm.PHReg.from_formula(fml, dx, status="died")
r0 = m0.fit()
r0.summary()

We use partial regression plots to visualize the fitted model.  The function implemented below plots the partial effect of birth year on the log hazard.

In [None]:
def plot_birthyear_partial(dx, rr):
    dp = dx.iloc[0:100, :].copy()
    dp["gender"] = "Female"
    # Region is arbitrary but must be fixed
    dp["un_region"] = "Africa"
    dp["birth"] = np.linspace(1500, 1970, 100)
    dp["birth1500"] = dp["birth"] - 1500
    lhr = rr.predict(exog=dp).predicted_values

    plt.clf()
    plt.grid(True)
    plt.plot(dp["birth"].values, lhr)
    plt.xlabel("Birth year", size=15)
    plt.ylabel("Contribution to the log hazard", size=15)

In the first model, there is a linear main effect for birth year.  The hazard of death is decreasing as year of birth increases.

In [None]:
plot_birthyear_partial(dx, r0)

Next we include a quadratic effect for birth year, to see if the relationship between year of birth and log mortality hazard might be curved (parabolic), holding all other variables fixed. 

In [None]:
fml = "clifespan ~ birth1500 + I(birth1500**2) + gender + un_region"
m1 = sm.PHReg.from_formula(fml, dx, status="died")
r1 = m1.fit()
r1.summary()

Since the quadratic term for year of birth is statistically significant, there is evidence for curvature in the relationship between birth year and mortality hazard.  However this model could be mis-specified -- the true relationship might be non-quadratic.

In [None]:
plot_birthyear_partial(dx, r1)

Next we add a cubic term to the model.

In [None]:
fml = "clifespan ~ birth1500 + I(birth1500**2) + I(birth1500**3) + gender + un_region"
m2 = sm.PHReg.from_formula(fml, dx, status="died")
r2 = m2.fit()
r2.summary()

Based on this model, the mortality hazard was fairly constant until around 1800, then it began to drop.

In [None]:
plot_birthyear_partial(dx, r2)

High order polynomials make poor basis functions.  A more effective approach is polynomial splines, which are piecewise polynomials.  Below we use a cubic spline basis with four degrees of freedom to capture the birth year effect.

In [None]:
fml = "clifespan ~ bs(birth1500, 4) + gender + un_region"
m3 = sm.PHReg.from_formula(fml, dx, status="died")
r3 = m3.fit()
r3.summary()

In [None]:
plot_birthyear_partial(dx, r3)

Above we found that year of birth is a strong predictor of the mortality hazard.  We also have a strong sex difference, with males having a much greater hazard.  Next we consider whether the year of birth effect differs by sex.

In [None]:
fml = "clifespan ~ bs(birth1500, 4) * gender + un_region"
m4 = sm.PHReg.from_formula(fml, dx, status="died")
r4 = m4.fit()
r4.summary()

To assess whether the birthyear x gender interaction is significant, we can use a log-likelihood ratio test:

In [None]:
stat = 2*(r4.llf - r3.llf)
dof = len(r4.params) - len(r3.params)
print("stat=", stat)
print("dof=", dof)
1 - chi2(dof).cdf(stat)

To understand the meaning of the interaction, we plot below the sex-specific contributions of year of birth to the log hazard.

In [None]:
def plot_birthyear_partial_sexdiff(dx, rr):
    dp = dx.iloc[0:200, :].copy()
    dp["gender"] = np.concatenate((["Female"]*100, ["Male"]*100))
    dp["female"] = np.concatenate((np.ones(100), np.zeros(100)))
    dp["un_region"] = "Africa"
    b = np.linspace(1500, 1970, 100)
    dp["birth"] = np.concatenate((b, b))
    dp["birth1500"] = dp["birth"] - 1500
    lhr = rr.predict(exog=dp).predicted_values

    plt.figure(figsize=(8, 5))
    plt.clf()
    plt.axes([0.1, 0.1, 0.75, 0.8])
    plt.grid(True)
    plt.plot(dp.iloc[0:100, :]["birth"].values, lhr[0:100], label="Female")
    plt.plot(dp.iloc[100:200, :]["birth"].values, lhr[100:200], label="Male")
    ha, lb = plt.gca().get_legend_handles_labels()
    leg = plt.figlegend(ha, lb, loc="center right")
    leg.draw_frame(False)
    plt.xlabel("Birth year", size=15)
    plt.ylabel("Contribtution to the log hazard", size=15)

In [None]:
plot_birthyear_partial_sexdiff(dx, r4)

Above we considered an interaction between a categorical variable (gender) and a quantitative variable (birth1500).  Since the quantitative variable was modeled with splines, this createst an interaction between the gender indicator and each basis function of birth1500.  A more parsimonious way to model interactions involving splines is to model that main effect with a spline, but use only linear terms for the interaction.  This technique is illustrated below.

In [None]:
dx["female"] = (dx["gender"] == "Female").astype(int)
fml = "clifespan ~ bs(birth1500, 4) + female + birth1500 : female + un_region"
m5 = sm.PHReg.from_formula(fml, dx, status="died")
r5 = m5.fit()
r5.summary()

In [None]:
plot_birthyear_partial_sexdiff(dx, r5)

## Baseline hazard functions

The estimated baseline cumulative hazard function reflects the age-specific hazard of death, controlling for all covariates in the model.  The  cumulative hazard is easy to estimate but not straightforward to interpret.

In [None]:
ti, cumhaz, surv = r0.baseline_cumulative_hazard[0]

plt.clf()
plt.grid(True)
plt.plot(ti, cumhaz)
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("Cumulative hazard", size=15)

Estimate and plot the baseline cumulative hazard function on the log scale

In [None]:
ti, cumhaz, surv = r0.baseline_cumulative_hazard[0]

plt.clf()
plt.grid(True)
plt.plot(ti, np.log(np.clip(cumhaz, 1e-4, np.inf)))
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("Log cumulative hazard", size=15)

Next we estimate and plot the baseline hazard function using numerical differentiation

In [None]:
ti, chaz, surv = r0.baseline_cumulative_hazard[0]
haz = np.diff(chaz) / np.diff(ti)
shaz = sm.nonparametric.lowess(np.log(haz), ti[0:-1])

plt.clf()
plt.grid(True)
plt.plot(shaz[:, 0], shaz[:, 1])
plt.xlim(0, 100)
plt.xlabel("Age", size=15)
plt.ylabel("log hazard", size=15)

# Stratification

Fit a sex-stratified proportional hazards regression model

In [None]:
fml = "clifespan ~ bs(birth, 4) + un_region"
m6 = sm.PHReg.from_formula(fml, dx, status="died", strata="gender")
r6 = m6.fit()
r6.summary()

Plot the estimated baseline hazard function for each sex:

In [None]:
bh = r6.baseline_cumulative_hazard
snames = m6.surv.stratum_names

plt.clf()
plt.figure(figsize=(8, 5))
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k in 0,1:
    ti = bh[k][0]
    chaz = bh[k][1]
    haz = np.diff(chaz) / np.diff(ti)
    shaz = sm.nonparametric.lowess(np.log(haz), ti[0:-1], frac=0.5)
    plt.plot(shaz[:,0], shaz[:,1], label=snames[k])
plt.xlabel("Age", size=15)
plt.ylabel("Log hazard", size=15)
plt.xlim(0, 100)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)