# Understanding variation in lifespans of notable people using the BHHT data

The BHHT (Brief History of Human Time) project provides a dataset about "notable people" based mainly on wikipedia biography articles.

The analyses below focus on lifespans of the people in the BHHT dataset, aiming to understand how lifespans vary based on factors including era of birth, the geographic region where the person lived, and the person's sex and occupation.

This analysis uses survival analysis methods, allowing us to use information from still-living people.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pathlib import Path
from scipy.stats.distributions import chi2

Change the path below as needed to point to the directory holding the data file.

In [None]:
pa = Path("/home/kshedden/mynfs/data/Teaching/bhht")

Load the dataset.  Use the latin-1 encoding since there is some non-UTF data in the file.  Add "nrows=####" when developing to reduce the run time, but always use the complete data to get final results.

In [None]:
df = pd.read_csv(pa / Path("cross-verified-database.csv.gz"), encoding="latin-1", nrows=200000)

Create a lifespan variable (years of life).  It will be missing for people who are currently living.

In [None]:
df["lifespan"] = df["death"] - df["birth"]

Exclude people born before 1500, there is too little data to gain a meaningful understanding of the trends in lifespan prior to this year.

In [None]:
dx = df.query("birth >= 1500")

Retain only variables to be analyzed below.

In [None]:
dx = dx[["birth", "lifespan", "gender", "un_region", "level1_main_occ"]]
dx.head()

There are a small number of people with missing or "Other" gender but it is too small of a sample to draw conclusions.

In [None]:
print(dx.gender.value_counts())
dx = dx.loc[dx["gender"].isin(["Female", "Male"]), :]

Drop uninformative occupation codes.

In [None]:
dx = dx.loc[~dx["level1_main_occ"].isin(["Missing", "Other"]), :]

Censor lifespans at 2022

In [None]:
censor_year = 2022
dx["clifespan"] = dx["lifespan"].fillna(censor_year - dx["birth"])
dx["died"] = 1 - 1*dx["lifespan"].isnull()

Now we can drop all rows with missing data

In [None]:
dx = dx.drop("lifespan", axis=1)
dx = dx.dropna()
dx.head()

Create a categorical variable indicating the century in which a person was born.

In [None]:
dx["era"] = np.floor((dx["birth"] - 1500) / 100)

One of the factors we will consider as a predictor of lifespan is occupation, which has the levels and frequencies shown below:

In [None]:
dx["level1_main_occ"].value_counts()

Another factor of interest is the region where the person lived, which has five levels coded as follows:

In [None]:
dx["un_region"].value_counts()

Plot the marginal survival functions for people born in each century.  These survival functions are estimated using the product limit (Kaplan-Meier) method.

In [None]:
plt.figure(figsize=(8, 5))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k,g in dx.groupby("era"):
    if k == 5:
        continue
    sf = sm.SurvfuncRight(g.clifespan, g.died)
    la = "%.0f" % (1500 + k*100)
    plt.plot(sf.surv_times, sf.surv_prob, label=la)
plt.xlabel("Age", size=15)
plt.ylabel("Proportion alive", size=15)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)

Create a translated birth year setting year 1500 as year zero.  This makes it easier to interpret the proportional hazard regression models so that effects are relative to 1500 as a reference year.

In [None]:
dx["birth1500"] = dx["birth"] - 1500

Fit a proportional hazards regression model

In [None]:
fml = "clifespan ~ birth1500 + gender + level1_main_occ + un_region"
m0 = sm.PHReg.from_formula(fml, dx, status="died")
r0 = m0.fit()
r0.summary()

Plot the partial effect of birth year.

In [None]:
def plot_birthyear_partial(dx, rr):
    dp = dx.iloc[0:100, :].copy()
    dp["gender"] = "Female"
    # Occupation and region are arbitrary but must be fixed
    dp["level1_main_occ"] = "Leadership"
    dp["un_region"] = "Asia"
    dp["birth"] = np.linspace(1500, 2000, 100)
    dp["birth1500"] = dp["birth"] - 1500
    lhr = rr.predict(exog=dp).predicted_values

    plt.clf()
    plt.grid(True)
    plt.plot(dp["birth"], lhr)
    plt.xlabel("Birth year (transformed)", size=15)
    plt.ylabel("Contribtution to the log hazard", size=15)

In the first model, there is a linear main effect for birth year.  The hazard of death is decreasing as year of birth increases.

In [None]:
plot_birthyear_partial(dx, r0)

Next we include a quadratic effect for birth year, to see if the relationship between year of birth and log mortality hazard might be curved (parabolic), holding all other variables fixed. 

In [None]:
fml = "clifespan ~ birth1500 + I(birth1500**2) + gender + level1_main_occ + un_region"
m1 = sm.PHReg.from_formula(fml, dx, status="died")
r1 = m1.fit()
r1.summary()

Since the quadratic term for year of birth is statistically significant, there is evidence for curvature in the relationship between birth year and mortality hazard.  However this model could be mis-specified -- the true relationship might be non-quadratic.

In [None]:
plot_birthyear_partial(dx, r1)

Next we add a cubic term to the model.

In [None]:
fml = "clifespan ~ birth1500 + I(birth1500**2) + I(birth1500**3) + gender + level1_main_occ + un_region"
m2 = sm.PHReg.from_formula(fml, dx, status="died")
r2 = m2.fit()
r2.summary()

Based on this model, the mortality hazard was fairly constant until around 1800, then it began to drop.

In [None]:
plot_birthyear_partial(dx, r2)

High order polynomials make poor basis functions.  A more effective approach is polynomial splines, which are piecewise polynomials.  Below we use a cubic spline basis with four degrees of freedom to capture the birth year effect.

In [None]:
fml = "clifespan ~ bs(birth1500, 4) + gender + level1_main_occ + un_region"
m3 = sm.PHReg.from_formula(fml, dx, status="died")
r3 = m3.fit()
r3.summary()

In [None]:
plot_birthyear_partial(dx, r3)

Above we found that year of birth is a strong predictor of the mortality hazard.  We also have a strong sex difference, with males having a much greater hazard.  Next we consider whether tyear of birth effect differs by sex.

In [None]:
fml = "clifespan ~ bs(birth1500, 4) * gender + level1_main_occ + un_region"
m4 = sm.PHReg.from_formula(fml, dx, status="died")
r4 = m4.fit()
r4.summary()

To assess whether the birthyear x gender interaction is significant, we can use a log-likelihood ratio test:

In [None]:
stat = 2*(r4.llf - r3.llf)
dof = len(r4.params) - len(r3.params)
1 - chi2(dof).cdf(stat)

To understand the meaning of the interaction, we plot below the sex-specific contributions of year of birth to the log hazard.

In [None]:
def plot_birthyear_partial_sexdiff(dx, rr):
    dp = dx.iloc[0:200, :].copy()
    dp["gender"] = np.concatenate((["Female"]*100, ["Male"]*100))
    dp["level1_main_occ"] = "Leadership"
    dp["un_region"] = "Asia"
    b = np.linspace(1500, 2000, 100)
    dp["birth"] = np.concatenate((b, b))
    dp["birth1500"] = dp["birth"] - 1500
    lhr = rr.predict(exog=dp).predicted_values

    plt.figure(figsize=(8, 5))
    plt.clf()
    plt.axes([0.1, 0.1, 0.75, 0.8])
    plt.grid(True)
    plt.plot(dp.iloc[0:100, :]["birth"], lhr[0:100], label="Female")
    plt.plot(dp.iloc[100:200, :]["birth"], lhr[100:200], label="Male")
    ha, lb = plt.gca().get_legend_handles_labels()
    leg = plt.figlegend(ha, lb, loc="center right")
    leg.draw_frame(False)
    plt.xlabel("Birth year (transformed)", size=15)
    plt.ylabel("Contribtution to the log hazard", size=15)

In [None]:
plot_birthyear_partial_sexdiff(dx, r4)

## Baseline hazard functions

The estimated baseline cumulative hazard function reflects the age-specific hazard of death.  The  cumulative hazard is easy to estimate but not straightforward to interpret.

In [None]:
ti, cumhaz, surv = r0.baseline_cumulative_hazard[0]

plt.clf()
plt.grid(True)
plt.plot(ti, cumhaz)
plt.xlabel("Age", size=15)
plt.ylabel("Cumulative hazard", size=15)

Estimate and plot the baseline cumulative hazard function on the log scale

In [None]:
ti, cumhaz, surv = r0.baseline_cumulative_hazard[0]

plt.clf()
plt.grid(True)
plt.plot(ti, np.log(np.clip(cumhaz, 1e-4, np.inf)))
plt.xlabel("Age", size=15)
plt.ylabel("Log cumulative hazard", size=15)

Next we estimate and plot the baseline hazard function using numerical differentiation

In [None]:
ti, chaz, surv = r0.baseline_cumulative_hazard[0]
haz = np.diff(chaz) / np.diff(ti)
shaz = sm.nonparametric.lowess(np.log(haz), ti[0:-1])

plt.clf()
plt.grid(True)
plt.plot(shaz[:, 0], shaz[:, 1])
plt.xlabel("Age", size=15)
plt.ylabel("log hazard", size=15)

Fit a sex-stratified proportional hazards regression model

In [None]:
fml = "clifespan ~ bs(birth, 4) + level1_main_occ + un_region"
m4 = sm.PHReg.from_formula(fml, dx, status="died", strata="gender")
r4 = m4.fit()
r4.summary()

Plot the baseline hazard function for each sex

In [None]:
bh = r4.baseline_cumulative_hazard
snames = m4.surv.stratum_names

plt.clf()
plt.figure(figsize=(8, 5))
plt.axes([0.1, 0.1, 0.75, 0.8])
plt.grid(True)
for k in 0,1:
    ti = bh[k][0]
    chaz = bh[k][1]
    haz = np.diff(chaz) / np.diff(ti)
    shaz = sm.nonparametric.lowess(np.log(haz), ti[0:-1])
    plt.plot(shaz[:,0], shaz[:,1], label=snames[k])
plt.xlabel("Age", size=15)
plt.ylabel("Log hazard", size=15)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, loc="right")
leg.draw_frame(False)