# Blood pressure in US adults - analysis using dimension reduction regression

This notebook demonstrates the use of dimension reduction regression to understand the predictors of adult systolic blood pressure in the NHANES data.

In [None]:
from statsmodels.regression.dimred import SIR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from statsmodels.nonparametric.smoothers_lowess import lowess
from read import *

We will use 10 demographic and anthropometric predictors of systolic blood pressure (SBP).

In [None]:
vx = ["RIAGENDR", "RIDAGEYR", "BMXWT", "BMXHT", "BMXBMI", "BMXLEG",
      "BMXARML", "BMXARMC", "BMXWAIST", "BMXHIP"]
vn = ["BPXSY1"] + vx

In [None]:
dx = df.loc[:, vn].dropna()
dx.head()

The code below requires numerically coded variables.

In [None]:
dx["RIAGENDRx"] = dx.RIAGENDR.replace({"F": 1, "M": -1})

Dimension reduction regression focuses on explaining deviations from the mean, so we mean center all variables here.

In [None]:
for m in dx.columns:
    if dx[m].dtype == np.float64:
        dx[m] -= dx[m].mean()

Next we fit a dimension reduction (DR) regression model using sliced inverse regression (SIR).

In [None]:
y = np.asarray(dx["BPXSY1"])
vz = [x.replace("RIAGENDR", "RIAGENDRx") for x in vx]
X = np.asarray(dx[vz])
m = SIR(y, X)
r = m.fit()
scores = np.dot(X, r.params)
scores.shape
r.params[:, 0:2]

To understand the fitted DR regression model, we stratify on the j'th score, then plot the mean of SBP with respect to the k'th score.

In [None]:
def plotstrat(j, k, scores):

    dp = pd.DataFrame({"strat": scores[:, j], "x": scores[:, k], "y": dx.BPXSY1})
    dp["strat"] = pd.qcut(scores[:, j], 5)

    plt.clf()
    plt.figure(figsize=(7, 4))
    plt.axes([0.12, 0.12, 0.65, 0.8])
    plt.grid(True)
    for ky, dv in dp.groupby("strat"):
        xx = np.linspace(dv.x.min(), dv.x.max(), 100)
        m = lowess(dv.y, dv.x)
        f = interp1d(m[:, 0], m[:, 1])
        la = "%.2f-%.2f" % (ky.left, ky.right)
        plt.plot(xx, f(xx), "-", label=la)

    ha, lb = plt.gca().get_legend_handles_labels()
    leg = plt.figlegend(ha, lb, loc="center right")
    leg.draw_frame(False)
    leg.set_title("Score %d" % (j + 1))

    plt.xlabel("Score %d" % (k + 1), size=15)
    plt.ylabel("SBP (centered)", size=15)
    plt.show()

We can stratify on score 2 then plot against score 1, and then we can stratify on score 1 and plot against score 2. 

From the first plot below (stratifying on score 2), we see that expected blood pressure is increasing in score 1 for every fixed value of score 2.  However, the rates of increase differ.  People with greater values of score 2 have a steeper increase of expected SBP with respect to score 1.

From the second plot below (stratifying on score 1), we see that expected SBP can be either increasing or decreasing with respect to score 2, depending on the value of score 1.  For large values of score 1, expected SBP is increasing in score 2, while for small (negative) values of score 1, expected SBP is decreasing in score 2.

In [None]:
plotstrat(1, 0, scores)
plotstrat(0, 1, scores)

In [None]:
cols = {"F": "orange", "M": "purple"}

To undestand what the scores mean, we can plot each score against each covariate.

In [None]:
for j in range(2):
    for x in dx.columns:
        if x in ["RIAGENDR","RIAGENDRx"] :
            continue
        plt.figure(figsize=(7, 5))
        plt.clf()
        plt.grid(True)
        if x != "BPXSY1":
            plt.xlabel(x, size=15)
            plt.ylabel("Score %d" % (j + 1), size=15)
            dp = pd.DataFrame({"x": dx[x], "y": scores[:,j], "sex": dx["RIAGENDR"]})
            for sex in "F", "M":
                ii = dp.sex == sex
                dz = dp.loc[ii, :]
                lw = lowess(dz["y"], dz["x"])
                plt.plot(dz["x"], dz["y"], "o", mfc="none", alpha=0.2, color=cols[sex],
                         label=sex, rasterized=True)
                plt.plot(lw[:, 0], lw[:, 1], "-", color=cols[sex])
        else:
            plt.ylabel(x, size=15)
            plt.xlabel("Score %d" % (j + 1), size=15)
            dp = pd.DataFrame({"y": dx[x], "x": scores[:,j], "sex": dx["RIAGENDR"]})
            for sex in "F", "M":
                ii = dp.sex == sex
                dz = dp.loc[ii, :]
                lw = lowess(dz["y"], dz["x"])
                plt.plot(dz["x"], dz["y"], "o", mfc="none", color=cols[sex],
                         alpha=0.2, label=sex, rasterized=True)
                plt.plot(lw[:, 0], lw[:, 1], "-", color=cols[sex])
        ha, lb = plt.gca().get_legend_handles_labels()
        leg = plt.figlegend(ha, lb, loc="center right")
        leg.draw_frame(False)


Here is one way of understanding the results above:

Score 1 appears to be "sex adjusted age".  The relationship between score 1 and age is nearly perfectly linear, and there is a difference in intercept but not in the slope based on sex.  Score 1 increases linearly in age, at around 0.5 points per decade of life, and males are about 0.5 units greater than females at the same age. 

Score 1 also has associations with other variables, but the interpretation of score 1 as "sex adjusted age" seems most straightforward, as there is much more scatter in the other relationships.  Score 1 associates positively with SBP, so arguably score 1 capture much of the role of sex and age in relation to blood pressure.

Score 2 also plays an important role, in that it moderates the relationship between SBP and score 1.  People with a greater positive value of score 2 will have a steeper slope between SBP and score 1, while people with a negative value of score 2 will have a weaker (but still positive) relationship between SBP and score 1.

Score 2 is related to several of the anthropometric measures, but is most strongly related to BMI and arm circumference.  People with greater BMI and arm circumference have lower values of score 2, which as noted above moderates the relationship between SBP and score 1 (age/sex).  People with greater body fat have greater SBP at earlier ages, but their SBP increases more slowly with age.

Also of note is that score 2 on its own has minimal association with SBP and only a modest association with age.