# Influence of salt intake behavior to blood pressure
## STATS 506 group project
### Author: Xinjun Li

This is a report generated by jupyter notebook for the group project of STATS 506 in University of Michigan.

Our project aims to answer the following question:

>**Is salt intake associated with blood pressure?**
**If so, to what extent is that relationship mediated or moderated by age or waist size?**

We will use [NHANES](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx) data in analysis.

Required software and packages to run the code are as follows:
* Python3 
* os
* pandas
* numpy
* scipy
* statsmodels
* patsy
* matplotlib

In [29]:
# Import packages
import os
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import patsy
import statsmodels.api as sm
from statsmodels.stats.mediation import Mediation
import matplotlib.pyplot as plt

In [76]:
# Set working directory
os.chdir('D:/学习/密歇根/STAT506/Group project/stats506/')

# read data
demo_ = pd.read_excel("RawData/Demographics_15_16.xlsx")
BMI_ = pd.read_excel("RawData/Body_measures_2015_16.xlsx")
bp_ = pd.read_excel('RawData/Blood_Pressure_2015_16.xlsx')
nutr_ = pd.read_excel("RawData/Dietary_nutrients_firstday_2015_16.xlsx")

In [92]:
demo=demo_
BMI=BMI_
bp=bp_
nutr=nutr_

### Data cleaning

We need first clean the raw data and join different dataset.

In [93]:
# select useful columns
demo=demo.set_index('SEQN'
                    ).filter(items=['RIDAGEYR']
                    ).dropna()


BMI=BMI.set_index('SEQN'
                  ).filter(items=['BMXWAIST']
                  ).dropna()


nutr1=nutr.set_index('SEQN'
                    ).filter(items=['DBD100']
                    ).dropna(
                    )


# Calculate mean of blood pressure
bp=bp.set_index('SEQN'
                ).filter(regex='(BPXSY*)|(BPXDI*)')
bp=bp.assign(SY=bp.filter(regex='BPXSY*').mean(axis=1, skipna = True),
             DI=bp.filter(regex='BPXDI*').mean(axis=1, skipna = True)
             ).filter(items=['SY','DI']).dropna()


# Merge all data set
df1=bp.join(demo,how='inner').join(BMI,how='inner').join(nutr1,how='inner')
df1=df1-df1.mean()

Take a look at the data we are about to work on.

In [94]:
# Show data summary of numeric variables
print(df1.describe())

                 SY            DI      RIDAGEYR      BMXWAIST        DBD100
count  4.682000e+03  4.682000e+03  4.682000e+03  4.682000e+03  4.682000e+03
mean  -2.212668e-14  3.851378e-14  6.501421e-15 -2.538316e-13  4.044418e-16
std    1.730776e+01  1.337119e+01  2.160344e+01  1.901313e+01  8.491468e-01
min   -4.551360e+01 -6.605639e+01 -3.012281e+01 -4.731115e+01 -6.749252e-01
25%   -1.218026e+01 -7.389719e+00 -2.012281e+01 -1.391115e+01 -6.749252e-01
50%   -2.846932e+00  6.102805e-01 -3.122811e+00 -6.111491e-01 -6.749252e-01
75%    8.486402e+00  8.610281e+00  1.787719e+01  1.218885e+01  3.250748e-01
max    8.715307e+01  5.794361e+01  4.187719e+01  7.798885e+01  7.325075e+00


### Fit OLS

Now fit the ordinary least square to the data.
The models used are:
`DI ~ DBD100` and `SY ~ DBD100`

#### Diastolic result

In [95]:
# fit ols to Diastolic measurements
ols_DI=ols('DI~DBD100',data=df1).fit()
# Print the summary
print(ols_DI.summary())

                            OLS Regression Results                            
Dep. Variable:                     DI   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     14.87
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           0.000117
Time:                        02:02:32   Log-Likelihood:                -18776.
No. Observations:                4682   AIC:                         3.756e+04
Df Residuals:                    4680   BIC:                         3.757e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3.759e-14      0.195   1.93e-13      1.0

#### Systolic result

In [55]:
# fit ols to Systolic measurements
ols_SY=ols('SY~DBD100',data=df1).fit()
# Print the summary
print(ols_SY.summary())

                            OLS Regression Results                            
Dep. Variable:                     SY   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     14.68
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           0.000129
Time:                        01:37:13   Log-Likelihood:                -19985.
No. Observations:                4682   AIC:                         3.997e+04
Df Residuals:                    4680   BIC:                         3.999e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    119.4778      0.253    472.714      0.0

From the above results, we know that both models are significant and every coefficients are significant(at the level of 95%).

### Moderation effect of waist size


First, add two columns recording a standiviation above and below  of waist size.

In [96]:
df1['waist_sd'] = df1['BMXWAIST'].std()
df1['waist_up']=df1['BMXWAIST']+df1['waist_sd']
df1['waist_down']=df1['BMXWAIST']-df1['waist_sd']

#### Diastolic result

Fit model: `DI ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST`

In [97]:
moderation_DI = ols('DI ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST', data=df1).fit()
print(moderation_DI.summary())

                            OLS Regression Results                            
Dep. Variable:                     DI   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.090
Method:                 Least Squares   F-statistic:                     154.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           1.65e-95
Time:                        02:14:56   Log-Likelihood:                -18563.
No. Observations:                4682   AIC:                         3.713e+04
Df Residuals:                    4678   BIC:                         3.716e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           0.0101      0.187     

In [98]:
# one standard deviation above mean
moderation_DI_up = ols('DI ~ DBD100 + waist_up + DBD100 * waist_up', data=df1).fit()
print(moderation_DI_up.summary())

                            OLS Regression Results                            
Dep. Variable:                     DI   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.090
Method:                 Least Squares   F-statistic:                     154.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           1.65e-95
Time:                        02:15:04   Log-Likelihood:                -18563.
No. Observations:                4682   AIC:                         3.713e+04
Df Residuals:                    4678   BIC:                         3.716e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -3.9335      0.265    -

In [99]:
# one standard deviation below mean
moderation_DI_down = ols('DI ~ DBD100 + waist_down + DBD100 * waist_down', data=df1).fit()
print(moderation_DI_down.summary())

                            OLS Regression Results                            
Dep. Variable:                     DI   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.090
Method:                 Least Squares   F-statistic:                     154.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           1.65e-95
Time:                        02:15:11   Log-Likelihood:                -18563.
No. Observations:                4682   AIC:                         3.713e+04
Df Residuals:                    4678   BIC:                         3.716e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             3.9537      0.26

For the models above, coefficients for the interaction terms and salt intake itself are not significant(at level of 95%).
There are not moderation effect of waist size on salt intake and diastole.

#### Systolic result

Fit model: `SY ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST`

In [100]:
moderation_SY = ols('SY ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST', data=df1).fit()
print(moderation_SY.summary())

                            OLS Regression Results                            
Dep. Variable:                     SY   R-squared:                       0.177
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     334.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):          7.40e-197
Time:                        02:15:37   Log-Likelihood:                -19537.
No. Observations:                4682   AIC:                         3.908e+04
Df Residuals:                    4678   BIC:                         3.911e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -0.0099      0.230     

In [101]:
# one standard deviation above mean
moderation_SY_up = ols('SY ~ DBD100 + waist_up + DBD100 * waist_up', data=df1).fit()
print(moderation_SY_up.summary())

                            OLS Regression Results                            
Dep. Variable:                     SY   R-squared:                       0.177
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     334.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):          7.40e-197
Time:                        02:15:58   Log-Likelihood:                -19537.
No. Observations:                4682   AIC:                         3.908e+04
Df Residuals:                    4678   BIC:                         3.911e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          -7.2380      0.326    -

In [102]:
# one standard deviation below mean
moderation_SY_down = ols('SY ~ DBD100 + waist_down + DBD100 * waist_down', data=df1).fit()
print(moderation_SY_down.summary())

                            OLS Regression Results                            
Dep. Variable:                     SY   R-squared:                       0.177
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     334.6
Date:                Thu, 12 Dec 2019   Prob (F-statistic):          7.40e-197
Time:                        02:16:32   Log-Likelihood:                -19537.
No. Observations:                4682   AIC:                         3.908e+04
Df Residuals:                    4678   BIC:                         3.911e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             7.2183      0.32

For the models above, coefficients for the interation terms and salt intake itself are not significant(at level of 95%). 
There are not moderation effect of waist size on salt intake and diastole.

### Mediation effect of age

Fit model: `RIDAGEYR ~ DBD100` to see if there are relationships between age and salt intake

In [103]:
# test if there is relationship between age and salt intake.
age_D = ols('RIDAGEYR ~ DBD100', data=df1).fit()
print(age_D.summary())

                            OLS Regression Results                            
Dep. Variable:               RIDAGEYR   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     38.72
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           5.31e-10
Time:                        02:20:09   Log-Likelihood:                -21011.
No. Observations:                4682   AIC:                         4.203e+04
Df Residuals:                    4680   BIC:                         4.204e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -8.743e-16      0.314  -2.78e-15      1.0

The model is significant. There are relationships between age and salt intake behavior.

#### Diastolic result

Fit model: `DI ~ DBD100 + RIDAGEYR`.

In [104]:
mediation_DI = ols('DI ~ DBD100 + RIDAGEYR', data=df1).fit()
print(mediation_DI.summary())

                            OLS Regression Results                            
Dep. Variable:                     DI   R-squared:                       0.072
Model:                            OLS   Adj. R-squared:                  0.072
Method:                 Least Squares   F-statistic:                     182.8
Date:                Thu, 12 Dec 2019   Prob (F-statistic):           3.64e-77
Time:                        02:20:17   Log-Likelihood:                -18608.
No. Observations:                4682   AIC:                         3.722e+04
Df Residuals:                    4679   BIC:                         3.724e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3.759e-14      0.188      2e-13      1.0

The model is significant. Age might be a mediator between salt intake and diastolic.

In [105]:
# origin model and mediator model
med_model_DI=sm.OLS.from_formula('DI ~ RIDAGEYR+DBD100', data=df1)
mediator_DI=sm.OLS.from_formula('RIDAGEYR ~ DBD100', data=df1)

# origin model and mediator model
med_DI = Mediation(med_model_DI,mediator_DI,'DBD100','RIDAGEYR').fit()
print(med_DI.summary())

                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.378793        0.198924        0.563067    0.000
ACME (treated)            0.378793        0.198924        0.563067    0.000
ADE (control)             0.493689        0.059676        0.900952    0.026
ADE (treated)             0.493689        0.059676        0.900952    0.026
Total effect              0.872482        0.416557        1.328894    0.000
Prop. mediated (control)  0.426604        0.245650        0.855243    0.000
Prop. mediated (treated)  0.426604        0.245650        0.855243    0.000
ACME (average)            0.378793        0.198924        0.563067    0.000
ADE (average)             0.493689        0.059676        0.900952    0.026
Prop. mediated (average)  0.426604        0.245650        0.855243    0.000


All the mediation effect(ACME) are significant(at level of 95%).
Which means that age is a mediator between salt intake and diastolic.

#### Systolic result

Fit model: `SY ~ DBD100 + RIDAGEYR`.

In [106]:
mediation_SY = ols('SY ~ DBD100 + RIDAGEYR', data=df1).fit()
print(mediation_SY.summary())

                            OLS Regression Results                            
Dep. Variable:                     SY   R-squared:                       0.327
Model:                            OLS   Adj. R-squared:                  0.326
Method:                 Least Squares   F-statistic:                     1134.
Date:                Thu, 12 Dec 2019   Prob (F-statistic):               0.00
Time:                        02:23:41   Log-Likelihood:                -19067.
No. Observations:                4682   AIC:                         3.814e+04
Df Residuals:                    4679   BIC:                         3.816e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2.212e-14      0.208  -1.07e-13      1.0

The model is significant. 
Even thought the coefficients of salt intake is not significant(at level of 95%).
Age might be a mediator between salt intake and systolic.

In [108]:
# origin model and mediator model
med_model_SY=sm.OLS.from_formula('SY ~ RIDAGEYR+DBD100', data=df1)
mediator_SY=sm.OLS.from_formula('RIDAGEYR ~ DBD100', data=df1)

# origin model and mediator model
med_SY = Mediation(med_model_SY,mediator_SY,'DBD100','RIDAGEYR').fit()
print(med_SY.summary())

                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            1.034208        0.525030        1.529603    0.000
ACME (treated)            1.034208        0.525030        1.529603    0.000
ADE (control)             0.093089       -0.367313        0.540152    0.720
ADE (treated)             0.093089       -0.367313        0.540152    0.720
Total effect              1.127297        0.430358        1.775847    0.004
Prop. mediated (control)  0.918998        0.604010        1.648115    0.004
Prop. mediated (treated)  0.918998        0.604010        1.648115    0.004
ACME (average)            1.034208        0.525030        1.529603    0.000
ADE (average)             0.093089       -0.367313        0.540152    0.720
Prop. mediated (average)  0.918998        0.604010        1.648115    0.004


All the mediation effect(ACME) are significant.
Which means that age is a mediator between salt intake and systolic.

## Summary

From the analysis above, we know that the salt intake behavior have significant influence on people's blood pressure
(both diastolic and systolic).
The influence of salt intake behavior on blood pressure(both diastolic and systolic) is not modirated by waist size.
Age is a Mediator between salt intake behavior and blood pressure(both diastolic and systolic).

## Reference

1. <https://en.wikipedia.org/wiki/Moderation_(statistics)>
2. <http://web.pdx.edu/~newsomj/semclass/ho_mediation.pdf>