# Influence of salt intake behavior to blood pressure
## STATS 506 group project
### Author: Xinjun Li

This is a report generated by jupyter notebook for the group project of STATS 506 in University of Michigan.

Our project aims to answer the following question:

>**Is salt intake associated with blood pressure?**
**If so, to what extent is that relationship mediated or moderated by age or waist size?**

We will use [NHANES](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx) data in analysis.

Required software and packages to run the code are as follows:
* Python3 
* os
* pandas
* numpy
* scipy
* statsmodels
* patsy
* matplotlib

In [1]:
# Import packages
import os
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import patsy
import statsmodels.api as sm
from statsmodels.stats.mediation import Mediation
import matplotlib.pyplot as plt

In [2]:
# Set working directory
os.chdir('D:/学习/密歇根/STAT506/Group project/stats506/')

# read data
demo = pd.read_excel("RawData/Demographics_15_16.xlsx")
BMI = pd.read_excel("RawData/Body_measures_2015_16.xlsx")
bp = pd.read_excel('RawData/Blood_Pressure_2015_16.xlsx')
nutr = pd.read_excel("RawData/Dietary_nutrients_firstday_2015_16.xlsx")

### Data cleaning

We need first clean the raw data and join different dataset.

In [3]:
# select useful columns
# ! Note: we need to drop values '9' or '99' which represent "don't know"
demo=demo.set_index('SEQN'
                    ).filter(items=['RIDAGEYR']  # 'RIAGENDR','RIDRETH3'
                    ).dropna()
# demo[['RIAGENDR','RIDRETH3']]=demo[['RIAGENDR','RIDRETH3']].astype('category')
BMI=BMI.set_index('SEQN'
                  ).filter(items=['BMXWAIST']   # ,'BMXWT','BMXHT'
                  ).dropna()
nutr=nutr.set_index('SEQN'
                    ).filter(items=['DBD100']  #,'DBQ095Z','DRQSPREP'
                    ).dropna(
                    ).query('DBD100 != 9'
                    ).astype('category')

# Calculate mean of blood pressure
bp=bp.set_index('SEQN'
                ).filter(regex='(BPXSY*)|(BPXDI*)')
bp=bp.assign(SY=bp.filter(regex='BPXSY*').mean(axis=1, skipna = True),
             DI=bp.filter(regex='BPXDI*').mean(axis=1, skipna = True)
             ).filter(items=['SY','DI']).dropna()

# Merge all data set
df=bp.join(demo,how='inner').join(BMI,how='inner').join(nutr,how='inner')

Take a look at the data we are about to work on.

In [4]:
# Show data summary of numeric variables
print(df.describe()

SyntaxError: unexpected EOF while parsing (<ipython-input-4-3e8b582dd116>, line 2)

In [None]:
# Show data summary of categorical variables
print(df.describe(include='category'))

Generate plots show the relationship between salt intake behaviors and blood presure.

In [None]:
# plot Diastolic and salt intakes
DI_salt=df[['DBD100','DI']
            ].pivot(columns='DBD100', values='DI'
            )
DI_salt.columns=["Rarely","Occasionally","Very often"]
DI_salt.boxplot(grid=False)
plt.title("Boxplot of Diastolic and frequency of add salt to food at table")
plt.suptitle("")
plt.show()

In [None]:
# plot Systolic and salt intakes
SY_salt=df[['DBD100','SY']
            ].pivot(columns='DBD100', values='SY'
            )
SY_salt.columns=["Rarely","Occasionally","Very often"]
SY_salt.boxplot(grid=False)
plt.title("Boxplot of Systolic and frequency of add salt to food at table")
plt.suptitle("")
plt.show()

In [None]:
# Test normality of Diastolic given salt intake
ax1 = plt.subplot(131)
a=DI_salt[['Rarely']].dropna().to_numpy()
a=np.reshape(a,(len(a)))
qqplot = stats.probplot(a, plot=plt)
plt.yticks([])

ax2 = plt.subplot(132)
b=DI_salt[['Occasionally']].dropna().to_numpy()
b=np.reshape(b,(len(b)))
qqplot = stats.probplot(b, plot=plt)
plt.yticks([])

ax3 = plt.subplot(133)
c=DI_salt[['Very often']].dropna().to_numpy()
c=np.reshape(c,(len(c)))
qqplot = stats.probplot(c, plot=plt)
plt.yticks([])

ax1.set_title('Rarely')
ax2.set_title('Occasionally')
ax3.set_title('Very often')

plt.show()

In [None]:
# Test normality of Systolic given salt intake
ax1 = plt.subplot(131)
a=SY_salt[['Rarely']].dropna().to_numpy()
a=np.reshape(a,(len(a)))
qqplot = stats.probplot(a, plot=plt)
plt.yticks([])

ax2 = plt.subplot(132)
b=SY_salt[['Occasionally']].dropna().to_numpy()
b=np.reshape(b,(len(b)))
qqplot = stats.probplot(b, plot=plt)
plt.yticks([])

ax3 = plt.subplot(133)
c=SY_salt[['Very often']].dropna().to_numpy()
c=np.reshape(c,(len(c)))
qqplot = stats.probplot(c, plot=plt)
plt.yticks([])

ax1.set_title('Rarely')
ax2.set_title('Occasionally')
ax3.set_title('Very often')

plt.show()

### Fit OLS

Now fit the ordinary least square to the data.
The models used are:
`DI ~ DBD100` and `SY ~ DBD100`

#### Diastolic result

In [None]:
# fit ols to Diastolic measurements
ols_DI=ols('DI~DBD100',data=df).fit()
# Print the summary
print(ols_DI.summary())

In [None]:
print(anova_lm(ols_DI))

#### Systolic result

In [None]:
# fit ols to Systolic measurements
ols_SY=ols('SY~DBD100',data=df).fit()
# Print the summary
print(ols_SY.summary())

In [None]:
print(anova_lm(ols_SY))

From the above results, we know that both models are significant and every coefficients are significant(at the level of 95%).

### Moderation effect of waist size


First, add two columns recording a standiviation above and below  of waist size.

In [None]:
df['waist_sd'] = df['BMXWAIST'].std()
df['waist_up']=df['BMXWAIST']+df['waist_sd']
df['waist_down']=df['BMXWAIST']-df['waist_sd']

#### Diastolic result

Fit model: `DI ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST`

In [None]:
moderation_DI = ols('DI ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST', data=df).fit()
print(moderation_DI.summary())

In [None]:
print(anova_lm(moderation_DI))

In [None]:
# one standard deviation above mean
moderation_DI_up = ols('DI ~ DBD100 + waist_up + DBD100 * waist_up', data=df).fit()
print(moderation_DI_up.summary())

In [None]:
# one standard deviation below mean
moderation_DI_down = ols('DI ~ DBD100 + waist_down + DBD100 * waist_down', data=df).fit()
print(moderation_DI_down.summary())

For the models above, coefficients for the interaction terms and salt intake itself are not significant(at level of 95%).
There are not moderation effect of waist size on salt intake and diastole.

#### Systolic result

Fit model: `SY ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST`

In [None]:
moderation_SY = ols('SY ~ DBD100 + BMXWAIST + DBD100 * BMXWAIST', data=df).fit()
print(moderation_SY.summary())

In [None]:
print(anova_lm(moderation_SY))

In [None]:
# one standard deviation above mean
moderation_SY_up = ols('SY ~ DBD100 + waist_up + DBD100 * waist_up', data=df).fit()
print(moderation_SY_up.summary())

In [None]:
# one standard deviation below mean
moderation_SY_down = ols('SY ~ DBD100 + waist_down + DBD100 * waist_down', data=df).fit()
print(moderation_SY_down.summary())

For the models above, coefficients for the interation terms and salt intake itself are not significant(at level of 95%). 
There are not moderation effect of waist size on salt intake and diastole.

### Mediation effect of age

Fit model: `RIDAGEYR ~ DBD100` to see if there are relationships between age and salt intake

In [None]:
# test if there is relationship between age and salt intake.
age_D = ols('RIDAGEYR ~ DBD100', data=df).fit()
print(age_D.summary())

The model is significant. There are relationships between age and salt intake behavior.

#### Diastolic result

Fit model: `DI ~ DBD100 + RIDAGEYR`.

In [None]:
mediation_DI = ols('DI ~ DBD100 + RIDAGEYR', data=df).fit()
print(mediation_DI.summary())

The model is significant. Age might be a mediator between salt intake and diastolic.

In [None]:
# Create design matrix
DI,model_mat = patsy.dmatrices("DI ~ DBD100 + RIDAGEYR", data=df)
df_med_DI=pd.DataFrame(model_mat).iloc[:,1:]
df_med_DI.columns=['DBD2','DBD3','RIDAGEYR']
df_med_DI['DI']=DI

# origin model and mediator model
med_model_DI=sm.OLS.from_formula('DI ~ RIDAGEYR+DBD2+DBD3', data=df_med_DI)
mediator_DI=sm.OLS.from_formula('RIDAGEYR ~ DBD2+DBD3', data=df_med_DI)

# origin model and mediator model
med_DI = Mediation(med_model_DI,mediator_DI,['DBD2','DBD3'],'RIDAGEYR').fit()
print(med_DI.summary())

All the mediation effect(ACME) are significant(at level of 95%).
Which means that age is a mediator between salt intake and diastolic.

#### Systolic result

Fit model: `SY ~ DBD100 + RIDAGEYR`.

In [None]:
mediation_SY = ols('SY ~ DBD100 + RIDAGEYR', data=df).fit()
print(mediation_SY.summary())

The model is significant. 
Even thought the coefficients of salt intake is not significant(at level of 95%).
Age might be a mediator between salt intake and systolic.

In [None]:
# Create design matrix
SY,model_mat = patsy.dmatrices("SY ~ DBD100 + RIDAGEYR", data=df)
df_med_SY=pd.DataFrame(model_mat).iloc[:,1:]
df_med_SY.columns=['DBD2','DBD3','RIDAGEYR']
df_med_SY['SY']=SY

# origin model and mediator model
med_model_SY=sm.OLS.from_formula('SY ~ RIDAGEYR+DBD2+DBD3', data=df_med_SY)
mediator_SY=sm.OLS.from_formula('RIDAGEYR ~ DBD2+DBD3', data=df_med_SY)

# origin model and mediator model
med_SY = Mediation(med_model_SY,mediator_SY,['DBD2','DBD3'],'RIDAGEYR').fit()
print(med_SY.summary())

All the mediation effect(ACME) are significant.
Which means that age is a mediator between salt intake and systolic.

## Summary

From the analysis above, we know that the salt intake behavior have significant influence on people's blood pressure
(both diastolic and systolic).
The influence of salt intake behavior on blood pressure(both diastolic and systolic) is not modirated by waist size.
Age is a Mediator between salt intake behavior and blood pressure(both diastolic and systolic).

## Reference

1. <https://en.wikipedia.org/wiki/Moderation_(statistics)>
2. <http://web.pdx.edu/~newsomj/semclass/ho_mediation.pdf>

## Aknowledgement

I would like to thank Dr. Henderson for always patiently answering any questions.
I would also like to thank my fellow group members, Jingyan Lu and Karthik G. 
Your works are very inspiring. I cannot finishing this report without your contributions.