# Blood Pressure Analysis and Prediction
## Josh McComack
### Objectives
I wish to investigate the feasibility of predicting systolic and diastolic blood pressure based on the [National Health and Nutrition Examination Survey](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/home) (NHANES) data set from 2013-2014 found on kaggle.com. The primary questions I wish to answer are:
1. Which variables from the survey are most predictive empirically and do they correspond to the what mainstream literature identifies as key factors in blood pressure levels?
2. Comparing scikit-learn’s SGDRegressor, MultiTaskLasso, and RandomForestRegressor, which regression model offers the best predictions on this data set?
3. Does the best model perform well enough to serve as a possible supplementary or alternative way of “measuring” blood pressure?


### Import Libraries and Data
First we will import our libraries and data. The NHANES data is stored in 6 seperate csv files.

In [None]:
# Import libraries
import pandas as pd

In [None]:
# Load Data
path = "./national-health-and-nutrition-examination-survey/"
demographic_df = pd.read_csv(path + "demographic.csv")
diet_df = pd.read_csv(path + "diet.csv")
exam_df = pd.read_csv(path + "examination.csv")
labs_df = pd.read_csv(path + "labs.csv")
# There seems to be a non UTF-8 Character in the medications data set.
meds_df = pd.read_csv(path + "medications.csv", encoding = "latin1")
questionnaire_df = pd.read_csv(path + "questionnaire.csv")

In [None]:
# Join all data frames into a single data frame
df = demographic_df
for frame in [diet_df, exam_df, labs_df, meds_df, questionnaire_df]:
    df = df.merge(frame, left_on="SEQN", right_on="SEQN", how="outer")
df.head(10)

### Data Cleaning and Selection
Our data set contains a total 1824 different columns, with overview of the columns found [here](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/home). To get started however, we will focus on features which have been identified as possible factors in blood presssure levels:
* Age [1][2]
* Gender [2]
* Sodium intake [1]
* Smoking [1][2]
* Exercise levels [1][2]
* Alcohol intake [1][2]
* Diabetes [2]
* Weight or Obesity [1][2]

* Family history of high blood pressure [1][2]
* Genetics (Which probably ties in with gender, and family history) [1]
* Stress levels [1][2]
* Chronic kidney Disease [1][2]


References:
* [1] [www.webmd.com](https://www.webmd.com/hypertension-high-blood-pressure/guide/blood-pressure-causes#1)
* [2] [www.heart.org](http://www.heart.org/en/health-topics/high-blood-pressure/why-high-blood-pressure-is-a-silent-killer/know-your-risk-factors-for-high-blood-pressure)


From the demographic data set, we can use the survey participants' age and gender identified in columns DMDHRAGE and RIAGENDR defined in the detailed column description found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013).

demographic_df.head()

In [None]:
demographic_df[['DMDHRAGE', 'RIAGENDR']].head()

The dietary data set contains potentially usefull information about sodium intake and diet, however the detailed column description does not explain how this information is encoded. So we will use the more quantifiable sodium column LBXSNASI taken from the [laboratory data set](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Laboratory&CycleBeginYear=2013). 

Note: there seems to be many more sodium columns missing this data set.

In [None]:
labs_df[['LBXSNASI']].head()

The [questionnaire](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Questionnaire&CycleBeginYear=2013) contains lots of varius data about smoking, exercise, and alcohol habits. A few columns we will use are:
##### Smoking

* SMQ040: Do you now smoke cigarettes?
* SMQ020: Have you smoked at least 100 cigarettes in your entire life?
* SMD650: During the past 30 days, on the days that you smoked, about how many cigarettes did you smoke per day?
* SMD641: On how many of the past 30 days did you smoke a cigarette?

##### Exercise

* PAQ677: On how many of the past 7 days did you exercise or participate in physical activity for at least 20 minutes that made you sweat and breathe hard, such as basketball, soccer, running, swimming laps, fast bicycling, fast dancing, or similar activities?
* PAQ678: On how many of the past 7 days did you do exercises to strengthen or tone your muscles, such as push-ups, sit-ups, or weight lifting?

##### Alcohol

* ALQ101: In any one year, have you had at least 12 drinks of any type of alcoholic beverage? By a drink, I mean a 12 oz. beer, a 5 oz. glass of wine, or one and half ounces of liquor.
* ALQ130: In the past 12 months, on those days that you drank alcoholic beverages, on the average, how many drinks did you have?

##### Diabetes
* DIQ010: Other than during pregnancy, have you ever been told by a doctor or health professional that you have diabetes or sugar diabetes?

In [None]:
questionnaire_df[['SMQ040', 'SMQ020', 'SMD650', 'SMD641', 'PAQ677', 'PAQ678', 'ALQ101', 'ALQ130', 'DIQ010']].head()

Examining the medications data set, we can see who is taking medication for blood pressure and/or diabetes. 


In [None]:
meds_df[['RXDDRUG', 'RXDRSD1']].head(11)

The data set doesn't seem to contain direct information on stress levels, chronic kidney disease, or family history of high blood pressure. But from the examination data set we can collect information on body mass index and blood pressure readings:

#####  Body Mass Index

* BMXBMI: Body Mass Index (kg/m**2)	

#####  Body Mass Index
* BPXDI 1-4: Diastolic blood pressure mm Hg (first through fourth readings)
* BPXSY 1-4: Systolic blood pressure mm Hg (first through fourth readings)

In [None]:
exam_df[['BMXBMI', 'BPXDI1', 'BPXDI2', 'BPXDI3', 'BPXDI4', 'BPXSY1', 'BPXSY2', 'BPXSY3', 'BPXSY4']].head()