# Predicting Blood Pressure Based on Diet
By: [Kelly Wu](https://www.linkedin.com/in/kelly-wu-nj/)

## Problem Statement

We have all gone to the doctor at least once in our life and went through the basic health checks: height, weight, temperature, and blood pressure. We're all familiar with the process of having our arm squeezed tightly by a cuff while a our doctor listens in with his or her stethoscope and watches the little monitor intently. Then we all hope to never hear that we have high blood pressure. High blood pressure, also known as a "silent killer," normally doesn't induce any health symptoms, but can lead to a heart attack or stroke. What's worse is that hypertension is so common that it is a leading risk for death and disability worldwide (Dr. Paul Whelton, an expert in hypertension and kidney disease at Tulane University). 

What is blood pressure? Blood pressure is given as two numbers. The first number represents the pressure in your blood vessels as the heart beats (systolic pressure). The second is the pressure as your heart relaxes and fills with blood (diastolic pressure). Normal blood pressure is considered to be 120/80 or lower, while high blood pressure is considered to be 140/90 or higher. So what affects blood pressure? There are numerous factors that can affect blood pressure and it's normal for it to fluctuate throughout the day. The time of day, the foods you eat, and stress are a few contributors to blood pressure. 

Food is essential to life, but majority of the population simply eat what's good to them or what's convenient. There aren't many people who actually go through the trouble of calculating the necessary macronutrients needed on a daily basis. Culture also has another factor that affects diet where maybe certain foods like rice is the primary carbohydrate versus pasta in another culture. With such differences in diet and diet being a factor that affects blood pressure, can we predict blood pressure simply based on what we eat? Maybe our predictions will cause us to rethink what we eat or put more consideration into eating more variety. 

After our regression modeling, we can refer to our $R^2$ scores to determine model accuracy. Once we can determine any relationships and correlations between diet and basic health knowledge such as height and weight and blood pressure. Our goal is to help nutritionists better assist their clients as well as allow the typical layman to determine their risk for high blood pressure without the hassle of going to a physician or spending money to purchase at at home blood pressure machine. 

## Executive Summary

We initially begin by gathering public data from The National Health and Nurtition Examination Survey (NHANES) from 2013 - 2014. Luckily, [Kaggle](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey#diet.csv) organized the data for us into simple `.csv` files.

## Contents: 
- [Imports](#Imports)
- [Cleaning Our Data](#Cleaning-Our-Data)
    - [Handling Null Values](#Handling-Null-Values)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    - [Visualizations](#Visualizations)
- [Preprocessing](#Preprocessing)
- [Regression Modeling](#Regression-Modeling)
    - [Lasso](#Lasso)
    - [Ridge](#Ridge)
    - [Elastic Net](#Elastic-Net)
- [Outside Research](#Outside-Research)
- [Conclusions](#Conclusions)
- [Recommendations](#Recommendations)
- [Sources](#Sources)

### Imports
Import our necessary libraries and `csv` files that contain our datasets to help answer our problem statement. Our datasets contain responses from almost 10,000 individuals regarding various health and dietary questions. 

[Back to Contents](#Contents:)

In [67]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

In [68]:
# Importing our datasets
diet_df = pd.read_csv('./datasets/diet.csv')
exam_df = pd.read_csv('./datasets/examination.csv')

In [69]:
# Preview our dietary data
diet_df.head()

Unnamed: 0,SEQN,WTDRD1,WTDR2D,DR1DRSTZ,DR1EXMER,DRABF,DRDINT,DR1DBIH,DR1DAY,DR1LANG,...,DRD370QQ,DRD370R,DRD370RQ,DRD370S,DRD370SQ,DRD370T,DRD370TQ,DRD370U,DRD370UQ,DRD370V
0,73557,16888.327864,12930.890649,1,49.0,2.0,2.0,6.0,2.0,1.0,...,,,,,,,,,,
1,73558,17932.143865,12684.148869,1,59.0,2.0,2.0,4.0,1.0,1.0,...,,2.0,,2.0,,2.0,,2.0,,2.0
2,73559,59641.81293,39394.236709,1,49.0,2.0,2.0,18.0,6.0,1.0,...,,,,,,,,,,
3,73560,142203.069917,125966.366442,1,54.0,2.0,2.0,21.0,3.0,1.0,...,,,,,,,,,,
4,73561,59052.357033,39004.892993,1,63.0,2.0,2.0,18.0,1.0,1.0,...,,2.0,,2.0,,2.0,,2.0,,2.0


In [70]:
# Preview our examinations data
exam_df.head()

Unnamed: 0,SEQN,PEASCST1,PEASCTM1,PEASCCT1,BPXCHR,BPAARM,BPACSZ,BPXPLS,BPXPULS,BPXPTY,...,CSXLEAOD,CSXSOAOD,CSXGRAOD,CSXONOD,CSXNGSOD,CSXSLTRT,CSXSLTRG,CSXNART,CSXNARG,CSAEFFRT
0,73557,1,620.0,,,1.0,4.0,86.0,1.0,1.0,...,2.0,1.0,1.0,1.0,4.0,62.0,1.0,,,1.0
1,73558,1,766.0,,,1.0,4.0,74.0,1.0,1.0,...,3.0,1.0,2.0,3.0,4.0,28.0,1.0,,,1.0
2,73559,1,665.0,,,1.0,4.0,68.0,1.0,1.0,...,2.0,1.0,2.0,3.0,4.0,49.0,1.0,,,3.0
3,73560,1,803.0,,,1.0,2.0,64.0,1.0,1.0,...,,,,,,,,,,
4,73561,1,949.0,,,1.0,3.0,92.0,1.0,1.0,...,3.0,1.0,4.0,3.0,4.0,,,,,1.0


In [71]:
# Look at the count of rows and columns 
diet_df.shape

(9813, 168)

In [72]:
# Look at the count of rows and columns 
exam_df.shape

(9813, 224)

From importing and previewing our datasets, we can see that there are many `Nan` values, hundreds of columns that may or may not be crucial to us, and column names that aren't clear as to what values fall under it. 

### Cleaning Our Data
By referring to the dietary and examination variable lists from the NHANES website, we can see what each column is referring to and perform some cleaning. From the variable lists, we determine which columns are important to us and can provide a new column name that is easier to distinguish what the values are corresponding to. 

[Back to Contents](#Contents:)

In [73]:
# Creating a data dictionary to rename columns
diet_dict = {
    'DR1.320Z' : 'water', 
    'DR1TCAFF' : 'caffeine', 
    'DR1TALCO' : 'alcohol', 
    'DR1TCALC' : 'calcium', 
    'DR1TCARB' : 'carbs', 
    'DR1TCHOL' : 'cholesterol',
    'DR1TFIBE' : 'fiber', 
    'DR1TNUMF' : 'total_foods', 
    'DR1TPROT' : 'protein', 
    'DR1TPOTA' : 'potassium', 
    'DR1TSODI' : 'sodium',
    'DR1TSUGR' : 'sugar',
    'DR1TTFAT' : 'fat',
    'DRQSDIET' : 'on_a_diet', 
}

In [74]:
# Creating a data dictionary to rename columns
exam_dict = {
    'BPXDI1' : 'diastolic_1', 
    'BPXDI2' : 'diastolic_2',
    'BPXDI3' : 'diastolic_3',
    'BPXDI4' : 'diastolic_4',
    'BPXSY1' : 'systolic_1', 
    'BPXSY2' : 'systolic_2', 
    'BPXSY3' : 'systolic_3', 
    'BPXSY4' : 'systolic_4',
    'PEASCTM1' : 'bp_time', 
    'PEASCCT1' : 'bp_comment', 
    'PEASCST1' : 'bp_status', 
    'BMXHT' : 'height', 
    'BMXWT' : 'weight'
}

In [75]:
# Isolating desired columns in dataframe
diet_columns = []
for col in diet_dict: 
    diet_columns.append(col)

diet_df = diet_df[diet_columns]

In [76]:
# Isolating desired columns in dataframe
exam_columns = []
for col in exam_dict: 
    exam_columns.append(col)

exam_df = exam_df[exam_columns]

In [77]:
# Rename columns with values from data dictionaries
diet = diet_df.rename(columns = diet_dict)
exam = exam_df.rename(columns = exam_dict)

In [78]:
# Preview diet dataframe
diet.head(3)

Unnamed: 0,water,caffeine,alcohol,calcium,carbs,cholesterol,fiber,total_foods,protein,potassium,sodium,sugar,fat,on_a_diet
0,960.0,203.0,0.0,949.0,239.59,209.0,10.8,11.0,43.63,2228.0,1323.0,176.47,52.81,2.0
1,360.0,240.0,119.0,3193.0,423.78,2584.0,16.7,8.0,338.13,4930.0,9726.0,44.99,124.29,2.0
2,1254.0,45.0,0.0,877.0,224.39,88.0,9.9,27.0,64.61,1694.0,2943.0,102.9,65.97,1.0


In [79]:
# Preview exams dataframe
exam.head(3)

Unnamed: 0,diastolic_1,diastolic_2,diastolic_3,diastolic_4,systolic_1,systolic_2,systolic_3,systolic_4,bp_time,bp_comment,bp_status,height,weight
0,72.0,76.0,74.0,,122.0,114.0,102.0,,620.0,,1,171.3,78.3
1,62.0,80.0,42.0,,156.0,160.0,156.0,,766.0,,1,176.8,89.5
2,90.0,76.0,80.0,,140.0,140.0,146.0,,665.0,,1,175.3,88.9


After extracting the columns of importance to us and renameing the columns, we still need to handle our null values that we can see from previewing our cleaned datasets. 
### Handling Null Values
Null values may not prove to be useful in solving our problem statement as we are determing risk of high blood pressure based on an individual's diet. Therefore, if there are no nutrientional values about an individual, we may not be able to properly train our model or make a prediction. 

[Back to Contents](#Contents:)

In [80]:
diet.isna().sum()

water          1152
caffeine       1282
alcohol        1282
calcium        1282
carbs          1282
cholesterol    1282
fiber          1282
total_foods    1152
protein        1282
potassium      1282
sodium         1282
sugar          1282
fat            1282
on_a_diet      1030
dtype: int64

In [81]:
exam.isna().sum()

diastolic_1    2641
diastolic_2    2404
diastolic_3    2405
diastolic_4    9298
systolic_1     2641
systolic_2     2404
systolic_3     2405
systolic_4     9298
bp_time         305
bp_comment     9493
bp_status         0
height          746
weight           90
dtype: int64

### Exploratory Data Analysis 
[Back to Contents](#Contents:)

### Visualizations
[Back to Contents](#Contents:)

### Preprocessing 
[Back to Contents](#Contents:)

### Regression Modeling 
[Back to Contents](#Contents:)

### Lasso 
[Back to Contents](#Contents:)

### Ridge 
[Back to Contents](#Contents:)

### Elastic Net
[Back to Contents](#Contents:)

### Outside Research
[Back to Contents](#Contents:)

### Conclusions 
[Back to Contents](#Contents:)

### Recommendations 
[Back to Contents](#Contents:)

### Sources
[Back to Contents](#Contents:)
- [Blood Pressure Matters](https://newsinhealth.nih.gov/2016/01/blood-pressure-matters)
- [Dietary Data](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013)
- [Dietary Variable List](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Dietary&CycleBeginYear=2013)
- [Examination Data](https://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013)
- [Examination Variable List](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Examination&CycleBeginYear=2013)
- [NHANES Datasets](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey#diet.csv)