# Data Preparation Notebook

### Introduction
This notebook serves as a tool used to do any additional filtering or cleaning after extracting the data from online. Any extra data processing done here will be justified and the final result will be saved in a file that can be used for later. 

    
#### Preparation steps and justifications
Each filtering action is justified by findings obtained during EDA. The list below should outline what specific type of transformation is done along with the reason why it needs to be done.

**Include only "Yes"/"No" values in "Coronary heart disease" column**
- Since the scope of this project is to predict whether or not a patient has CHD, the values in this column must be concise and clear; we cannot trust naively mapping "null" values to "no" or "dont know" to "no" values.


In [5]:
cleaned_data_filename = 'nhanes_data_processed.parquet'

In [6]:
import numpy as np
import pandas as pd
from IPython.display import Markdown, display
import os

# Display all rows and columns
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

data_directory = 'data'
raw_data_filename = 'nhanes_data.parquet'
resulting_filepath = os.path.join(data_directory, cleaned_data_filename)
original_filepath = os.path.join(data_directory, raw_data_filename)

### Read in our dataset

In [7]:
df = pd.read_parquet(original_filepath)
df.head()

Unnamed: 0,SEQN,Weight,Body mass index,Year Range,Systolic,Diastolic,Gender,Age,Diabetes,Glycohemoglobin,Cholesterol,High-density lipoprotein (HDL),Albumin,Alanine aminotransferase (ALT),Aspartate aminotransferase (AST),Alkaline phosphatase (ALP),Gamma-glutamyl transferase (GGT),Glucose,Iron,Lactate dehydrogenase (LDH),Phosphorus,Bilirubin,Protein,Triglycerides,Uric acid,Creatinine,White blood cells,Basophils,Red blood cells,Hemoglobin,Red blood cell width,Platelet count,Mean volume of platelets,Coronary heart disease,Blood related diabetes,Blood related stroke,Moderate-work,Vigorous-work
0,1.0,12.5,14.9,1999-2000,,,2.0,29.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2.0,75.4,24.9,1999-2000,106.0,58.0,1.0,926.0,2.0,4.7,5.56,1.39,45.0,16.0,19.0,62.0,20.0,78.0,11.28,140.0,1.066,12.0,72.0,1.298,362.8,61.9,7.6,5.397605e-79,4.73,14.1,13.7,214.0,7.7,2.0,2.0,2.0,,3.0
2,3.0,32.9,17.63,1999-2000,110.0,60.0,2.0,125.0,2.0,,3.34,0.78,,,,,,,,,,,,,,,7.5,5.397605e-79,4.52,13.7,11.7,270.0,8.6,,,,,
3,4.0,13.3,,1999-2000,,,1.0,22.0,2.0,,,,,,,,,,,,,,,,,,8.8,0.1,4.77,9.3,15.3,471.0,7.8,,,,,
4,5.0,92.5,29.1,1999-2000,122.0,82.0,1.0,597.0,2.0,5.5,7.21,1.08,45.0,28.0,22.0,63.0,34.0,95.0,24.54,133.0,1.033,8.6,73.0,3.85,404.5,70.7,5.9,5.397605e-79,5.13,14.5,13.1,209.0,10.4,2.0,2.0,2.0,17.0,1.0


## Filtering