# Heart Risk Factors Analysis

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

This notebook will be taking a look at the heart_disease_health_indicators_BRFSS2015 [dataset](https://www.kaggle.com/datasets/waqi786/heart-attack-dataset) found on Kaggle!

As someone with relatives working in the medical industry, I just want to utilize my skills
in Computer Science to help medical professionals. 

Some goals for this notebook are as follows:
1. Conduct Exploratory Data Analysis on the given dataset to highlight correlations on data
2. Utilize regression models to predict risk rate of heart attack

Other models may be used in the future!

## Exploratory Data Analysis

In [3]:
# Reading in the Kaggle Dataset
filename = os.path.join(os.getcwd(), "data", "heart_disease.csv")
df = pd.read_csv(filename, header= 0)
df

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,0.0,1.0,1.0,1.0,18.0,0.0,0.0,2.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


In [16]:
df.dtypes

Gender                   object
Age                       int64
Blood Pressure (mmHg)     int64
Cholesterol (mg/dL)       int64
Has Diabetes             object
Smoking Status           object
Chest Pain Type          object
Treatment                object
dtype: object

From the object types, has diabetes & gender can be transformed into numerical representations (0, 1)

In [23]:
for col in df:
    print("Column: ", col)
    print(df[col].unique())

Column:  Gender
['Male' 'Female']
Column:  Age
[70 55 42 84 86 66 33 73 63 88 69 78 89 71 30 77 76 74 45 34 61 52 49 81
 39 32 46 67 41 35 80 85 57 79 62 48 36 64 60 50 72 65 40 51 82 75 31 43
 37 38 54 44 59 58 53 83 56 87 47 68]
Column:  Blood Pressure (mmHg)
[181 103  95 106 187 125 182 115 174 154 133 165 153 110 107 112  91 101
 141 124 109 143 197 149 104 159 193 135 190 129 126 134 172 179 111 192
 180 166 119 139 116 191 120 158 138 198 162 142 169 178 196 164 161 168
 113 185 148 171 176 183 147  97 175 105 145  98 128 195 146 167 163 144
 156 122 152 136 151 150 114 127 186 184 137  96 188 100 173 199 132 160
 194  99  94 170 140 130 123 117 189 157 131 121 118 102  93 108  90  92
 155]
Column:  Cholesterol (mg/dL)
[262 253 295 270 296 271 288 286 254 150 236 171 215 182 242 179 227 259
 273 212 222 285 266 209 157 191 268 161 274 248 205 280 255 188 246 297
 181 249 258 235 201 204 198 200 186 217 176 233 216 210 172 165 190 183
 156 229 294 195 220 243 265 283 225 234 166 1

The other categorical data, Chest Pain Type and Treatment, can either be turned into seperate categories with One-Hot Encoding
or by turning them into numerical values.

Age, Blood Pressure, and Cholesterol should be turned into ranges
- Age: 30 to 40, 50 to 60, 70+
- Blood Pressure:
- Cholesterol: