# Data Processing and Analysis of the UCI Heart Disease Dataset
##### Authors: Jourdan Hourican, Lenna Wolffe, Madison Tarasuik, John Beliveau

The purpose of this notebook is to import the dataset, clean the data by replacing any missing values with the mean of the data,
 and extract features and target for analysis via kmeans clustering.

### Dataset
https://archive.ics.uci.edu/dataset/45/heart+disease

### Key Research Questions:

1. Are there specific combinations of risk factors that form clusters in this dataset?
2. Can we identify clusters based on clinical measurements (ex. cholesterol levels, resting blood pressure, maximum heart rate) and their association with heart disease?

In [3]:
%%capture 
#This hides the output of the cell

#Install necessary packages
%pip install ucimlrepo
%pip install scikit-learn

#Hide possible warning message from displaying
import warnings 
warnings.filterwarnings("ignore", message="KMeans is known to have a memory leak on Windows with MKL")

In [2]:
#Run the python script to import, clean, process, and run kmeans clustering on the dataset.
%run data_processing.py

test completed
/nPerfomring K-means Clustering
/nAnalyzing Clustering Results
/nCluster Summary Statistics:
               age       sex        cp    trestbps        chol       fbs  \
Cluster                                                                    
0        58.141509  0.839623  3.773585  135.537736  251.537736  0.207547   
1        49.084746  0.932203  2.762712  127.669492  229.364407  0.118644   
2        57.468354  0.088608  2.924051  132.531646  266.075949  0.113924   

          restecg     thalach     exang   oldpeak     slope        ca  \
Cluster                                                                 
0        1.273585  131.264151  0.688679  1.941509  2.000000  1.261059   
1        0.694915  164.796610  0.118644  0.562712  1.322034  0.347599   
2        1.050633  151.531646  0.151899  0.541772  1.481013  0.367089   

             thal  
Cluster            
0        6.195606  
1        4.500000  
2        3.123218  
/nUnit test completed.


### Discussion

By running the clustering scipt we can identify clusters based on clinical measurements, and doing so seems to provide further insight into the heart disease association. For example, Cluster 0 seems to have more severe cardiovascular symptoms and risk factors, while Cluster 1 has better cardiovascular fitness and lower risk factors.