Skip to content

Develop a predictive model to determine likelihood of heart disease given common medical metrics

License

Notifications You must be signed in to change notification settings

knishina/heart_attack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

heart_attack

Summary.

This project takes data consisting of attributes that are hypothesized to contribute to heart disease. The purpose of this project is to take the data and generate a predictive model for heart disease. The data was obtained through Kaggle. The following modules were used analyze/visualize and build a predictive model: pandas for data munging, matplotlib, seaborn for data visualization, and sklearn, eli5 for model building and its associated analyses.

The data have fourteen attributes. They include:

  • age         in years
  • sex         (1 = male; 0 = female)
  • cp           chest pain type
  • trestbps resting blood pressure (in mm Hg on admission to the hospital)
  • chol        serum cholestoral in mg/dl
  • fbs         (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg   resting electrocardiographic results
  • thalach   maximum heart rate achieved
  • exang       exercise induced angina (1 = yes; 0 = no)
  • oldpeak   ST depression induced by exercise relative to rest
  • slope       the slope of the peak exercise ST segment
  • ca            number of major vessels (0-3) colored by flourosopy
  • thal         3 = normal; 6 = fixed defect; 7 = reversable defect
  • target     1 or 0 (I think this means 1= heart disease; 0=no heart disease)

Data Analysis and Visualization.

Of the heart disease data, there are thirteen categories that cover commonly available statistics generated by the common blood tests performed on patients. To determine if the patient has heart disease, the individual's status is recorded as the target.

There were three main areas that were investigated. Whether there is a trend of heart disease with any of the following:

  • Age
  • Gender
  • Chest pain magnitude

Before any of those questions are addressed, a heatmap was generated to survey possible correlations.


heatmap Data Heatmap
All categories were set against each other. A resulting heatmap was produced that indicates positive/negative correlations. Considering the various columns in reference to target, there a few notable positive and negative relationships.

  • positive relationships include: cp, thalach, and slope.
  • negative relationships include: age, sex, exang, oldpeak, ca, and thal.



Age as an indicator for heart disease This plot considers age and its role as an indicator for heart disease. The legend indicates heart disease (1) v. no heart disease (0). In this case, the above bar graph indicates that there is little to correlation of age as an indicator for heart disease. This is further evidenced by the heat map having a negative correlation value of -0.23.

age



Gender as an indicator for heart disease This plot considers sex and its role as an indicator for heart disease. First, it appears that the data is skewed to males, meaning, that there are more males in this study compared to females. In fact, the ratio of males to females is 2:1. Second, the female population has a higher rate of heart disease; the male population has a lower rate of heart disease. Due to this discrepancy, the heatmap reads this as not having a positive correlation. In other words, heatmap is indicating that sex is not likely an indicator of having a heart disease (-0.28).

gender



Chest pain as an indicator for heart disease This plot considers chest pain type (cp) as an indicator for heart disease. For data where cp is 1 or higher, the incidence of heart disease is high. For data were cp is 0, the value of 0 indicates that there is no chest pain and correlates strongly with the absence of heart disease. According to the heatmap, the value for cp is 0.43, a positive correlation. That means that cp is likely an indicator of having a heart disease.

chest_pain



Model Building.

Three models were trained and tested. The three include: linear regression, logistic regression, and support vector machine (SVM). Of the three, the linear regression model had a poor predictive outcome.

Logistic Regression

  • Accuracy score: Train = 0.864; Test = 0.885.
  • Classification report:

classification_report1

SVM

  • Accuracy score: Train = 0.855; Test = 0.869.
  • Classification report:

classification_report2

The better model of the two is the Logistic Regression. Below is the weight per feature and the ROC for the logistic model.

weights

ROC


License.

This project is licensed under the MIT License - see the LICENSE file for details.

About

Develop a predictive model to determine likelihood of heart disease given common medical metrics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published