# Predicting heart disease using machine learning
This notebook looks into using many Python-based and data science libraries in an attempt to build a machine learning model to predict heart disease. In order to run successfully the program, the following command needs to be run for having the proper env: `conda env create --prefix ./env -f environment.yml`
The following approach will be taken:
1. Problem definition
2. Data
3. Evaluaiton
4. Features
5. Modelling
6. Experimentaiton

## 1. Problem Definition
Given a clinical dataset, can we predict if someone has heart disease?


## 2. Data
* Column Descriptions:
* Id (Unique id for each patient)
* Age (Age of the patient in years)
* Origin (place of study)
* Sex (Male/Female)
* Cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
* Trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
* Chol (serum cholesterol in mg/dl)
* Fbs (if fasting blood sugar > 120 mg/dl)
* Restecg (resting electrocardiographic results)
* -- Values: [normal, stt abnormality, lv hypertrophy]
* Thalach: maximum heart rate achieved
* Exang: exercise-induced angina (True/ False)
* Oldpeak: ST depression induced by exercise relative to rest
* Slope: the slope of the peak exercise ST segment
* Ca: number of major vessels (0-3) colored by fluoroscopy
* Thal: [normal; fixed defect; reversible defect]
* Num: the predicted attribute

The original data comes from Kaggle: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data


## 3. Evaluation
Checking is the accuracy of prediction can reach to 95%


## 4. Features
Creating a data dictionary


In [9]:
import  numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [11]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

In [12]:
df = pd.read_csv("heart-disease.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
