# Classifying Heart Disease

In this project, we'll try to classify the presence of heart disease in an individual using a dataset collected by the Cleveland Clinic Foundation.

We'll be using the [Heart Disease Data Set](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) from the UCI Machine Learning Repository. As mentioned, this dataset comes from the famous Cleveland Clinic Foundation, which recorded information on various patient characteristics, including age and chest pain, to try to classify the presence of heart disease in an individual. This a prime example of how machine learning can help solve problems that have a real impact on people's lives.

> Note: The dataset has already been partially cleaned. The original dataset has multiple classes.

In [19]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [32]:
heart = pd.read_csv('heart_disease.csv')
heart.head()

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,present
0,1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
2,3,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,4,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,5,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [33]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  303 non-null    int64  
 1   age         303 non-null    int64  
 2   sex         303 non-null    int64  
 3   cp          303 non-null    int64  
 4   trestbps    303 non-null    int64  
 5   chol        303 non-null    int64  
 6   fbs         303 non-null    int64  
 7   restecg     303 non-null    int64  
 8   thalach     303 non-null    int64  
 9   exang       303 non-null    int64  
 10  oldpeak     303 non-null    float64
 11  slope       303 non-null    int64  
 12  ca          303 non-null    object 
 13  thal        303 non-null    object 
 14  present     303 non-null    int64  
dtypes: float64(1), int64(12), object(2)
memory usage: 35.6+ KB


In [34]:
heart[['Unnamed: 0', 'ca', 'thal']]

Unnamed: 0.1,Unnamed: 0,ca,thal
0,1,0.0,6.0
1,2,3.0,3.0
2,3,2.0,7.0
3,4,0.0,3.0
4,5,0.0,3.0
...,...,...,...
298,299,0.0,7.0
299,300,2.0,7.0
300,301,1.0,7.0
301,302,1.0,3.0


## Cleaning the Dataset Further

Even though the dataset was partially cleaned, there are some variables in the dataset that need to be addressed.

1. The first column doesn't seem to indicate much at all and might actually be an leftover artifact of the previous attempt to clean and archive the dataset. The column will be removed entirely.
2. The second issue we might face are columns `ca` and `thal`. As per the dataset's official documentation:
    - `ca`
        - entered as an 'integer' in original dataset; is instead dtype 'object' (string)
        - "number of major vessels (0-3) colored by flourosopy"
        - last observation seems to suggest that there are errant entries (`?`)
    - `thal`
        - considered 'categorical' in dataset;
            - 3 = normal
            - 6 = fixed defect
            - 7 = reversable defect

Let's clean up these columns.

In [13]:
#remove the unnecessary column
heart.drop(columns=['Unnamed: 0'], inplace=True)
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,present
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


Let's see what all the values are in the `ca` column.

In [55]:

heart['ca'].value_counts()

0.0    176
1.0     65
2.0     38
3.0     20
?        4
Name: ca, dtype: int64

In [56]:
heart['present'].value_counts()

0    164
1    139
Name: present, dtype: int64

In [54]:
#change the data types of 'ca' and 'thal' to integers
heart['ca'].astype('float').astype('int')
heart.describe()

ValueError: could not convert string to float: '?'

## Exploratory Data Analysis

In [27]:
heart.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,present
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.458746
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.49912
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,1.0


In [31]:
heart['present'].value_counts(normalize=True)

0    0.541254
1    0.458746
Name: present, dtype: float64

There are more cases where heart diseases isn't observed in an individual (`present` = `0`) than there are those that were confirmed to have had heart disease (`present` = `1`).

In [37]:
# Checking potential predictors
heart.groupby("present").agg(
    {
        "age": "mean",
        "sex": "mean",
        "cp": "mean",
        "trestbps": "mean",
        "chol": "mean",
        "fbs": "mean",
        "restecg": "mean",
        "thalach": "mean",
        "exang": "mean",
        "oldpeak": "mean",
        "slope": "mean",
    }
)

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope
present,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,52.585366,0.560976,2.792683,129.25,242.640244,0.140244,0.835366,158.378049,0.140244,0.586585,1.408537
1,56.625899,0.820144,3.589928,134.568345,251.47482,0.158273,1.172662,139.258993,0.546763,1.574101,1.827338


In [45]:
heart.groupby("present").agg(pd.Series.mode)[['ca', 'thal']]


Unnamed: 0_level_0,ca,thal
present,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,3.0
1,0.0,7.0
