# Using Logistic Regression for Classifying Heart Disease

## 1. Introduction

This is a guided project from Dataquest's course "Logistic Regression Modeling in Python".

The aim is to implement a logistic regression machine learning model on a sanitized version of a real-life [Heart Disease dataset](https://archive.ics.uci.edu/dataset/45/heart+disease) from the UC Irvine Machine Learning Repository, donated by the Cleveland Clinic Foundation, which recorded information on various patient characteristics, such as age and chest pain, to try to classify the presence of heart disease in an individual.

The dataset contains these attributes:
1. **age**: age in years.
2. **sex**: gender (1 = male; 0 = female).
3. **cp**: chest pain type:
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
4. **trestbps**: resting blood pressure (in mm Hg on admission to the hospital.
5. **chol**: serum cholesterol in mg/dl.
6. **fbs**: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false).
7. **restecg**: resting electrocardiographic results.
8. **thalach**: maximum heart rate achieved.
9. **exang**: exercise induced angina (1 = yes; 0 = no)
10. **oldpeak**: ST depression induced by exercise relative to rest
11. **slope**: the slope of the peak exercise ST segment:
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
12. **ca**: number of major vessels (0-3) colored by flouroscopy.
13. **thal**: 3 = normal; 6 = fixed defect; 7 = reversible defect.
14. **present** (the predicted attribute): diagnosis of heart disease:
    - Value 0: not present
    - Value 1: present

In [None]:
import pandas as pd

# Read the data into a dataframe
heart = pd.read_csv('heart_disease.csv')

## 2. Exploring the Dataset

In [None]:
# Display the first five rows of the dataframe
heart.head()

The column "Unnamed: 0" appears to just be an index, which means that it's redundant and we can get rid of it.

In [None]:
heart = heart.drop('Unnamed: 0', axis=1)

In [None]:
# Double check the columns and rows in the dataset
heart.info()

There are 14 columns and 303 rows all up.

As listed above, the columns **ca** and **thal** are categorical.

Despite having numerical values, the following columns are in fact categorical based on their descriptions in the data dictionary at the top: **sex**, **cp**, **fbs**, **exang**, and **slope**.

In [None]:
# Let's check whether 'restecg' is categorical
heart['restecg'].unique()

The column **restecg** appears to be categorical as well.

In [None]:
# Double check for missing values.
heart.isna().sum()

There are no missing values.

### Exploratory Data Analysis: Descriptive Statistics

In [None]:
# Display the descriptive statistics for the 'heart' dataframe
heart.describe()

In [None]:
# Take a look at the descriptive statistics for the other categorical columns.
heart.describe(include=['object'])

### Exploratory Data Analysis: Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ax = sns.boxplot(x="age", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.countplot(x="sex", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.countplot(x="cp", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.boxplot(x="trestbps", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.boxplot(x="chol", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.countplot(x="fbs", data=heart)
plt.show()
plt.clf()

For fasting blood sugar, the majority are in the less than 120 mg/dl category (value 0).

In [None]:
ax = sns.countplot(x="restecg", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.boxplot(x="thalach", data=heart)
plt.show()
plt.clf()

In [None]:
ax = sns.countplot(x="exang", data=heart)
plt.show()
plt.clf()