<a href="https://colab.research.google.com/github/iscarff123/Practice/blob/main/HeartAttackAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Attack Analysis & Prediction

The purpose for this notebook is for me to practice and relearn my data science skills in python.

For this notebook, I will be analyzing the Heart Attack Analysis & Prediction dataset found on Kaggle.

https://www.kaggle.com/ronitf/heart-disease-uci?select=heart.csv

## Data Description
* Age: Age of patient
* Sex: 1 = male, 0 = female
* cp: Chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
* trestbps: Resting blood pressure (mmHg)
* chol: Cholestoral in mg/dl fetched via BMI sensor
* fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
* restecg: Resting Electrocardiographic results (0 = normal, 1 = having ST-T wave abnormality [T wave inversions and/or ST elevation or depression of > 0.05 mV], 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
* thalach: Maximum heart rate achieved
* exang: Exercise induced angina (1 = yes, 0 = no)
* oldpeak: ST depression induced by exercise relative to rest
* slope: The slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping)
* ca: number of major vessels colored by flourosopy (0 - 3)
* thal: 1 = normal; 2 = fixed defect; 3 = reversable defect
* target: 0 = less chance of having a heart attack, 1 = more chance of a heart attack


## Load Libraries & Data

In [1]:
import pandas as pd
import numpy as np
import sklearn
import plotly
import plotly.graph_objects as go
import plotly.express as px

In [2]:
### Load Data
data = pd.read_csv('https://raw.githubusercontent.com/iscarff123/Practice/main/heart.csv')

In [3]:
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       303 non-null    int64   
 1   sex       303 non-null    category
 2   cp        303 non-null    int64   
 3   trestbps  303 non-null    int64   
 4   chol      303 non-null    int64   
 5   fbs       303 non-null    int64   
 6   restecg   303 non-null    int64   
 7   thalach   303 non-null    int64   
 8   exang     303 non-null    int64   
 9   oldpeak   303 non-null    float64 
 10  slope     303 non-null    int64   
 11  ca        303 non-null    int64   
 12  thal      303 non-null    int64   
 13  target    303 non-null    int64   
dtypes: category(1), float64(1), int64(12)
memory usage: 31.3 KB


Change some variables to be catagorical.

In [13]:
data['sex'] = data['sex'].astype('category')
data['cp'] = data['cp'].astype('category')
data['fbs'] = data['fbs'].astype('category')
data['restecg'] = data['restecg'].astype('category')
data['exang'] = data['exang'].astype('category')
data['slope'] = data['slope'].astype('category')
data['ca'] = data['ca'].astype('category')
data['thal'] = data['thal'].astype('category')
data['target'] = data['target'].astype('category')

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       303 non-null    int64   
 1   sex       303 non-null    category
 2   cp        303 non-null    category
 3   trestbps  303 non-null    int64   
 4   chol      303 non-null    int64   
 5   fbs       303 non-null    category
 6   restecg   303 non-null    category
 7   thalach   303 non-null    int64   
 8   exang     303 non-null    category
 9   oldpeak   303 non-null    float64 
 10  slope     303 non-null    category
 11  ca        303 non-null    category
 12  thal      303 non-null    category
 13  target    303 non-null    category
dtypes: category(9), float64(1), int64(4)
memory usage: 15.8 KB


## Data Exploration

### Summary Statistics

In [15]:
data.describe()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak
count,303.0,303.0,303.0,303.0,303.0
mean,54.366337,131.623762,246.264026,149.646865,1.039604
std,9.082101,17.538143,51.830751,22.905161,1.161075
min,29.0,94.0,126.0,71.0,0.0
25%,47.5,120.0,211.0,133.5,0.0
50%,55.0,130.0,240.0,153.0,0.8
75%,61.0,140.0,274.5,166.0,1.6
max,77.0,200.0,564.0,202.0,6.2


1    207
0     96
Name: sex, dtype: int64
0    143
2     87
1     50
3     23
Name: cp, dtype: int64
0    258
1     45
Name: fbs, dtype: int64
1    152
0    147
2      4
Name: restecg, dtype: int64
0    204
1     99
Name: exang, dtype: int64
2    142
1    140
0     21
Name: slope, dtype: int64
0    175
1     65
2     38
3     20
4      5
Name: ca, dtype: int64
2    166
3    117
1     18
0      2
Name: thal, dtype: int64
1    165
0    138
Name: target, dtype: int64


There are many more men than women. The target value is pretty well balanced.

### Scatter Matrix

In [6]:
### Scatter plot of continous variables
fig = px.scatter_matrix(data,
                        dimensions = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'])
fig

We may potentially have some outliers. Lets look at the plots with differnt colors

In [37]:
fig = px.scatter_matrix(data,
                        dimensions = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'],
                        color = 'sex')
fig

In [40]:
fig = px.scatter_matrix(data,
                        dimensions = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'],
                        color = 'target')
fig

grouped bar charts next

study ANOVA