# Miniproject Joshua Kutschera (heart attack dataset)
This project is based on the [Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/data) from Kaggle.com.

[Idea for chart](https://github.com/holoviz/holoviews/issues/3821)

## About this dataset:

    Age : Age of the patient, 

    Sex : Sex of the patient, 1=male, 0=female
    
    cp : Chest Pain type chest pain type
            Value 1: typical angina
            Value 2: atypical angina
            Value 3: non-anginal pain
            Value 4: asymptomatic
            
    trtbps : resting blood pressure (in mm Hg)

    chol : cholestoral in mg/dl fetched via BMI sensor

    fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    
    rest_ecg : resting electrocardiographic results
            Value 0: normal
            Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
            Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
            
    thalach : maximum heart rate achieved

    exng: exercise induced angina (1 = yes; 0 = no)

    oldpeak: ST depression induced by exercise relative to rest (float)

    slp: Slope of the peak exercise ST segment 
        1= upsloping
        2= flat
        3= downsloping
        
    caa: number of major vessels (0-3)

    thall: thal: A blood disorder called thalassemia
        Value 0: NULL (dropped from the dataset previously

        Value 1: fixed defect (no blood flow in some part of the heart)
        
        Value 2: normal blood flow
        
        Value 3: reversible defect (a blood flow is observed but it is not normal)
    
    output (target): 0= less chance of heart attack 1= more chance of heart attack
    
## Dataset Overview
- `age`, `sex`, `cp`, `trtbps`, `chol`, etc., are the features.
- `output` (1 or 0) is the target variable (1 = risk of heart attack).

## Questions
Who is at risk of having a heart attack?
What is the biggest factor playing into being at risk?


 -   Find your favourite dataset
 -   Describe origin and specification of these data
 -   Find 2 research questions (prediction context)
 -   Plot variables of interest (both 1- and 2-variable plots)
 -   Interpret plots
 -   Fit tree, random forest to answer questions
 -   Describe performance (validation, train/test, cross validation)
 -   Compare with linear model (or logistic linear model)

In [1]:
# Step 1: Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error
from sklearn.metrics import log_loss
from math import sqrt
from scipy.stats import entropy

In [3]:
# Step 2: Load and Describe Dataset
heart_data = pd.read_csv("Data/heart.csv")

print("First 5 rows of the dataset:")
print(heart_data.head())

# Display basic information about the dataset
print("\nDataset Info:")
print(heart_data.info())

# Describe the dataset statistics
print("\nDataset Description:")
print(heart_data.describe())

First 5 rows of the dataset:
   age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  oldpeak  slp  \
0   63    1   3     145   233    1        0       150     0      2.3    0   
1   37    1   2     130   250    0        1       187     0      3.5    0   
2   41    0   1     130   204    0        0       172     0      1.4    2   
3   56    1   1     120   236    0        1       178     0      0.8    2   
4   57    0   0     120   354    0        1       163     1      0.6    2   

   caa  thall  output  
0    0      1       1  
1    0      2       1  
2    0      2       1  
3    0      2       1  
4    0      2       1  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol 

In [8]:
# import numpy as np
# data = np.loadtxt("./Data/heart.csv", delimiter=",", skiprows=1)
# X = data[:, :-1]
# Y = data[:, -1]

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# 
# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# 
# clf = RandomForestClassifier(n_estimators=10, random_state=42)
# 
# clf.fit(X_train, Y_train)
# 
# 
# predictions = clf.predict(X_test)
# accuracy = accuracy_score(Y_test, predictions)
# print("Test Accuracy:", accuracy)
# feature_importances = clf.feature_importances_
# 
# print("Feature Importances:")
# for i, importance in enumerate(feature_importances):
#     print(f"Feature {i}: {importance}")
# 


Test Accuracy: 0.8524590163934426
Feature Importances:
Feature 0: 0.07656258321357846
Feature 1: 0.037190916301048496
Feature 2: 0.08372473207079724
Feature 3: 0.073102984540921
Feature 4: 0.08830322771673946
Feature 5: 0.017574305947343678
Feature 6: 0.02388805612055446
Feature 7: 0.1288887993723594
Feature 8: 0.09452967791135913
Feature 9: 0.09609913623121151
Feature 10: 0.06726894376432285
Feature 11: 0.1263750158499526
Feature 12: 0.08649162095981171
