# Heart Attack Prediction Case Study
## Seminar, Data Science Case Study
__Week 9__, 25/04/15

### __Problem definition:__ Predict whether a patient will have a heart attack or not


## Dataset description: 

| **Feature**     | **Description**                                                                                         | **Values**                                                                                 |
|------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| **Age**         | Age of the individual.                                                                                  | Numerical                                                                                 |
| **Sex**         | Biological sex of the individual.                                                                       | Male, Female                                                                              |
| **exang**       | Whether the individual experiences chest pain caused by exercise (indicates potential heart problems).  | 1 = Yes, 0 = No                                                                          |
| **ca**          | Number of major blood vessels that show narrowing or blockage (higher value = more severe issues).      | 0 to 3                                                                                   |
| **cp**          | Type of chest pain experienced.                                                                         | 1 = Typical angina (heart-related pain)                                                  |
|                  |                                                                                                         | 2 = Atypical angina (not directly related to heart issues)                               |
|                  |                                                                                                         | 3 = Non-anginal pain (unrelated to the heart)                                            |
|                  |                                                                                                         | 4 = Asymptomatic (no chest pain)                                                        |
| **trtbps**      | Resting blood pressure (in mm Hg). Reflects blood force on vessel walls while at rest.                  | Numerical                                                                                 |
| **chol**        | Cholesterol level in blood (mg/dl). High values indicate a higher risk of heart disease.                | Numerical                                                                                 |
| **fbs**         | Fasting blood sugar level (>120 mg/dl indicates possible diabetes).                                     | 1 = True, 0 = False                                                                      |
| **rest_ecg**    | Resting electrocardiogram results (measures heart's electrical activity).                               | 0 = Normal                                                                               |
|                  |                                                                                                         | 1 = Abnormalities in heart signals (e.g., irregular heartbeat)                          |
|                  |                                                                                                         | 2 = Enlarged heart due to strain/damage                                                 |
| **thalach**     | Maximum heart rate achieved during physical activity (lower values may indicate weaker performance).    | Numerical                                                                                 |
| **target**      | Likelihood of heart problems based on all factors.                                                      | 0 = Lower chance of heart attack                                                         |
|                  |                                                                                                         | 1 = Higher chance of heart attack                                                       |


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn # scikit-learn might work
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# !pip3 install xxx 

## Data Preparation

In [None]:
print(data.shape)
data.head()

In [None]:
print(o2_saturation.shape)
o2_saturation.head()

In [None]:
data.info()

## Exploratory Data Analysis (EDA)

In [None]:
data.describe()

In [None]:
# what is the distribution of age column?


In [None]:
##  What is the share between males and females in the sample?


In [None]:
## Add o2_saturation data in the data frame 'data'


In [None]:
corr = data.corr()

plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True)

In [None]:
# Specify X and y and split the dataset


In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Modeling
Using __Random Forest Classifier__

### Random Forest Classifier

#### Overview
Used for both classification and regression tasks. 

It is an ensemble method that builds multiple decision trees and merges their predictions to achieve better accuracy and reduce the risk of overfitting.

#### How it Works
1. **Data Sampling**: Random subsets of the training data are created using bootstrapping.
2. **Tree Construction**: A decision tree is built for each subset, considering only a random subset of features at each split.
3. **Prediction**:
   - **Classification**: The final prediction is based on majority voting across all trees.
   - **Regression**: The final prediction is the average of all tree predictions.

#### Advantages
- Handles high-dimensional data well.
- Works well with both numerical and categorical data.
- Provides feature importance scores, helping in feature selection.

#### Disadvantages
- Requires more computational resources due to multiple trees.
- Less interpretable than a single decision tree.
- May not perform well on highly imbalanced datasets without proper preprocessing.


In [None]:
# specify model


# train model


# make prediction


In [None]:
forest_preds

### Evaluate the result

In [None]:
# Manually calculate number of True positive and False negative predictions


In [None]:
sns.heatmap(confusion_matrix(y_test, forest_preds), annot=True)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.title("Confusion Matrix")

In [None]:
# Based on the confusion matrics fill:
true_positive = 
true_negative = 
false_positive = 
false_negative = 

### Evaluation Metrics
#### Precision
**Description**: Precision measures the proportion of correctly identified positive instances among all instances predicted as positive. It focuses on how accurate the positive predictions are.

**Rationale**: High precision means fewer false positives, which is crucial in scenarios like spam detection or medical diagnosis where false positives can have significant consequences.

**Formula**:  
$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$


#### Recall
**Description**: Recall, or sensitivity, measures the proportion of correctly identified positive instances out of all actual positive instances.

**Rationale**: High recall ensures fewer false negatives, important in cases like disease detection where missing a positive case can be critical.

**Formula**:  
$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

#### F1 Score
**Description**: The F1 Score is the harmonic mean of Precision and Recall, balancing both metrics. It is used when there’s a need to balance precision and recall, especially in cases with imbalanced datasets.

**Rationale**: The F1 score gives a single performance metric that penalizes extreme values of Precision or Recall.

**Formula**:  
$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

#### Accuracy
**Description**: Accuracy measures the proportion of correctly classified instances (both positive and negative) out of all instances.

**Rationale**: Accuracy is a simple metric but can be misleading on imbalanced datasets, as it doesn’t distinguish between types of errors.

**Formula**:  
$$
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Instances}}
$$

### Result Saving

In [None]:
# create empty dataframe with indexes: ['precision','recall', 'f1_score','accuracy']


In [None]:
# write a function which will return 4 metrics:
# [round(sklearn.metrics.precision_score(y_test, forest_preds), 2),round(sklearn.metrics.recall_score(y_test, forest_preds), 2),round(sklearn.metrics.f1_score(y_test, forest_preds), 2),round(sklearn.metrics.accuracy_score(y_test, forest_preds), 2)]




In [None]:
# add the result of model_1 into the df 'result'
result['model_first'] = 
result

In [None]:
forest.predict(X_test)[:5]

In [None]:
res_proba = forest.predict_proba(X_test)
res_proba[:5]

## Hyperparameter tuning

### Change the threshold

In [None]:
# get the result for a threshold = 0.7


In [None]:
# Generate results for any threshold from 0.1 till 0.9 and add it into 'result' df


In [None]:
result

In [None]:
result.T.sort_values('accuracy', ascending=False)

### Using RandomizedSearchCV for hyperparameter turning

### Hyperparameters
Key hyperparameters to tune in Random Forest:

1. `n_estimators`
- Description: The number of trees in the Random Forest.
- Impact: A higher number of trees increases model accuracy but also computational cost. Commonly tuned to find a balance between performance and efficiency.

2. `max_depth`
- Description: The maximum depth of each tree in the forest.
- Impact: Limits the growth of trees to avoid overfitting. Shallower trees generalize better but may underfit the data.

3. `max_features`
- Description: The number of features to consider when splitting a node.
- "sqrt": Considers the square root of the total number of features.
- "log2": Considers the logarithm (base 2) of the total number of features.
- "None": Considers all features
- Impact: Controls how the model selects features, impacting diversity among trees and overall performance.

4. `min_samples_split`
- Description: The minimum number of samples required to split an internal node.
- Impact: Higher values restrict tree splitting, preventing overfitting but increasing the risk of underfitting.

5. `min_samples_leaf`
- Description: The minimum number of samples required to be in a leaf node.
- Impact: Larger values prevent the model from learning overly specific patterns, improving generalization but potentially missing finer details.

In [None]:
estimator = RandomForestClassifier()
grid = {"n_estimators": [50, 60, 70, 80, 90, 100, 110, 120],
        "max_depth": [3, 5, 7, 9, 11, 13, 15],
        "max_features" : ["sqrt", "log2", None],
        "min_samples_split": [2, 3, 4, 5],
        "min_samples_leaf": [1, 2, 3, 4]}

rand_search_model = RandomizedSearchCV(estimator=estimator,
                                      param_distributions=grid)

In [None]:
# Retrain the model with tuned parameters and compare with model 1


In [None]:
rand_search_model.best_params_

In [None]:
result['model_hp_tuned'] = 
result.T.sort_values('accuracy', ascending=False)

### Feature importance

In [None]:
importances = forest.feature_importances_

In [None]:
forest_importances = pd.Series(importances, index = X.columns)
forest_importances = forest_importances.sort_values(ascending=False)
fig, ax = plt.subplots()
forest_importances.plot.bar(ax=ax)
ax.set_title("Feature importances")
fig.tight_layout()

In [None]:
# Drop the least important feature and compare the result with model 1


In [None]:
feat_model = rand_search_model.fit()
feat_pred = feat_model.predict()

In [None]:
result['model_features_tuned'] = 
result.T.sort_values('accuracy', ascending=False)

## Best Model Presentation

Let's use our knowledge to produce the final model

In [None]:
# Define the parameters
final_classifier = RandomForestClassifier(
    #<params>
)

# Fit the model to your training data restricted with one of the variables
final_classifier.fit()

# Generate predictions 
final_probs = final_classifier.predict_proba(t)

# Use best threshold
final_preds = [0 if i[0] >= 0 # Insert threshold
                else 1 for i in final_probs]

In [None]:
result['model_final'] =
result.T.sort_values('accuracy', ascending=False)

In [None]:
# Let's see the confusion matrix for the final model
sns.heatmap(confusion_matrix(y_test, final_preds), annot=True)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.title("Final Model Confusion Matrix")