# Classification: Heart Failure Data

For this part of the project, you will create a classification model. In the first part, you will train a decision tree model, and in the second part (optional and advanced), you will train a support vector machine model.

The dataset contains the medical records of 260 patients who experienced heart failure, collected during their follow-up period. Each patient profile includes 11 clinical features. For more information on these features, you can refer to the metadata [here](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records).

## Import modules

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import random

## Load data

In [None]:
link_to_file = "https://raw.githubusercontent.com/Center-for-Health-Data-Science/Python_part2/main/data/project_work/heart_failure.csv"

# Load in the data


## EDA and data cleaning

The first step with our data is Exploratory Data Analysis (EDA). Use these questions to guide your analysis:

* Which features/explanatory variables are present? Are they numeric or categorical? Should they all be interpreted the same way? What do you want to use as the outcome variable?
* Are there missing values?
* Is there an index or ID you should remove?
* Create bar plots for the categorical variables and check if the categories are balanced.
* Create box plots and summary statistics for the numeric variables. Check their distributions and ranges. Are there outliers present?
* Remove data you think is unreliable or wrong.

## Correlations
Take a look at the correlation between the numeric features and the outcome variable. Which numeric features exhibit the highest correlation with the outcome variable?

In [None]:
# Define the correlation matrix


In [None]:
# Plot the correlation matrix in a heatmap


## Prepare data for modelling

As you have learned during the course, there are several steps you should take to prepare your data before training a model. Before you read the list below, take a moment to recall some of those steps.

* Scale numeric values: Define the standard scaler object, then fit and transform the numeric values.
* Convert categorical features to the appropriate data type so Python can interpret them as categories. Do they need to be dummy coded?
* Identify the outcome variable and ensure it is of the correct data type for analysis.
* Split your data and outcome variable into training and test sets.


In [None]:
from sklearn.preprocessing import StandardScaler


In [None]:
from sklearn.model_selection import train_test_split


## PCA

Make a PCA of the scaled numeric features to investigate the structure of the data.


## Training the model

Now it is time to train a decision tree model with your traning data. Set `max_leaf_nodes` to 6.

In [None]:
# Define model
from sklearn.tree import DecisionTreeClassifier


# Fit model to training data


# Plot the tree (take a look at the exercises)
from sklearn.tree import plot_tree


## Model evaluation

To evaluate model performance, apply the trained decision tree model to the test data. Assess the predicted labels and compare them to the known true classes (alive or dead, if you have used death as the outcome variable).

In [None]:
# Predict the outcome of X_test
y_pred =

For a classification model, we can use the confusion matrix to assess model performance.

<details>
<summary>Confusion matrix explained</summary>

|               | Predicted Positive | Predicted Negative |
|---------------|---------------------|---------------------|
| **Actual Positive**   | True Positive (TP)       | False Negative (FN)      |
| **Actual Negative**   | False Positive (FP)      | True Negative (TN)       |


- **True Positive (TP)**: The number of actual positive cases correctly predicted as positive.
- **False Negative (FN)**: The number of actual positive cases incorrectly predicted as negative.
- **False Positive (FP)**: The number of actual negative cases incorrectly predicted as positive.
- **True Negative (TN)**: The number of actual negative cases correctly predicted as negative.




</details>

In [None]:
from sklearn.metrics import confusion_matrix

# Define confusion matrix


Now, we calculate the precision score which is the proportion of the predicted cases that are actually cases. Consider what a 'case' represents in this dataset and what the precision score signifies in this context.

<details>
<summary>Hint</summary>

The formular for the precision score is:

$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$



In the context of this dataset, and if you used 'death' as your outcome variable, the precision score signifies how many of those predicted to be dead are actually dead.



</details>

In [None]:
from sklearn.metrics import precision_score

# Define presision score


Finally, we will calculate the recall score, which is the proportion of actual cases that are predicted as cases. Again, consider what this represents in the context of this dataset.

<details>
<summary>Hint</summary>

The formula for recall score is:

$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$

In the context of this dataset, and if you used 'death' as your outcome variable, the recall score signifies the proportion of actual dead individuals that were predicted as dead by the model.

</details>

In [None]:
from sklearn.metrics import recall_score

# Define recall score


## Model interpretation

Let's look at the feature importances of the model using the `feature_importences_` attribute. Which features are the most important for the model? How were these correlated to the outcome variable in the correlation matrix? Are the important features what you expect? Why might they not be? How could the model be improved?

<details>
<summary>Hint</summary>

The dataset contains only 260 observations, and the categorical variables are not completely balanced (which is acceptable). Training with a larger dataset would improve generalization and potentially enhance performance when predicting the test data.

</details>

# Optional and advanced part: Support Vector Machine

If you feel comfortable with the previous section and want an extra challenge, this section is for you!

We are going to fit a model that you haven't encountered yet: the Support Vector Machine (SVM). Use Google (or your preferred search engine) to find out which module from scikit-learn you need to import to use this model.

## Training the model

In [None]:
# Import the model
from sklearn import XXX

# Initialize the model with default settings


# Fit the model with your traing data (use the same training data as you trained the desision tree on, if you want to be able the performance of both models)


## Model evaluation

Use the model to predict the outcome of X_test.

Construct the confusion matrix.

Calculate the precision score. Do you remember what this indicates?

Calculate the recall score. Do you rememeber what this indicates?

Discuss the model performance with the person next to you.

## Model optimization

For now, we have trained the support vector machine model using the default settings. First, let’s examine the parameters of the model we just created using the `get_params()` function.  What is the default value of the kernel parameter?

Now, let’s examine the possible arguments for the kernel parameter. To do this, we will use the `help()` function.

Choose one of the possible kernel arguments (other than the default) and run the model again. Evaluate how this model performs compared to the default settings.

You will need to run the model with all possible kernel arguments to determine which kernel performs best with your data. Exclude the 'precomputed' argument. Try to create a loop or a function to avoid repetitive code.

Discuss with your group or the person next to you which kernel performed the best.