# HSMA Exercise

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke.  There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a decision tree model to try to predict whether or not a stroke patient receives clot-busting treatment.  Use the prompts below to write each section of code.

## Core - Fitting and Evaluating a Decision Tree

Run the code below to import the dataset and the libraries we need. 

In [None]:
import pandas as pd
import numpy as np
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from ydata_profiling import ProfileReport

# Load data
data = pd.read_csv('../datasets/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()

Look at an overview of the data by running the code below.

We're going to use a library we haven't covered before to give a quick summary of the dataframe. 

You used this data last week, so it should feel familiar to you. 

Do you prefer this method or the code you used last week in the logistic regression exercise?

In [None]:
profile = ProfileReport(data)

profile.to_notebook_iframe()

Load in the 'stroke_data_feature_descriptions' dataframe and view that too - you can just view the whole dataframe with pandas rather than using the ProfileReport. 

Hint: it's in the same folder as the hsma_stroke.csv dataset we imported above.

Divide the data into features and labels.

Fit a Decision Tree model.

Use the trained model to predict labels in both training and test sets, and calculate and compare accuracy.

Calculate the additional model metrics.

- precision
- specificity
- recall (sensitivity)

Plot the decision tree.

## Extension - Refining Your Decision Tree

Let's experiment by changing a few parameters.

### Maximum Depth

Try changing the value of the 'max_depth' parameter when setting up your DecisionTreeClassifier.

### Minimum Samples

Try changing the values of 'min_samples_split' (the default value is 2).

Now try adjusting 'min_samples_leaf' (the default is 1).

### Split Criterion

Compare the performance when using 

- Gini Impurity
- Entropy
- Log Loss

## Comparing Performance with a Logistic Regression Model

Copy your code in from last week's logistic regression exercise (or write this in from scratch - there isn't much that is different to the decision tree model!). 

**Remember - you will need to standardise the data for the logistic regression model!**

Look at these additional metrics as well:

- precision
- specificity
- recall (sensitivity)

Use the cell below to write out an interpretation of the performance of the logistic regression model and the decision tree. 

Think about the presence of **false positives** and **false negatives**. 

Which might you be more interested in minimizing in this model? 

Hint - giving thrombolysis to good candidates for it can lead to less disability after stroke and improved outcomes. However, there is a risk that giving thrombolysis to the wrong person could lead to additional bleeding on the brain and worse outcomes. What might you want to balance?

## Challenge Exercises

### Bonus Exercise 1

Have a read of this article on feature importance in decision trees: [Article Link](https://www.codecademy.com/article/fe-feature-importance-final)

In particular, make sure you read the section "Pros and cons of using Gini importance" so you can understand some of the things you need to keep in mind when looking at feature importance in trees.

We can access the feature importance by running the following code:

In [None]:
# modify this code to point towards your decision tree model object (make sure that object
# was using the gini index as the criteria)
feature_importances = _______.feature_importances_

How does this compare to the feature importance for your logistic regression?

In [None]:
# modify this code to point towards your logistic regression model object
feature_importances = _______.feature_importances_

Can you create a graph showing feature importance in these?

Try ordering your plot so that the features with the most importance are at the top.

### Bonus Exercise 2
Can you improve accuracy of your decision tree model by changing the size of your train / test split?  



### Bonus Exercise 3

Try dropping some features from your data.  

Can you improve the performance of your model this way?