# HSMA Exercise

The data loaded in this exercise is for seven acute stroke units, and whether a patient receives clost-busting treatment for stroke.  There are lots of features, and a description of the features can be found in the file stroke_data_feature_descriptions.csv.

Train a decision tree model to try to predict whether or not a stroke patient receives clot-busting treatment.  Use the prompts below to write each section of code.

## Core Tasks

Run the code below to import the dataset and the libraries we need. 

In [None]:
import pandas as pd
import numpy as np
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from ydata_profiling import ProfileReport

# Download data
# (not required if running locally and have previously downloaded data)

download_required = False

if download_required:

    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '2004_titanic/master/jupyter_notebooks/data/hsma_stroke.csv'
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data to data subfolder
    data.to_csv(data_directory + 'hsma_stroke.csv', index=False)

# Load data
data = pd.read_csv('../datasets/hsma_stroke.csv')
# Make all data 'float' type
data = data.astype(float)
# Show data
data.head()

Look at an overview of the data. Choose whichever method you like.

(e.g. something like the 'head' method from pandas.)

Divide the data into features and labels.

Fit a random forest model.

Use the trained model to predict labels in both training and test sets.

Calculate and compare accuracy across training and test sets.

Look at the other model metrics.

- precision
- specificity
- recall (sensitivity)
- f1


Plot a confusion matrix for your model.

Plot a normalized confusion matrix for your model.

## Part 2 - Refining Your Random Forest

Let's experiment by changing a few parameters.

After changing the parameters, look at the model metrics like accuracy, precision, and recall.

Tweak the parameters to see what model performance you can achieve.

### Maximum Depth

### Number of Trees

## Part 3 - Comparing Performance with a Decision Tree Model

Copy your code in from the previous exercise on decision trees.

If you tuned your decision tree, you can bring in the best-performing of your decision tree models.

Look at all of the metrics.

- precision
- specificity
- recall (sensitivity)
- f1

Plot a confusion matrix for the decision tree model. 

Plot a normalised confusion matrix for the decision tree model. 

## Extension

### ROC and AUC

Create receiver operating curves (ROC), labelled with the area under the curve (AUC). 

### Comparing Performance with a Logistic Regression Model

Copy your code in from last week's logistic regression exercise. 

**Remember - you will need to standardise the data for the logistic regression model!**

Look at all of the metrics.

- precision
- specificity
- recall (sensitivity)
- f1


Plot a confusion matrix for the logistic regression model. 

Plot a normalised confusion matrix for the logistic regression model.

### Comparing all of the models

In the previous exercise, we compared the performance of the logistic regression model and the decision tree model.

Now consider the random forest too. 

Compare and contrast the confusion matrices for each fo these.

If one of these models were to be selected, which model would you recommend to put into use, and why?

Remember: giving thrombolysis to good candidates for it can lead to less disability after stroke and improved outcomes. However, there is a risk that giving thrombolysis to the wrong person could lead to additional bleeding on the brain and worse outcomes. What might you want to balance?

You can write your answer into the empty cell below.

## Challenge

### Challenge Exercise 1

Try plotting all of your confusion matrices onto a single matplotlib figure. 

Make sure you give each of these a title.

Hint: You'll need to create multiple matplotlib subplots.

Now do the same for the normalised confusion matrices.

### Challenge Exercise 2

Using a random forest gives us another way to look at feature importance in our datasets.

Take a look at this example from the scikit learn documentation. 
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Calculate the feature importance using both methods (mean decrease in impurity, and feature permutation) for the dataset we have been working with in this exercise. 

Do they produce the same ranking of feature importance? 

### Challenge Exercise 3
Can you improve accuracy of your random forest by changing the size of your train / test split?  



### Challenge Exercise 4

Try dropping some features from your data.  

Can you improve the performance of your random forest this way?