# A Simple Scikit-Learn Classification Workflow

This notebook shows a breif workflow you might use with `scikit-learn` to build a machine learning model to classify whether or not a patient has heart disease. The problem we will be exploring is binary classification (a sample can only be one of two things).

In a statement, 

> Given clinical parameters about a patient, can we predict whether or not they have heart disease?
>  - 1 = Yes they have heart disease
> - 0 = No they don't have heart disease
>
**Note:** The data we are using is from the Cleveland database which is available at UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/heart+Disease

It follows the diagram below:

<img src="images/sklearn-workflow.png"/>

**Note:** This workflow assumes your data is ready to be used with machine learning models (is numerical, has no missing values).

In [None]:
# install required packages
!pip install pandas numpy matplotlib scikit-learn

In [None]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## 1. Get the data ready

In [None]:
# Import dataset
heart_disease = pd.read_csv("data/heart-disease.csv")

# View the data
heart_disease.head()

With this example, we're going to use all of the columns except the target column to predict the targert column.

In other words, using a patient's medical and demographic data to predict whether or not they have heart disease.


In [None]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 2. Choose the model/estimator

You can do this using the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

<img src="images/sklearn-ml-map.png" width=500/>

In Scikit-Learn, machine learning models are referred to as estimators.

In this case, since we're working on a classification problem, we've chosen the RandomForestClassifier estimator which is part of the ensembles module.

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

## 3. Fit the model to the data and use it to make a prediction

A model will (attempt to) learn the patterns in a dataset by calling the `fit()` function on it and passing it the data.

In [None]:
model.fit(X_train, y_train)

Once a model has learned patterns in data, you can use them to make a prediction with the `predict()` function.

In [None]:
# Make predictions
y_preds = model.predict(X_test)

In [None]:
# Calculate accuracy
accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

In [None]:
# This will be in the same format as y_test
y_preds

In [None]:
X_test

Select a single sample of data and use it to make a prediction.

X_test is the data the model hasn't seen before (testing data).

X_test.loc[?] is the data of a single sample.

In [None]:
X_test.loc[]

In [None]:
heart_disease.loc[]

In [None]:
# Make a prediction on a single sample (has to be array)
model.predict(np.array(X_test.loc[]).reshape(1, -1))

## 4. Evaluate the model
A trained model/estimator can be evaluated by calling the `score()` function and passing it a collection of data.

In [None]:
# On the training set
model.score(X_train, y_train)

In [None]:
# On the test set (unseen)
model.score(X_test, y_test)

In [None]:
print(f"Model accuracy on the test set: {model.score(X_test, y_test) * 100:.2f}%")

## 5. Experiment to improve (hyperparameter tuning)

A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter tuning.

In [None]:
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print("")

**Note:** It's best practice to test different hyperparameters with a validation set or cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score

# Try different numbers of estimators with cross-validation and no cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")

## 6. Save a model for later use

A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's `pickle` module.

In [None]:
import pickle

# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [None]:
# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.predict(np.array(X_test.loc[]).reshape(1, -1))