# A Simple Scikit-Learn Classification Workflow

This notebook shows a breif workflow you might use with `scikit-learn` to build a machine learning model to classify whether or not a patient has heart disease.

It follows the diagram below:

<img src="images/sklearn-workflow.png"/>

**Note:** This workflow assumes your data is ready to be used with machine learning models (is numerical, has no missing values).

## 0.An end-to-end Scikit-Learn workflow

In [58]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [22]:
# 1.Get the data ready
import pandas as pd
heart_disease=pd.read_csv('heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


With this example, we're going to use all of the columns except the target column to predict the targert column.

In other words, using a patient's medical and demographic data to predict whether or not they have heart disease.

In [23]:
# create x (features matrix)
x = heart_disease.drop("target",axis =1) 

# create y (lables)
y=heart_disease["target"]

## 2. Choose the model/estimator

You can do this using the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

<img src="images/sklearn-ml-map.png" width=500/>

In Scikit-Learn, machine learning models are referred to as estimators.

In this case, since we're working on a classification problem, we've chosen the RandomForestClassifier estimator which is part of the ensembles module.

In [24]:
# 2.Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf =RandomForestClassifier(n_estimators =100)

# we'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fit the model to the data and use it to make a prediction

A model will (attempt to) learn the patterns in a dataset by calling the `fit()` function on it and passing it the data.

In [15]:
# 3.fit the model to the training data

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [18]:
import sklearn
sklearn.show_versions()


System:
    python: 3.11.3 | packaged by Anaconda, Inc. | (main, Apr 19 2023, 23:46:34) [MSC v.1916 64 bit (AMD64)]
executable: D:\ml_projects\env\python.exe
   machine: Windows-10-10.0.22621-SP0

Python dependencies:
      sklearn: 1.2.2
          pip: 23.0.1
   setuptools: 66.0.0
        numpy: 1.24.3
        scipy: 1.10.1
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: D:\ml_projects\env\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2023.1-Product
    num_threads: 10
threading_layer: intel

       filepath: D:\ml_projects\env\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 12


In [26]:
clf.fit(x_train,y_train);

Once a model has learned patterns in data, you can use them to make a prediction with the `predict()` function.

In [29]:
# make a prediction
y_label= clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [30]:
x_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
178,43,1,0,120,177,0,0,120,1,2.5,1,0,3
95,53,1,0,142,226,0,0,111,1,0.0,2,0,3
117,56,1,3,120,193,0,0,162,0,1.9,1,0,3
105,68,0,2,120,211,0,0,115,0,1.5,1,0,2
195,59,1,0,170,326,0,0,140,1,3.4,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2
122,41,0,2,112,268,0,0,172,1,0.0,2,0,2
22,42,1,0,140,226,0,1,178,0,0.0,2,0,2


In [31]:
y_preds =clf.predict(x_test)

In [32]:
y_preds

array([0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int64)

In [33]:
y_test

178    0
95     1
117    1
105    1
195    0
      ..
6      1
106    1
122    1
22     1
197    0
Name: target, Length: 61, dtype: int64

## 4. Evaluate the model
A trained model/estimator can be evaluated by calling the `score()` function and passing it a collection of data.

In [35]:
# 4 Evaluate the model
clf.score(x_train,y_train)

1.0

In [37]:
clf.score(x_test,y_test)

0.8524590163934426

In [39]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.85      0.82      0.84        28
           1       0.85      0.88      0.87        33

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61



In [40]:
confusion_matrix(y_test,y_preds)

array([[23,  5],
       [ 4, 29]], dtype=int64)

In [42]:
accuracy_score(y_test,y_preds)

0.8524590163934426

## 5. Experiment to improve (hyperparameter tuning)

A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter 

In [54]:
# 5 .Improve a model
# try different amount of n_estimators
np.random.seed(42)
for i in range(10,100,10):
    print(f"trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators =i).fit(x_train,y_train)
    print(f"Model accuracy on test set:{clf.score(x_test,y_test)*100:.2f}%")
    print("")     

trying model with 10 estimators...
Model accuracy on test set:81.97%

trying model with 20 estimators...
Model accuracy on test set:85.25%

trying model with 30 estimators...
Model accuracy on test set:81.97%

trying model with 40 estimators...
Model accuracy on test set:81.97%

trying model with 50 estimators...
Model accuracy on test set:81.97%

trying model with 60 estimators...
Model accuracy on test set:83.61%

trying model with 70 estimators...
Model accuracy on test set:80.33%

trying model with 80 estimators...
Model accuracy on test set:85.25%

trying model with 90 estimators...
Model accuracy on test set:83.61%



## 6. Save a model for later use

A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's `pickle` module.

In [55]:
# 6. Save a model and load it
import pickle

pickle.dump(clf,open("random_forest_model_1.pkl","wb"))

In [56]:
loaded_model = pickle.load(open("random_forest_model_1.pkl","rb"))
loaded_model.score(x_test,y_test)

0.8360655737704918