# SCIKIT LEARN WORKFLOW
## END TO END SCIKIT LEARN WORKFLOW

What is end to end means? 

One thing to note though Scikit Learn is such big library thus only call what you need will be a better approach to practical use.

### Random Forest Classifier Workflow for Classifying Heart Disease.
#### 1. Get the data ready
I will use the heart disease data. Remember that it was stored in csv format. First I need to import it first. Pandas will be the appropriate library to get and process the data to form Data Frame.

In [1]:
# get the data heart-disease.csv
import pandas as pd
heart_disease_data = pd.read_csv('../data/heart-disease.csv')
heart_disease_data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


***NOTE***
The heart_disease_data consist of all patient characteristics except for the target column. Target column is the end result whether the patient has heart_disease or not and we want to seek the connection with each character.

In [2]:
# first we need to separate the target column from other data
x_data = heart_disease_data.drop('target', axis=1)
x_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [3]:
# now the target label will be in the y_data
y_data = heart_disease_data['target']
y_data.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

***NOTE:***
now I need to split the data frame in the x_data and y_data to do this I will use the scikit learn libs

The way to import is a bit different since it is named sklearn thus to import I need import sklearn.

Now as this library is quite bid I need to be specific. This time it is model_selection to be exact I want to randomly split the data frame heart_disease_data y_data and x_data. I will use the train_test_split library

In [4]:
# import the library
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

# just to be safe if we need to handle some matrices I will import numpy as well
import numpy as np

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data)
# let's test the result but since too big I will just validate each size
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((227, 13), (227,), (76, 13), (76,))

***NOTE:***
The data (shape) above is important. 

I need to validatae that the x_train and y_train first size is the same (in this case 227).

The same case is true for the x_test and y_test (in this case 76)

It means will be 227 data for train and 76 for testing that chosen randomly.

#### 2. Choose the model and hyperparameters
Now this is the very heart of the Scikit Learn. A variety of models or in machine learning is known as classifiers. However, in Scikit documentation it is more commonly known as estimators. 

TODO So what is hyperparameters??

Next we will use Random Forest Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=forest#sklearn.ensemble.RandomForestClassifier

 learn about Random Forest Classifier model: https://en.wikipedia.org/wiki/Random_forest
 It is said that this is part of the ensemble learning method thus it is put in the Sklearn ensemble libs

In [6]:
# import the Random Forest 
from sklearn.ensemble import RandomForestClassifier
# too long the name thus better put it into a variable
clf = RandomForestClassifier()

In [7]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### 3. Fit the model to the data and use it to make predictions
Fitting the model on the data involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

First I need to build the forest of trees model from the training set, in this case is **x_train and y_train.**

Read more about the forest of the random forest classifier's fit method: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit



In [8]:
# making random forest classifier 
clf.fit(x_train, y_train)

***NOTE:***
This clf.fit does not need to be stored in variable. I think it will just use the clf object to store the data from both side of randomly selected train data. 


Here clf = RandomForestClassifier is an **object**. It is just like the **DataFrame** in Pandas. You can modify directly the instatiation of this DataFrame (in this case heart_disease_data) which you can cut, process, etc. 

#### Use the model to make a prediction
The whole point of training a machine learning model is to use it to make some kine of prediction in the future. 

Once our model instance is trained, you can use the predict() method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label.

***CAUTION***: data you predict on has to be in the same shape as data you trained on.

Fortunately we already have the correct shape since we built our train and test data randomly using the Scikit Learn **train_test_split()** function. This will ensure the x_train and x_test will be the same shape. (We already verified this above), emphasis in the number of column (in this case the second in the shape function output).

In [9]:
# just to verify
x_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
112,64,0,2,140,313,0,1,133,0,0.2,2,0,3
193,60,1,0,145,282,0,0,142,1,2.8,1,2,3
43,53,0,0,130,264,0,0,143,0,0.4,1,0,2
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,50,0,0,110,254,0,0,159,0,0.0,2,0,2
262,53,1,0,123,282,0,1,95,1,2.0,1,2,3
266,55,0,0,180,327,0,2,117,1,3.4,1,0,2
132,42,1,1,120,295,0,1,162,0,0.0,2,0,2


In [10]:
# now using the predict method to make prediction onf the test data 
y_preds = clf.predict(x_test)

#### 4. Evaluate the model
Now we've made some predictions (y_preds). we can start to use some more Scikit-Learn methods to figure out how good our models is. 

Each model or estimator has built-in score method. This method compares how well the model was able to learn the patterns between the features and labels. In other words, it returns how accurate your model is.

Using Random Forest Classifier score method. 

**score(X, y[, sample_weight])** https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score

Return the mean accuracy on the given test data and labels.

In [11]:
clf.score(x_train, y_train)

1.0

In [12]:
clf.score(x_test, y_test)

0.9078947368421053

***NOTE:*** this numbers are the mean accuracy between two data set. After using the predictin model we can use the model on the test data sets. 

#### use the additional evaluation methods and also libs for reporting
This is part of the metrics libs of the sklearn. 

See the [Metrics and scoring: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation) section and the Pairwise metrics, Affinities and Kernels section of the user guide for further details.

The [sklearn.metrics module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) includes score functions, performance metrics and pairwise metrics and distance computations.

Here is the list I need to use from this lib:
1. [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)
1. [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)
1. [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)


[3.3.2.7. Classification report](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report)

The classification_report function builds a text report showing the main classification metrics. Here is a small example with custom target_names and inferred labels:

**sklearn.metrics.classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')**[source](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)

In [13]:
# import the sklearn.metric additiona evaluation and reporting methods
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# now print the classification report y_true is y_test, y_pred is well y_preds
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.93      0.85      0.89        33
           1       0.89      0.95      0.92        43

    accuracy                           0.91        76
   macro avg       0.91      0.90      0.91        76
weighted avg       0.91      0.91      0.91        76



From this lib I need [confusion matrix (here is the user guide on it)](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix). Also here is [wikipedia page on cofusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

**sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)**

returns: 
    C: ndarray of shape (n_classes, n_classes)

Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class.

In [14]:
# use the confusion matrix with y_true = y_test and y_pred = y_preds
confusion_matrix(y_test, y_preds)

array([[28,  5],
       [ 2, 41]], dtype=int64)

Next is the [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)

**sklearn.metrics.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)**

Returns: score: float

If normalize == True, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).

The best performance is 1 with normalize == True and the number of samples with normalize == False.

*see the example in the source guide*

In [15]:
# use the accuracy_score with y_true = y_test and y_pred = y_preds
accuracy_score(y_test, y_preds)

0.9078947368421053

#### 5. Experiment to improve
The first model we build is often referred  to as baseline.

Once we have got a baseline model, like we have here, it is important to remember, this is often not the final model we will use. 

The next step in the workflow is to try to improve upon your baseline model.

To do this there are two ways to look at it: 
1. Form a model perspective 
1. From the data perspective

From model perspective maybe would involves things such as using a more complex model or tuning current model's hyper-parameters.

From data perspective, this may involves collecting more data or improving the data quality. This will give better chance of the model to learn the pattern within.

If you are already working on an existing dataset, it is often easier to try a series of model perspective experiments first then turn to data perspective if you are not getting the result you are looking for. This is sounds like our current case where we alreasy have heart disease data frame.

One thing you should be aware of is if you are tuning a model's hyper-parameters in a series of experiments, your results should always be **cross-validated**. 

Cross validation is a way of making sure that results you are getting are consistent across your training and test dataset. This must validate the use of multiple versions of training and test sets rather than just pure luck because of the order the training and test sets were created.
- try different hyper-parameters
- All different parameters should be cross validated
    - **CAUTION:** beware of cross validation for time series problems

Different models you use will have different hyper-parameters you can tune. For the case of our model the Random Forrest Classifier, we will start trying different values for n_estimators.

In [16]:
# Try different numbers of estimators (trees) ... (no cross-validation)
# why does it need the random seed?
# this is important since we will compare this results with the one with cross validation
np.random.seed(42)

# we will make estimators of (10, 20, 30, 40, 50, 60, 70, 80, 90)
for i in range(10,100,10):
    print(f"trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accuracy on test: {model.score(x_test, y_test) * 100}%")
    print("")

trying model with 10 estimators...
Model accuracy on test: 84.21052631578947%

trying model with 20 estimators...
Model accuracy on test: 89.47368421052632%

trying model with 30 estimators...
Model accuracy on test: 86.8421052631579%

trying model with 40 estimators...
Model accuracy on test: 85.52631578947368%

trying model with 50 estimators...
Model accuracy on test: 85.52631578947368%

trying model with 60 estimators...
Model accuracy on test: 90.78947368421053%

trying model with 70 estimators...
Model accuracy on test: 89.47368421052632%

trying model with 80 estimators...
Model accuracy on test: 89.47368421052632%

trying model with 90 estimators...
Model accuracy on test: 89.47368421052632%



In [17]:
# now we will compare the different numbers of estimators but with cross validation
# import the coross validation score lib
from sklearn.model_selection import cross_val_score

# re-seeding the random generator and set the same number series
np.random.seed(42)

# rerun similar test but adding cross validation
for i in range(10,100,10):
    print(f"trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accuracy on test: {model.score(x_test, y_test) * 100}%")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, x_data, y_data, cv=5)) * 100}%")
    print("")

trying model with 10 estimators...
Model accuracy on test: 84.21052631578947%
Cross-validation score: 78.53551912568305%

trying model with 20 estimators...
Model accuracy on test: 88.1578947368421%
Cross-validation score: 79.84699453551912%

trying model with 30 estimators...
Model accuracy on test: 86.8421052631579%
Cross-validation score: 80.50819672131148%

trying model with 40 estimators...
Model accuracy on test: 86.8421052631579%
Cross-validation score: 82.15300546448088%

trying model with 50 estimators...
Model accuracy on test: 85.52631578947368%
Cross-validation score: 81.1639344262295%

trying model with 60 estimators...
Model accuracy on test: 86.8421052631579%
Cross-validation score: 83.47540983606557%

trying model with 70 estimators...
Model accuracy on test: 84.21052631578947%
Cross-validation score: 81.83060109289617%

trying model with 80 estimators...
Model accuracy on test: 89.47368421052632%
Cross-validation score: 82.81420765027322%

trying model with 90 estimato

***NOTE***: the cross_val_score function return an array. Thus the score must be averaged in order to make some practical scale. Therefore, it is needed numpy means to average the cross_val_score return.

[learn more on cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

question: why does it build their own model each time trying new amount of estimator? Can it just use the clf object made earlier and see the changes from there?

Another way to find the best estimator is using [GridSearchCV sklearn model](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV).

In [18]:
# now we use GridSearchCV
from sklearn.model_selection import GridSearchCV

# re-seeding the random number generator
np.random.seed(42)

# the main part for the grid search is defining the param_grid
# param_grid is a dict type with str as key and list as value
# the key (str) is the parameter of the estimator to be iterate and tested to find the most optimal parameter
param_grid = {'n_estimators': [i for i in range(10,100,10)]}

# preparing the grid
grid = GridSearchCV(RandomForestClassifier(),param_grid, cv=5)

# use the grid to fit the whole data
grid.fit(x_data, y_data)

# find the best grid
grid.best_params_

{'n_estimators': 80}

In [19]:
# now we use the parameter in the grid.best_params_ to be used to the selected estimator (in this case Random Forest Classifier)
# then set the parameter on n_estimator in that estimator using the grid.best_params_
# then we set our clf (we set earlier in our first attempt to model the whole data) according to the best_estimator_ 
clf = grid.best_estimator_
clf

In [20]:
# use the best model to fit
clf = clf.fit(x_train, y_train)

In [21]:
# rate the score for the new fit model
clf.score(x_test,y_test)

0.868421052631579

#### 6. Save a model for someone else to use
When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or apllication you offer.

Saving a model also allows you to reuse it later without having to go through retrianing it. Which is helpful, especially when your training times start to increase. 

You can save a scikit-learn model using [Python's built-in pickle module](https://docs.python.org/3/library/pickle.html) which serialize objects.

In this session I will try to serialize model which tried in number of estimator. Note that the last model is made the number of estimator of 90.

The score for the last model with number of estimator of 90 is 86.8421052631579% which we will use to validate the model once it was re-loaded

In [22]:
# we use pickle to serialize the model (only the model not the whole notebook)
import pickle

# save the model called model (the one iterated for number of estimator 10 to 100)
pickle.dump(model, open('random_forest_model_1.pkl', 'wb'))

***NOTE***: After this command is run you can confirm that random_forest_model_1.pkl file is created and available in the same folder as this ipynb file location.

Let's validate by loading the model form the random_forest_model_1.pkl file

In [23]:
loaded_model = pickle.load(open('random_forest_model_1.pkl', 'rb'))
loaded_model

***NOTE***: The loaded model is the same type and having the same n_estimators parameter as the latest model (which is 90). But we still need to verify the model to be consistent which means it must have the same performace score as the orignal model.

In [24]:
# testing the loaded_model consistency
# the score should match the model(n_estimator=90) score which is 86.8421052631579%
loaded_model.score(x_test, y_test)

0.868421052631579

***NOTE***: the score when testing the x_test and y_test is 0.868421052631579 which is the same as 86.8421052631579%.

Meaning the loaded_model is consistent as the serialized original model(n_estimators=90)