## The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

# created by Nikolay K. MTK: 673010


It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

Variance of Wavelet Transformed image (continuous).
Skewness of Wavelet Transformed image (continuous).
Kurtosis of Wavelet Transformed image (continuous).
Entropy of image (continuous).
Class (0 for authentic, 1 for inauthentic).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

In [11]:
# Let's layout our plan here
solution_steps = [
    "1. Getting the data ready",
    "2. Choose the right estimator/algorithm for our problems",
    "3. Fit the model/algorithm and use it to make predictions on our data",
    "4. Evaluating a model",
    "5. Improve a model",
    "6. Save and load a trained model"]

In [12]:
solution_steps

['1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model']

In [6]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
import pandas as pd
banknote_class = pd.read_excel("data/banknote.xlsx")
banknote_class

Unnamed: 0,Variance of Wavelet Transformed image,Skewness of Wavelet Transformed image,Kurtosis of Wavelet Transformed image,Entropy of image,"Class (0 for authentic, 1 for inauthentic)"
0,3.62160,8.66610,-2.8073,-0.44699,0
1,4.54590,8.16740,-2.4586,-1.46210,0
2,3.86600,-2.63830,1.9242,0.10645,0
3,3.45660,9.52280,-4.0112,-3.59440,0
4,0.32924,-4.45520,4.5718,-0.98880,0
...,...,...,...,...,...
1367,0.40614,1.34920,-1.4501,-0.55949,1
1368,-1.38870,-4.87730,6.4774,0.34179,1
1369,-3.75030,-13.45860,17.5932,-2.77710,1
1370,-3.56370,-8.38270,12.3930,-1.28230,1


# 1. Getting our data ready to be used with machine learning¶
Split the data into features and lables (usually 'X' and 'y')

In [15]:
# Features
X = banknote_class.drop("Class (0 for authentic, 1 for inauthentic)", axis=1)
X.head()

# Labels
y = banknote_class["Class (0 for authentic, 1 for inauthentic)"]
y.head()

# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1097, 4), (275, 4), (1097,), (275,))

## 2. Choose the right estimator/algorithm for our problems
Sklearn refers to machine learning models, algorithms as estimators.

Classification problem - predicting a category ( 1 or 2)

Somethimes we will see clf (short for classifier) used as a classification estimator
2 class classification problem

This is often referred to as model or clf (short for classifier) or estimator (as in the Scikit-Learn) documentation.

Hyperparameters are like knobs on an oven you can tune to cook your favourite dish.

Used choosing the right estimator flowchart from the scikit learn page
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

# 2.1 Picking a machine learning model for a classification problem
Consulting the map and it says to try LinearSVC

In [18]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Instantiate LinearSVC
clf = LinearSVC(dual=False)
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
clf.score(X_train, y_train)

0.9881494986326345

## 3. Fit the model/algorithm on our data and use it to make predictions
# 3.1 Fitting the model to the data
Different names for:

* `X` = features, features variables, data

* `y` = labels, targets, target variables

In [20]:
# Fit the model to the data (training the ML model)
clf.fit(X_train, y_train)

LinearSVC(dual=False)

In [22]:
# Evaluate the LinearSVC (use the patterns the model has learned)
clf.score(X_test, y_test)

0.9854545454545455

## 3.2 Make predictions using a ML model
2 ways to make predictions:

    1. predict()
    2. predict_proba()

In [24]:
X_test.head()

Unnamed: 0,Variance of Wavelet Transformed image,Skewness of Wavelet Transformed image,Kurtosis of Wavelet Transformed image,Entropy of image
430,1.5691,6.3465,-0.1828,-2.4099
588,-0.27802,8.1881,-3.1338,-2.5276
296,0.051979,7.0521,-2.0541,-3.1508
184,-1.7559,11.9459,3.0946,-4.8978
244,2.4287,9.3821,-3.2477,-1.4543


In [25]:
# Use a trained model to make predictions
clf.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

In [26]:
np.array(y_test) 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

In [27]:
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.9854545454545455

In [28]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.9854545454545455

In [31]:
# Let's predict() on the same data...
clf.predict(X_test[:10])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [32]:
X_test[:10]

Unnamed: 0,Variance of Wavelet Transformed image,Skewness of Wavelet Transformed image,Kurtosis of Wavelet Transformed image,Entropy of image
430,1.5691,6.3465,-0.1828,-2.4099
588,-0.27802,8.1881,-3.1338,-2.5276
296,0.051979,7.0521,-2.0541,-3.1508
184,-1.7559,11.9459,3.0946,-4.8978
244,2.4287,9.3821,-3.2477,-1.4543
590,4.6352,-3.0087,2.6773,1.212
78,0.24835,7.6439,0.9885,-0.87371
708,5.1731,3.9606,-1.983,0.40774
411,4.0047,0.45937,1.3621,1.6181
43,0.96441,5.8395,2.3235,0.066365


## 4. Evaluating a machine learning model
Three ways to evaluate Scikit-Learn models/estimators:

1. Estimator's built-in score() method
2. The scoring parameter
3. Problem-specific metric functions

    You can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html

In [35]:
# Import cross validation score
from sklearn.model_selection import cross_val_score
# Evaluating using the score parameter
clf.score(X_test, y_test)

0.9854545454545455

In [36]:
cross_val_score(clf, X, y, cv=5)

array([0.98909091, 0.98181818, 0.97810219, 1.        , 0.98905109])

In [38]:
cross_val_score(clf, X, y, cv=10)

array([0.98550725, 0.99275362, 0.96350365, 0.99270073, 0.97080292,
       0.98540146, 1.        , 1.        , 0.99270073, 0.98540146])

In [39]:
# Different classification metrics

# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))

# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

0.9854545454545455
[[146   2]
 [  2 125]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       148
           1       0.98      0.98      0.98       127

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275



## 5. Improve through experimentation
Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective asks:

* Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.
* Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective asks:

* Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the Scikit-Learn machine learning map, ensemble methods are generally considered more complex models)
* Could we improve the current model? If the model you're using performs well straight out of the box, can the hyperparameters be tuned to make it even better?

Hyperparameters are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

In [41]:
# How to find a model's hyperparameters
clf = LinearSVC()
clf.get_params() # returns a list of adjustable hyperparameters

{'C': 1.0,
 'class_weight': None,
 'dual': True,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

In [45]:
# Example of adjusting hyperparameters by hand

# Split data into X & y
X = banknote_class.drop("Class (0 for authentic, 1 for inauthentic)", axis=1) # use all columns except target
y = banknote_class["Class (0 for authentic, 1 for inauthentic)"] # we want to predict y using X

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate two models with different settings
clf_1 = LinearSVC(dual=False)
clf_2 = LinearSVC(dual=False)

# Fit both models on training data
clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)

# Evaluate both models on test data and see which is best
print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))

0.9854227405247813
0.9854227405247813


# 6. Save and reload your trained model
You can save and load a model with pickle.

In [68]:
# Saving a model with pickle
import pickle

# Save an existing model to file
pickle.dump(clf_2, open("lsvc_linear_svc_model_banknote_1.pkl", "wb"))

In [69]:
# Load a saved pickle model
loaded_pickle_model = pickle.load(open("lsvc_linear_svc_model_banknote_1.pkl", "rb"))

# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)

0.9890909090909091