# Docker Exercise
## Implement Docker in an end-to-end project for Bank Note Authentication

## Bank Note Authentication
### Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Import Dataset

In [2]:
data = pd.read_csv("BankNote_Authentication.csv")

## Understanding the Data

In [3]:
data.head()
# features extracted from images are "Variance", "Skewness", "Curtosis", "Entropy" by Wavelet Transform Tool
# Model will be using these features to identify whether the Bank Note is authentic (1) or not (0)

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [4]:
data.shape
# has 1372 rows of data
# 4 features 1 target

(1372, 5)

In [5]:
data.isnull().sum()

variance    0
skewness    0
curtosis    0
entropy     0
class       0
dtype: int64

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   variance  1372 non-null   float64
 1   skewness  1372 non-null   float64
 2   curtosis  1372 non-null   float64
 3   entropy   1372 non-null   float64
 4   class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


## Feature Engineering

In [7]:
# baseline accuracy for dataset
baseline = data['class'].value_counts(normalize=True)
baseline

0    0.555394
1    0.444606
Name: class, dtype: float64

In [8]:
# set X as features
# set y as target
X = data.iloc[:,:-1].astype(np.float32)
y = data.iloc[:,-1]

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   variance  1372 non-null   float32
 1   skewness  1372 non-null   float32
 2   curtosis  1372 non-null   float32
 3   entropy   1372 non-null   float32
dtypes: float32(4)
memory usage: 21.6 KB


In [10]:
# split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)

## Modelling and Evaluation

In [11]:
# create and fit model
model = RandomForestClassifier()
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
# prediction
y_pred = model.predict(X_test)

In [13]:
# Evaluate model
accuracy_score(y_test, y_pred)

0.9927272727272727

## Pickle Model

In [14]:
# create Pickle file
pickle_out = open("model.pkl", "wb")
pickle.dump(model, pickle_out)
pickle_out.close()

In [15]:
# test prediction
array = np.array([[0.5,2,1,0.2]])
model.predict(array)

array([0], dtype=int64)