# ML Model Building

Data source:

https://www.kaggle.com/ritesaluja/bank-note-authentication-uci-data

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

Dataset can be used for Binary Classification sample problems

In [1]:
# load data
import pandas as pd
import numpy as np



In [3]:
df= pd.read_csv('BankNote_Authentication.csv')
df.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   variance  1372 non-null   float64
 1   skewness  1372 non-null   float64
 2   curtosis  1372 non-null   float64
 3   entropy   1372 non-null   float64
 4   class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


In [7]:
df.describe(exclude='int')

Unnamed: 0,variance,skewness,curtosis,entropy
count,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657
std,2.842763,5.869047,4.31003,2.101013
min,-7.0421,-13.7731,-5.2861,-8.5482
25%,-1.773,-1.7082,-1.574975,-2.41345
50%,0.49618,2.31965,0.61663,-0.58665
75%,2.821475,6.814625,3.17925,0.39481
max,6.8248,12.9516,17.9274,2.4495


In [8]:
# separate predictors and target
X = df.drop('class', axis=1)
y = df['class']


In [10]:
X.head()


Unnamed: 0,variance,skewness,curtosis,entropy
0,3.6216,8.6661,-2.8073,-0.44699
1,4.5459,8.1674,-2.4586,-1.4621
2,3.866,-2.6383,1.9242,0.10645
3,3.4566,9.5228,-4.0112,-3.5944
4,0.32924,-4.4552,4.5718,-0.9888


In [11]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: class, dtype: int64

In [12]:
## Train test split
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

In [18]:
# implement Random Forest with Cross Validation
from sklearn.ensemble import RandomForestClassifier # or RandomForestRegressor for regression tasks
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

In [17]:
# instatiate Random Forest Model
rf_model = RandomForestClassifier(random_state=42)


In [19]:
#Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

In [20]:
# Set up GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)


In [21]:
# Fit the grid search
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [22]:
# Print results
print("Best hyperparameters found:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate on test data
best_rf_model = grid_search.best_estimator_
test_accuracy = best_rf_model.score(X_test, y_test)
print("Accuracy on the test set:", test_accuracy)

Best hyperparameters found: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 50}
Best cross-validation score: 0.9945330012453301
Accuracy on the test set: 0.9963636363636363


In [23]:
grid_search.predict([[2,3,4,1]])



array([0], dtype=int64)

In [None]:
# Save model using pickle
import pickle
pickle_out = open("grid_search.pkl", "wb")
pickle.dump(grid_search, pickle_out)
pickle_out.close()