# Prediction of counterfeit notes using classification models

### Data Set Information:
Bank Note Authentication UCI ML Repository dataset

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.


#### Attribute Information:

1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer)

Machine learning algorithms learn from the dataset. Therefore, in order to identify whether a banknote is real or not, we needed a dataset of real as well as fake banknotes along with their different features.

The dataset contains a total of 1372 records of different banknotes. The four left columns are data that we can use to predict whether a note is genuine or counterfeit, which is external data provided by a human, coded as 0 and 1.where 0 represents genuine and 1 represents counterfeit/fake banknote.

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import pandas as pd
from pandas.plotting import scatter_matrix
from numpy import mean

%matplotlib inline
import matplotlib.pyplot as plt

In [3]:
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

In [4]:
## 1.a: Read the CSV data from the below text file using Pandas library and store it in a variable 'dataframe'. 
# Pass the below names list to the names argument of the Pandas read_csv function and print the first 5 rows.
file_name = "data_banknote_authentication.txt"
names=['variance','skewness','kurtosis','entropy','class']
df = pd.read_csv(file_name,sep=',',header=None,names = names)
df.head()

Unnamed: 0,variance,skewness,kurtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [5]:
# 1.b 
# Print the number of columns and rows of the dataset

print(df.shape)
print("")
print('There are 1372 rows and 5 columns.')

(1372, 5)

There are 1372 rows and 5 columns.


In [6]:
# Print the number of records for each class in the dataset

print(df['class'].value_counts())
print("")
print('There are 762 and 610 records in Class 0 and 1 respectively.')

0    762
1    610
Name: class, dtype: int64

There are 762 and 610 records in Class 0 and 1 respectively.


In [7]:
# 2.a Train-test split
# The first 4 columns of the data frame is the explanatory variable X and the last column is the explained variable class
# Split the data frame values stored in array, into X & y and do a train-test split with 20% for the test/validation set. 
# Use random seed of 55,
# array = dataset.values

X = df[["variance","skewness","kurtosis","entropy"]]

y = df["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

In [8]:
# 2.b: Train a logistic regression classifier by fitting on the training data, 
# predict on the test set and print out the accuracy

LR = LogisticRegression()
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
accuracy_score(y_test, y_pred)

0.9963636363636363

In [9]:
#2.c Evaluating an algorithm with different parameters
#Evaluate the kNN algorithm by looping across different values for n_neighbors parameter,
#fitting on the training set, predicting on the validation set & appending the accuracy scores to a list
# Plot the accuracy scores as a function of the n_neighbors parameter

# Running KNN for various values of n_neighbors and storing results
knn_r_acc = []

for i in range(1,26):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    test_score = knn.score(X_test, y_test)
    train_score = knn.score(X_train, y_train)
    knn_r_acc.append((i, test_score, train_score))
    
df = pd.DataFrame(knn_r_acc, columns=['K','Test Score','Train Score'])

df

Unnamed: 0,K,Test Score,Train Score
0,1,1.0,1.0
1,2,1.0,1.0
2,3,1.0,0.999088
3,4,1.0,1.0
4,5,1.0,1.0
5,6,1.0,1.0
6,7,1.0,1.0
7,8,1.0,1.0
8,9,1.0,1.0
9,10,1.0,1.0


In [10]:
#Using k-fold CrossValidation
#1. Use a seed of 55 and scoring of accuracy

#2. define the following in a list named models:

# models.append(('LDA', LinearDiscriminantAnalysis()))
# Similar to LDA, append LinearDiscriminantAnalysis, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB & SVC

# get a list of models to evaluate

# 3. Use kFoldsplits as 10

# 4. Loop through each name and model in models:

#4.a: Set kfold as model_selection.KFold, using the respective values for n_splits, seed and setting shuffle as True

#4.b: Set cv_results as the model_selection.cross_val_score output by feeding it in the model, training data and scoring metric

#4.c: Print the accuracy score for each model

def get_models():
    models = list()
    models.append(LinearDiscriminantAnalysis())
    models.append(KNeighborsClassifier())
    models.append(DecisionTreeClassifier())
    models.append(GaussianNB())
    models.append(SVC())
    return models

def evaluate_model(cv, model):
    scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    return mean(scores)

KFold = model_selection.KFold

cv = KFold(n_splits=10, shuffle=True, random_state=55)

models = get_models()

cv_results = list()

for model in models:
    cv_mean = evaluate_model(cv, model)
    cv_results.append(cv_mean)
    print('%s: cv=%.3f' % (type(model).__name__, cv_mean))

LinearDiscriminantAnalysis: cv=0.973
KNeighborsClassifier: cv=1.000
DecisionTreeClassifier: cv=0.985
GaussianNB: cv=0.841
SVC: cv=0.995
