# Fraud Detection
## Solution
In order to solve the problem 
- Which is provide a classification method that correctly states whether a signature can be a fraud/genuine/disguised type.
- **I propose 2 Deep Learning classification model using a Shallow model and a MiniVGG model that are two kinds of CNN arquitecture.**

## Libraries

In [1]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from pyimagesearch.preprocessing import ImageToArrayPreprocessor
from pyimagesearch.preprocessing import AspectAwarePreprocessor
from pyimagesearch.datasets import SimpleDatasetLoader
from pyimagesearch.nn.conv import ShallowNet
from pyimagesearch.nn.conv import MiniVGGNet
from keras.optimizers import SGD
from imutils import paths
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import argparse
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Dataset
The objetive of the challenge is classify a signature in one of these 3 classes (Genuine, Disguised, Simulated). For these reason I only work with these 3 classes.

### Readind the training dataset

In [3]:
dataset_path = 'TrainingSet'
image_paths = list(paths.list_images(dataset_path))
class_names = [pt.split(os.path.sep)[-2] for pt in image_paths]
class_names = [str(x) for x in np.unique(class_names)]

In [4]:
class_names

['1', '2', '3']

Where:
- 1:  Genuine
- 2:  Disguised
- 3:  Simulated(fraud)

Some example of images path

In [31]:
image_paths[::10]

['TrainingSet/2/D037.png',
 'TrainingSet/2/D114.png',
 'TrainingSet/1/G111.png',
 'TrainingSet/1/G097.png',
 'TrainingSet/1/G003.png',
 'TrainingSet/1/G102.png',
 'TrainingSet/1/G172.png',
 'TrainingSet/1/G084.png',
 'TrainingSet/1/G015.png',
 'TrainingSet/1/G187.png',
 'TrainingSet/3/S049.png',
 'TrainingSet/3/S138.png',
 'TrainingSet/3/S032.png',
 'TrainingSet/3/S081.png',
 'TrainingSet/3/S196.png',
 'TrainingSet/3/S142.png',
 'TrainingSet/3/S145.png',
 'TrainingSet/3/S022.png',
 'TrainingSet/3/S028.png',
 'TrainingSet/3/S008.png']

Set all shape of image to (128,90) and load the dataset

In [6]:
width = 128
height = 90
aap = AspectAwarePreprocessor(width, height)
iap = ImageToArrayPreprocessor()

In [7]:
sdl = SimpleDatasetLoader(preprocessors=[aap, iap])
(data, labels) = sdl.load(image_paths, verbose=50)

[INFO] processed 50/200
[INFO] processed 100/200
[INFO] processed 150/200
[INFO] processed 200/200


In [8]:
data = data.astype("float") / 255.0

Split the dataset in train and test. 30% of dataset is for testing.

In [10]:
(trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.30, random_state=42)

Number of images for training and for testing

In [32]:
print("Training images: ", len(trainX))
print("Testing images: ", len(testX))

Training images:  140
Testing images:  60


Count the number of images for each class for training and for testing. We have 30% of images of each class for test

In [33]:
from collections import Counter

In [35]:
print(Counter(trainY))
print("Total:", len(trainY))

Counter({'3': 73, '1': 53, '2': 14})
Total: 140


In [37]:
print(Counter(testY))
print("Total:", len(testY))

Counter({'3': 31, '1': 23, '2': 6})
Total: 60


Converting the labels into output standar of CNN model

In [16]:
trainY_b = LabelBinarizer().fit_transform(trainY)
testY_b = LabelBinarizer().fit_transform(testY)

Then, these 2 ways of labels represent the same thing

In [40]:
print(trainY[:5])
print(trainY_b[:5])

['3' '3' '1' '2' '1']
[[0 0 1]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [1 0 0]]


## Imbalanced class
The dataset doesn't has the same number of images to each class (reference, disguised, genuine and simulated). In orden to solve this problem I **modify the cost function of models to balance the dataset**

In [2]:
from sklearn.utils import class_weight

This is done using a class_weights function that get a parameters that is inversely proportional to number of instances of each classes. In our case we have:

In [18]:
class_weights = class_weight.compute_class_weight('balanced', np.unique(trainY), trainY)

In [19]:
class_weights

array([0.88050314, 3.33333333, 0.63926941])

we get these parameters for each classes:
- 1:  Genuine --> 0.88050314
- 2:  Disguised --> 3.3333333
- 3:  Simulated --> 0.63926941

That is taken into consideration in the cost function of the model

## Classification
### Shallow Model
Training this model with 5 epoch

In [20]:
# initialize the optimizer and model
print("[INFO] compiling model...")
optShallowNet = SGD(lr=0.005)
modelShallowNet = ShallowNet.build(width=width, height=height, depth=1, classes=3)
modelShallowNet.compile(loss="categorical_crossentropy", optimizer=optShallowNet,
metrics=["accuracy"])
# train the network
print("[INFO] training network...")
EPOCH = 5
H_ShallowNet = modelShallowNet.fit(trainX, trainY_b, validation_data=(testX, testY_b), class_weight=class_weights,
batch_size=2, epochs=EPOCH, verbose=1)

[INFO] compiling model...
[INFO] training network...
Train on 140 samples, validate on 60 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Arquitecture of the model

In [21]:
modelShallowNet.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 90, 128, 32)       320       
_________________________________________________________________
activation_1 (Activation)    (None, 90, 128, 32)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 368640)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 1105923   
_________________________________________________________________
activation_2 (Activation)    (None, 3)                 0         
Total params: 1,106,243
Trainable params: 1,106,243
Non-trainable params: 0
_________________________________________________________________


Prediction

In [22]:
# evaluate the network
print("[INFO] evaluating network...")
predictionsmodelShallowNet = modelShallowNet.predict(testX, batch_size=2)
print(classification_report(testY_b.argmax(axis=1), predictionsmodelShallowNet.argmax(axis=1), target_names=class_names))


[INFO] evaluating network...
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        23
           2       0.00      0.00      0.00         6
           3       0.52      1.00      0.68        31

    accuracy                           0.52        60
   macro avg       0.17      0.33      0.23        60
weighted avg       0.27      0.52      0.35        60



  'precision', 'predicted', average, warn_for)


Confusion matrix

In [24]:
from sklearn.metrics import confusion_matrix

In [25]:
confusion_matrix(testY_b.argmax(axis=1), predictionsmodelShallowNet.argmax(axis=1))

array([[ 0,  0, 23],
       [ 0,  0,  6],
       [ 0,  0, 31]])

The results are not good because all image signature is classified as **simulated(fraud)**. Some reasons of that is the few numbers of images for each class.

#### Possible solutons
- Data augmentation: Increase the number of images for each classes in order to have more samples and to get balanced dataset.
- Use more powerfull CNN model

I decided to propose a second model called **MiniVGG** that is more powerfull CNN model.

### MiniVGG model
Training the model

In [26]:
print("[INFO] compiling model...")
opt = SGD(lr=0.05)
model = MiniVGGNet.build(width=width, height=height, depth=1, classes=len(class_names))
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

# train the network
print("[INFO] training network...")
H = model.fit(trainX, trainY_b, validation_data=(testX, testY_b), batch_size=2, epochs=EPOCH, verbose=1)

[INFO] compiling model...
[INFO] training network...
Train on 140 samples, validate on 60 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Prediction

In [27]:
# evaluate the network
print("[INFO] evaluating network...")
predictionsmodelMiniVGG = model.predict(testX, batch_size=2)
print(classification_report(testY_b.argmax(axis=1), predictionsmodelMiniVGG.argmax(axis=1), target_names=class_names))


[INFO] evaluating network...
              precision    recall  f1-score   support

           1       0.42      1.00      0.59        23
           2       0.67      0.33      0.44         6
           3       0.50      0.03      0.06        31

    accuracy                           0.43        60
   macro avg       0.53      0.46      0.36        60
weighted avg       0.49      0.43      0.30        60



In [28]:
confusion_matrix(testY_b.argmax(axis=1), predictionsmodelMiniVGG.argmax(axis=1))

array([[23,  0,  0],
       [ 3,  2,  1],
       [29,  1,  1]])

These results are better that previous, all genuine signature are classified correctly, however of 6 disguised signature 3 of them are classified as genuine and 1 as fraud. Of 31 simulated(fraud) images only one is classified correctly

## Conclusions
- Data augmentation is neccesary to get more samples of each classes.
- MiniVGG is my baseline model but It could be more tuning to adapt to this problem.
- Another technique the could perform is Matching signature based of distance metric. For example use reference signature like our pattern and doing matching with other kind of signature like genuine or fraud. The fraud should have more distance respect to reference signature.

## Save H5 model file
it is saving the miniVGG model

In [41]:
import pickle
model.save("modelMiniVGGNet.h5")
print("Saved model to disk")

Saved model to disk
