# <center>Fraud Detection - TESTING</center>

# Victor Francisco
email: victorfco27@gmail.com

# Brief approach overview

The adopted approach consists of using shape and texture features extracted from the image in order to train different classifiers into models for signature fraud detection.
Images were first rescaled to 60% of the real size to not compromise performance, plus they keep their height and width proportions.

### Importing all necessary libraries

In [1]:
import os
import glob
import cv2 as cv
import numpy as np
import pandas as pd
import SimpleITK as sitk
import matplotlib.pyplot as plt
from radiomics import featureextractor
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from skimage.io import imread, imshow
from skimage.filters import gaussian, threshold_otsu, median
from skimage.morphology import disk, dilation, erosion
from skimage.transform import rotate, rescale
from statistics import mean, stdev
from joblib import dump, load

## Image processing functions

### Create signature segmentation mask (It was not used for training by default. However, if you used it during training, a few changes during the mask creation has to be done for the test case.)

Parameters: image as array.

Returns: binarized image as array of 1 and 0 values.

Note: Created for testing the hypothesis of whether use the whole image or create a ROI (region of interest) withing signature segmentation. As result of the test, using the whole image showed higher classification results.

In [2]:
def create_mask(img):
    eroded = erosion(img, disk(5))
    blur = gaussian(eroded, 3)
    thresh = threshold_otsu(blur)
    binary = blur > thresh
    binary = ~binary
    
    return binary.astype(int)

### Extract PyRadiomics features from image within a given mask

Parameters: path to image

Returns: list of features extracted from the image

Note: We are assuming the whole image rescaled to 60% of the actual size for feature extraction. Thus we create a mask of same size of the used image. The features are based on shape and texture.

[PyRadiomics]: https://pyradiomics.readthedocs.io/en/latest/features.html

More information on [PyRadiomics] features.

In [3]:
def extract_pyrad_features(img_path):
    image = imread(img_path, as_gray = True)
    image = rescale(image, 0.6, anti_aliasing = True)

    img = sitk.GetImageFromArray(image)

    # GetSize() return x, y, z size, array should be z, y, x. use [::-1] to reverse direction
    mask_arr = np.ones(img.GetSize()[::-1], dtype = 'int')

    # Get the SimpleITK image object from the array
    mask = sitk.GetImageFromArray(mask_arr)
    
    # Use this line instead of the above to consider using the image segmentation mask option
#     mask = sitk.GetImageFromArray(create_mask(image))

    # Copy geometric information from the image (origin, spacing, direction)
    mask.CopyInformation(img)

    # Store the full mask
#     sitk.WriteImage(mask, '{}_mask.nrrd'.format(os.path.splitext(img_path)[0]), True)  # True specifies it can use compression
    
    extractor = featureextractor.RadiomicsFeatureExtractor()
    
    # Disable all feature classes and enable all but shape3D
    extractor.disableAllFeatures()
    extractor.enableFeatureClassByName('shape2D')
    extractor.enableFeatureClassByName('firstorder')
    extractor.enableFeatureClassByName('glcm')
    extractor.enableFeatureClassByName('gldm')
    extractor.enableFeatureClassByName('glrlm')
    extractor.enableFeatureClassByName('glszm')
    extractor.enableFeatureClassByName('ngtdm')
    
    result = extractor.execute(img, mask)
    
    # Checking beginning of feature list, since it contains configuration information prior to the actual
    # extracted values
    
#     print(list(result.keys())[21])
#     print(list(result.values())[21])
#     print(list(result.keys())[22])
#     print(list(result.values())[22])
    
    feat_values = list(result.values())[22:]
    
    return feat_values

## Data manipulation

### Images path definition

ref_path: path to references images folder

que_path: path to questioned images folder

In [4]:
ref_path = '/path/to/TestSet/Reference'
que_path = '/path/to/TestSet/Questioned'

ref_imgs = glob.glob(ref_path + "/*" )
que_imgs = glob.glob(que_path + "/*" )

### Feature extraction from reference images

In [5]:
ref_list = []
for i in ref_imgs:
    i_features = extract_pyrad_features(i)
    ref_list.append(i_features)
    
ref_df = pd.DataFrame(ref_list)

  warn('The default multichannel argument (None) is deprecated.  Please '
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to 

### Feature extraction from questionable images

In [6]:
que_list = []
for i in que_imgs:
    i_features = extract_pyrad_features(i)
    que_list.append(i_features)
    
que_df = pd.DataFrame(que_list)

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

### Definition of labels for each image class

Genuine: 'g'; 
Disguise: 'd'; 
Simulated/Fraud/Forged: 'f'; 

### Combining the extracted features of every image class in the same dataframe

In [7]:
full_df = que_df.append(ref_df, ignore_index = True)

### Checking for possible NaN in the dataset

In [8]:
full_df.isnull().values.any()

False

### Data normalization

The selected method for data normalization is Standardization, it gives a better understading of the difference of classes for each feature.

Since we do not want to predict reference images, we perform a fit on the combined dataset (reference and questionables) and then apply the transformation only on questionable data.

In [9]:
scaler = StandardScaler()
scaler.fit(full_df)
norm_df = scaler.transform(que_df)

In [10]:
norm_df = pd.DataFrame(norm_df)

### Loading classifier

Define the saved model to clf_path. Since I saved it in the same folder, it is expected to run just like this.

In [11]:
clf_path = 'fraud.joblib'
classifier = load(clf_path)

In [12]:
prediction = classifier.predict(norm_df)

### Prediction results

In [18]:
prediction

array(['f', 'f', 'f', 'f', 'f', 'g', 'f', 'f', 'f', 'g', 'f', 'f', 'f',
       'g', 'f', 'g', 'f', 'g', 'f', 'g', 'f', 'f', 'f', 'f', 'f', 'f',
       'g', 'f', 'f', 'g', 'g', 'f', 'g', 'f', 'f', 'd', 'f', 'f', 'f',
       'g', 'd', 'f', 'd', 'g', 'g', 'g', 'g', 'f', 'g', 'g', 'g', 'd',
       'f', 'f', 'g', 'f', 'g', 'g', 'f', 'f', 'f', 'f', 'f', 'g', 'f',
       'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'g', 'f', 'g', 'g', 'g',
       'f', 'g', 'd', 'g', 'f', 'f', 'g', 'g', 'd', 'f', 'f', 'f', 'g',
       'g', 'g', 'f', 'f', 'g', 'f', 'g', 'g', 'f'], dtype=object)