# <center>Fraud Detection - TRAINING</center>

# Victor Francisco
email: victorfco27@gmail.com

# Brief approach overview

The adopted approach consists of using shape and texture features extracted from the image in order to train different classifiers into models for signature fraud detection.
Images were first rescaled to 60% of the real size to not compromise performance, plus they keep their height and width proportions.

### Importing all necessary libraries

In [1]:
import os
import glob
import cv2 as cv
import numpy as np
import pandas as pd
import SimpleITK as sitk
import matplotlib.pyplot as plt
from radiomics import featureextractor
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from skimage.io import imread, imshow
from skimage.filters import gaussian, threshold_otsu, median
from skimage.morphology import disk, dilation, erosion
from skimage.transform import rotate, rescale
from statistics import mean, stdev
from joblib import dump, load

## Image processing functions

### Create signature segmentation mask (It is not being used by default)

Parameters: image as array.

Returns: binarized image as array of 1 and 0 values.

Note: Created for testing the hypothesis of whether use the whole image or create a ROI (region of interest) withing signature segmentation. As result of the test, using the whole image showed higher classification results.

In [2]:
def create_mask(img):
    eroded = erosion(img, disk(5))
    blur = gaussian(eroded, 3)
    thresh = threshold_otsu(blur)
    binary = blur > thresh
    binary = ~binary
    
    return binary.astype(int)

### Extract PyRadiomics features from image within a given mask

Parameters: path to image

Returns: list of features extracted from the image

Note: We are assuming the whole image rescaled to 60% of the actual size for feature extraction. Thus we create a mask of same size of the used image. The features are based on shape and texture.

[PyRadiomics]: https://pyradiomics.readthedocs.io/en/latest/features.html

More information on [PyRadiomics] features.

In [3]:
def extract_pyrad_features(img_path):
    image = imread(img_path, as_gray = True)
    image = rescale(image, 0.6, anti_aliasing = True)

    img = sitk.GetImageFromArray(image)

    # GetSize() return x, y, z size, array should be z, y, x. use [::-1] to reverse direction
    mask_arr = np.ones(img.GetSize()[::-1], dtype = 'int')

    # Get the SimpleITK image object from the array
    mask = sitk.GetImageFromArray(mask_arr)
    
    # Use this line instead of the above to consider using the image segmentation mask option
#     mask = sitk.GetImageFromArray(create_mask(image))

    # Copy geometric information from the image (origin, spacing, direction)
    mask.CopyInformation(img)

    # Store the full mask
#     sitk.WriteImage(mask, '{}_mask.nrrd'.format(os.path.splitext(img_path)[0]), True)  # True specifies it can use compression
    
    extractor = featureextractor.RadiomicsFeatureExtractor()
    
    # Disable all feature classes and enable all but shape3D
    extractor.disableAllFeatures()
    extractor.enableFeatureClassByName('shape2D')
    extractor.enableFeatureClassByName('firstorder')
    extractor.enableFeatureClassByName('glcm')
    extractor.enableFeatureClassByName('gldm')
    extractor.enableFeatureClassByName('glrlm')
    extractor.enableFeatureClassByName('glszm')
    extractor.enableFeatureClassByName('ngtdm')
    
    result = extractor.execute(img, mask)
    
    # Checking beginning of feature list, since it contains configuration information prior to the actual
    # extracted values
    
#     print(list(result.keys())[21])
#     print(list(result.values())[21])
#     print(list(result.keys())[22])
#     print(list(result.values())[22])
    
    feat_values = list(result.values())[22:]
    
    return feat_values

## Data manipulation

### Images path definition

ref_path: path to reference images folder

gen_path: path to genuine images folder

dis_path: path to disguise images folder

sim_path: path to simulated/forged/fraud images folder

In [4]:
ref_path = '/media/victorfco/Storage/Projects/Data/candidate-data/02-FraudDetection/TrainingSet/Reference'
gen_path = '/media/victorfco/Storage/Projects/Data/candidate-data/02-FraudDetection/TrainingSet/Genuine'
dis_path = '/media/victorfco/Storage/Projects/Data/candidate-data/02-FraudDetection/TrainingSet/Disguise'
sim_path = '/media/victorfco/Storage/Projects/Data/candidate-data/02-FraudDetection/TrainingSet/Simulated'

ref_imgs = glob.glob(ref_path + '/*' )
gen_imgs = glob.glob(gen_path + '/*' )
dis_imgs = glob.glob(dis_path + '/*' )
sim_imgs = glob.glob(sim_path + '/*' )

### Feature extraction from reference images

In [5]:
ref_list = []
for i in ref_imgs:
    i_features = extract_pyrad_features(i)
    ref_list.append(i_features)
    
ref_df = pd.DataFrame(ref_list)

  warn('The default multichannel argument (None) is deprecated.  Please '
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated


### Feature extraction from genuine images

In [6]:
gen_list = []
for i in gen_imgs:
    i_features = extract_pyrad_features(i)
    gen_list.append(i_features)
    
gen_df = pd.DataFrame(gen_list)

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

### Feature extraction from disguise images

In [7]:
dis_list = []
for i in dis_imgs:
    i_features = extract_pyrad_features(i)
    dis_list.append(i_features)
    
dis_df = pd.DataFrame(dis_list)

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

### Feature extraction from simulated/fraud images

In [8]:
sim_list = []
for i in sim_imgs:
    i_features = extract_pyrad_features(i)
    sim_list.append(i_features)
    
sim_df = pd.DataFrame(sim_list)

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated
GLCM is symmetrical, therefore Sum Average = 2 * Joint Avera

### Definition of labels for each image class

Genuine: 'g'; 
Disguise: 'd'; 
Simulated/Fraud: 'f'; 
Reference: 'g'

Note: Since reference images are 'genuine', we are including them in the dataset as this class.

In [9]:
gen_labels = pd.DataFrame({'labels': ['g'] * gen_df.shape[0]})
dis_labels = pd.DataFrame({'labels': ['d'] * dis_df.shape[0]})
sim_labels = pd.DataFrame({'labels': ['f'] * sim_df.shape[0]})
ref_labels = pd.DataFrame({'labels': ['g'] * ref_df.shape[0]})

### Combining the extracted features of every image class in the same dataframe

In [10]:
full_df = gen_df.append(dis_df, ignore_index = True)
full_df = full_df.append(sim_df, ignore_index = True)
full_df = full_df.append(ref_df, ignore_index = True)

### Checking for possible NaN in the dataset

In [11]:
full_df.isnull().values.any()

False

### Combining class labels of every image in the same dataframe

In [12]:
labels_df = gen_labels.append(dis_labels, ignore_index = True)
labels_df = labels_df.append(sim_labels, ignore_index = True)
labels_df = labels_df.append(ref_labels, ignore_index = True)

### Data normalization

The selected method for data normalization is Standardization, it gives a better understading of the difference of classes for each feature.

In [13]:
scaler = StandardScaler()
norm_df = scaler.fit_transform(full_df)

In [14]:
norm_df = pd.DataFrame(norm_df)

### Data split process

Data was divided in TRAINING and TEST sets. The test set consists of 33% of the whole dataset.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(norm_df, labels_df, test_size=0.33, random_state=1)

We can verify that 102 features were extracted from each image.

In [16]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

Training Features Shape: (140, 102)
Training Labels Shape: (140, 1)
Testing Features Shape: (69, 102)
Testing Labels Shape: (69, 1)


### Function to train models for each chosen classifier

Parameters: four arrays corresponding to train data, train data class labels, test data and test data class labels.

Returns: dictionary with accuracy and confusion matrix for each classifier model.

Note: We decided to evaluate four different classifiers: K-Nearest Neighbors (KNN), Naive Bayes (NB), Support Vector Machine (SVC) and Multilayer Perceptron (MLP). For KNN, we used 5 different values of k (1, 3, 5, 7 and 9). SVC kernel is RBF. MLP with simple 2 hidden layers (a and b) of sizes a = n_features+n_classes/2 and b = a/2. 

In [17]:
def training_classifiers(X_train, y_train, X_test, y_test):
    knn1 = KNeighborsClassifier(n_neighbors=1)
    knn3 = KNeighborsClassifier(n_neighbors=3)
    knn5 = KNeighborsClassifier(n_neighbors=5)
    knn7 = KNeighborsClassifier(n_neighbors=7)
    knn9 = KNeighborsClassifier(n_neighbors=9)
    nb = GaussianNB()
    svc = SVC(kernel = 'rbf', gamma = 'scale', random_state = 1)
    mlp = MLPClassifier(hidden_layer_sizes = (52, 26), max_iter = 1000, learning_rate = 'adaptive',
                        solver = 'adam', random_state = 1, tol = 0.000000001)
    
    knn1.fit(X_train, y_train)
    knn3.fit(X_train, y_train)
    knn5.fit(X_train, y_train)
    knn7.fit(X_train, y_train)
    knn9.fit(X_train, y_train)
    nb.fit(X_train, y_train)
    svc.fit(X_train, y_train)
    mlp.fit(X_train, y_train)

    y_knn1_pred = knn1.predict(X_test)
    y_knn3_pred = knn3.predict(X_test)
    y_knn5_pred = knn5.predict(X_test)
    y_knn7_pred = knn7.predict(X_test)
    y_knn9_pred = knn9.predict(X_test)
    y_nb_pred = nb.predict(X_test)
    y_svc_pred = svc.predict(X_test)
    y_mlp_pred = mlp.predict(X_test)

    knn1_acc = metrics.accuracy_score(y_test, y_knn1_pred)
    knn3_acc = metrics.accuracy_score(y_test, y_knn3_pred)
    knn5_acc = metrics.accuracy_score(y_test, y_knn5_pred)
    knn7_acc = metrics.accuracy_score(y_test, y_knn7_pred)
    knn9_acc = metrics.accuracy_score(y_test, y_knn9_pred)
    nb_acc = metrics.accuracy_score(y_test, y_nb_pred)
    svc_acc = metrics.accuracy_score(y_test, y_svc_pred)
    mlp_acc = metrics.accuracy_score(y_test, y_mlp_pred)
    
    result = {
        'knn-1': {
            'acc': knn1_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_knn1_pred)
        },
        'knn-3': {
            'acc': knn3_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_knn3_pred)
        },
        'knn-5': {
            'acc': knn5_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_knn5_pred)
        },
        'knn-7': {
            'acc': knn7_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_knn7_pred)
        },
        'knn-9': {
            'acc': knn9_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_knn9_pred)
        },
        'nb': {
            'acc': nb_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_nb_pred)
        },
        'svc': {
            'acc': svc_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_svc_pred)
        },
        'mlp': {
            'acc': mlp_acc,
            'c_matrix': metrics.confusion_matrix(y_test, y_mlp_pred)
        }
    }
    
    return result

### Function to print out accuracy and confusion matrix for each model classification result

Parameters: dictionary having accuracy and confusion matrix for each trained classifier.

Returns: None.

It only prints results.

In [18]:
def show_acc_cmat(result):
    for clf in result:
        print(clf)
        print('Accuracy: {:.2f}%'.format((result[clf]['acc'])*100))
        print('Confusion Matrix:\n{}'.format(result[clf]['c_matrix']))
        print('')
            

### Fuction to show a summary of the k-fold cross-validation process accuracy results

Parameters: dictionary with a list of accuracy values for each trained model.

Returns: None.

It only prints results.

In [19]:
def show_training_acc(result_summary):
    for clf in result_summary:
        print(clf)
        print('Accuracy (%):')
        print('Max: {:.2f}'.format(max(result_summary[clf])*100))
        print('Min: {:.2f}'.format(min(result_summary[clf])*100))
        print('Mean: {:.2f}'.format(mean(result_summary[clf])*100))
        print('SD: {:.2f}'.format(stdev(result_summary[clf])*100))
        print('')

## Stratified K-Fold (SKF) Cross-Validation process (k = 10 folds)

In [20]:
result_summary = {
    'knn-1': [],
    'knn-3': [],
    'knn-5': [],
    'knn-7': [],
    'knn-9': [],
    'nb': [],
    'svc': [],
    'mlp': []
}

k = 1
skf = StratifiedKFold(n_splits = 10)

for train_index, val_index in skf.split(X_train, y_train):

    X_skf_train, X_skf_test = X_train.iloc[train_index], X_train.iloc[val_index]
    y_skf_train, y_skf_test = y_train.iloc[train_index], y_train.iloc[val_index]
    
    fold_result = training_classifiers(X_skf_train, y_skf_train, X_skf_test, y_skf_test)
    
    print('')
    print('Fold {}'.format(k))
    k = k + 1
    show_acc_cmat(fold_result)
    
    for clf in result_summary:
        result_summary[clf].append(fold_result[clf]['acc'])

  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 1
knn-1
Accuracy: 62.50%
Confusion Matrix:
[[1 0 1]
 [0 4 4]
 [1 0 5]]

knn-3
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [0 5 3]
 [0 0 6]]

knn-5
Accuracy: 68.75%
Confusion Matrix:
[[1 1 0]
 [0 5 3]
 [0 1 5]]

knn-7
Accuracy: 75.00%
Confusion Matrix:
[[1 1 0]
 [0 6 2]
 [0 1 5]]

knn-9
Accuracy: 75.00%
Confusion Matrix:
[[0 2 0]
 [0 6 2]
 [0 0 6]]

nb
Accuracy: 68.75%
Confusion Matrix:
[[2 0 0]
 [0 5 3]
 [0 2 4]]

svc
Accuracy: 93.75%
Confusion Matrix:
[[2 0 0]
 [0 7 1]
 [0 0 6]]

mlp
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 2
knn-1
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [1 6 1]
 [0 1 5]]

knn-3
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]

knn-5
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [0 6 2]
 [0 1 5]]

knn-7
Accuracy: 81.25%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 1 5]]

knn-9
Accuracy: 75.00%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 2 4]]

nb
Accuracy: 62.50%
Confusion Matrix:
[[1 1 0]
 [0 4 4]
 [0 1 5]]

svc
Accuracy: 81.25%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 1 5]]

mlp
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 3
knn-1
Accuracy: 68.75%
Confusion Matrix:
[[2 0 0]
 [0 5 3]
 [0 2 4]]

knn-3
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]

knn-5
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]

knn-7
Accuracy: 75.00%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 2 4]]

knn-9
Accuracy: 87.50%
Confusion Matrix:
[[2 0 0]
 [0 7 1]
 [0 1 5]]

nb
Accuracy: 56.25%
Confusion Matrix:
[[2 0 0]
 [0 4 4]
 [0 3 3]]

svc
Accuracy: 87.50%
Confusion Matrix:
[[2 0 0]
 [0 8 0]
 [0 2 4]]

mlp
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 4
knn-1
Accuracy: 85.71%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 2 3]]

knn-3
Accuracy: 85.71%
Confusion Matrix:
[[1 1 0]
 [0 7 0]
 [0 1 4]]

knn-5
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

knn-7
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

knn-9
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

nb
Accuracy: 85.71%
Confusion Matrix:
[[2 0 0]
 [0 6 1]
 [0 1 4]]

svc
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

mlp
Accuracy: 92.86%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 5
knn-1
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-3
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-5
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-7
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 2 3]]

knn-9
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

nb
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 6
knn-1
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

knn-3
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-5
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

knn-7
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-9
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

nb
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 1 4]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[0 0 1]
 [1 6 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 7
knn-1
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-5
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-7
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-9
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

nb
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 2 3]]

svc
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 0 5]]

mlp
Accuracy: 84.62%
Confusion Matrix:
[[0 1 0]
 [0 7 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 8
knn-1
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 1 4]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-5
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-7
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [1 2 4]
 [0 0 5]]

knn-9
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [1 2 4]
 [0 0 5]]

nb
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 9
knn-1
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]

knn-3
Accuracy: 69.23%
Confusion Matrix:
[[0 1 0]
 [0 6 1]
 [1 1 3]]

knn-5
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]

knn-7
Accuracy: 53.85%
Confusion Matrix:
[[0 1 0]
 [0 4 3]
 [1 1 3]]

knn-9
Accuracy: 53.85%
Confusion Matrix:
[[0 1 0]
 [0 4 3]
 [0 2 3]]

nb
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [0 4 3]
 [0 2 3]]

svc
Accuracy: 69.23%
Confusion Matrix:
[[0 1 0]
 [0 6 1]
 [0 2 3]]

mlp
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]


Fold 10
knn-1
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 0 5]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 0 5]]

knn-5
Accuracy: 100.00%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 0 5]]

knn-7
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 1 4]]

knn-9
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 1 4]]

nb
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [2 4 1]
 [0 1 4]]

svc
Accuracy



### SKF results summary

In [21]:
show_training_acc(result_summary)

knn-1
Accuracy (%):
Max: 85.71
Min: 61.54
Mean: 74.58
SD: 8.91

knn-3
Accuracy (%):
Max: 87.50
Min: 69.23
Mean: 80.64
SD: 5.87

knn-5
Accuracy (%):
Max: 100.00
Min: 61.54
Mean: 79.42
SD: 12.68

knn-7
Accuracy (%):
Max: 100.00
Min: 53.85
Mean: 76.97
SD: 13.67

knn-9
Accuracy (%):
Max: 100.00
Min: 53.85
Mean: 77.60
SD: 14.03

nb
Accuracy (%):
Max: 85.71
Min: 56.25
Mean: 71.17
SD: 9.79

svc
Accuracy (%):
Max: 100.00
Min: 69.23
Mean: 84.71
SD: 9.84

mlp
Accuracy (%):
Max: 92.86
Min: 61.54
Mean: 81.83
SD: 9.36



During training and validation, in average, we can assume SVC as best performer classifier.

## Model evaluation on test set

This step shows the classifier models performance according to data not used during training and validation processes.

In [22]:
test_result = training_classifiers(X_train, y_train, X_test, y_test)
show_acc_cmat(test_result)

  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


knn-1
Accuracy: 88.41%
Confusion Matrix:
[[ 5  0  1]
 [ 1 25  5]
 [ 0  1 31]]

knn-3
Accuracy: 89.86%
Confusion Matrix:
[[ 5  0  1]
 [ 0 27  4]
 [ 1  1 30]]

knn-5
Accuracy: 86.96%
Confusion Matrix:
[[ 6  0  0]
 [ 1 25  5]
 [ 1  2 29]]

knn-7
Accuracy: 89.86%
Confusion Matrix:
[[ 5  0  1]
 [ 0 28  3]
 [ 0  3 29]]

knn-9
Accuracy: 89.86%
Confusion Matrix:
[[ 5  0  1]
 [ 0 27  4]
 [ 0  2 30]]

nb
Accuracy: 81.16%
Confusion Matrix:
[[ 5  1  0]
 [ 3 22  6]
 [ 0  3 29]]

svc
Accuracy: 95.65%
Confusion Matrix:
[[ 5  1  0]
 [ 0 31  0]
 [ 0  2 30]]

mlp
Accuracy: 91.30%
Confusion Matrix:
[[ 5  0  1]
 [ 0 29  2]
 [ 0  3 29]]





#### SVC showed higher accuracy and confusion matrix with less misclassified instances.

## Principal Component Analysis (PCA) - SKF cross validation

Here we decided to evaluate whether we can reduce the feature vector to optimize training and to obtain a possible better classification result.
In this case, we want the minimum number of principal components to retain 95% of the data variance.

In [23]:
pca = PCA(.95)

pca_df = pd.DataFrame(pca.fit_transform(X_train))
print('Number of components: {}'.format(pca.n_components_))
pca_n_comp = pca.n_components_

result_summary = {
    'knn-1': [],
    'knn-3': [],
    'knn-5': [],
    'knn-7': [],
    'knn-9': [],
    'nb': [],
    'svc': [],
    'mlp': []
}

k = 1
skf = StratifiedKFold(n_splits = 10)

for train_index, val_index in skf.split(pca_df, y_train):

    X_skf_train, X_skf_test = X_train.iloc[train_index], X_train.iloc[val_index]
    y_skf_train, y_skf_test = y_train.iloc[train_index], y_train.iloc[val_index]
    
    fold_result = training_classifiers(X_skf_train, y_skf_train, X_skf_test, y_skf_test)
    
    print('')
    print('Fold {}'.format(k))
    k = k + 1
    show_acc_cmat(fold_result)
    
    for clf in result_summary:
        result_summary[clf].append(fold_result[clf]['acc'])

  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Number of components: 6


  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 1
knn-1
Accuracy: 62.50%
Confusion Matrix:
[[1 0 1]
 [0 4 4]
 [1 0 5]]

knn-3
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [0 5 3]
 [0 0 6]]

knn-5
Accuracy: 68.75%
Confusion Matrix:
[[1 1 0]
 [0 5 3]
 [0 1 5]]

knn-7
Accuracy: 75.00%
Confusion Matrix:
[[1 1 0]
 [0 6 2]
 [0 1 5]]

knn-9
Accuracy: 75.00%
Confusion Matrix:
[[0 2 0]
 [0 6 2]
 [0 0 6]]

nb
Accuracy: 68.75%
Confusion Matrix:
[[2 0 0]
 [0 5 3]
 [0 2 4]]

svc
Accuracy: 93.75%
Confusion Matrix:
[[2 0 0]
 [0 7 1]
 [0 0 6]]

mlp
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 2
knn-1
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [1 6 1]
 [0 1 5]]

knn-3
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]

knn-5
Accuracy: 75.00%
Confusion Matrix:
[[1 0 1]
 [0 6 2]
 [0 1 5]]

knn-7
Accuracy: 81.25%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 1 5]]

knn-9
Accuracy: 75.00%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 2 4]]

nb
Accuracy: 62.50%
Confusion Matrix:
[[1 1 0]
 [0 4 4]
 [0 1 5]]

svc
Accuracy: 81.25%
Confusion Matrix:
[[1 1 0]
 [0 7 1]
 [0 1 5]]

mlp
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 3
knn-1
Accuracy: 68.75%
Confusion Matrix:
[[2 0 0]
 [0 5 3]
 [0 2 4]]

knn-3
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]

knn-5
Accuracy: 81.25%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 1 5]]

knn-7
Accuracy: 75.00%
Confusion Matrix:
[[2 0 0]
 [0 6 2]
 [0 2 4]]

knn-9
Accuracy: 87.50%
Confusion Matrix:
[[2 0 0]
 [0 7 1]
 [0 1 5]]

nb
Accuracy: 56.25%
Confusion Matrix:
[[2 0 0]
 [0 4 4]
 [0 3 3]]

svc
Accuracy: 87.50%
Confusion Matrix:
[[2 0 0]
 [0 8 0]
 [0 2 4]]

mlp
Accuracy: 87.50%
Confusion Matrix:
[[1 0 1]
 [0 8 0]
 [0 1 5]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 4
knn-1
Accuracy: 85.71%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 2 3]]

knn-3
Accuracy: 85.71%
Confusion Matrix:
[[1 1 0]
 [0 7 0]
 [0 1 4]]

knn-5
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

knn-7
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

knn-9
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

nb
Accuracy: 85.71%
Confusion Matrix:
[[2 0 0]
 [0 6 1]
 [0 1 4]]

svc
Accuracy: 100.00%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 0 5]]

mlp
Accuracy: 92.86%
Confusion Matrix:
[[2 0 0]
 [0 7 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 5
knn-1
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-3
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-5
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

knn-7
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 2 3]]

knn-9
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

nb
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 2 3]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 6
knn-1
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

knn-3
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-5
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

knn-7
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-9
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 1 4]]

nb
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 1 4]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[0 0 1]
 [1 6 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 7
knn-1
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-5
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-7
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-9
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

nb
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 2 3]]

svc
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 6 1]
 [0 0 5]]

mlp
Accuracy: 84.62%
Confusion Matrix:
[[0 1 0]
 [0 7 0]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 8
knn-1
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 1 4]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

knn-5
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

knn-7
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [1 2 4]
 [0 0 5]]

knn-9
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [1 2 4]
 [0 0 5]]

nb
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 0 5]]

svc
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [1 4 2]
 [0 0 5]]

mlp
Accuracy: 76.92%
Confusion Matrix:
[[1 0 0]
 [0 5 2]
 [0 1 4]]



  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



Fold 9
knn-1
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]

knn-3
Accuracy: 69.23%
Confusion Matrix:
[[0 1 0]
 [0 6 1]
 [1 1 3]]

knn-5
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]

knn-7
Accuracy: 53.85%
Confusion Matrix:
[[0 1 0]
 [0 4 3]
 [1 1 3]]

knn-9
Accuracy: 53.85%
Confusion Matrix:
[[0 1 0]
 [0 4 3]
 [0 2 3]]

nb
Accuracy: 61.54%
Confusion Matrix:
[[1 0 0]
 [0 4 3]
 [0 2 3]]

svc
Accuracy: 69.23%
Confusion Matrix:
[[0 1 0]
 [0 6 1]
 [0 2 3]]

mlp
Accuracy: 61.54%
Confusion Matrix:
[[0 1 0]
 [0 5 2]
 [1 1 3]]


Fold 10
knn-1
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 0 5]]

knn-3
Accuracy: 84.62%
Confusion Matrix:
[[1 0 0]
 [1 5 1]
 [0 0 5]]

knn-5
Accuracy: 100.00%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 0 5]]

knn-7
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 1 4]]

knn-9
Accuracy: 92.31%
Confusion Matrix:
[[1 0 0]
 [0 7 0]
 [0 1 4]]

nb
Accuracy: 69.23%
Confusion Matrix:
[[1 0 0]
 [2 4 1]
 [0 1 4]]

svc
Accuracy



### SKF results summary after PCA

In [24]:
show_training_acc(result_summary)

knn-1
Accuracy (%):
Max: 85.71
Min: 61.54
Mean: 74.58
SD: 8.91

knn-3
Accuracy (%):
Max: 87.50
Min: 69.23
Mean: 80.64
SD: 5.87

knn-5
Accuracy (%):
Max: 100.00
Min: 61.54
Mean: 79.42
SD: 12.68

knn-7
Accuracy (%):
Max: 100.00
Min: 53.85
Mean: 76.97
SD: 13.67

knn-9
Accuracy (%):
Max: 100.00
Min: 53.85
Mean: 77.60
SD: 14.03

nb
Accuracy (%):
Max: 85.71
Min: 56.25
Mean: 71.17
SD: 9.79

svc
Accuracy (%):
Max: 100.00
Min: 69.23
Mean: 84.71
SD: 9.84

mlp
Accuracy (%):
Max: 92.86
Min: 61.54
Mean: 81.83
SD: 9.36



During validation, the models do not seem to improve in performance.

## PCA Analysis - Evaluation of models on test set

Considering the number of principal components selected in the previous section, we perform an evaluation on test data.

In [25]:
pca_2 = PCA(n_components = pca_n_comp)
pca_train_df = pd.DataFrame(pca_2.fit_transform(X_train))
pca_test_df = pd.DataFrame(pca_2.fit_transform(X_test))

pca_test_result = training_classifiers(pca_train_df, y_train, pca_test_df, y_test)
show_acc_cmat(pca_test_result)

  if sys.path[0] == '':
  del sys.path[0]
  
  from ipykernel import kernelapp as app
  app.launch_new_instance()
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


knn-1
Accuracy: 60.87%
Confusion Matrix:
[[ 0  5  1]
 [11 14  6]
 [ 1  3 28]]

knn-3
Accuracy: 56.52%
Confusion Matrix:
[[ 0  6  0]
 [11 12  8]
 [ 1  4 27]]

knn-5
Accuracy: 55.07%
Confusion Matrix:
[[ 0  6  0]
 [11 11  9]
 [ 1  4 27]]

knn-7
Accuracy: 60.87%
Confusion Matrix:
[[ 0  6  0]
 [ 9 14  8]
 [ 1  3 28]]

knn-9
Accuracy: 60.87%
Confusion Matrix:
[[ 0  6  0]
 [ 9 13  9]
 [ 1  2 29]]

nb
Accuracy: 59.42%
Confusion Matrix:
[[ 0  5  1]
 [ 4 16 11]
 [ 0  7 25]]

svc
Accuracy: 57.97%
Confusion Matrix:
[[ 0  6  0]
 [ 7 17  7]
 [ 0  9 23]]

mlp
Accuracy: 60.87%
Confusion Matrix:
[[ 1  5  0]
 [ 8 17  6]
 [ 1  7 24]]





#### We can see a degradation of performance by all classifiers once we reduced the feature vector size. Thus we will proceed to train our final model WITHOUT any PCA procedure.

## Training final model

Since SVC showed highest performance among all other classifiers, we decided to select it as to-go option for this problem.

In [26]:
svc = SVC(kernel = 'rbf', gamma = 'scale', random_state = 1)
svc.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=1, shrinking=True, tol=0.001,
    verbose=False)

### Saving trained model

In [27]:
dump(svc, 'fraud.joblib')

['fraud.joblib']