<a href="https://colab.research.google.com/github/marparven1/MachineLearning_Classification/blob/main/Fraud_Detection_NN_Loyola_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ejercicio: Detección de fruade
Este ejercicio tiene por objetivo poner en práctica los conceptos las técnicas de aprendizaje supervisado y en particular clasificación. El ejercicio se divide en las siguientes secciones: 
1. exploración de los datos mediante técnicas estadísticas clásicas, 
2. selección de atributos, 
3. entrenamiento de diferentes clasificadores,
4. evaluación.

# Librerías:
Importamos todas las librerías que serán utilizadas para el ejercicio.




In [None]:
# Import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
!pip install --upgrade --no-cache-dir gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.5.3.tar.gz (14 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: gdown
  Building wheel for gdown (PEP 517) ... [?25l[?25hdone
  Created wheel for gdown: filename=gdown-4.5.3-py3-none-any.whl size=14841 sha256=5b3667eb301f647f04062066dd99343785ca86aaf491418bd1d472516a395999
  Stored in directory: /tmp/pip-ephem-wheel-cache-l_6nzvlq/wheels/94/8d/0b/bdcd83555c3555f91a33f6c2384428d9f163c7d75ab0d272b4
Successfully built gdown
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.5.3


# Datos:
Descargamos los datos desde google drive a la instancia de google colab.

In [None]:
!gdown --id 1bWFuwCD_elqw3_jzR1Xob1YMnu-c7HcF

Downloading...
From: https://drive.google.com/uc?id=1bWFuwCD_elqw3_jzR1Xob1YMnu-c7HcF
To: /content/creditcard.csv
100% 150M/150M [00:02<00:00, 62.2MB/s]


In [None]:
# Ref.: https://www.kaggle.com/isaikumar/creditcardfraud/version/1
# Data frame with credit card data

# Leemos el archivo CSV
df = pd.read_csv( 'creditcard.csv' )

# 1. Análisis de los datos

1.   Visualizamos la información del dataframe de datos.
2.   Calculamos la correlación entre cada una de las variables y la clase. Cuanto más grande el valor de correlación en valor absoluto mayor es la correlación estadística entre la variable y la clase Class que contiene si la transacción es fraudulenta o legítima.


In [None]:
# 1
# Details on data
print(df.shape)
print(df.columns)

# 2
# Compute correlation of features with respect to Class
# By default uses Pearson correlation that estimates linear correlation: 
# +1 or -1 linearly correlated, 0 non linear correlated
df.corr()["Class"].sort_values(key=abs,ascending=False)

(284807, 31)
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


Class     1.000000
V17      -0.326481
V14      -0.302544
V12      -0.260593
V10      -0.216883
V16      -0.196539
V3       -0.192961
V7       -0.187257
V11       0.154876
V4        0.133447
V18      -0.111485
V1       -0.101347
V9       -0.097733
V5       -0.094974
V2        0.091289
V6       -0.043643
V21       0.040413
V19       0.034783
V20       0.020090
V8        0.019875
V27       0.017580
Time     -0.012323
V28       0.009536
V24      -0.007221
Amount    0.005632
V13      -0.004570
V26       0.004455
V15      -0.004223
V25       0.003308
V23      -0.002685
V22       0.000805
Name: Class, dtype: float64

# 2. Selección de atributos
1. Seleccionamos los atributos (características) a ser utilizadas. Dentro de los atributos $X$ no debe estar la clase $Class$.
2. Verificamos la cantidad de muestras de cada clase.

In [None]:
# 1
# Extract features and labels

#
# Obs. Modify the following code to select set of features
#
# The following line allows to select some variables to be used
X = df.loc[:,['V2','Amount']]

# Class is the target to predict (classify)
y = df.Class

# 2
# Number of samples per class
unique, counts = np.unique(y, return_counts=True)
for i in range(0,len(unique)):
  print('Class %i: Samples %i' % (unique[i], counts[i]) )

# Number of attributes
print('Number of attributes: %i' % (X.shape)[1])

Class 0: Samples 284315
Class 1: Samples 492
Number of attributes: 2


# 3. Entrenamiento del clasificador

1. El conjunto de datos se separa en entrenamiento y test.
2. Se selecciona y entrena un clasificador.
3. Métricas de performance



In [None]:
# Classifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# 1
# Split train and test 
# stratify=y means the same % of classes is present in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)

# 2
#
# Selection of classifier
#
clf = KNeighborsClassifier(n_neighbors=5)
#clf = DecisionTreeClassifier()
#clf = LogisticRegression(solver='lbfgs',max_iter=500)
#clf = GaussianNB()

# Train classifier
clf.fit(X_train,y_train)

# Opcional: Sirve para visualizar las probabilidades a 
# Priors GuassianNB
#print( clf.class_prior_ )

# Opcional: Visualización de los coeficientes de regresión logística.
# Coefs Logistic Regression
# print( clf.coef_ )

KNeighborsClassifier()

# 4. Evaluación

In [None]:
# 3
# Confusion matrix
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

print(' Confusion matrix ------------------')
print(cm)

sum_diag = sum(cm[i][i] for i in range(2))
sum_all = sum(sum(cm))

print('\n % Corr ----------------------------')
print( sum_diag/sum_all*100 )

# Classification report
from sklearn.metrics import classification_report
target_names = ['1', '0']

print('\n Classification report ------------------')
report = classification_report(y_test, y_pred, target_names=target_names,output_dict=True)
print('Class 0 (Fraud) %s: ' % report['0'])
print('Class 1 (Valid) %s: ' % report['1'])

# Amount of fraud detected
a = X_test.Amount
# TP 
ind = (y_pred == 1) & (y_test == 1)
# Amount for TP
amount_detected = np.sum( a[ ind ] ) 
# Total amount of fraud
ind = (y_test == 1)
amount_total = np.sum( a[ ind ] ) 

print('\n Amount ------------------ \n Detected = %f, Total = %f, Percentage = %f' % (amount_detected, amount_total, (amount_detected/amount_total)*100))

 Confusion matrix ------------------
[[142149      9]
 [   222     24]]

 % Corr ----------------------------
99.83778545546474

 Classification report ------------------
Class 0 (Fraud) {'precision': 0.7272727272727273, 'recall': 0.0975609756097561, 'f1-score': 0.17204301075268819, 'support': 246}: 
Class 1 (Valid) {'precision': 0.9984406936805951, 'recall': 0.9999366901616511, 'f1-score': 0.9991881319654587, 'support': 142158}: 

 Amount ------------------ 
 Detected = 1413.870000, Total = 30229.360000, Percentage = 4.677142


# Ejercicio

El objetivo del ejercicio es comparar diferentes clasificadores para el problema de detección de fraude. 

Para cada clasificador se deberán evaluar los indicadores: FN, FP, precision, recall, % de monto de fraude detectado. Se sugiere visualizar las matrices de confusión para poder evaluar los diferentes clasificadores en conjunto con los demás indicadores mencionados antes.

## Clasificadores
1. Vecinos más cercanos
2. Decision Trees


