# Classification Assignment: Detect Fraudulent Transactions

Estimated Time: 2-3 hours

Complete this assignment to detect whether a mobile money transaction is fraudulent.

Data source: [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/ntnu-testimon/paysim1)

Attributes:
- step: Maps a unit of time in the real world. In this case 1 step is 1 hour of time.
- type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- amount: amount of the transaction in local currency
- nameOrig: customer who started the transaction
- oldbalanceOrg - initial balance before the transaction
- newbalanceOrig - customer's balance after the transaction.
- nameDest - recipient ID of the transaction.
- oldbalanceDest - initial recipient balance before the transaction.
- newbalanceDest - recipient's balance after the transaction.
- isFlaggedFraud - flags illegal attempts to transfer more than 200.000 in a single transaction.

Target variable to predict:
- isFraud - identifies a fraudulent transaction (1) and non fraudulent (0)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score
import pickle
import numpy as np

%matplotlib inline

## Load data

Use pd.read_csv to load the data from `paysim_transactions_64k.csv`

## Data cleaning (not needed)

Not needed because there are no NaN values.

## Data encoding
Label encode the following columns from strings to numbers:
  * `type`
  * `nameOrig`
  * `nameDest`

## Data exploration

- Plot the correlation matrix

## Data sampling

- Check for imbalance for the target column (`isFraud`)
- If imbalance exists
   - First, hold out a portion (5%) of each class for test data
   - Then, use random over-sampling to increase the minority class training data to 10000

## PCA plot

Visualise the dataset in 2D using PCA.

## Train model

- Scale X train by fitting a scaler on the training set
- Train two classifiers: SGDClassifier and SVC

Note: No need to use learning_curve, because the minority class is too small. Just call `.fit()` using the whole training set.

## Evaluation metrics

- Compare the F1 scores between the classifiers.
- Plot the confusion matrix for both classifiers.
- Plot the ROC curve and compute Area Under the Curve (AUC)

## Deployment and Prediction

* Determine which is the better model based on AUC
* Save X_scaler and **the better** model
* Load the scaler and model
* Get predictions for the fraud transactions in the test set