<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Training-a-Classifier-to-Detect-Fraudulent-Financial-Transactions" data-toc-modified-id="Training-a-Classifier-to-Detect-Fraudulent-Financial-Transactions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Training a Classifier to Detect Fraudulent Financial Transactions</a></span><ul class="toc-item"><li><span><a href="#Creating-the-Test-Set" data-toc-modified-id="Creating-the-Test-Set-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Creating the Test Set</a></span></li><li><span><a href="#Fitting-a-Dummy-Classifier" data-toc-modified-id="Fitting-a-Dummy-Classifier-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Fitting a Dummy Classifier</a></span></li><li><span><a href="#Implementing-the-Predict-Function" data-toc-modified-id="Implementing-the-Predict-Function-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Implementing the Predict Function</a></span></li><li><span><a href="#Testing-the-Dummy-Classifier" data-toc-modified-id="Testing-the-Dummy-Classifier-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Testing the Dummy Classifier</a></span></li></ul></li></ul></div>

# Training a Classifier to Detect Fraudulent Financial Transactions
First, we load our dataset. Note that due to privacy concerns, all features but for Time and Amount have generic names and were found through [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis).

In [None]:
import pandas as pd
import os
path = "/data/mlproject21" if os.path.exists("/data/mlproject21") else "."
df = pd.read_csv(os.path.join(path, "transactions.csv.zip"))
df.head()

## Creating the Test Set
We perform a split into a train and test set:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns = "Class"),
                                                    df["Class"],
                                                    test_size = 0.2,
                                                    stratify = df["Class"])
print(f"{y_train.size} train samples\n {y_test.size} test samples")

## Fitting a Dummy Classifier
For now, we use a dummy classifier where no matter the input, the probability of reporting a fraudulent transaction is always equal to the ratio of fraudulent transactions in our training set.

In [None]:
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier().fit(X_train, y_train)

## Implementing the Predict Function
You will have to implement the predict function that we will run in order to check the performance of your model on a secret test set. The better you perfrom, the higher you'll be on the leaderboard! For now, our solution is to use our dummy classifier to make predictions based on the input values.

Note that this predict function should return the value of the decision function of your model for each transaction in `values`. In the probabilistic case, these are the probabilities that the input values are of target class 1 (i.e. fraud). The higher the value of the decision function, the more likely that a transaction is fraudulent.

In [None]:
def leader_board_predict_fn(values):
    # sklearn.dummy.DummyClassifier.predict_proba returns a Nx2 array.
    # Column 0 is the probability of target class 0
    # Column 1 is the probability of target class 1
    return dummy.predict_proba(values)[:, 1]

## Testing the Dummy Classifier
To measure the classifier's performance on the test set, we will use the [ROC AUC score](https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics). The best possible score is 1.

In [None]:
### LEADER BOARD TEST
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, leader_board_predict_fn(X_test))
print(f"Leaderboard Score: {score}")
### LEADER BOARD TEST