<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Financial-Transactions" data-toc-modified-id="Financial-Transactions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Financial Transactions</a></span><ul class="toc-item"><li><span><a href="#The-Leaderboard-Predict-function" data-toc-modified-id="The-Leaderboard-Predict-function-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>The Leaderboard Predict function</a></span></li><li><span><a href="#Testing-your-Implementation" data-toc-modified-id="Testing-your-Implementation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Testing your Implementation</a></span></li></ul></li></ul></div>

# Financial Transactions

The ability to identify fraudulent transactions is of great interest to the payments industry. In this notebook, you will make use of the binary classifier you trained on the transcations dataset to detect fraud.

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import roc_auc_score
import pathlib

In [2]:
path = "/data/mlproject22" if os.path.exists("/data/mlproject22") else "."
train_data = pd.read_csv(os.path.join(path, "transactions.csv.zip"))
X_train = train_data.drop(columns = "Class")
y_train = train_data["Class"]

## The Leaderboard Predict function
Replace the comment and `NotImplementedError` in the `leader_board_predict_fn` with code that loads your model parameters and returns the likelyhood of fraud for each transaction (i.e. row) in the values dataframe. Note that the returned array should contain a single decision function value for each transaction, indicating whether the transaction is fraudulent (i.e. it belongs to target class $1$). The higher the decision function value, the more likely that the transaction is fraud.
You can import the packages you require.

In [3]:
import joblib
def leader_board_predict_fn(values):
    
    # Load the trained model parameters
    rf_classifier = joblib.load("/home/csaz7668/random_forest_params.pkl")
    scaler = joblib.load("/home/csaz7668/random_forest_scaler.pkl")

    X = values
    X_scaled = scaler.transform(X)

    # Predict the likelihood of fraud (decision function values) for each transaction
    decision_function_values = rf_classifier.predict_proba(X_scaled)[:, 1]  # Get the probability of the positive class
    print(decision_function_values)
    return decision_function_values

## Testing your Implementation
Your model should return the probability or decision function value that indicates the likelyhood of fraud for each input transaction. To verify that this is the case, we run your model on a subset of the transactions dataset it was trained on. There is a hidden cell that performs the actual test on the unseen test set and computes your score for the leaderboard using the [ROC AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) score.

In [4]:
def get_score():
    """
    Function to compute scores for train and test datasets.
    """

    import pandas as pd
    import numpy as np
    import os
    from sklearn.metrics import roc_auc_score
    import pathlib

    try:
        path = "/data/mlproject22" if os.path.exists("/data/mlproject22") else "."
        test_data = pd.read_csv(os.path.join(path, "transactions.csv.zip"))
        X_test = test_data.drop(columns = "Class")
        y_test = test_data["Class"]
        decision_function_values = leader_board_predict_fn(X_test)
        assert decision_function_values.shape == (X_test.shape[0],)
        dataset_score = roc_auc_score(y_test, decision_function_values)
        assert dataset_score >= 0.0 and dataset_score <= 1.0
    except Exception:
        dataset_score = float("nan")
    print(f"Train Dataset Score: {dataset_score}")

    import os
    import pwd
    import time
    import datetime
    import pandas as pd
    user_id = pwd.getpwuid( os.getuid() ).pw_name
    curtime = time.time()
    dt_now = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")

    try:
        HIDDEN_DATASET_PATH = os.path.expanduser("/data/mlproject22-test-data")
        test_data = pd.read_csv(os.path.join(HIDDEN_DATASET_PATH,"transactions_scoreboard.csv.zip"))
        X_test = test_data.drop(columns=["Class"])
        y_test = test_data["Class"]
        decision_function_values = leader_board_predict_fn(X_test)
        hiddendataset_score = roc_auc_score(y_test, decision_function_values)
        print(f"Test Dataset Score: {hiddendataset_score}")
        score_dict = dict(
            score_hidden=hiddendataset_score,
            score_train=dataset_score,
            unixtime=curtime,
            user=user_id,
            dt=dt_now,
            comment="",
        )
    except Exception as e:
        err = str(e)
        score_dict = dict(
            score_hidden=float("nan"),
            score_train=dataset_score,
            unixtime=curtime,
            user=user_id,
            dt=dt_now,
            comment=err
        )

    #if list(pathlib.Path(os.getcwd()).parents)[0].name == 'source':
    #    print("we are in the source directory... replacing values.")
    #    print(pd.DataFrame([score_dict]))
    #    score_dict["score_hidden"] = -1
    #    score_dict["score_train"] = -1
    #    print("new values:")
    #    print(pd.DataFrame([score_dict]))

    pd.DataFrame([score_dict]).to_csv("transactions.csv", index=False)
    
get_score()

[0. 0. 0. ... 0. 0. 0.]
Train Dataset Score: 0.9935252189198022
[0. 0. 0. ... 0. 0. 0.]
Test Dataset Score: 0.9527033889667291
