<a href="https://colab.research.google.com/github/mdvandergon/financial_transaction_scoring/blob/main/transaction_aml_scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transaction AML Scoring
Creating an AML fraud classifier using BoostedTrees and a transaction graph embedding.

This project uses sample data from IBM's AMLSim project. You can find their repo here: https://github.com/IBM/AMLSim/

They also have a wiki page about the data.
https://github.com/IBM/AMLSim/wiki/Data-Schema-for-Input-Parameters-and-Generated-Data-Set#transactions-transactionscsv


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Step 0 - get the environment set up

In [None]:
! pip install --user --upgrade numpy"<1.19.0,>=1.16.0" powerlaw python-dateutil plotly plotly_express==0.4.1 pandas tensorflow==2.3.1 umap-learn

## Step 1 - Get the IBM AML Transaction Data

There is a way to generate it, but they also have some sample data.

In [None]:
# example data is available on Dropbox :)
! wget https://www.dropbox.com/sh/l3grpumqfgbxqak/AAC8YT4fdn0AYKhyZ5b3Ax16a?dl=1 -O aml.zip

In [None]:
! unzip aml.zip -d data/
! echo "DONE!"

In [None]:
# 7 zip (apt install p7zip)
! p7zip -d data/100vertices-10Kedges.7z
! p7zip -d data/1Mvertices-100Medges.7z
# if you don't have the space, you can use this medium dataset
# ! p7zip -d data/10Kvertices-1Medges.7z

# The work begins...

In [None]:
import plotly.express as px
import pandas as pd
import json
import umap
import numpy as np
import tensorflow as tf
import sklearn.neighbors
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt
from scipy.spatial import distance

In [None]:
%load_ext tensorboard

In [None]:
# create a co-occurance matrix where sender, receiver pairs are tallied up
def create_cooccurance_matrix(df: pd.DataFrame, col_0: str, col_1: str, normed=False):
  n = max(df[col_0].max(), df[col_1].max())
  mtx = np.zeros((n + 1, n + 1))
  for i, row in df.iterrows():
    s = row[col_0]
    d = row[col_1]
    mtx[s,d] += 1
  if normed:
    mtx = mtx / np.linalg.norm(mtx)
  return mtx

In [None]:
MODEL = umap.UMAP()
def embed(x: int, model: umap.UMAP = MODEL):
  return model.embedding_[x]

def get_embedding_features(df: pd.DataFrame, col_0: str, col_1: str, model=MODEL,
                           embed_suffix ='_EMBED', dst_col='EMBED_DISTANCE'):
  
  for col in [col_0, col_1]:
    df.loc[: , col + embed_suffix] = df[col].apply(lambda x: embed(x, model=model))
    # split out components as features
    for c in range(n_components):
      df.loc[:, col + f'{embed_suffix}_{c}'] = df[col + embed_suffix].apply(lambda x: x[c])

  # compute the cosine distance: float
  us = df[col_0 + embed_suffix].values
  vs = df[col_1 + embed_suffix].values
  cos = np.array([distance.cosine(u, v) for u,v in zip(us, vs)])
  df.loc[:, dst_col] = cos

  return df

In [None]:
# based on the data investigation, these are our VIP columns
src_col = 'SENDER_ACCOUNT_ID'
dst_col = 'RECEIVER_ACCOUNT_ID'
TARGET_COL = 'IS_FRAUD'
LABEL_COL = 'ALERT_ID'

## Step 2 - Transaction EDA and Graph Embedding with UMAP

Refer to this link for data dictionary: https://github.com/IBM/AMLSim/wiki/Data-Schema-for-Input-Parameters-and-Generated-Data-Set#transactions-transactionscsv

In [None]:
transactions_path = '100vertices-10Kedges/transactions.csv'
sample_df = pd.read_csv(transactions_path)
print(sample_df.shape)

In [None]:
sample_df.head()

In [None]:
# transaction types
sample_df[TARGET_COL].value_counts()

In [None]:
co_mtx = create_cooccurance_matrix(sample_df, src_col, dst_col)

In [None]:
# create a label for each account_id was in a IS_FRAUD observation
n = max(sample_df[src_col].max(), sample_df[dst_col].max())
fraudulent = sample_df[(sample_df[TARGET_COL] == True)]
fraud_parties = pd.concat([fraudulent[src_col], fraudulent[dst_col]])
fraud_parties = set(fraud_parties.values.ravel())  # unsorted list of parties in a fraudulent transaction
fraud_label = np.array([1 if i in fraud_parties else 0 for i in range(n+1)])

### UMAP embedding to approximate local area

In [None]:
n_components = 3
metric = 'correlation' # hellinger, correlation
model = umap.UMAP(n_components=n_components,
                  metric=metric)
embedding = model.fit(co_mtx)
# umap.plot.points(mapper, values=np.arange(100000), theme='viridis')

In [None]:
# Optional: you can use this to do KNN clustering (-1 is an outlier)
# outlier_scores = sklearn.neighbors.LocalOutlierFactor(contamination=0.001428).fit_predict(embedding.embedding_)

In [None]:
px.scatter_3d(x=embedding.embedding_[:,0], y=embedding.embedding_[:,1], z=embedding.embedding_[:,2], color=fraud_label,
              title=f"Tx Graph Embedding on small AMLSim dataset ({metric})")

There is a time-varying UMAP available that would be worth investigating:
https://umap-learn.readthedocs.io/en/latest/aligned_umap_politics_demo.html

In [None]:
# other ideas:

# 1- Aligned UMAP
# create a relation map, which is just a map from ACCOUNT_ID to an id
# ids = set(np.concatenate([sample_df[src_col].values, sample_df[dst_col].values])) 
# relation_dict = {x:i for i, x in enumerate(ids)}

# 2 - there might be an accelerated way to do the co-occurance matrix, but I am too tired
# encoded_src = tf.keras.utils.to_categorical(df[src_col])
# encoded_dst = tf.keras.utils.to_categorical(df[dst])
# create an co-occurance matrix
# df_asint = df.astype(int)
# coocc = df_asint.T.dot(df_asint)

### Turn our embedding into a feature set

In [None]:
get_embedding_features(sample_df.head(), src_col, dst_col, model=embedding)

## Step 3 - Tensorflow model training and benchmark against linear

**Targets: Y**

1) IS_FRAUD: binary (derived from ALERT_ID, categorical)

**Features: X**

1) TX_AMOUNT: float 

2) TX_TYPE: categorical

3) SENDER_ACCOUNT_ID_EMBED (several columns)

4) RECEIVER_ACCOUNT_ID_EMBED (several columns)







In [None]:
tf.random.set_seed(42)

In [None]:
# switch to the big-ole dataset
transactions_path = '10Kvertices-1Medges/transactions.csv'
df = pd.read_csv(transactions_path)
print(df.shape)

In [None]:
# train/eval/test split
# FYI: If we want to predict the alert code, we can't have negative labels, so a null label will be "0" instead of -1
# df.loc[:, LABEL_COL] = df.loc[:, LABEL_COL] + 1
# simulate a cut point for a test set
test_idx = int(df.shape[0] * 0.1)
# (train/eval)/test split:
df_test = df.iloc[-test_idx:, :]
X_test = df_test.drop(TARGET_COL, axis=1)
y_test = df_test[TARGET_COL].astype(int)

# train/eval
df_ = df.iloc[:test_idx, :]
X = df_.drop(TARGET_COL, axis=1)
y = df_[TARGET_COL].astype(int)
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# # upsample fraudlent tx
# recombine
d = pd.concat([X_train, y_train], axis=1)

# separate data using target column
not_fraud = d[d[TARGET_COL] == 0]
fraud = d[d[TARGET_COL] == 1]

# upsample IS_FRAUD
n_samples =  d.shape[0] // 10 # roughly approx prec-recall tradeoff for a business: look at 10% of transactions
fraud_upsampled = resample(fraud,
                          replace=True,
                          n_samples=n_samples,
                          random_state=42)

# recreate training set
d_train = pd.concat([not_fraud, fraud_upsampled])
X_train = d_train.drop(TARGET_COL, axis=1)
y_train = d_train[TARGET_COL]

# check new class counts
y_train.value_counts()

In [None]:
# obtain our embeddings without leaking information from our eval or test set
n_components = 3
metric = 'euclidean' # hellinger, euclidean, correlation
print("building matrix...")
co_mtx = create_cooccurance_matrix(X_train, src_col, dst_col)
print("training embedding...")
train_embedding = umap.UMAP(n_components=n_components,
                            metric=metric).fit(co_mtx)
print("done. applying embeddings to train")
X_train = get_embedding_features(X_train, src_col, dst_col, model=train_embedding)

# experimental - create a feature based on local outliers (KNN)
# print("predicting outliers...")
# src_outlier_feature_col = 'SRC_OUTLIER_SCORE'
# dst_outlier_feature_col = 'DST_OUTLIER_SCORE'
# clf = sklearn.neighbors.LocalOutlierFactor(contamination=0.001428)
# _scores = clf.fit(embedding.embedding_)  # fit_predict returns a -1, 1
# outlier_scores_src = [clf.negative_outlier_factor_[i] for i in X_train[src_col].values]  # get a score for each sender_id
# outlier_scores_dst = [clf.negative_outlier_factor_[i] for i in X_train[dst_col].values]
# X_train.loc[:, src_outlier_feature_col] = outlier_scores_src
# X_train.loc[:, dst_outlier_feature_col] = outlier_scores_dst

In [None]:
src_embed_cols = [f'SENDER_ACCOUNT_ID_EMBED_{c}' for c in range(n_components)]
dst_embed_cols = [f'RECEIVER_ACCOUNT_ID_EMBED_{c}' for c in range(n_components)]
color_scale = color_continuous_scale=px.colors.cmocean.matter
px.scatter_3d(x=X_train[src_embed_cols[0]],
              y=X_train[src_embed_cols[1]],
              z=X_train[src_embed_cols[2]],
              opacity=0.5,
              color_continuous_scale=color_scale,
              color=y_train.astype(float),
              title=f"Embedding Space ({metric})")

In [None]:
# apply the embedding model to the eval set
print("getting embed on eval set...")
X_eval = get_embedding_features(X_eval, src_col, dst_col, model=train_embedding)

# Test -- could run the embedding on both train/eval, but this is quicker
print("getting embed on test set...")
X_test = get_embedding_features(X_test, src_col, dst_col, model=train_embedding)
# X_test = X_test[CATEGORICAL_COLUMNS + NUMERIC_COLUMNS]

In [None]:
# filter by cols we want to model
# to learn more about feature columns: https://www.tensorflow.org/tutorials/structured_data/feature_columns

src_embed_cols = [f'SENDER_ACCOUNT_ID_EMBED_{c}' for c in range(n_components)]
dst_embed_cols = [f'RECEIVER_ACCOUNT_ID_EMBED_{c}' for c in range(n_components)]
CATEGORICAL_COLUMNS = [src_col]  # could also have TX_TYPE, but it only has one value
NUMERIC_COLUMNS = ['TX_AMOUNT', 'EMBED_DISTANCE'] + src_embed_cols + dst_embed_cols

def one_hot_cat_column(feature_name, vocab):
  return tf.feature_column.indicator_column(
      tf.feature_column.categorical_column_with_vocabulary_list(feature_name,
                                                 vocab))
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  # Need to one-hot encode categorical features.
  vocabulary = X_train[feature_name].unique()
  feature_columns.append(one_hot_cat_column(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name,
                                           dtype=tf.float32))

In [None]:
# drop cols
X_train = X_train[CATEGORICAL_COLUMNS + NUMERIC_COLUMNS]
X_eval = X_eval[CATEGORICAL_COLUMNS + NUMERIC_COLUMNS]
X_test= X_test[CATEGORICAL_COLUMNS + NUMERIC_COLUMNS]

In [None]:
X_train.head(1)

In [None]:
# training description
print(X_train.shape, y_train.shape)
print(y_train.name, "\n--- on: ---\n", "\n".join(list(X_train.columns)))
print(y_train.value_counts())

In [None]:
# 5 batches
NUM_EXAMPLES = len(y_train)
EXP_PER_BATCH = 5
def make_input_fn(X, y, n_epochs=None, shuffle=True):
  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
    if shuffle:
      dataset = dataset.shuffle(NUM_EXAMPLES)
    # For training, cycle thru dataset as many times as need (n_epochs=None).
    dataset = dataset.repeat(n_epochs)
    # In memory training doesn't use batching (ie batch === num_examples)
    dataset = dataset.batch(NUM_EXAMPLES // EXP_PER_BATCH)
    return dataset
  return input_fn

# Training and evaluation input functions.
train_input_fn = make_input_fn(X_train, y_train)
eval_input_fn = make_input_fn(X_eval, y_eval, shuffle=False, n_epochs=1)
test_input_fn = make_input_fn(X_test, y_test, shuffle=False, n_epochs=1)

### train a baseline linear classifier

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns)

# Train model.
linear_est.train(train_input_fn, max_steps=100)

# Evaluation.
result = linear_est.evaluate(eval_input_fn)
print(pd.Series(result))

In [None]:
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=50, title='predicted probabilities')

In [None]:
fpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)
plt.show()

Tensorflow Model Training

In [None]:
! rm -rf bt_cls/

In [None]:
# Above one batch is defined as the entire dataset.
model_dir = 'bt_cls'
n_batches = 5 # the whole dataset
max_depth = 4
l2_reg = 1e-8
max_steps = 100
# prune_mode = 'post'
# tree_complexity = 1e-4
est = tf.estimator.BoostedTreesClassifier(feature_columns, 
                                          max_depth=max_depth,
                                          l2_regularization=l2_reg,
                                          n_batches_per_layer=n_batches,
                                          model_dir=model_dir)

# The model will stop training once the specified number of trees is built, not
# based on the number of steps.
est.train(train_input_fn, max_steps=max_steps)

# Eval.
result = est.evaluate(eval_input_fn)
print(result)

### BoostedTree evaluation results

In [None]:
pred_dicts = list(est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=50, title='predicted probabilities')

In [None]:
fpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)
plt.show()

In [None]:
importances = est.experimental_feature_importances(normalize=True)
df_imp = pd.Series(importances)

# Visualize importances
N = X_train.shape[1]
ax = (df_imp.iloc[0:N][::-1]
    .plot(kind='barh',
          color='blue',
          title='Gain feature importances',
          figsize=(10, 6)))
ax.grid(False, axis='y')

In [None]:
color_scale = color_continuous_scale=px.colors.cmocean.matter
px.scatter_3d(x=X_eval[src_embed_cols[0]],
              y=X_eval[src_embed_cols[1]],
              z=X_eval[src_embed_cols[2]],
              opacity=0.5,
              color_continuous_scale=color_scale,
              color=probs.values, text=["fraud" if x else "" for x in y_eval.values],
              title="Needles in a Haystack: Tx Graph Embedding on AMLSim dataset (red = potential AML)")

In [None]:
px.scatter(x=X_eval['EMBED_DISTANCE'],
              y=probs,
              opacity=0.8,
              color_continuous_scale=color_scale,
              color=y_eval.astype(int),
              title="Needles in a Haystack: Graph distance for targets is close")

In [None]:
px.scatter(x=X_eval['TX_AMOUNT'],
              y=probs,
              opacity=0.8,
              color_continuous_scale=color_scale,
              color=y_eval.astype(int),
              title="(Small) Needles in a Haystack: fradulent transactions are small")

In [None]:
# VaR = probs * X_eval['TX_AMOUNT'].values  # Expected Value at Risk
# px.histogram(x=VaR)

## Test set results


In [None]:
result = linear_est.evaluate(test_input_fn)
print("Linear Cls Test Set results")
print(result)

In [None]:
result = est.evaluate(test_input_fn)
print("Boosted Tress Cls Test Set results")
print(result)

In [None]:
# Could train the Co-Occurance Matrix on train and eval. 
# Because this was all generated by the same process, I don't think it is necessary
# co_mtx = create_cooccurance_matrix(pd.concat([X_train, X_eval]), src_col, dst_col)
# emb = umap.UMAP(n_components=n_components).fit(co_mtx)
# df_test = get_embeddings(df_test, src_col, dst_col, model=emb)

### Export to Saved Model

for use later via: https://www.tensorflow.org/api_docs/python/tf/saved_model/load

In [None]:
# This is mysterious, but this example made it easy: https://www.tensorflow.org/lattice/tutorials/canned_estimators#creating_input_fn
srv_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
        feature_spec=tf.feature_column.make_parse_example_spec(feature_columns))
est.export_saved_model('saved_model', srv_fn, as_text=True)

save...

In [None]:
from google.colab import files
import os
import zipfile

In [None]:
export_num = 1606669696  # check the output above to see this number
path = f'saved_model/{export_num}'
export_name = 'transaction_scorer.zip'

with zipfile.ZipFile(export_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
  for root, dirs, filepaths in os.walk(path):
      for f in filepaths:
          zipf.write(os.path.join(root, f))

files.download(export_name)

In [None]:
# download the graph embedding
embed_name = 'transaction_graph.npy'
np.save(embed_name, train_embedding.embedding_)
files.download(embed_name)

### FYI: Porting to AWS Sagemaker

In [None]:
# run on AWS
# https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#prepare-a-training-script
# aml_estimator = TensorFlow(entry_point='aml_cls.py',
#                              role=role,
#                              train_instance_count=2,
#                              train_instance_type='ml.p3.2xlarge',
#                              framework_version='2.3.0',
#                              py_version='py3',
#                              distributions={'parameter_server': {'enabled': True}})