# Fraud detection in Credit Card transactions with LSTM model
In this demo, we will show Tensorflow functionality on IBM Z/LinuxONE using credit card transactional data in CSV format provided by IBM in kaggle.com. The data contains normal and also transactions that are likely fraudulent with patterns in historical data such as time since last transaction, amount of money transferred, location (city, country), type of card payment (online/offline), etc. The data is transformed and joined in a Pandas DataFrame, then split into training, validation and test datasets. A Long Short-Term Memory (LSTM) algorithm is then used to predict Fraud Yes/No in the transactions.

Gates are part of the hidden mechanics of the LSTM layer in Tensorflow/Keras. LSTM model in this demo will be built with 2 hidden LSTM layers. Each layer implicitly consists of the gates as depicted below.  

![LSTM CC Diagram](<./LSTM for CC on IBM Telum Chip.png>)

At first we will set up a machine learning environment by importing key libraries like TensorFlow (with Keras), NumPy, Pandas, and others for data processing, math operations, model saving/loading, and visualization. We also configures TensorFlow to suppress non-critical log messages and output the TensorFlow version.

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import math
import joblib
import pydot
print(tf.__version__)

In [None]:
timesteps = 7 # Six past transactions followed by current transaction

# Load dataset into Pandas DataFrame
In this section, we'll demonstrate how to load credit card transactions from a CSV file using the Pandas library, convert columns as needed, remove duplicates, and split the dataset into training (50%), validation (30%), and testing sets (20%). Then we generate the indices for each set using a random selection approach. The resulting indices can be used to access specific rows from the original dataframe or perform other operations.

In [None]:
tdf = pd.read_csv('transactions_history.csv')
tdf['Merchant Name'] = tdf['Merchant Name'].astype(str)
tdf.sort_values(by=['User','Card'], inplace=True)
tdf.reset_index(inplace=True, drop=True)
print (tdf.info())

# Get first of each User-Card combination
first = tdf[['User','Card']].drop_duplicates()
f = np.array(first.index)

# Drop the first N transactions
drop_list = np.concatenate([np.arange(x,x + timesteps - 1) for x in f])
index_list = np.setdiff1d(tdf.index.values,drop_list)

# Split into 0.5 train, 0.3 validate, 0.2 test
tot_length = index_list.shape[0]
train_length = tot_length // 2
validate_length = (tot_length - train_length) * 3 // 5
test_length = tot_length - train_length - validate_length
print (tot_length,train_length,validate_length, test_length)

# Generate list of indices for train, validate, test
np.random.seed(1111)
train_indices = np.random.choice(index_list, train_length, replace=False)
tv_list = np.setdiff1d(index_list, train_indices)
validate_indices = np.random.choice(tv_list, validate_length, replace=False)
test_indices = np.setdiff1d(tv_list, validate_indices)
print (train_indices, validate_indices, test_indices)

# Generate a test sample as original dataset + test indices
Indices are necessary to speed up large data processing. They play a crucial role in efficient data handling, especially when working with large datasets like millions of transaction records. Indexing enables quick access to rows and efficient joins/merges, optimizes memory usage and eliminates repeated sorting. 

In [None]:
def create_test_sample(df, indices):
    print(indices)
    rows = indices.shape[0]
    index_array = np.zeros((rows, timesteps), dtype=int)
    for i in range(timesteps):
        index_array[:,i] = indices + 1 - timesteps + i
    uniques = np.unique(index_array.flatten())
    df.loc[uniques].to_csv('test_220_40k.csv',index_label='Index')
    np.savetxt('test_220_40k.indices',indices.astype(int),fmt='%d')

create_test_sample(tdf, validate_indices[:40000]) # Uncomment this line to generate a test sample                    

# Define custom mapping functions
To customize how text columns are treated, we define custom mapping (encoder) functions using the astype and to_datetime methods in Pandas. These mapping functions can transform specific categories or values within a column to numerical representations of a specific type, allowing for more nuanced analysis and modeling. For example, a function amtEncoder that maps credit card transaction value in USD with $ to a float type value suitable for model training.

In [None]:
def timeEncoder(X):
    X_hm = X['Time'].str.split(':', expand=True)
    d = pd.to_datetime(dict(year=X['Year'],month=X['Month'],day=X['Day'],hour=X_hm[0],minute=X_hm[1])).astype(int)
    return pd.DataFrame(d)

def amtEncoder(X):
    amt = X.apply(lambda x: x[1:]).astype(float).map(lambda amt: max(1,amt)).map(math.log)
    return pd.DataFrame(amt)

def decimalEncoder(X,length=5):
    dnew = pd.DataFrame()
    for i in range(length):
        dnew[i] = np.mod(X,10) 
        X = np.floor_divide(X,10)
    return dnew

def fraudEncoder(X):
    return np.where(X == 'Yes', 1, 0).astype(int)

# Fit pre-processing function using DataFrameMapper and pickle the mapper
In machine learning, pre-processing transforms raw data into a structured format by handling missing values, scaling features, encoding categorical variables, and addressing imbalanced data. It also involves tasks like feature engineering, outlier removal, and dimensionality reduction to enhance model performance and accuracy. These steps ensure the data is clean and ready for model training.

In [None]:
save_dir = 'shared/fraud_lstm'
os.makedirs(save_dir, exist_ok=True)

In [None]:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.impute import SimpleImputer

mapper = DataFrameMapper([('Is Fraud?', FunctionTransformer(fraudEncoder)),
                          (['Merchant State'], [SimpleImputer(strategy='constant'), FunctionTransformer(np.ravel),
                                               LabelEncoder(), FunctionTransformer(decimalEncoder), OneHotEncoder()]),
                          (['Zip'], [SimpleImputer(strategy='constant'), FunctionTransformer(np.ravel),
                                     FunctionTransformer(decimalEncoder), OneHotEncoder()]),
                          ('Merchant Name', [LabelEncoder(), FunctionTransformer(decimalEncoder), OneHotEncoder()]),
                          ('Merchant City', [LabelEncoder(), FunctionTransformer(decimalEncoder), OneHotEncoder()]),
                          ('MCC', [LabelEncoder(), FunctionTransformer(decimalEncoder), OneHotEncoder()]),
                          (['Use Chip'], [SimpleImputer(strategy='constant'), LabelBinarizer()]),
                          (['Errors?'], [SimpleImputer(strategy='constant'), LabelBinarizer()]),
                          (['Year','Month','Day','Time'], [FunctionTransformer(timeEncoder), MinMaxScaler()]),
                          ('Amount', [FunctionTransformer(amtEncoder), MinMaxScaler()])
                         ], input_df=True, df_out=True)
mapper.fit(tdf)

Serialize and deserialize fitted mapper as Python objects using pickle module. This allows you to save a preprocessing-function to a file and load it later without having to retrain it. 

In [None]:
joblib.dump(mapper, open(os.path.join(save_dir,'fitted_mapper.pkl'),'wb'))
mapper = joblib.load(open(os.path.join(save_dir,'fitted_mapper.pkl'),'rb'))

# Do a transform on one row to get number of mapped features (including label)

In [None]:
mapped_sample = mapper.transform(tdf[:100])
mapped_size = mapped_sample.shape[-1]
print(mapped_size)

# Utilities 

A set of helper functions and classes

In [None]:
def print_trainable_parameters():
    total = 0
    for variable in tf.trainable_variables():
        shape = variable.get_shape()
        parameters = 1
        for dim in shape:
            parameters *= dim
        total += parameters
        print (variable, shape, parameters)
    print(total)

def f1(conf):
    precision = float(conf[1][1]) / (conf[1][1]+conf[0][1])
    recall = float(conf[1][1]) / (conf[1][1]+conf[1][0])
    return 2 * precision * recall / (precision + recall)

class TP(tf.keras.metrics.TruePositives):
    def update_state(self, y_true, y_pred, sample_weight=None):
        super().update_state(y_true[-1,:,:], y_pred[-1,:,:], sample_weight)

class FP(tf.keras.metrics.FalsePositives):
    def update_state(self, y_true, y_pred, sample_weight=None):
        super().update_state(y_true[-1,:,:], y_pred[-1,:,:], sample_weight)

class FN(tf.keras.metrics.FalseNegatives):
    def update_state(self, y_true, y_pred, sample_weight=None):
        super().update_state(y_true[-1,:,:], y_pred[-1,:,:], sample_weight)

class TN(tf.keras.metrics.TrueNegatives):
    def update_state(self, y_true, y_pred, sample_weight=None):
        super().update_state(y_true[-1,:,:], y_pred[-1,:,:], sample_weight)


# Defining and compiling the LSTM model with Tensorflow
LSTM (Long Short-Term Memory) needs to be defined first with a set of paramaters using integrated Keras in Tensorflow for simplicity. 
```input_shape=tf_input``` - is a matrix of ```timesteps=7``` the number of time steps per input sequence, and ```input_size``` is the number of features per time step.
Training and Testing is completed in batches of 16. 

Using visualization function within the Keras library - ```plot_model``` will create a graphical representation of the LSTM model's architecture.

In [None]:
units = [200,200]
input_size = mapped_size - 1
output_size = 1
batch_size = 16
timesteps = 7
tf_input = ([timesteps, input_size])

lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(units[0], input_shape=tf_input, batch_size=batch_size, time_major=True, return_sequences=True),
    tf.keras.layers.LSTM(units[1], return_sequences=True, time_major=True),
    tf.keras.layers.Dense(output_size, activation='sigmoid')
])

lstm_model.summary()
tf.keras.utils.plot_model(lstm_model, 'model.png', show_shapes=True)

In [None]:
# Compile model
metrics=['accuracy', 
    TP(name='TP'),
    FP(name='FP'),
    FN(name='FN'),
    TN(name='TN'),
    tf.keras.metrics.TruePositives(name='tp'),
    tf.keras.metrics.FalsePositives(name='fp'),
    tf.keras.metrics.FalseNegatives(name='fn'),
    tf.keras.metrics.TrueNegatives(name='tn')
   ]

lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=metrics)

# Generator for training batches
Generator function produces one batch of data at a time using indices, instead of loading everything into memory at once. This allows streaming of large datasets for on-the-fly processing.

In [None]:
def gen_training_batch(df, mapper, index_list, batch_size):
    np.random.seed(98765)
    train_df = df.loc[index_list]
    non_fraud_indices = train_df[train_df['Is Fraud?'] == 'No'].index.values
    fraud_indices = train_df[train_df['Is Fraud?'] == 'Yes'].index.values
    fsize = fraud_indices.shape[0]
    while True:
        indices = np.concatenate((fraud_indices,np.random.choice(non_fraud_indices,fsize,replace=False)))
        np.random.shuffle(indices)
        rows = indices.shape[0]
        index_array = np.zeros((rows, timesteps), dtype=int)
        for i in range(timesteps):
            index_array[:,i] = indices + 1 - timesteps + i
        full_df = mapper.transform(df.loc[index_array.flatten()])
        target_buffer = full_df['Is Fraud?'].to_numpy().reshape(rows, timesteps, 1)
        data_buffer = full_df.drop(['Is Fraud?'],axis=1).to_numpy().reshape(rows, timesteps, -1)

        batch_ptr = 0
        while (batch_ptr + batch_size) <= rows:
            data = data_buffer[batch_ptr:batch_ptr+batch_size]
            targets = target_buffer[batch_ptr:batch_ptr+batch_size]
            batch_ptr += batch_size
            yield data,targets

# Tensorflow training - LSTM model
Training the LSTM model via forward pass, loss calculation, backward pass, weights update, epoch loop and validation after each epoch. In ModelCheckpoint, after each epoch, Tensorflow checks the monitored metric  and if the monitored metric improves, the model is saved to the specified location. The best model is then retrieved from checkpoints and saved in the ```saved_dir```.

In [None]:
steps_per_epoch = 500
checkpoint_dir = "./checkpoints/ccf_220_keras_lstm_static/"
filepath = checkpoint_dir + "iter-{epoch:02d}/model.ckpt"
batch_size = 16

print ("Learning...")
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=filepath, save_weights_only=True, verbose=1)
train_generate = gen_training_batch(tdf,mapper,train_indices,batch_size)
lstm_model.fit(train_generate, epochs=5, steps_per_epoch=steps_per_epoch, verbose=1, callbacks=[cp_callback])

lstm_model.save_weights(os.path.join(save_dir,"wts"))
lstm_model.save(save_dir)

# Generator for validation/test batches
In similar to the generator used in training process, we will use the generator function for test batches during the testing step.

In [None]:
def gen_test_batch(df, mapper, indices, batch_size):
    rows = indices.shape[0]
    index_array = np.zeros((rows, timesteps), dtype=int)
    for i in range(timesteps):
        index_array[:,i] = indices + 1 - timesteps + i
    count = 0
    while (count + batch_size <= rows):        
        full_df = mapper.transform(df.loc[index_array[count:count+batch_size].flatten()])
        data = full_df.drop(['Is Fraud?'],axis=1).to_numpy().reshape(batch_size, timesteps, -1)
        targets = full_df['Is Fraud?'].to_numpy().reshape(batch_size, timesteps, 1)
        count += batch_size
        yield data, targets

# Tensorflow testing - LSTM model
During the model testing step in TensorFlow/Keras, we will evaluate the trained LSTM model on unseen data (a test set) to measure its final performance.The model is loaded from the ```saved_model``` location, compiled and then tested via ```evaluate``` function.
We will use small and bigger batches for a Quick test and Full test accordingly.

In [None]:
batch_size = 2000

input_size=mapped_size-1
output_size=1
units=[200,200]
timesteps = 7
tf_input = ([timesteps, input_size])

new_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(units[0], input_shape=tf_input, batch_size=batch_size, time_major=True, return_sequences=True),
    tf.keras.layers.LSTM(units[1], return_sequences=True, time_major=True),
    tf.keras.layers.Dense(output_size, activation='sigmoid')
])
new_model.load_weights(os.path.join(save_dir,"wts"))
new_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=metrics)
ddf = pd.read_csv('test_220_40k.csv', dtype={"Merchant Name":"str"}, index_col='Index')
indices = np.loadtxt('test_220_40k.indices')

batch_size = 2000

print("\nQuick test")
test_generate = gen_test_batch(ddf,mapper,indices,batch_size)
new_model.evaluate(test_generate, verbose=1)

print("\nFull test")
test_generate = gen_test_batch(tdf,mapper,test_indices,batch_size)
new_model.evaluate(test_generate, verbose=1)

## Convert to ONNX
tf2onnx is a tool that converts TensorFlow models into ONNX (Open Neural Network Exchange) format, making them compatible with other platforms and backends. ONNX is an open-source format designed for interoperability across different deep learning frameworks, allowing models trained in TensorFlow (or other frameworks like PyTorch) to be used on different platforms without needing to be re-trained or re-implemented. In our context here, models trained on x86 can be exported to ONNX and deployed on IBM Z/LinuxONE for inferencing. 

![LSTM CC Diagram](<./Train Anywhere Inference on IBM Z.png>)

In [None]:
# Convert into ONNX format
from tensorflow import keras
import tf2onnx
import onnx
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# convert to onnx
onnx_model,_ = tf2onnx.convert.from_keras(new_model, opset=13)

# Save model in ONNX format
onnx.save_model(onnx_model, "fraud_lstm.onnx")
