# Identifying Fraudulent Transactions

The objective of this notebook is to demonstrate that adding graph features as part of a Machine Learning pipeline often results in more accurate predictions. To accomplish this, we’ll use the [PaySim](https://github.com/voutilad/paysim-demo) dataset and go thru the following steps:
1. Build Binary Classifier, using traditional Machine Learning (ML) features, to detect fraudulent transactions.
2. Retraining the Binary Classifier on an enhance set of features (by adding graph features).
3. Compare the performance measures on both models.
4. Looking at the features weight or importance in the model.
5. Use Regularization to select the most importance features.
6. Discuss the Precision/Recall Threshold.

## Preparing the Environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from neo4j import GraphDatabase
from multiprocessing import Pool,cpu_count
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, StandardScaler, label_binarize
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix,precision_score,recall_score,f1_score,roc_auc_score,precision_recall_curve

## Utility Classes and Functions

### Database Class

This class is use to encapsulate the connection to Neo4j Database and the execution of queries. It has the following instance variables:
1. _driver: Neo4j driver instance
2. query: Cypher query that was ran last
3. data: Pandas Data Frame object containing the results of a Cypher query

In [None]:
class Database:
    
    def __init__(self, uri, user, password,encrypted=False):
        self._driver = GraphDatabase.driver(uri, auth=(user, password),encrypted=encrypted)

    def close(self):
        self._driver.close()
        
    def getData(self,query):
        self.query = query
        with self._driver.session() as session:
            results = session.run(query)
        self.data = pd.DataFrame(results.values(),columns=results.keys())
    
    def runQuery(self,query):
        self.query = query
        with self._driver.session() as session:
            session.run(query)

### Datasets Class

This class is use to split a dataset into training and testing. It has the following instance variables:
1. original_data: Original Data Frame passed as an argument
2. data: Original dataset after deleting the attributes that are not going to be used as ML features
3. train_set and test_set: Resulting Dataframes after train & test spliting
4. X_Train_DF, X_Test_DF: Train and test Dataframes after removing the fraud labels
5. Y_Train, Y_Test: Train and test fraud labels

In [None]:
class Datasets():
    
    def __init__(self,data,test_size=.25):
        self.original_data = data
        self.data = data.drop(['data_chunk','txOrder','fromId','toId'],axis=1)
        self.data.previousTxType.fillna('First',inplace=True)
        train_set, test_set = train_test_split(self.data,test_size=test_size,random_state=42)
        self.train_set = train_set.copy()
        self.test_set = test_set.copy()
        self.X_Train_DF = self.train_set.drop('fraudulentTx',axis=1)
        self.Y_Train = self.train_set['fraudulentTx'].copy()
        self.X_Test_DF = self.test_set.drop('fraudulentTx',axis=1)
        self.Y_Test = self.test_set['fraudulentTx'].copy()

### AttributeSelector Class

This class is use to select the attributes from the dataset that are quantitative or qualitative

In [None]:
class AttributeSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, quantitative = True): 
        self.quantitative = quantitative
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.quantitative:
            self.attribute_names = list(X.select_dtypes(exclude=['object','category']).columns)
        else:
            self.attribute_names = list(X.select_dtypes(include=['object','category']).columns)
        return X[self.attribute_names].values

### MyLabelBinarizer Class

This class is use to create a one hot encoding for all the qualitative attributes

In [None]:
class MyLabelBinarizer(BaseEstimator, TransformerMixin):
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        transformations = []
        for col in range(x.shape[1]):
            transformations.append(label_binarize(x[:,col],classes=np.unique(x[:,col])))
        return np.hstack(transformations)

### Functions

These functions are used to execute a parallel computation of the minimum, mean and maximum of the previous 7 transaction and add these features to the dataset.

In [None]:
def prevTxFeatures(data,lastK=7):
    for i in range(data.shape[0]):
        last7TxAmount = data.iloc[:i+1,2].tail(lastK)
        data.iloc[i,7] = last7TxAmount.min()
        data.iloc[i,8] = last7TxAmount.mean()
        data.iloc[i,9] = last7TxAmount.max()
    return(data)

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pd.concat(ret_list)

def groupApply(df):
    return df.groupby('fromId').apply(prevTxFeatures)

## Traditional ML Pipeline

### Neo4j Database Information

In [None]:
uri = 'bolt://neo4jdb:7687'
graph = Database(uri,user='neo4j',password='DS_Training')

### Loading the Data

Computing the previous transaction features describe previously take approximately 15 minutes. In order to avoid waiting for this process to complete, a Data Frame containing these features were persisted as a pickle file. The following code cell is loading this Data Frame into memory.

In [None]:
data = pd.read_pickle('/data/paysim_transactions.pkl.gzip',compression='gzip')
data.head()

The following 3 code cells can be skip, they only demonstrate how the previous ML features were obtained and persisted to disk.

cypherQuery = '''
MATCH(c:Client)-[:PERFORMED]->(t:Transaction)-[:TO]->(n)
WITH t,c,labels(t)[0] AS txType,t.amount AS txAmount,t.globalStep AS txOrder,
	t.fraud AS fraudulentTx,id(n) AS toId
OPTIONAL MATCH(t)<-[:NEXT]-(pt:Transaction)
RETURN id(c) AS fromId,txType,txAmount,txOrder,fraudulentTx,toId,labels(pt)[0] AS previousTxType
ORDER BY id(c),txOrder
'''
graph.getData(cypherQuery)

start = time.time()
data = graph.data.assign(min7PrevTxAmount=0.0,mean7PrevTxAmount=0.0,max7PrevTxAmount=0.0)
data["data_chunk"] = data["fromId"].mod(cpu_count() * 3)
data = applyParallel(data.groupby('data_chunk'),groupApply)
end = time.time()
print("Execution time: " + str(end - start))

data.to_pickle('./paysim_transactions.pkl.gzip',compression='gzip')

### Train and Test Split

In [None]:
ml_data = Datasets(data)

In [None]:
ml_data.train_set.info()

### Train Set Descriptive Statistics

#### Quantitative Features

In [None]:
ml_data.X_Train_DF.describe()

#### Qualitative Features

In [None]:
frequency = {}
missing = {}
for col in ml_data.X_Train_DF.select_dtypes(include=['object','category']).columns:
    frequency[col] = ml_data.X_Train_DF[col].value_counts()
    missing[col] = ml_data.X_Train_DF[col].isnull().sum()

In [None]:
frequency

In [None]:
missing

### Fraud Labels Sampling Bias

In [None]:
ml_data.data.fraudulentTx.value_counts()/ml_data.data.shape[0]

In [None]:
ml_data.Y_Train.value_counts()/len(ml_data.Y_Train)

### Preparing the Training Set for Machine Learning

In [None]:
quant_pipeline = Pipeline([('selector',AttributeSelector()),
                           ('std_scaler',StandardScaler())
                          ])

qual_pipeline = Pipeline([('selector',AttributeSelector(quantitative=False)),
                          ('label_binarizer',MyLabelBinarizer())
                         ])

data_prep_pipeline = FeatureUnion(transformer_list=[('quant_pipeline',quant_pipeline),
                                               ('qual_pipeline',qual_pipeline)])
X_Prepared = data_prep_pipeline.fit_transform(ml_data.X_Train_DF)
print(X_Prepared.shape)

### Stochastic Gradient Descent Classifier (SGD) Accuracy

The following code cell is training and SGD model and evaluating its accuracy using a 5-fold cross validation.

In [None]:
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_Prepared, ml_data.Y_Train)
cross_val_score(sgd_clf, X_Prepared, ml_data.Y_Train, cv=5, scoring="accuracy")

The model has correctly classified 99.7 % of the transactions. However, this is simply because only about 1% of the transactions are labeled as fraudulent. So, a model that always classifies a transaction as non-fraudulent, it will correctly classify 99% of the transactions. Beats Nostradamus!

This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (some labels are much more frequent than others).

### SGD Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general is to estimate how frequent the model incorrectly classifies the transactions. Each row of this matrix represents the observations, while each column represents the predictions.

Observations/Predictions| Negative | Positive   |
:---------|:------------:|:----------|
Negative | TN | FP|
Positive | FN| TP|

Where:
* TP = True Positive
* FP = False Positive
* FN = False Negative
* TN = True Negative



In [None]:
predictions = cross_val_predict(sgd_clf,X_Prepared,ml_data.Y_Train,cv=5)
ml_conf_matrix = confusion_matrix(ml_data.Y_Train,predictions)
ml_conf_matrix

### SGD Performance Measures

Although the confusion matrix provides a lot of information, there are more concise metrics available.

* __Precision__ is the accuracy of the positive predictions. 

  $precision =\frac{TP}{TP + FP}$
  
    
* __Recall__, also called sensitivity or true positive rate (TPR)), is the ratio of positive instances that were correctly classified 

    $recall = \frac{TP}{TP + FN}$

* __$F_1$ Score__ is the harmonic mean of precision and recall

    $F_1 = \frac{TP}{TP + \frac{FP + FN}{2}}$
    
* __ROC AUC__ is the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.

In [None]:
performance_measures={}
performance_measures['precision'] = precision_score(ml_data.Y_Train,predictions)
performance_measures['recall'] = recall_score(ml_data.Y_Train,predictions)
performance_measures['f1'] = f1_score(ml_data.Y_Train, predictions)
performance_measures['roc_auc'] = roc_auc_score(ml_data.Y_Train,predictions)
pd.Series(performance_measures)

## Graph Powered ML Pipeline

### Graph Algorithms

The following code cell is creating a monopartite in-memory graph named Transactions using a Cypher projection. The nodes on this in-memory graph represents clients and the relationships represents transactions among them. The weight in these edges are the total amount of transferred between clients. This graph is then used to execute Page Rank, Betweenness, Triangle Count and Closeness algorithms and their results are stored in the Neo4j Database as properties of the Client nodes.

In [None]:
namedGraphQuery = '''
CALL gds.graph.create.cypher('Transactions',
	'MATCH(n) WHERE n:Client OR n:Bank OR n:Merchant RETURN id(n) AS id',
    'MATCH(c:Client)-[:PERFORMED]->(t:Transaction)-[:TO]->(n)
	RETURN id(c) AS source, id(n) AS target,sum(t.amount) AS totalAmount'
);
'''

pageRankQuery='''
CALL gds.pageRank.write('Transactions',{relationshipWeightProperty:'totalAmount',writeProperty:'pageRankScore'});
'''

betweennessQuery='''
CALL gds.betweenness.write('Transactions',{writeProperty:'betweennessScore'});
'''

triangleCountQuery='''
CALL gds.triangleCount.write('Transactions',{writeProperty:'triangleCount'})
'''

closenessQuery = '''
CALL gds.alpha.closeness.write('Transactions',{writeProperty:'closeness'})
'''

removeNamedGraph='''
CALL gds.graph.drop('Transactions')
'''

graph.runQuery(namedGraphQuery)
graph.runQuery(pageRankQuery)
graph.runQuery(betweennessQuery)
graph.runQuery(triangleCountQuery)
graph.runQuery(closenessQuery)
graph.runQuery(removeNamedGraph)

The following code cell is retrieving all the results algorithms ran previously as a Pandas Data Frame.

### Loading the Graph Features

In [None]:
graphFeaturesQuery='''
MATCH(n)
WHERE n:Client OR n:Merchant OR n:Bank
RETURN id(n) AS id,n.pageRankScore AS pageRank,n.betweennessScore AS betweenness,n.triangleCount AS triangleCount,
    n.closeness AS closeness
'''
graph.getData(graphFeaturesQuery)
graph.close()
graphFeatures = graph.data

### Combining ML Features and Graph Features.

In [None]:
allFeatures = ml_data.original_data.merge(graphFeatures,left_on='toId',right_on='id',how='inner')

In [None]:
allFeatures.head()

### Train and Test Split

In [None]:
graph_ml_data = Datasets(allFeatures.drop('id',axis=1))

### Quantitative Descriptive Statistics

In [None]:
graph_ml_data.data.describe()

### Preparing the Data for ML, Training and Obtaining Accuracy for Stochastic Gradient Descent Classifier

In [None]:
X_Prepared = data_prep_pipeline.fit_transform(graph_ml_data.X_Train_DF)
sgd_clf.fit(X_Prepared, graph_ml_data.Y_Train)
cross_val_score(sgd_clf, X_Prepared, graph_ml_data.Y_Train, cv=5, scoring="accuracy")

### Graph Power ML Confusion Matrix

In this section we are obtaining the Graph Powered ML model Confusion Matrix and comparing it with the Traditional ML model Confusion Matrix.

In [None]:
predictions = cross_val_predict(sgd_clf,X_Prepared,graph_ml_data.Y_Train,cv=5)
graph_ml_conf_matrix = confusion_matrix(graph_ml_data.Y_Train,predictions)
print(graph_ml_conf_matrix)
print('Change')
print(graph_ml_conf_matrix - ml_conf_matrix)

As it can be observed in the matrix above, a total of 1,322 and 573 False Positive and False Negative were correctly classified using the Graph Powered ML model.

### Graph Powered ML Performance Measures

In [None]:
graph_performance_measures={}
graph_performance_measures['precision'] = precision_score(graph_ml_data.Y_Train,predictions)
graph_performance_measures['recall'] = recall_score(graph_ml_data.Y_Train,predictions)
graph_performance_measures['f1'] = f1_score(graph_ml_data.Y_Train, predictions)
graph_performance_measures['roc_auc'] = roc_auc_score(graph_ml_data.Y_Train,predictions)
pd.Series(graph_performance_measures)

#### Comparing Traditional ML and Graph Powered ML Performance Measures

In [None]:
pd.concat([pd.Series(performance_measures,name='ML'),pd.Series(graph_performance_measures,name='Graph_ML')],axis=1)

### Features Weight (Importance)

#### Obtaining the names of the features

In [None]:
quantitative_features = data_prep_pipeline.get_params()['quant_pipeline'].get_params()['steps'][0][1].attribute_names
qualitative_features = data_prep_pipeline.get_params()['qual_pipeline'].get_params()['steps'][0][1].attribute_names
all_features = quantitative_features
for col in qualitative_features:
    all_features = all_features + [col + '_' + x for x in list((graph_ml_data.data[col].unique()))]
all_features

#### Associating each feature with its corresponding weight

In [None]:
sorted(zip(np.abs(sgd_clf.coef_[0]), all_features), reverse=True)

#### Regularization of the Graph Power ML Model

In [None]:
sgd_clf = SGDClassifier(random_state=42,penalty='elasticnet')
sgd_clf.fit(X_Prepared, graph_ml_data.Y_Train)
predictions = cross_val_predict(sgd_clf,X_Prepared,graph_ml_data.Y_Train,cv=5)
graph_ml_conf_matrix - confusion_matrix(graph_ml_data.Y_Train,predictions)

In [None]:
sorted(zip(np.abs(sgd_clf.coef_[0]), all_features), reverse=True)

### Graph Powered ML Model Test Performance Measures

In [None]:
X_Test_Prepared = data_prep_pipeline.fit_transform(graph_ml_data.X_Test_DF)
predictions = cross_val_predict(sgd_clf,X_Test_Prepared,graph_ml_data.Y_Test,cv=5)
test_performance_measures={}
test_performance_measures['precision'] = precision_score(graph_ml_data.Y_Test,predictions)
test_performance_measures['recall'] = recall_score(graph_ml_data.Y_Test,predictions)
test_performance_measures['f1'] = f1_score(graph_ml_data.Y_Test, predictions)
test_performance_measures['roc_auc'] = roc_auc_score(graph_ml_data.Y_Test,predictions)
print(confusion_matrix(graph_ml_data.Y_Test,predictions))
pd.Series(graph_performance_measures)

### Precision/Recall Tradeoff

The F1 score favors classifiers that have similar precision and recall. This is not always what you want, in some contexts you mostly care about precision, and in other contexts you really care about recall. As the following code cell demonstrates, it is fairly easy to create a classifier with virtually any precision you want just by setting a high enough threshold. However, a high precision classifier is not very useful if its recall is too low!

In [None]:
Y_Scores = cross_val_predict(sgd_clf,X_Prepared,graph_ml_data.Y_Train,cv=5,method='decision_function')
precisions, recalls, thresholds = precision_recall_curve(graph_ml_data.Y_Train, Y_Scores)

In [None]:
fig, axes = plt.subplots(1,2,figsize=(17,10))
axes[0].plot(thresholds, precisions[:-1], "b--", label="Precision")
axes[0].plot(thresholds, recalls[:-1], "g-", label="Recall")
axes[0].set_xlabel("Threshold")
axes[0].legend(loc="best")
axes[0].set_ylim([0, 1])
axes[0].set_title('Precision & Recall Against Threshold')
axes[1].plot(recalls,precisions)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision Against Recall')
plt.show()

In [None]:
Y_Train_Recall_98 = (Y_Scores > -1)
precision = precision_score(graph_ml_data.Y_Train,Y_Train_Recall_98)
recall = recall_score(graph_ml_data.Y_Train,Y_Train_Recall_98)
print('Precision: {0}'.format(precision))
print('Recall: {0}'.format(recall))