***
***
# __`Visualization of the Paysim dataset and preparation of the dataframe`__

In [None]:
#importazione librerie, autenticazione e creazione dell'entità "grafo"
#libraries, authenticazion and creation of the graph entity

import pandas as pd
import numpy as no
from py2neo import Graph, Node, Relationship
from graphdatascience import GraphDataScience


url = "insert-url-here"
pwd = "insert-password-here"
gds = GraphDataScience(url, auth=('neo4j', pwd))
graph = Graph(url, auth=('neo4j' , pwd))



Let's see how initially appears the dataset in _neo4j_.

<div align="center">
    <img src="pics/init.png" width="600" height="auto"> <center>
    <br>

We first implement the 'Weakly connected components' algorithm (based on "union find" data structure): It discovers hidden networks that constitute complex fraud rings based on common identities, such as multiple applicants all residing at the same address. We make joins by which we check whether certain individuals share an email or phone or ssn. We first need to create a new attribute that we call "SHARED_IDENTIFIERS". First we identify the client pairs that share personal information.

In [None]:
query = """MATCH (c1:Client)-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]->(info)
<-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
WITH c1, c2, count(*) as cnt
MERGE (c1) - [:SHARED_IDENTIFIERS {count: cnt}] - (c2);"""
graph.run(query).to_data_frame()

The output shows all possible relationships between the subjects, present by means of their ID, and the variable freq representing the number of identifiers they share.

<div align="center">
    <img src="pics/frequencies.png" width="400" height="auto"> <center>
    <br>

The attribute 'SHARED_IDENTIFIERS' was added, which relates people (real or fictitious) who share the same identity.
Through this relationship, we create a temporary graph that we will need in order to effectively apply the 'Weakly Connected Component' algorithm. In particular, it is possible to detect clusters (groups of connected nodes in which each node represents a connected component).

In [None]:
query = """CALL gds.graph.project('WCC_GRAPH', 'Client', 
{
    SHARED_IDENTIFIERS: {
        type: 'SHARED_IDENTIFIERS',
        properties: {
            count: {
                property: 'count'
            }
        }
    }

})  YIELD graphName, nodeCount, relationshipCount;"""

wcc_graph = graph.run(query).to_data_frame()


Visualization of the "Shared Identifiers" relationship:

<div align="center">
    <img src="pics/shared_id.png" width="600" height="auto"> <center>
    <br>

Esecution of Weakly Connected Components algorithm:

In [None]:
query = """CALL gds.wcc.stream('WCC_GRAPH')
YIELD componentId, nodeId
WITH componentId AS cluster, gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients
WITH *, size(clients) AS clusterSize
WHERE clusterSize > 1
UNWIND clients AS client
MATCH (c:Client)
WHERE c.id = client
SET c.firstPartyFraudGroup = cluster
WITH cluster, clusterSize, COLLECT(DISTINCT client) AS clientsInCluster
RETURN cluster, clusterSize, clientsInCluster;

"""
graph.run(query)

cluster,clusterSize,clientsInCluster
37,4,"['4307389536474215', '4454357737255718', '4227754063695630', '4297626800379477']"
248,2,"['4026791806639957', '4169382142859298']"
334,11,"['4047841290742877', '4777446725804270', '4689748983194261', '4284694673228754', '4309833640194449', '4316366410034140', '4735236106636727', '4187135907098538', '4361847869567817', '4910140986334626', '4114683318919154']"


After finding the clusters we perform a subsequent algorithm, the Node Similarity: this is based on the idea that, two nodes are similar if they share the same neighbours. In particular, we use the Jaccard metric, also known as the Jaccard Similarity Score, obtained as the ratio between the related nodes in the network common to A and B (intersection), and the sum of these (union).

<div align="center">
    <img src="pics/jaccard_score.png" width="600" height="auto"> <center>
    <br>

For example, if there are two common identifiers:

<div align="center">
    <img src="pics/example.png" width="450" height="auto"> <center>
    <br>

So there are three possible values: 0.2, 0.5, 1.0 (based on the three possible common identifiers).

After creating the "SIMILARITY" relationship, let's apply the similarity algorithm based on "Jaccard Score".

In [None]:
query="""CALL gds.nodeSimilarity.stream('WCC_GRAPH', { topK: 15 })
YIELD node1, node2, similarity
WITH gds.util.asNode(node1) AS client1, gds.util.asNode(node2) AS client2, similarity
OPTIONAL MATCH (client1)-[rel:SIMILARITY]-(client2)
RETURN client1.id AS node1, client2.id AS node2, similarity, rel.similarity AS relationshipSimilarity
ORDER BY similarity DESC
LIMIT 15;"""
similarity = graph.run(query).to_data_frame()
similarity

I also decide to count, for each client node, a Centrality Score given by the sum of all the weights of its arcs. Based on this and a certain threshold, the value of the Boolean variable prediction is set. 
A shared_identifiers column is also inserted with the maximum number of identifiers for each client node.

In [None]:
import pandas as pd

df = pd.read_csv('dataframe.csv')


result_df = df.groupby('client1').agg({'score': 'sum', 'count': 'max'}).reset_index()

result_df = result_df.rename(columns={'score': 'centrality', 'count': 'shared_identifiers'})



result_df.to_csv('dataframe.csv', index=False)


The dataframe that I will use for the machine learning phase appears like this:

<div align="center">
    <img src="pics/final_df.png" width="450" height="auto"> <center>
    <br>

But first, we have to balance the dataframe, as is appropriate in a classification problem.

At this stage, a treshold value is also set, indicated by 2.3 (80th percentile), at which a client is considered fraudulent.  We thus have the two classes: positive and negative.

In [None]:

df = pd.read_csv('dataframe.csv')


result_df['prediction'] = result_df['centrality'].apply(lambda x: 1 if x >= 2.3 else 0)


result_df.to_csv('dataframe.csv', index=False)

<div align="center">
    <img src="pics/classes.png" width="550" height="auto"> <center>
    <br>

We therefore identify 1.7 as the value below which we eliminate the values in the presence of which the classes would be strongly unbalanced (271 positive against 65 negative).

***
***
# __`Machine Learning`__

We use the _Extreme Gradient Boosting_ algorithm. is an iterative model based on decision trees in which, at each iteration, an attempt is made to minimise an objective function. As the objective function decreases, the loss decreases and the efficiency of the algorithm increases.

<div align="center">
    <img src="pics/flowchart.png" width="550" height="auto"> <center>
    <br>

Flow chart of the algorithm.

The objective function is defined as follows:<br>
-Loss term:<br>
        Measures how much the model's predictions deviate from actual values. For example, for regression, a common loss is the mean square loss (L2 loss). In our case (a binary classification model) is appropriate to use the logloss.<br>

-Adjustment Term (Penalisation):<br>
        Prevents the model from becoming too complex, preventing overfitting. It is based on parameters such as the number of leaves.

<div align="center">
    <img src="pics/obj.png" width="550" height="auto"> <center>
    <br>

I also use cross validation to make the model generalise. In particular, I use 4 folds: these are 4 iterations in each of which, a different fold is used as the test set, while the other k-1 folds are used as the training set.

<div align="center">
    <img src="pics/crossval.png" width="550" height="auto"> <center>
    <br>

Metrics: Accuracy, Precision, Recall.

<div align="center">
    <img src="pics/metrics.png" width="550" height="auto"> <center>
    <br>

Code and results:

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import cross_validate, StratifiedKFold, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
import shap

df = pd.read_csv('df.csv')
df = df.drop(df[df['centrality'] < 1.7].index)

#standardization of the features
scaler = StandardScaler()
numerical_features = ['shared_identifiers', 'centrality']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

#estraction of the important features
X = df.drop(['client', 'prediction'], axis=1)
y = df['prediction']

#split the dataset in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#XGBoost model
model = xgb.XGBClassifier(
 n_estimators=100,
    max_depth=3,
    eval_metric='logloss',
    use_label_encoder=False

)

#metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score), 
    'f1' : make_scorer(f1_score)
}

#cross-validation with 4 fold
cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
results = cross_validate(model, X_train, y_train, cv=cv, scoring=scoring)

#average scores
average_accuracy = results['test_accuracy'].mean()
average_precision = results['test_precision'].mean()
average_recall = results['test_recall'].mean()


#results
print(f'Test Accuracy: {average_accuracy:.4f}')
print(f'Test Precision: {average_precision:.4f}')
print(f'Test Recall: {average_recall:.4f}')



<div align="left">
    <img src="pics/scores.png" width="450" height="auto"> <center>
    <br>