<a href="https://colab.research.google.com/github/guerinjeanmarc/FraudWorkshop/blob/main/P2p_Fraud_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Fraud Detection With Neo4j & Graph Data Science

This analysis uses [Neo4j and Graph Data Science (GDS)](https://neo4j.com/docs/graph-data-science/current/) to explore an anonymized data sample from a Peer-to-Peer (P2P) payment platform.  The notebook is split up into the following sections to cover various stages of the graph data science workflow:

- Notebook Setup
- Part 1: Exploring Connected Fraud Data
- Part 2: Resolving Fraud Communities using Entity Resolution and Community Detection
- Part 3: Recommending Suspicious Accounts With Centrality & Node Similarity
- Part 4: Predicting Fraud Risk Accounts with Machine Learning

Original Source: https://github.com/neo4j-product-examples/demo-fraud-detection-with-p2p

## Notebook Setup <a name="p0"></a>

In [None]:
# additional dependencies 
# !pip install graphdatascience 
# !pip install scikit-learn==1.1.1 

In [None]:
import pandas as pd
import os 
pd.set_option('display.width', 0)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 50)

### Neo4j Settings

In [None]:
# HOST = 'neo4j://localhost'
# USERNAME = 'neo4j'
# PASSWORD = 'password'
# DATABASE = 'p2p'

In [None]:
# replace XXX with your user number and enter your password
HOST = 'neo4j+s://neo4j-sky.graphdatabase.ninja:443'
USERNAME = 'attendeeXX'
PASSWORD = 'your_password_here'
DATABASE = 'p2pXX'

In [None]:
# import os
# from dotenv import load_dotenv
# load_dotenv('credentials.env')

### Connect to Graph Data Science

In [None]:
from graphdatascience import GraphDataScience

# Use Neo4j URI and credentials according to your setup
gds = GraphDataScience(HOST, auth=(USERNAME, PASSWORD), aura_ds=False)
gds.set_database(DATABASE)

## Part 1: Exploring Connected Fraud Data

Access to Neo4j Browser is here https://neo4j-sky.graphdatabase.ninja/browser/

Access to Bloom is here https://neo4j-sky.graphdatabase.ninja/bloom

### Dataset Introduction
***
<span style="color:green"> _Below is a query you can use to visualize the graph schema in Neo4j Browser_ </span> 

`CALL db.schema.visualization`
***
We will be using an anonymized sample of user accounts and transactions from a real-world Peer-to-Peer (P2P) platform. Prior to ingesting the data into graph, the original identification numbers were removed and categorical values were masked. Each user account has a unique 128-bit identifier, while the other nodes, representing unique credit cards, devices, and ip addresses have been assigned random UUIDs. These identifiers are stored as the guid property in the graph schema.

Each user node has a property to indicate money transfer fraud (named `MoneyTransferFraud`) that is 1 for known fraud and 0 otherwise. This indicator is determined by a combination of credit card chargeback events and manual review. A chargeback is an action taken by a bank to reverse electronic payments. It involves reversing a payment and triggering a dispute resolution process, often for billing errors and unauthorized credit use. In short, a user must have at least one chargeback to be considered fraudulent. Only a small proportion of the user accounts, roughly 0.7 %, are flagged for fraud.

Below is a breakdown of high-level counts by labels and relationships as well as the flagged accounts.

In [None]:
# total node counts
gds.run_cypher('''
    CALL apoc.meta.stats()
    YIELD labels
    UNWIND keys(labels) AS nodeLabel
    RETURN nodeLabel, labels[nodeLabel] AS nodeCount
''')

In [None]:
# total relationship counts
gds.run_cypher('''
    CALL apoc.meta.stats()
    YIELD relTypesCount
    UNWIND keys(relTypesCount) AS relationshipType
    RETURN relationshipType, relTypesCount[relationshipType] AS relationshipCount
''')

In [None]:
#fraud money transfer flags
gds.run_cypher('MATCH(u:User) RETURN u.fraudMoneyTransfer AS fraudMoneyTransfer, count(u) AS cnt')

### A Closer Look at Cards and Devices

As a first step, we break out the ratio of fraud vs non-fraud users connected to credit cards and devices below. We find that very few cards or devices are centered well around flagged accounts. This is a bit surprising. If a card or device is used by a flagged account, we would expect the other accounts that use the card/device to also be mostly flagged for fraud as well. 

In [None]:
# Setting a label for flagged users will enable faster lookups in cypher and faster gds projections
gds.run_cypher('MATCH(u:User) WHERE u.fraudMoneyTransfer=1 SET u:FlaggedUser RETURN count(u)')

To evaluate the connectivity of cards and devices to multiple Users and FlaggedUsers, we project the relevant nodes and relationships into an in-memory graph, and then use [Degree Centrality](https://neo4j.com/docs/graph-data-science/2.1/algorithms/degree-centrality/) with the GDS library to count the incoming and outgoing relationships. `writeNodeProperties` ([Node Operations Docs](https://neo4j.com/docs/graph-data-science/current/graph-catalog-node-ops/)) captures the results and writes them back to the database as properties on the specified node types.

In [None]:
# GDS degree centrality to count the number of Users connected to each identifier type - Card, Device, IP
g, _  = gds.graph.project('id-projection', ['User', 'Card', 'Device', 'IP'],{
        'HAS_CC': {'orientation': 'REVERSE'},
        'HAS_IP': {'orientation': 'REVERSE'},
        'USED': {'orientation': 'REVERSE'}
    })
gds.degree.mutate(g, mutateProperty='degree')
gds.graph.writeNodeProperties(g, ['degree'], ['Card', 'Device', 'IP'])
g.drop()

In [None]:
# GDS degree centrality to count the number of Flagged Users connected to each identifier type - Card, Device, IP
g, _  = gds.graph.project('id-projection', ['FlaggedUser', 'Card', 'Device', 'IP'],{
        'HAS_CC': {'orientation': 'REVERSE'},
        'HAS_IP': {'orientation': 'REVERSE'},
        'USED': {'orientation': 'REVERSE'}
    })
gds.degree.mutate(g, mutateProperty='flaggedDegree')
gds.graph.writeNodeProperties(g, ['flaggedDegree'], ['Card', 'Device', 'IP'])
g.drop()

In [None]:
# Calculate the ratio of flagged users to total users
gds.run_cypher('''
    MATCH(n) WHERE n:Card OR n:Device OR n:IP
    SET n.flaggedRatio = toFloat(n.flaggedDegree)/toFloat(n.degree)
''')

*** 
<span style="color:green"> _Run these queries to view some of the results in Neo4j Browser_ </span> 


Review the results

```cypher
MATCH (n:Card) 
RETURN n.guid AS cardId, n.degree AS degree, n.flaggedDegree AS flaggedDegree, n.flaggedRatio AS flaggedUserRatio
ORDER BY degree DESC LIMIT 10
```

View use of a single card by flagged and un-flagged users

```cypher 
MATCH p=(n:Card {guid: "4f694da1-efe6-4087-bfd4-4c28b642b45f"})-[]-(u:User)
RETURN p
```
***

<br>

Below we can calculate the ratio of flagged to unflagged users. This is where we find that very few cards or devices are centered well around flagged accounts (for example only 31 Cards are _only_ used by FlaggedUsers) 

In [None]:
print('Flagged User Ratio for Card Count')
gds.run_cypher('''
    MATCH(n:Card) WHERE n.degree > 1
    WITH toFloat(count(n)) AS total
    MATCH(n:Card) WHERE n.degree > 1
    WITH n, total, CASE
        WHEN n.flaggedRatio=0 THEN '0'
        WHEN n.flaggedRatio=1  THEN '1'
        ELSE 'Between 0-1' END AS flaggedUserRatio
    RETURN flaggedUserRatio, count(n) AS count, round(toFloat(count(n))/total,3) AS percentCount, total as totalUsers
    ORDER BY flaggedUserRatio
''')

In [None]:
print('Flagged User Ratio for Device Count')
gds.run_cypher('''
    MATCH(n:Device) WHERE n.degree > 1
    WITH toFloat(count(n)) AS total
    MATCH(n:Device) WHERE n.degree > 1
    WITH n, total, CASE
        WHEN n.flaggedRatio=0 THEN '0'
        WHEN n.flaggedRatio=1  THEN '1'
        ELSE 'Between 0-1' END AS flaggedUserRatio 
    RETURN flaggedUserRatio, count(n) as count, round(toFloat(count(n))/total,3) AS percentCount, total as totalUsers 
    ORDER BY flaggedUserRatio
''')

## Part 2: Resolving Fraud Communities using Entity Resolution and Community Detection

#### Exploring Potential Fraud Patterns with Community Detection

Given the mixed usage of devices by flagged and unflagged users we discovered above, we suspect fraud activity is not completely labeled. This may be because of the limited chargeback logic used to flag fraud. At the same time we do not want to simply label every user that shares a card or device with another flagged account, since it is possible that a benign user's device and or card was used fraudulently by another. Since fraudsters are actively avoiding being identified, actors committing fraud are often not just represented by a single user account, but rather by multiple accounts and identifiers which, hopefully for us, share some connections and similarities.

<br>

With graphs, we can attempt to roughly identify these fragmented identities with **Community Detection**, a large set of methods that attempt to partition graphs into well connected groups a.k.a. Communities, where the connectivity in the communities is significantly higher than outside the community.

<br>
In fraud detection we often need to identify communities that reflect underlying groups of individuals. Due to expectations of auditability, it is important the process be explainable. In this section we will follow a scalable and explainable process to identify additional fraud risk users. 

* Define Entity Resolution (ER) rules to link Users who may be the same individual with derived relationships
* Use the determanistic and explainable algorithm [Weakly Connected Components (WCC)](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) to resolve the communities based on our derived relationships 
* Label all users in communities that include flagged accounts as fraud risks.
<br>

### Entity Resolution Business Rules

We will now use Entity Resolution (ER) to resolve groups of individuals behind sets of user accounts. For this analysis, we will use some pretty straightforward ER business logic. If either of the two below conditions are true, we will resolve two user accounts by linking them together with a new relationship type.

1. One user sent money to another user that shares the same credit card
2. Two users share a card or device connected to less than or equal to 10 total accounts, and those two users also share at least two other identifiers of type credit card, device, or IP address

You could switch out or add different rules to the above, these are just examples. In a real-world scenario these business rules would pass by SMEs and possibly be backed by further supervised machine learning on manually labeled data. More advanced techniques for this type of ER are possible in graph and we describe them in [this whitepaper](https://neo4j.com/whitepapers/graph-data-science-use-cases-entity-resolution/) and [this blog](https://neo4j.com/developer-blog/exploring-supervised-entity-resolution-in-neo4j/).

For a P2P dataset, we do not necessarily want to label all senders/receivers of flagged user transactions as fraudulent since some fraud schemes involve transactions with victims. Furthermore, additional identifiers such as IP may be inexact and cards + devices can be fraudulently controlled/used without the owners permission. Hence we use somewhat stringent rules that align with the patterns noted in Part 1. We can apply relationships to reflect these business rules using cypher:


In [None]:
# P2P with shared card rule
gds.run_cypher('''
    MATCH (u1:User)-[r:P2P]->(u2)
    WITH u1, u2, count(r) AS cnt
    MATCH (u1)-[:HAS_CC]->(n)<-[:HAS_CC]-(u2)
    WITH u1, u2, count(DISTINCT n) AS cnt
    MERGE(u1)-[s:P2P_WITH_SHARED_CARD]->(u2)
    RETURN count(DISTINCT s) AS cnt
''')

In [None]:
# Shared ids rule
gds.run_cypher('''
    MATCH (u1:User)-[:HAS_CC|USED]->(n)<-[:HAS_CC|USED]-(u2)
    WHERE n.degree <= 10 AND id(u1) < id(u2)
    WITH u1, u2, count(DISTINCT n) as cnt
    MATCH (u1)-[:HAS_CC|USED|HAS_IP]->(m)<-[:HAS_CC|USED|HAS_IP]-(u2)
    WITH u1, u2, count(DISTINCT m) as cnt
    WHERE cnt > 2
    MERGE(u1)-[s:SHARED_IDS]->(u2)
    RETURN count(DISTINCT s)
''')

### Using Weakly Connected Components (WCC) to Resolve Communities

[Weakly Connected Components (WCC)](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) is a practical and highly scalable community detection algorithm. It is also deterministic and very explainable. It defines a community simply as a set of nodes connected by a subset of relationship types in the graph. This makes WCC a good choice for formal community assignment in production fraud detection settings.

Below we run WCC on a graph projection consisting of Users and the ER relationships created above, then write out the resulting community IDs as wccId


In [None]:
g, _ = gds.graph.project('comm-projection', ['User'], {
    'SHARED_IDS': {'orientation': 'UNDIRECTED'},
    'P2P_WITH_SHARED_CARD': {'orientation': 'UNDIRECTED'}
})

df = gds.wcc.write(g, writeProperty='wccId')
g.drop()
# df

### Labeling Fraud Risk User Accounts

As these communities are meant to label underlying groups of individuals, if even one flagged account is in the community, we will label all user accounts in the group as fraud risks:


In [None]:
gds.run_cypher('''
    MATCH (f:FlaggedUser)
    WITH collect(DISTINCT f.wccId) AS flaggedCommunities
    MATCH(u:User) WHERE u.wccId IN flaggedCommunities
    SET u:FraudRiskUser
    SET u.fraudRisk=1
    RETURN count(u)
''')

In [None]:
gds.run_cypher('''
    MATCH (u:User) WHERE NOT u:FraudRiskUser
    SET u.fraudRisk=0
    RETURN count(u)
''')

### WCC Community Statistics

The breakdown of communities by size is listed below. The majority are single user communities. Only a small portion have multiple users and of those, community sizes are mostly 2 and 3. Larger communities are rare. However, if we look at the fraudUser accounts we will see that the majority reside in multi-user communities. The 118 fraud accounts in single user communities are flagged users (via original chargeback logic) that have yet to be resolved to a community.

In [None]:
gds.run_cypher( '''
    MATCH (u:User)
    WITH u.wccId AS community, count(u) AS cSize, sum(u.fraudRisk) AS cFraudSize
    WITH community, cSize, cFraudSize,
    CASE
        WHEN cSize=1 THEN ' 1'
        WHEN cSize=2 THEN ' 2'
        WHEN cSize=3 THEN ' 3'
        WHEN cSize>3 AND cSize<=10 THEN ' 4-10'
        WHEN cSize>10 AND cSize<=50 THEN '11-50'
        WHEN cSize>50 THEN '>50' END AS componentSize
    RETURN componentSize, 
        count(*) AS numberOfComponents, 
        sum(cSize) AS totalUserCount, 
        sum(cFraudSize) AS fraudUserCount 
    ORDER BY componentSize
''')

***
<span style="color:green"> To view individual communities in Neo4j Browser </span>

First, get a list of "interesting" communities that contain a mix of labeled fraud Users (`fraudMoneyTransfer=1`), and fraud risk Users inferred from the WCC community detection labeling. 
```cypher
MATCH (u:User)
WITH u.wccId AS community, count(u) AS cSize, sum(u.fraudMoneyTransfer) as cLabeledFraud
WHERE (cSize<>cLabeledFraud) AND cLabeledFraud>0
RETURN community as communityId, cSize as communitySize, cLabeledFraud as labeledFraudUsers ORDER BY communitySize DESC
```

Next, choose a communityId and set it as a parameter - for example community 2153
```cypher 
:param id=>2153;
```

Finally, copy-paste the following query into your browser cell 
<span style="color:red"> ** note, delete the backslash before the parameter id! This is just a markdown formatting workaround.</span>
```cypher
MATCH(u1:User{wccId: \$id})-[r1:HAS_CC|USED]->(n)<-[r2:HAS_CC|USED]-(u2:User{wccId: $id})
WITH *
OPTIONAL MATCH (u1)-[r3:P2P]-(u2)
RETURN *
```


Overall, you will notice a high degree of overlapping connectivity of identifiers and P2P transactions between users, which we should expect given our ER rules.
*** 

### Outcomes of Fraud Risk Labeling
Fraud Risk labeling helped identify an additional 211 new fraud risk user accounts, nearly doubling the number of known fraud users (87.5% increase). We also see that 65% of the money going to/from previously flagged accounts and other users can be attributed to the newly identified risk accounts:

In [None]:
gds.run_cypher('''
   MATCH (:FlaggedUser)-[r:P2P]-(u)  WHERE NOT u:FlaggedUser
   WITH toFloat(sum(r.totalAmount)) AS p2pTotal
   MATCH (u:FraudRiskUser)-[r:P2P]-(:FlaggedUser) WHERE NOT u:FlaggedUser
   WITH p2pTotal,  toFloat(sum(r.totalAmount)) AS fraudRiskP2pTotal
   RETURN round((fraudRiskP2pTotal)/p2pTotal,3) AS p
''').p[0]

Additionally, while the newly identified 211 accounts represents less than 1% of total users in the sample, 12.7% of the total P2P amount in the sample involved the newly identified accounts as senders or receivers:

In [None]:
gds.run_cypher('''
   MATCH (:User)-[r:P2P]->()
   WITH toFloat(sum(r.totalAmount)) AS p2pTotal
   MATCH (u:FraudRiskUser)-[r:P2P]-() WHERE NOT u:FlaggedUser
   WITH p2pTotal, toFloat(sum(r.totalAmount)) AS fraudRiskP2pTotal
   RETURN round((fraudRiskP2pTotal)/p2pTotal,3) AS p
''').p[0]

Finally, we can see an improvement in card and device discrimination with many more cards and devices being used by fraud risk accounts exclusively.

In [None]:
# GDS degree centrality to count the number of FraudRisk Users connected to each identifier type - Card, Device, IP
g, _  = gds.graph.project('id-projection', ['FraudRiskUser', 'Card', 'Device', 'IP'],{
        'HAS_CC': {'orientation': 'REVERSE'},
        'HAS_IP': {'orientation': 'REVERSE'},
        'USED': {'orientation': 'REVERSE'}
    })
gds.degree.mutate(g, mutateProperty='fraudRiskDegree')
gds.graph.writeNodeProperties(g, ['fraudRiskDegree'], ['Card', 'Device', 'IP'])
g.drop()

In [None]:
gds.run_cypher('''
    MATCH(n) WHERE n:Card OR n:Device OR n:IP
    SET n.fraudRiskRatio = toFloat(n.fraudRiskDegree)/toFloat(n.degree)
''')

Below we re-calculate the ratio of flagged to unflagged users. We now see an increase in cards/devices used only by FraudRiskUsers (from 31 to 351 for Cards)

In [None]:
gds.run_cypher('''
    MATCH(n:Card) WHERE n.degree > 1
    WITH toFloat(count(n)) AS total
    MATCH(n:Card) WHERE n.degree > 1
    WITH n, total, CASE
        WHEN n.fraudRiskRatio=0 THEN '0'
        WHEN n.fraudRiskRatio=1  THEN '1'
        ELSE 'Between 0-1' END AS fraudRiskRatio
    WITH fraudRiskRatio, n, total
    RETURN fraudRiskRatio, count(n) AS count, round(toFloat(count(n))/total,3) AS percentCount, total as totalUsers
    ORDER BY fraudRiskRatio
''')

In [None]:
gds.run_cypher('''
    MATCH(n:Device) WHERE n.degree > 1
    WITH toFloat(count(n)) AS total
    MATCH(n:Device) WHERE n.degree > 1
    WITH n, total, CASE
        WHEN n.fraudRiskRatio=0 THEN '0'
        WHEN n.fraudRiskRatio=1  THEN '1'
        ELSE 'Between 0-1' END AS fraudRiskRatio
    RETURN fraudRiskRatio, count(n) AS count, round(toFloat(count(n))/total,3) AS percentCount, total as totalUsers 
    ORDER BY fraudRiskRatio
''')

The aggregate P2P statistics combined with improvements in Card and Device metrics are significant given the limited scope of the previously flagged fraud which focused on chargebacks.  These results strongly imply that there are more sophisticated networks of fraudulent money flows behind the chargebacks rather than the chargebacks being isolated occurrences.

## Part 3: Recommending Suspicious Accounts With Centrality & Node Similarity

In parts 1 & 2 we explored the graph and identified high risk fraud communities. At this stage, we may want to expand beyond our business logic to automatically identify other users that are suspiciously similar to the fraud risks already identified. Neo4j and GDS makes it simple to triage and recommend such suspect users for further investigation in a matter of seconds. We can leverage both centrality and similarity algorithms for this.


### Using Weighted Degree Centrality to Recommend Potential High Risk Accounts

We can quickly and easily generate a ranked list of suspicious user accounts with weighted degree centrality. Specifically, we can calculate the degree centrality of users in respect to their identifiers (Devices, Cards, and IPs) weighted by the fraudRiskRatios we made in part 2. In this case, a simple Cypher query suffices.


In [None]:
gds.run_cypher('''
    MATCH(f:FraudRiskUser)-[:HAS_CC|HAS_IP|USED]->(n)
    WITH DISTINCT n
    MATCH(u:User)-[:HAS_CC|HAS_IP|USED]->(n) WHERE NOT u:FraudRiskUser
    WITH left(u.guid,8) as uid,
        sum(n.fraudRiskRatio) AS totalIdFraudRisk,
        count(n) AS numberFraudRiskIds
    WITH uid, totalIdFraudRisk,
        numberFraudRiskIds,
        totalIdFraudRisk/toFloat(numberFraudRiskIds) AS averageFraudIdRisk
    WHERE averageFraudIdRisk >= 0.25
    RETURN uid, totalIdFraudRisk, numberFraudRiskIds, averageFraudIdRisk
    ORDER BY totalIdFraudRisk DESC LIMIT 10
''')

Users in the above result list are sorted by how much identifying information they share with previously labeled fraud risks, the ones with the most being at the top. Technically speaking, these users are ranked by their total Id fraud risk, which is equal to the sum of the fraudRiskRatios from the Identifiers they are connected to. In the query we also implement a limit on the average fraud risk to avoid users that just have a lot of high-degree identifiers (likely proxy ip addresses shared by only a small fraction of fraud risk users). This sort of filtering can be tweaked by use case to get the right balance between total risk vs average risk.

In a real-world fraud detection use case, these results can be triaged by analysts to label more fraud accounts and grow labeled fraud communities.

### Using Node Similarity to Expand on Fraud Communities

Simple calculations like weighted degree centrality work well for identifying suspicious users over the whole graph, but what if we are interested in how users are related to a specific fraud risk community or set of communities? Perhaps we hypothesize that communities of fraud risk users are actually bigger than currently represented but we don't have exact business rules to apply. We can leverage similarity algorithms to help us score and recommend users for this.

GDS offers multiple algorithms for similarity. In this analysis we will focus on the aptly named  [Node Similarity](https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/) algorithm. Node similarity parallelizes well and is explainable. It identifies pairs of similar nodes based on a straightforward Jaccard similarity calculation. So while other ML-based similarity approaches like FastRP + KNN covered [in this post](https://neo4j.com/developer-blog/exploring-practical-recommendation-systems-in-neo4j/) scale well for running globally on very large graphs, Node Similarity is a good choice where explainability is important and you can narrow down the universe of comparisons to a subset of your data. Examples of narrowing down include focusing on just single communities, newly added users, or users within a specific proximity to suspect accounts. In this analysis we will take the third approach, filtering to just those Cards, Devices, and IP addresses that connect to at least one fraud risk account from Part 2.


Below we apply three queries to calculate node similarity. The first query enables the identifier filtering via setting a new label on the Card, Device, and IP address nodes that connect to fraud risk accounts. The first query also weights relationships by the inverse of degree centrality, essentially downplaying the importance of identifiers proportional to the number of other users they connect to. This is important as some identifiers, particularly IP addresses, can connect to hundreds or thousands of users, in which case the identifier may be very generic (like a proxy IP address) and not as relevant to true user identity. The second and third query project the graph and write relationships back to the database with a score to represent similarity strength between user node pairs. You will notice that we use a similarity cutoff of 0.01 in the third query, which is intended to rule out weak associations and keep the similarities relevant.

In [None]:
# label identifiers and users that are close to fraud risk users and assign inverse degree weight
gds.run_cypher('''
    MATCH(f:FraudRiskUser)-[:HAS_CC|HAS_IP|USED]->(n)
    WITH DISTINCT n
    MATCH(n)<-[r:HAS_CC|HAS_IP|USED]-(u)
    SET n:FraudSharedId
    SET r.inverseDegreeWeight = 1.0/(n.degree-1.0)
    RETURN count(DISTINCT n)
''')

In [None]:
# This cell takes a minute
g, _ = gds.graph.project('similarity-projection', ['User', 'FraudSharedId'], ['HAS_CC', 'USED', 'HAS_IP'],
                         relationshipProperties=['inverseDegreeWeight'])



df = gds.nodeSimilarity.write(g, writeRelationshipType='SIMILAR_IDS', writeProperty='score',
                              similarityCutoff=0.01, relationshipWeightProperty='inverseDegreeWeight', concurrency=4)
g.drop()
df

From there, we can run a Cypher query to rank users by how similar they are to known fraud risk communities.

In [None]:
#get nodes similar to the high risk ones
gds.run_cypher('''
    MATCH (f:FraudRiskUser)
    WITH f.wccId AS componentId, count(*) AS numberOfUsers, collect(f) AS users
    UNWIND users AS f
    MATCH (f)-[s:SIMILAR_IDS]->(u:User) WHERE NOT u:FraudRiskUser AND numberOfUsers > 2
    RETURN u.guid AS userId, sum(s.score) AS totalScore, collect(DISTINCT componentId) AS closeToCommunityIds 
    ORDER BY totalScore DESC LIMIT 5
''')

***

<span style="color:green">  Let’s take a look at the first user in the list, user `0b3f278ff6b348fb1a599479d9321cd9`.</span> 
<br>This user account seems interesting in the sense that they connect to two different communities.</span>

```cypher
MATCH (u:User {guid: "0b3f278ff6b348fb1a599479d9321cd9"})-[:SIMILAR_IDS]-(u2:FraudRiskUser)
WITH u as targetUser, collect(distinct u2.wccId) as communities 
UNWIND communities as c 
MATCH (u2)-[:SHARED_IDS|P2P_WITH_SHARED_CARD]-(u3:User {wccId:c})
RETURN *
```
You can see how this user connects to the two fraud risk communities and how the similarity relationships were based on shared IP addresses. This user seems to act as a sort of bridge between the two communities, suggesting not only that the user is likely part of the fraud communities but also that the two communities may actually reflect one-in-the same.
***


Overall, centrality and similarity metrics like these are fast and easy to implement with Neo4j and GDS. They can help advance your Data Science approach by introducing automated and semi-supervised processes to assist in targeted triage and identification of suspicious user accounts based on previously labeled data.

## Part 4: Predicting Fraud Risk Accounts with Machine Learning

In real-world scenarios we often don't know which user accounts are fraudulent ahead of time. There will be cases, like with this dataset, where some accounts get flagged due to business rules, for example chargeback history, or via user reporting mechanisms. However, as we saw in parts 1 & 2 above, those flags don't tell the whole story. The community detection and recommendation approaches we previously covered can go a very long way in helping us understand this story and label additional fraud risk users and patterns. That said, there are multiple reasons why we may want to add supervised Machine Learning to predict the additional fraud risks:

 - __Proactive Detection:__ We can train a model to identify fraudulent actors ahead of time (such as before additional chargebacks or system flags) and better identify new communities that aren't connected to older known fraud accounts.
 - __Measurable Performance:__ Supervised learning models produce clear performance metrics that enable us to evaluate and adjust as needed
 - __Automation:__ supervised Machine Learning automates the prediction of fraud risk accounts.

In the below sections we will walk through how to engineer graph features for ML, export those features to python, then train and evaluate an ML model for fraud classification.

### ML Pipeline for Node Classification

In [None]:
pipe, _ = gds.beta.pipeline.nodeClassification.create("gds_demo")

In [None]:
# Add target labels
gds.run_cypher('''
    MATCH (u:User)
    SET u.fraudLabel = u.fraudRisk - u.fraudMoneyTransfer
''')

### Feature Engineering
If we want a machine learning model to successfully classify fraud risk user accounts, we need to supply features that will be informative for the task. The below commands engineer graph features using GDS.  This includes features from [WCC](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) community sizes, [pageRank](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/), and [degree centrality](https://neo4j.com/docs/graph-data-science/current/algorithms/degree-centrality/).

#### Community Features

In [None]:
# Capture features for community size from our WCC community detection in Part 2, and if the user is part of a community
gds.run_cypher('''
    MATCH (u:User)
    WITH u.wccId AS componentId, count(*) AS communitySize, collect(u) AS users
    WITH communitySize, toInteger(communitySize > 1) AS partOfCommunity, users
    UNWIND users as u
    SET u.communitySize = communitySize
    SET u.partOfCommunity = partOfCommunity;
''')

#### ID Centrality Features

In [None]:
g, _ = gds.graph.project("p2p-sharedid-features", 
    {
        "User": {"label": "User", "properties":["communitySize", "partOfCommunity", "fraudLabel"]},
        "Card": {"label": "Card"},
        "Device": {"label": "Device"},
        "IP": {"label": "IP"}
    },
    {
        "HAS_IP": {"type": "HAS_IP", "orientation": "NATURAL"},
        "HAS_CC": {"type": "HAS_CC", "orientation": "NATURAL"},
        "USED": {"type": "USED", "orientation": "NATURAL"},
        "P2P": {"type": "P2P", "orientation": "NATURAL", "aggregation": "SUM", "properties": ["totalAmount"]},
        "P2P_REVERSE": {"type": "P2P", "orientation": "REVERSE", "aggregation": "SUM", "properties": ["totalAmount"]},
        "SHARED_IDS": {"type": "SHARED_IDS", "orientation": "UNDIRECTED"},
        "P2P_WITH_SHARED_CARD": {"type": "P2P_WITH_SHARED_CARD", "orientation": "NATURAL"}
    }
)

In [None]:
# capture how many cards, devices, or IPs used by each user with degree centrality
gds.degree.mutate(g, nodeLabels=['User', 'Card'], relationshipTypes=['HAS_CC'], mutateProperty='cardDegree')
gds.degree.mutate(g, nodeLabels=['User', 'Device'], relationshipTypes=['USED'], mutateProperty='deviceDegree')
gds.degree.mutate(g, nodeLabels=['User', 'IP'], relationshipTypes=['HAS_IP'], mutateProperty='ipDegree');

#### P2P With Cards and Id Sharing Centrality Features

In [None]:
gds.degree.mutate(g, nodeLabels=["User"], relationshipTypes=["SHARED_IDS"], mutateProperty="sharedIdsDegree")

gds.pageRank.mutate(g, nodeLabels=["User"], relationshipTypes=["P2P_WITH_SHARED_CARD"], 
                    maxIterations=1000, mutateProperty="p2pSharedCardPageRank")

gds.pageRank.mutate(g, nodeLabels=["User"], relationshipTypes=["P2P"], maxIterations=1000,
                     mutateProperty="p2pSentPageRank")

gds.pageRank.mutate(g, nodeLabels=["User"], relationshipTypes=["P2P_REVERSE"], maxIterations=1000,
                     relationshipWeightProperty='totalAmount', mutateProperty="p2pReceivedWeightedPageRank")

gds.degree.mutate(g, nodeLabels=["User"], relationshipTypes=["P2P_REVERSE"], 
                  relationshipWeightProperty='totalAmount', mutateProperty="p2pReceivedWeightedDegree")

### Feature Selection
Select features from the list of features we generated in the previous step

In [None]:
pipe.selectFeatures(["communitySize", "partOfCommunity", "cardDegree", "deviceDegree", "ipDegree",
                    "sharedIdsDegree", "p2pSharedCardPageRank", "p2pSentPageRank",
                     "p2pReceivedWeightedPageRank", "p2pReceivedWeightedDegree"
                    ])

In [None]:
pipe.feature_properties()

In [None]:
pipe.addLogisticRegression(maxEpochs=500, penalty=0.01)

In [None]:
# Train the pipeline targeting node property "my-class" as label and "ACCURACY" as only metric
trained_pipe_model, res = pipe.train(g, targetNodeLabels=['User'], modelName="my-model", targetProperty="fraudLabel", metrics=["ACCURACY", "RECALL(CLASS=*)", "F1_WEIGHTED"], concurrency=4)

In [None]:
trained_pipe_model.metrics()

#### Write computed features to the database

In [None]:
gds.graph.writeNodeProperties(g, ["cardDegree", "deviceDegree", "ipDegree",
                    "sharedIdsDegree", "p2pSharedCardPageRank", "p2pSentPageRank",
                     "p2pReceivedWeightedPageRank", "p2pReceivedWeightedDegree"
                    ], ['User'])
g.drop()

### Machine Learning Training & Evaluation using scikit-learn

#### Add target label weights

In [None]:
# for use with ML frameworks that take indo account weights on target label when training the model 
gds.run_cypher('''
MATCH (u:User) WITH count(u) as numSamples
MATCH (u:User) WHERE u.fraudLabel > 0  WITH numSamples, count(u) as negativeSamples
MATCH (u:User)  WITH u, numSamples, negativeSamples,

CASE u.fraudLabel
WHEN 1 THEN toFloat(numSamples)/(2 * negativeSamples)
ELSE toFloat(numSamples)/(2 * (numSamples - negativeSamples))
END as weight

SET u.fraudLabelWeight = weight
''')

#### Get and Prepare data

In [None]:
df = gds.run_cypher('''
    MATCH(u:User)
    RETURN u.guid AS guid,
        u.wccId AS wccId,
        u.fraudLabel AS fraudLabel,
        u.fraudLabelWeight AS fraudLabelWeight,
        u.sharedIdsDegree AS sharedIdsDegree,
        u.p2pSharedCardPageRank AS p2pSharedCardPageRank,
        u.p2pSentPageRank AS p2pSentPageRank,
        u.p2pReceivedWeightedPageRank AS p2pReceivedWeightedPageRank,
        u.p2pReceivedWeightedDegree AS p2pReceivedWeightedDegree,
        u.ipDegree AS ipDegree,
        u.cardDegree AS cardDegree,
        u.deviceDegree AS deviceDegree,
        u.communitySize AS communitySize,
        u.partOfCommunity AS partOfCommunity
''')
df

In [None]:
X = df.drop(columns=['fraudLabel', 'fraudLabelWeight', 'wccId', 'guid'])
y = df.fraudLabel

In [None]:
X

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

#### Model Training and Evaluation

For purposes of this demo we are going to use a random forest classifier. Other classifiers including logistic regression, SVM, Neural Nets and Boosting variants could work as well. Going into the exact pros and cons of these models is out of scope here. Overall, exploring classification with Random Forests is a safe bet since they are relatively robust to feature scaling and collinearity issues and require minimal tuning to get working well.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=500, random_state=0, max_depth=5, bootstrap=True, class_weight='balanced')
clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
print('Accuracy of random forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
print('\nConfusion Matrix: ')
disp = ConfusionMatrixDisplay.from_predictions(y_test, clf.predict(X_test), display_labels=clf.classes_,
                                               cmap='Greys', colorbar=False)

Below is a ranked list of the most influential features. Among the most important are the community sizes and the shared ids degree and p2p shared card pageRank.

In [None]:
from sklearn.inspection import permutation_importance
result = permutation_importance(clf, X_train, y_train, random_state=0)
pd.DataFrame(abs(result['importances_mean']),index=X_train.columns).sort_values(0, ascending=False)

### Investigating Unlabeled High-Probability Fraud Risk Predictions
The labeling from part 2 wasn't perfect. Now that we have trained a machine learning model, investigating user accounts that were predicted as high probability fraud risks despite not being labeled as such by us (ostensible false positives), will bring further insights.

The below commands will isolate some cases from the test set so we can visualize in Neo4j Browser or [Neo4j Bloom](https://neo4j.com/product/bloom/)

In [None]:
# Retrieve High Probability predictions for non-fraud risk labeled data in the testset
y_prob = clf.predict_proba(X_test)
y_test_df = y_test.to_frame(name='cls')
y_test_df['predictedProbability']=y_prob[:, 1]
test_prob_df = y_test_df[(y_test_df.predictedProbability > 0.88) & (y_test_df.cls == 0)] \
    .join(df[['guid','wccId', 'communitySize']])
test_prob_df

In [None]:
#Write back to database for investigation in Bloom
for index, row in test_prob_df.iterrows():
    gds.run_cypher('''
        MATCH(u:User) WHERE u.guid = $guid
        SET u.predictedProbability = $predictedProbability
    ''', params = row.to_dict())

*** 
<span style="color:green"> You can look at the predicted fraudulent users and their community using an example like the below. </span>
<br>Hint: Set the User node caption to be `predictedProbability` to more easily identify the suspect! 

```cypher 
MATCH (u:User) WHERE u.guid="c4d3c05ebf06a6b7b2f10833a51a0b70" 
WITH u.wccId as commId
MATCH p=(c:User)-[:P2P|HAS_IP|HAS_CC|USED*1..2]-(d:User) 
WHERE c.wccId=commId and d.wccId=commId and id(c)<>id(d)
RETURN p;
``` 
*** 
Perhaps unsurprisingly, these examples exhibit P2P payments with shared card behavior. They also have a relatively large number of credit cards (the median degree centrality on cards is 3). This could potentially be a sign of fraud, though it is hard to know on an anonymized dataset like this. This is where subject matter expert review and iteration comes in. If this behavior turns out to be a clear indicator of fraud, it means we are predicting fraud more proactively before chargebacks take place which is the ideal. In this case, if we re-label these users appropriately and re-train our ML model as more data comes in, we will further improve predictive performance. If, on the other hand, it turns out that some of this behavior is benign, we can adjust the feature engineer and model so the ML learns to rule out such cases which will likewise improve predictive performance and increase our understanding of fraud patterns. Either way, it is a win.

### Additional Visualization with Bloom
<br>You can also explore more custom styling and user workflows with the Neo4j visualization tool [Bloom](https://neo4j.com/docs/bloom-user-guide/current/) - try the query above as a "Saved Cypher" Search phrase, with guid as a parameter! 



-------------
------ 
## Clean Up
This section will help clean all the additional graph elements and properties created in the above workflow.

In [None]:
# drop all pipelines 
p_names = gds.beta.pipeline.list().pipelineName.tolist()
for p_name in p_names: 
    p = gds.pipeline.get(p_name)
    gds.beta.pipeline.drop(p)

In [None]:
# drop all inmemory graphs
gdf = gds.graph.list()
# make sure only trying to drop graphs in current db
g_names = gdf[gdf['database']==DATABASE].graphName.tolist()
for g_name in g_names:
    g = gds.graph.get(g_name)
    gds.graph.drop(g)

In [None]:
# drop all models
models = gds.run_cypher("CALL gds.beta.model.list()")
for m in models['modelInfo']:
    gds.run_cypher('''CALL gds.beta.model.drop("{}")'''.format(m['modelName']))

In [None]:
# delete created relationships
gds.run_cypher('MATCH (:User)-[r:SHARED_IDS]->() DELETE r')
gds.run_cypher('MATCH (:User)-[r:P2P_WITH_SHARED_CARD]->() DELETE r')
gds.run_cypher('MATCH (:User)-[r:SIMILAR_IDS]->() DELETE r')

In [None]:
# remove created node Labels
gds.run_cypher('MATCH (u:FlaggedUser) REMOVE u:FlaggedUser')
gds.run_cypher('MATCH (u:FraudRiskUser) REMOVE u:FraudRiskUser')
gds.run_cypher('MATCH (u:FraudSharedId) REMOVE u:FraudSharedId')

In [None]:
# remove created node properties
gds.run_cypher('''
    MATCH (n)
    REMOVE n.wccId,
        n.sharedIdsDegree,
        n.predictedProbability,
        n.partOfCommunity,
        n.p2pSharedCardPageRank,
        n.p2pSharedCardDegree,
        n.p2pSentWeightedPageRank,
        n.p2pSentWeightedDegree,
        n.p2pSentPageRank,
        n.p2pSentDegree,
        n.p2pReversedSharedCardPageRank,
        n.p2pReversedSharedCardDegree,
        n.p2pReceivedWeightedPageRank,
        n.p2pReceivedWeightedDegree,
        n.p2pReceivedPageRank,
        n.p2pReceivedDegree,
        n.louvainCommunityId,
        n.ipDegree,
        n.fraudLabel,
        n.fraudLabelWeight,
        n.fraudRiskRatio,
        n.fraudRiskDegree,
        n.fraudRisk,
        n.flaggedRatio,
        n.flaggedDegree,
        n.deviceDegree,
        n.degree,
        n.communitySize,
        n.cardDegree
''')

In [None]:
# remove created relationship properties
gds.run_cypher('MATCH ()-[r]->() REMOVE r.inverseDegreeWeight')