<a href="https://colab.research.google.com/github/guerinjeanmarc/FraudWorkshop/blob/main/Paysim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fraud Detection and Inverstigation Using Graph Data Science Library

Example GDS workflow to demonstrate fraud detection and investigation using Neo4j Graph Data Science. This browser guide contains snippets of cypher code and a brief explanation in each slide to help with the demo.

We will use the GDS Library to get you started with few scenarios in first party and synthetic identity fraud detection and investigation.

## Notebook Setup
We need a dedicated environment where Neo4j and GDS are available, in our case we will use the Graph Data Science sandbox.

- Go to https://sandbox.neo4j.com
- login and click on New Project
- select **Fraud Detection**, then Create

In [None]:
#install dependencies
!pip install graphdatascience

Collecting graphdatascience
  Downloading graphdatascience-1.7-py3-none-any.whl (938 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/938.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/938.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m931.8/938.7 kB[0m [31m15.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m938.7/938.7 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multimethod<2.0,>=1.0 (from graphdatascience)
  Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Collecting neo4j<6.0,>=4.4.2 (from graphdatascience)
  Downloading neo4j-5.10.0.tar.gz (187 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.7/187.7 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting 

### Neo4j Settings


In [None]:
# Replace XXX with your connection details
BoltURL = 'bolt://XXXXXX:7687'
Username = 'neo4j'
password = 'XXXXXXX'

### Connect to Graph Data Science

In [None]:
from graphdatascience import GraphDataScience

# Use Neo4j URI and credentials according to your setup
gds = GraphDataScience(BoltURL, auth=(Username, password), aura_ds=False)

# Check connection
print(gds.version())

2.4.1


### Helper function

In [None]:
def clear_graph_by_name(g_name):
    if gds.graph.exists(g_name).exists:
        g = gds.graph.get(g_name)
        gds.graph.drop(g)



### Problem Definitition
#### What is Fraud?
Fraud occurs when an individual or group of individuals, or a business entity intentionally deceives another individual or business entity with misrepresentation of identity, products, services, or financial transactions and/or false promises with no intention of fulfilling them.

#### Fraud Categories
- First-party Fraud
  - An individual, or group of individuals, misrepresent their identity or give false information when applying for a product or services to receive more favourable rates or when have no intention of repayment.
- Second-party Fraud
  - An individual knowingly gives their identity or personal information to another individual to commit fraud or someone is perpetrating fraud in his behalf.
- Third-party Fraud
  - An individual, or a group of individuals, create or use another person’s identity, or personal details, to open or takeover an account.


### Exercises
We will use Neo4j GDS library to detect and label two types of fraudsters

1. First party fraudsters (Module #1)
2. Money Mules (Module #2)

### Preliminary Data Analysis

We will use Paysim dataset for the hands-on exercises. Paysim is a synthetic dataset that mimics real world mobile money transfer network.

Let’s explore the dataset.

1. Database Schema and Stats
2. Nodes and Relationships
3. Transaction Types

#### Database Schema and Stats

In Neo4j browser, let's look at the schema:

```Cypher
CALL db.schema.visualization()
```
db.schema.visualization shows all node labels. We can select only the labels we are interrested in with apoc.meta.subGraph:
```Cypher
CALL apoc.meta.subGraph({labels:['Client','Email','SSN','Phone','Transaction','Bank','Merchant']})
```

We can explore the graph in Neo4j browser by writing cypher queries that:
- Show some (:Client) nodes
- Return the number of (:Client) nodes
- Return the number of (:SSN) nodes
- Show some relationships between (:Client) and (:SSN)
- Return the number of relationships between (:Client) and (:SSN)
- How many Clients does not have SSN?
- How many Distinct SSN are shared between 2 Client?


In [None]:
# total node counts
gds.run_cypher('''
    CALL apoc.meta.stats()
    YIELD labels
    UNWIND keys(labels) AS nodeLabel
    RETURN nodeLabel, labels[nodeLabel] AS nodeCount
''')

Unnamed: 0,nodeLabel,nodeCount
0,Debit,4392
1,Bank,3
2,Email,2229
3,Mule,433
4,SSN,2238
5,Payment,74577
6,Merchant,347
7,Transaction,323489
8,Phone,2234
9,CashOut,76023


In [None]:
# total relationship counts
gds.run_cypher('''
    CALL apoc.meta.stats()
    YIELD relTypesCount
    UNWIND keys(relTypesCount) AS relationshipType
    RETURN relationshipType, relTypesCount[relationshipType] AS relationshipCount
''')

Unnamed: 0,relationshipType,relationshipCount
0,HAS_SSN,2433
1,LAST_TX,2332
2,PERFORMED,323489
3,NEXT,321157
4,HAS_EMAIL,2433
5,TO,323489
6,FIRST_TX,2332
7,HAS_PHONE,2433


## Module #1: First-party Fraud
Synthetic identity fraud and first party fraud can be identified by performing entity link analysis to detect identities linked to other identities via shared PII.

There are three types of personally identifiable information (PII) in this dataset - SSN, Email and Phone Number

Our hypothesis is that clients who share identifiers are suspicious and have a higher potential to commit fraud. However, all shared identifier links are not suspicious, for example, two people sharing an email address. Hence, we compute a fraud score based on shared PII relationships and label the top X percentile clients as fraudsters.

We will first identify clients that share identifiers and create a new relationship between clients that share identifiers

### Identify clients sharing PII


In [None]:
gds.run_cypher('''
    MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n) <-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
    WHERE id(c1) < id(c2)
    RETURN c1.id, c1.name, c2.id, c2.name, count(*) AS freq
    ORDER BY freq DESC;
''')

Unnamed: 0,c1.id,c1.name,c2.id,c2.name,freq
0,4952527271473904,Lauren Carver,4816336012071985,Khloe Lowery,3
1,4883445100935916,Claire Morin,4708373581412325,Isaac Gallegos,3
2,4658150168863397,Ashley Butler,4100374538108184,Brandon Howell,3
3,4673951123644611,Christian Reese,4795773320377768,Matthew Riddle,3
4,4192214340630620,Lauren Stanley,4912097363222923,Zoe Miranda,3
...,...,...,...,...,...
754,4910140986334626,Juan Trujillo,4114683318919154,Nathaniel Myers,1
755,4454780847105236,Andrea Cummings,4210575070378533,Adrian Jacobson,1
756,4721862020593706,Alexandra Duke,4210575070378533,Adrian Jacobson,1
757,4445521165797820,Oliver Gentry,4210575070378533,Adrian Jacobson,1


In [None]:
# Number of unique clients sharing PII
gds.run_cypher('''
    MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(n) <-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
    WHERE id(c1) <> id(c2)
    RETURN count(DISTINCT c1.id) AS freq;
''')

Unnamed: 0,freq
0,336


### Create a new relationship
Create a new relationship to connect clients that share identifiers and add the number of shared identifiers as a property on that relationship


In [None]:
gds.run_cypher('''
    MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN] ->(n)<- [:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
    WHERE id(c1) < id(c2)
    WITH c1, c2, count(*) as cnt
    MERGE (c1) - [:SHARED_IDENTIFIERS {count: cnt}] -> (c2);
''')

Visualize the new relationship created above.

```Cypher
MATCH p = (:Client) - [s:SHARED_IDENTIFIERS] -> (:Client) WHERE s.count >= 2 RETURN p limit 25
```


### Graph Algorithms

Graph algorithms are used to compute metrics for graphs, nodes, or relationships.

They can provide insights on relevant entities in the graph (centralities, ranking), or inherent structures like communities (community-detection, graph-partitioning, clustering).

The Neo4j Graph Data Science (GDS) library contains many graph algorithms. The algorithms are divided into categories which represent different problem classes. For more information, please click here: Algorithms

### Fraud detection workflow in Neo4j GDS

We will construct a workflow with graph algorithms to detect fraud rings, score clients based on the number of common connections and rank them to select the top few suspicious clients and label them as fraudsters.

Identify clusters of clients sharing PII using a community detection algorithm (Weakly Connected Components)
Find similar clients within the clusters using pairwise similarity algorithms (Node Similarity)
Calculate and assign fraud score to clients using centrality algorithms (Degree Centrality) and
Use computed fraud scores to label clients as potential fraudsters

### Graph Projection

A central concept in the GDS library is the management of in-memory graphs. Graph algorithms run on a graph data model which is a projection of the Neo4j property graph data model. For more information, please click here: Graph Management

A projected graph can be stored in the catalog under a user-defined name. Using that name, the graph can be referred to by any algorithm in the library.

In [None]:
# clear the graph if it exists beforehand
clear_graph_by_name('wcc')

# create graph projection
g, _ = gds.graph.project(
    'wcc',
    ['Client'],
    {
    'SHARED_IDENTIFIERS': {'orientation': 'UNDIRECTED', 'properties':['count']}
    }
)

print(f"Created {g.name()} with {g.node_count():,} nodes, {g.relationship_count():,} relationships")

Loading:   0%|          | 0/100 [00:00<?, ?%/s]

Created wcc with 2,433 nodes, 1,518 relationships


In [None]:
# Show the graph catalog
gds.graph.list()

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
0,"{'p99': 7, 'min': 0, 'max': 9, 'mean': 0.62392...",wcc,neo4j,9067 KiB,9284736,2433,1518,{'relationshipProjection': {'SHARED_IDENTIFIER...,0.000257,2023-07-18T09:33:58.007080267+00:00,2023-07-18T09:34:00.538351527+00:00,"{'graphProperties': {}, 'relationships': {'SHA...","{'graphProperties': {}, 'relationships': {'SHA..."


In [None]:
# delete selected graph
#g.drop()

graphName                                                              wcc
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             2433
relationshipCount                                                     1518
configuration            {'relationshipProjection': {'SHARED_IDENTIFIER...
density                                                           0.000257
creationTime                           2023-04-26T15:34:47.411091770+00:00
modificationTime                       2023-04-26T15:34:47.728501503+00:00
schema                   {'graphProperties': {}, 'relationships': {'SHA...
schemaWithOrientation    {'graphProperties': {}, 'relationships': {'SHA...
Name: 0, dtype: object

### Memory Estimation and Graph Projection

It is a good practice to run memory estimates before creating your graph to make sure you have enough memory to create an in-memory graph. For more information, click here: Memory Estimation

Named graphs can be created using either a Native projection or a Cypher projection. Native projections provide the best performance by reading from the Neo4j store files. Using Cypher projections is a more flexible and expressive approach with diminished focus on performance compared to the native projections. For more information, click here: Native and Cypher Projection

In [None]:
# get memory estimate before creating the graph
gds.graph.project.estimate(
    ['Client'],
    {'SHARED_IDENTIFIERS': {'orientation': 'UNDIRECTED', 'properties':['count']} }
)

requiredMemory                                  [612 KiB ... 3205 KiB]
treeView             graph projection: [612 KiB ... 3205 KiB]\n|-- ...
mapView              {'components': [{'components': [{'memoryUsage'...
bytesMin                                                        626784
bytesMax                                                       3282936
nodeCount                                                         2433
relationshipCount                                                  759
heapPercentageMin                                                  0.1
heapPercentageMax                                                  0.1
Name: 0, dtype: object

### 1. Identify groups of clients sharing PII (Fraud rings)

Run Weakly connected components to find clusters of clients sharing PII.

Weakly Connected Components is used to find groups of connected nodes, where all nodes in the same set form a connected component. WCC is often used early in an analysis understand the structure of a graph. More informaton here: WCC documentation

In [None]:
df = gds.wcc.write(g, writeProperty='wccId', consecutiveIds=True)
g.drop()
df

writeMillis                                                            294
nodePropertiesWritten                                                 2433
componentCount                                                        2148
componentDistribution    {'p99': 7, 'min': 1, 'max': 12, 'mean': 1.1326...
postProcessingMillis                                                    37
preProcessingMillis                                                      0
computeMillis                                                           38
configuration            {'jobId': 'fbb9cc05-b22f-4988-9938-34a2134e0b9...
Name: 0, dtype: object

In [None]:
gds.run_cypher('''
    MATCH (c:Client)
    RETURN c.id as clientId, c.name as name, c.wccId as wccId
    ORDER BY wccId LIMIT 20
''')

Unnamed: 0,clientId,name,wccId
0,4528449536009586,Parker Coleman,0
1,4215465225552213,Nolan Whitfield,1
2,4252021221910485,Madeline Bennett,2
3,4391421102919948,Alex Bradley,3
4,4091319647578836,Maya Brooks,4
5,4312387486707181,Grace Hammond,5
6,4189330002136246,Benjamin Moss,5
7,4872929154943952,David Olson,5
8,4823433093413060,Alex Chen,5
9,4834589450383852,Logan Little,5


In [None]:
# Use cypher to filter clusters based on the size (>1) and then set a property on Client nodes
gds.run_cypher('''
    MATCH (c:Client)
    WITH c.wccId as cluster, collect(c.id) as clients
    WITH cluster, clients, size(clients) AS clusterSize
    WHERE clusterSize > 1
    UNWIND clients as client
    MATCH (c:Client) WHERE c.id = client
    SET c.firstPartyFraudGroup = cluster
    SET c:FirstPartyFraudGroup
''')

### Collect and visualize clusters in Neo4j Browser

Visualize clusters with greater than 9 client nodes:

```Cypher
MATCH (c:Client)
WITH c.firstPartyFraudGroup AS fpGroupID, collect(c.id) AS fGroup
WITH *, size(fGroup) AS groupSize WHERE groupSize >= 9
WITH collect(fpGroupID) AS fraudRings
MATCH p=(c:Client)-[:HAS_SSN|HAS_EMAIL|HAS_PHONE]->()
WHERE c.firstPartyFraudGroup IN fraudRings
RETURN p
```

### Pairwise similarity scores for additional context

We have observed that some identifiers (Email/SSN/Phone Number) are connected to more than one client pointing to reuse of identifiers among clients.

We hypothesize that identities that are connected to highly reused identifiers have higher potential to commit fraud.

We could compute pairwise similarity scores using Jaccard metric and build additional relationships to connect clients based on shared identifiers and score these pairs based on Jaccard score.

In [None]:
# Graph Projection
g, _ = gds.graph.project('similarity', ['FirstPartyFraudGroup','Email', 'Phone', 'SSN'], {
    'HAS_EMAIL': {'orientation': 'UNDIRECTED'},
    'HAS_PHONE': {'orientation': 'UNDIRECTED'},
    'HAS_SSN': {'orientation': 'UNDIRECTED'}
})

### Write similarity scores to in-memory graph (Mutate)

We can mutate in-memory graph by writing outputs from the algorithm as node or relationship properties.

In [None]:
df = gds.nodeSimilarity.mutate(g, mutateRelationshipType='SIMILAR_IDS', mutateProperty='score')
df

preProcessingMillis                                                       0
computeMillis                                                           310
mutateMillis                                                            104
postProcessingMillis                                                     -1
nodesCompared                                                           746
relationshipsWritten                                                   2938
similarityDistribution    {'p1': 0.14285707473754883, 'max': 1.000007152...
configuration             {'topK': 10, 'similarityMetric': 'JACCARD', 'b...
Name: 0, dtype: object

### 3. Calculate First-party Fraud Score

We compute first party fraud score using weighted degree centrality algorithm.

In this step, we compute and assign fraud score (firstPartyFraudScore) to clients in the clusters identified in previous steps based on SIMILAR_TO relationships weighted by jaccardScore

Weighted degree centrality algorithm add up similarity scores (jaccardScore) on the incoming SIMILAR_TO relationships for a given node in a cluster and assign the sum as the corresponding firstPartyFraudScore. This score represents clients who are similar to many others in the cluster in terms of sharing identifiers. Higher firstPartyFraudScore represents greater potential for committing fraud.

In [None]:
#Write back centrality scores as firstPartyFraudScore to the database using write mode.
df = gds.degree.write(g, nodeLabels= ['FirstPartyFraudGroup'], relationshipTypes=['SIMILAR_IDS'],
                       relationshipWeightProperty='score', writeProperty='firstPartyFraudScore')

In [None]:
# return top10 first party fraudsters
gds.run_cypher('''
    MATCH (c:Client)
    WHERE c.firstPartyFraudScore IS NOT NULL
    RETURN c.id AS id, c.name AS name, c.firstPartyFraudScore as score
    ORDER BY score DESC LIMIT 10
''')

Unnamed: 0,id,name,score
0,4024985944102082,Charlotte Foster,3.5
1,4268433407129628,Jose Roberson,3.2
2,4830783673717400,Scarlett Solomon,3.2
3,4614177132519923,Ryan Patel,3.1
4,4371660075922934,Allison Alvarez,3.1
5,4189330002136246,Benjamin Moss,3.0
6,4632977841783696,Julia Ortega,3.0
7,4029043591201321,Brooklyn Harrison,2.9
8,4818802026065667,Madeline Ramos,2.9
9,4359490519123048,Landon Welch,2.9


### 4. Attach fraudster labels

We find clients with first-party fraud score greater than some threshold (X) and label those top X percentile clients as fraudsters. In this example, using 95th percentile as a threshold, we set a property FirstPartyFraudster on the Client node.

In [None]:
gds.run_cypher('''
    MATCH(c:Client)
    WHERE c.firstPartyFraudScore IS NOT NULL
    WITH percentileCont(c.firstPartyFraudScore, 0.95) AS firstPartyFraudThreshold

    MATCH(c:Client)
    WHERE c.firstPartyFraudScore > firstPartyFraudThreshold
    SET c:FirstPartyFraudster
''')

In [None]:
# count top 95% first party fraudsters
gds.run_cypher('''
    MATCH (c:FirstPartyFraudster)
    RETURN count(c)
''')

Unnamed: 0,count(c)
0,17


## End of Module #1: First-party Fraud

In this module:

1. Identified clusters of clients sharing PII
2. Computed pairwise similarity based on shared PII
3. Computed first-party fraud score and
4. Labeled some clients as first-party fraudsters

## Module #2: Second-party Fraud/ Money Mules

According to FBI, criminals recruit money mules to help launder proceeds derived from online scams and frauds. Money mules add layers of distance between victims and fraudsters, which makes it harder for law enforcement to accurately trace money trails.

In this exercise, we detect money mules in the paysim dataset. Our hypothesis is that clients who transfer money to/from first party fraudsters are suspects for second party fraud.

Identify and explore transactions (money transfers) between first-party fraudsters and other clients
Detect second-party fraud networks

### Transactions between first-party fraudsters and client

The first step is to find out clients who weren’t identified as first party fraudsters but they transact with first party fraudsters

```Cypher
MATCH p=(:Client:FirstPartyFraudster)-[]-(:Transaction)-[]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;
```

Also, lets find out what types of transactions do these Clients perform with first party fraudsters

In [None]:
gds.run_cypher('''
    MATCH (:Client:FirstPartyFraudster)-[]-(txn:Transaction)-[]-(c:Client)
    WHERE NOT c:FirstPartyFraudster
    UNWIND labels(txn) AS transactionType
    RETURN transactionType, count(*) AS freq;
''')

Unnamed: 0,transactionType,freq
0,Transfer,89
1,Transaction,89


### Create new relationships

Let’s go ahead and create TRANSFER_TO relationships between clients with firstPartyFraudster tags and other clients. Also add the total amount from all such transactions as a property on TRANSFER_TO relationships.

Since the total amount transferred from a fraudster to a client and the total amount transferred in the reverse direction are not the same, we have to create relationships in two separate queries.

- TRANSFER_TO relationship from a fraudster to a client (look at the directions in queries)
- Add SecondPartyFraudSuspect tag to these clients

In [None]:
gds.run_cypher('''
    MATCH (c1:FirstPartyFraudster)-[]->(t:Transaction)-[]->(c2:Client)
    WHERE NOT c2:FirstPartyFraudster
    WITH c1, c2, sum(t.amount) AS totalAmount
    SET c2:SecondPartyFraudSuspect
    CREATE (c1)-[:TRANSFER_TO {amount:totalAmount}]->(c2);
''')

- TRANSFER_TO relationship from a client to a fraudster.

In [None]:
gds.run_cypher('''
    MATCH (c1:FirstPartyFraudster)<-[]-(t:Transaction)<-[]-(c2:Client)
    WHERE NOT c2:FirstPartyFraudster
    WITH c1, c2, sum(t.amount) AS totalAmount
    SET c2:SecondPartyFraudSuspect
    CREATE (c1)<-[:TRANSFER_TO {amount:totalAmount}]-(c2);
''')

### Visualize relationships in Neo4j Browser

Visualize newly created TRANSFER_TO relationships
```Cypher
MATCH p=(:Client:FirstPartyFraudster)-[:TRANSFER_TO]-(c:Client)
WHERE NOT c:FirstPartyFraudster
RETURN p;
```

### Second-party Fraud

Our objective is to find out clients who may have supported the first party fraudsters and were not identified as potential first party fraudsters.

Our hypothesis is that clients who perform transactions of type Transfer where they either send or receive money from first party fraudsters are flagged as suspects for second party fraud.

To identify such clients, make use of TRANSFER_TO relationships and use this recipe:

Use WCC (community detection) to identify networks of clients who are connected to first party fraudsters
Use PageRank (centrality) to score clients based on their influence in terms of the amount of money transferred to/from fraudsters
Assign risk score (secondPartyFraudScore) to these clients

### 1. Graph Projection and WCC

Let’s use native projection and create an in-memory graph with Client nodes and TRANSFER_TO relationships.

In [None]:
g, _ = gds.graph.project('SecondPartyFraudNetwork',
    'Client',
    'TRANSFER_TO',
    relationshipProperties='amount'
)

We will see if there are any clusters with more than one clients in them and if there are, then we should add a tag secondPartyFraudGroup to find them later using local queries.

- Write results to the database

In [None]:
gds.wcc.write(g, writeProperty='wccId2')

writeMillis                                                            187
nodePropertiesWritten                                                 2433
componentCount                                                        2378
componentDistribution    {'p99': 1, 'min': 1, 'max': 18, 'mean': 1.0231...
postProcessingMillis                                                     4
preProcessingMillis                                                      0
computeMillis                                                            5
configuration            {'jobId': '91ed3b0b-0267-4312-bbd8-807990ffce4...
Name: 0, dtype: object

In [None]:
gds.run_cypher('''
    MATCH (c:Client)
    WITH c.wccId2 as clusterId, collect(c.id) AS cluster
    WITH clusterId, size(cluster) as clusterSize, cluster
    WHERE clusterSize > 1
    UNWIND cluster as client
    MATCH (c:Client {id:client})
    SET c.secondPartyFraudGroup = clusterId
    SET c:SecondPartyFraudGroup
''')

In [None]:
gds.run_cypher('''
  MATCH (c:SecondPartyFraudGroup)
  WITH c.secondPartyFraudGroup as groupId, COLLECT(DISTINCT c.name) as names
  RETURN groupId, size(names) as groupSize, names ORDER BY groupSize DESC
''')

Unnamed: 0,groupId,groupSize,names
0,21,18,"[Zoe Burgess, Levi Hogan, Allison Freeman, Ben..."
1,47,16,"[Makayla Gonzalez, Gabriella Buchanan, Joseph ..."
2,8,13,"[Julia Barlow, Grayson Cortez, Evelyn Craig, H..."
3,1465,5,"[Jason Walker, Eva Dillard, Charlotte Foster, ..."
4,34,3,"[Angel Barton, Elizabeth Britt, Landon Welch]"
5,1771,3,"[Brandon Mcintosh, Gabriel Oliver, Julia Ortega]"
6,2077,3,"[Stella Mcconnell, Samantha Mueller, Aaliyah T..."
7,2021,2,"[Damian Lynch, Benjamin Moss]"


### 2. Second-party Fraudster PageRank scores

Use pagerank to find out who among the suspects have relatively higher fraud scores. Please note that relationships are weighted by the total amount transferred to fraudsters.

Write results to the database
Attach a secondPartyFraudScore tag to the clients with PageRank scores as values

In [None]:
gds.pageRank.write(g, relationshipTypes=['TRANSFER_TO'], maxIterations=1000, relationshipWeightProperty='amount',
                   writeProperty='pageRankScore')

writeMillis                                                              72
nodePropertiesWritten                                                  2433
ranIterations                                                             3
didConverge                                                            True
centralityDistribution    {'p99': 0.14999961853027344, 'min': 0.14999961...
postProcessingMillis                                                     55
preProcessingMillis                                                       0
computeMillis                                                            64
configuration             {'maxIterations': 1000, 'writeConcurrency': 4,...
Name: 0, dtype: object

In [None]:
gds.run_cypher('''
    MATCH(c:Client)
    RETURN c.id, c.name, c.pageRankScore
    ORDER BY c.pageRankScore DESC LIMIT 10
''')

Unnamed: 0,c.id,c.name,c.pageRankScore
0,4343063345299248,Brayden Weiss,2.04975
1,4977596531678389,Michael Rodriquez,1.833
2,4029043591201321,Brooklyn Harrison,1.8075
3,4288767058170373,Colton Browning,1.48875
4,4912587051525728,Sarah Klein,1.425
5,4583937317122539,Aiden Hurst,1.2975
6,4659802546143350,Eva Dillard,0.640875
7,4446118457512030,Kennedy Keith,0.5325
8,4413678751619160,Stella Mcconnell,0.405
9,4668965540204665,Brandon Mcintosh,0.405


In [None]:
# clean up
g.drop()

graphName                                          SecondPartyFraudNetwork
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             2433
relationshipCount                                                       55
configuration                                                           {}
density                                                           0.000009
creationTime                           2023-04-26T04:20:29.659688402+00:00
modificationTime                       2023-04-26T04:20:29.735019677+00:00
schema                   {'graphProperties': {}, 'relationships': {'TRA...
schemaWithOrientation    {'graphProperties': {}, 'relationships': {'TRA...
Name: 0, dtype: object

## End of Module #2

In this module we accomplished the following tasks:

Identified clusters of clients and first-party fraudsters transferring money between them
Calculated second-party fraud score and identified second-party fraudsters