<a href="https://colab.research.google.com/github/mihir-shah-bh/graph-summit-apac-2023/blob/develop/GDS_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graph Data Science Workshop with Neo4j

Click on the link below to open a Colab version of the notebook. You will be able to create your own version.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/neo4j-field/graph-summit-apac-2023/blob/main/GDS_Workshop.ipynb" target="_blank">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo">Run your own notebook in Colab
    </a>
  </td>
</table>

---
## Target

Do fraud analysis on a group of persons and transactions using graphs and data science.  

## Context

This notebook allows you load a dataset based on an updated version of [Paysim](https://www.sisu.io/posts/paysim/).  
PaySim is an approximation using an agent-based model and some anonymized, aggregate transactional data from a real mobile money network operator to create synthetic financial data sets academics and hackers can use for exploring ways to detect fraudulent behavior.  
Using this [code](https://github.com/voutilad/paysim), you can generate your own dataset with different caracteristics (size, fraud occurences...).    

We're going to leverage [Neo4j Graph Data Science (GDS)](https://neo4j.com/docs/graph-data-science/current/algorithms/) to investigate through the data and find out fraud patterns and fraudsters.  

## Dataset

The dataset used in this notebook represents money transfers between around 2500 clients, 75 merchants, 5 banks with 175000 transactions across 30 days.  
There are 5 types of transactions:  
* CashIn: a client moves money into the network via a merchant
* CashOut: a client moves money out of the network via a merchant
* Debit: a client moves money into a bank
* Transfer: a client sends money to another client
* Payment: a client exchanges money for something via a merchant

We will try to identify Clients which are fraudsters, potentially targeting other users with fake accounts to accept payments for goods that stolen, illegal or even non existent.  
We added to the original Paysim model some clients details (Phone, Email, SSN) to help identify clients using fake or stolen credentials to gain access to the system (first party fraud).  

---

## Let's get a graph database

We will use a Neo4j graph database created on the [Neo4j sandbox](https://neo4j.com/sandbox/).  
Once connected, on the _Select a project_ page, go to the section _Your own data_ and select the _Blank Sandbox_.  
Click on the _Create_ button at the bottom of the page.  
After few seconds, you should see the below.  
<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/sandbox_start.png?raw=1" alt="Sandbox Start" width="75%" title="Sandbox Start">  

And once it's up and running, you can access the connection details by clicking on the top right down arrow and picking the *Connection details* tab.  
You will need 2 things:
* Password  
* Bolt URL   

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/sandbox_details.png?raw=1" alt="Sandbox Details" width="75%" title="Sandbox Details">  

---

## Let's code

First we will import the [Neo4j GDS python library](https://pypi.org/project/graphdatascience/)  

In [1]:
# Install Neo4j GDS Python Client
import sys
!{sys.executable} -m pip install graphdatascience

# Import our GDS entry point
from graphdatascience import GraphDataScience

Collecting graphdatascience
  Downloading graphdatascience-1.7-py3-none-any.whl (938 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m938.7/938.7 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multimethod<2.0,>=1.0 (from graphdatascience)
  Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Collecting neo4j<6.0,>=4.4.2 (from graphdatascience)
  Downloading neo4j-5.9.0.tar.gz (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting textdistance<5.0,>=4.0 (from graphdatascience)
  Downloading textdistance-4.5.0-py3-none-any.whl (31 kB)
Building wheels for collected packages: neo4j
  Building wheel for neo4j (pyproject.toml) ... [?25l[?25h

### Instantiate your GDS Session

Use Neo4j/Bolt URI and credentials according to your setup  

For local standalone instance Bolt connection without auth    
`gds = GraphDataScience("bolt://localhost:7687", auth=None)`  

For local standalone instance Bolt connection with auth    
`gds = GraphDataScience("bolt://localhost:7687", auth=("neo4j", "<password>"))`  

For remote cluster Neo4j connection with auth  
`gds = GraphDataScience("neo4j://<FQDN or IP Address>:7687", auth=("neo4j", "<password>"))`  

For remote standalone instance Bolt connection with auth   
`gds = GraphDataScience("bolt://<FQDN or IP Address>:7687", auth=("neo4j", "<password>"))`

In [3]:
# >> Update the password and the URL here <<
gds = GraphDataScience("bolt://3.236.29.67:7687", auth=("neo4j", "fire-carrier-sea"))

### Check the GDS version installed

In [4]:
print(f"Neo4j GDS Version: {gds.version()}")

Neo4j GDS Version: 2.3.6


### Optional - Set database if you're not using the default _neo4j_ database.

Not applicable for Neo4j Sandbox as we have only one database named _neo4j_.

In [5]:
#gds.set_database("my-db")

### Cleaning the database or making it ready for a rerun of the notebook.
We are starting with a fresh clean database, however if the database was previously loaded, we have the option to clear it out first here. Then we will use it to load the data from CSV files, running [Cypher](https://neo4j.com/developer/cypher/) queries. The RELOAD_DATA flag can be used to skip this step for experimenting with different algorithms later on.

In [None]:
# Set flag to control reloading of all data
RELOAD_DATA = True


if RELOAD_DATA: # Delete all, takes few miniutes on a full database
    gds.run_cypher(
        """
        MATCH (n) CALL {
          WITH n
          DETACH DELETE n
        } IN TRANSACTIONS OF 10 ROWS;
        """
    )
else: # Reset the GDS properties when we re run the book without erasing all
    gds.run_cypher(
        """
        MATCH (c:Client) SET c.fraud_group = null, c.intra_fraud_group = null, c.score = null;
        """
    )

### Test reading some data

Using [LOAD CSV](https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/), we are loading csv files into the database, creating the graph on the fly.  
The first cell is to test the file access, by reading it and showing only the first 5 rows.  

In [None]:
# Checking if we can access the data
if RELOAD_DATA:
    nodeListCSV = gds.run_cypher(
    """
    LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/clients.csv" AS row
    RETURN row.NAME as Name, row.PHONENUMBER as phoneNumber, row.SSN as SSN, row.EMAIL as email LIMIT 5
    """
    )

# The object returned is a Pandas Data Frame, so we can explore using standard Pandas methods
nodeListCSV.head(5)

### Creating constraints and indexes

For data integrity, we will create [constraints](https://neo4j.com/docs/cypher-manual/current/constraints/) to have a robust graph data model. Each constraint enforces uniqueness of an identifier for a given label. An index is also created for the name property on Client nodes, this allows fast lookups when querying clients by name.

In [None]:
if RELOAD_DATA:
    # First we create index
    CONSTRAINTS = [
      "CREATE CONSTRAINT ClientConstraint IF NOT EXISTS FOR (p:Client) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT EmailConstraint IF NOT EXISTS FOR (p:Email) REQUIRE p.email IS UNIQUE;",
      "CREATE CONSTRAINT PhoneConstraint IF NOT EXISTS FOR (p:Phone) REQUIRE p.phoneNumber IS UNIQUE;",
      "CREATE CONSTRAINT SSNConstraint IF NOT EXISTS FOR (p:SSN) REQUIRE p.ssn IS UNIQUE;",
      "CREATE CONSTRAINT MerchantConstraint IF NOT EXISTS FOR (p:Merchant) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT BankConstraint IF NOT EXISTS FOR (p:Bank) REQUIRE p.id IS UNIQUE;",
      "CREATE CONSTRAINT TransactionConstraint IF NOT EXISTS FOR (p:Transaction) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT DebitConstraint IF NOT EXISTS FOR (p:Transaction) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT CashInConstraint IF NOT EXISTS FOR (p:CashIn) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT CashOutConstraint IF NOT EXISTS FOR (p:CashOut) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT TransferConstraint IF NOT EXISTS FOR (p:Transfer) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE CONSTRAINT PaymentConstraint IF NOT EXISTS FOR (p:Payment) REQUIRE p.globalStep IS UNIQUE;",
      "CREATE INDEX      ClientNameIndex IF NOT EXISTS FOR (n:Client) ON (n.name)"
    ]
    for c in CONSTRAINTS:
        gds.run_cypher(c)

### Loading all the data

We will load 7 csv files:  
* one for clients   
* one for merchants  
* five for transactions  

We can see how each node is created with a label and at least one property.  
We see all the relationships between all the nodes, we represent each transaction as a relationship between its participants.  

In [None]:
if RELOAD_DATA:

    # Load Clients data
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/clients.csv" AS row
        WITH row
        MERGE (c:Client { id: row.ID })
        SET c.name = row.NAME
        MERGE (p:Phone { phoneNumber: row.PHONENUMBER })
        MERGE (c)-[:HAS_PHONE]->(p)
        MERGE (s:SSN { ssn: row.SSN })
        MERGE (c)-[:HAS_SSN]->(s)
        MERGE (e:Email { email: row.EMAIL })
        MERGE (c)-[:HAS_EMAIL]->(e);
    """
    )

    # Load Merchants data
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/merchants.csv" AS row
        WITH row
        MERGE (m:Merchant { id: row.ID })
        SET m.name = row.NAME, m.highRisk = toBoolean(row.HIGHRISK);
    """
    )

    # Load Debit data
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/debit.csv" AS row
        WITH row
        MERGE (b:Bank { id: row.IDDEST })
        SET b.name = row.NAMEDEST
        MERGE (c:Client { id: row.IDORIG })
        MERGE (t:Transaction:Debit { globalStep: toInteger(row.GLOBALSTEP) })
        SET t.amount = toFloat(row.AMOUNT)
        MERGE (t)-[:TO]->(b)
        MERGE (c)-[:PERFORMED]->(t);
    """
    )

    # Load CashIn data, largest file then takes few seconds
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/cashin.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            MERGE (t:Transaction:CashIn { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 10 ROWS;
    """
    )

    # Load CashOut data
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/cashout.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            SET c.name = row.NAMEORIG
            MERGE (t:Transaction:CashOut { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 10 ROWS;
    """
    )

    # Load Payment data
    gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/payment.csv" AS row
        CALL {
            WITH row
            MERGE (m:Merchant { id: row.IDDEST })
            SET m.name = row.NAMEDEST
            MERGE (c:Client { id: row.IDORIG })
            SET c.name = row.NAMEORIG
            MERGE (t:Transaction:Payment { globalStep: toInteger(row.GLOBALSTEP) })
            SET t.amount = toFloat(row.AMOUNT)
            MERGE (t)-[:TO]->(m)
            MERGE (c)-[:PERFORMED]->(t)
        } IN TRANSACTIONS OF 5 ROWS;
    """
    )

    # Load Transfer data
    gds.run_cypher(
    """
    LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/data/transfer.csv" AS row
    CALL {
        WITH row
        MERGE (cd:Client { id: row.IDDEST })
        SET cd.name = row.NAMEDEST
        MERGE (co:Client { id: row.IDORIG })
        SET co.name = row.NAMEORIG
        MERGE (t:Transaction:Transfer { globalStep: toInteger(row.GLOBALSTEP) })
        SET t.amount = toFloat(row.AMOUNT)
        MERGE (t)-[:TO]->(cd)
        MERGE (co)-[:PERFORMED]->(t)
    } IN TRANSACTIONS OF 5 ROWS;
    """
    )

We have now taken a series of flat data sources and constructed a rich graph representation of the connections present in the sample dataset. At this point we have the following data model :

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/initial_data_model.png?raw=1" alt="Initial graph data model" width="75%"  title="Initial Graph Data Model">  

---
### Enriching the graph

Using the transaction details, we are able to enrich the model by adding the ordering of the transaction using the global step (a synthetic timestamp of sorts).

In [None]:
if RELOAD_DATA:
    # Update data model with new relationships
    gds.run_cypher(
    """
    MATCH (c:Client) with c.id as clientId
    CALL {
        WITH clientId
        MATCH (c:Client {id: clientId})-[:PERFORMED]->(tx:Transaction)
        WITH c, tx ORDER BY tx.globalStep
        WITH c, collect(tx) AS txs
        WITH c, txs, head(txs) AS _start, last(txs) AS _last

        MERGE (c)-[:FIRST_TX]->(_start)
        MERGE (c)-[:LAST_TX]->(_last)
        WITH c, apoc.coll.pairsMin(txs) AS pairs

        UNWIND pairs AS pair
          WITH pair[0] AS a, pair[1] AS b
          MERGE (a)-[n:NEXT]->(b)
    } IN TRANSACTIONS OF 10 ROWS;
    """
    )

These changes have created a new layer in the graph, where relationships show transactions in chronological order :

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/enhanced_data_model.png?raw=1" alt="Enhanced graph data model" width="75%" title="Enhanced Graph Data Model">

This allows us to query and view transaction data in different ways, for example we can show Johns transactions as a group or in their order as below :

Performed transactions | Ordered transactions
- | -
![alt](https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/performed_relationships.png?raw=1) | ![alt](https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/ordered_relationships.png?raw=1)




### Having a first look at the dataset

Neo4j maintains statistics of the various node labels and relationship types found in the active database. We can gain access to this information using a call to apoc.meta.stats with our python client. Here we are simply asking for the stats and then listing the relative frequency of each label (transaction types, clients, merchants etc) that exist in our dataset.

In [None]:
result = gds.run_cypher(
    """
    CALL apoc.meta.stats() YIELD nodeCount, labels
    UNWIND keys(labels) as label
    RETURN label as nodeLabel,
        labels[label] as frequency,
        round(toFloat(labels[label])/nodeCount, 3) as relativeFrequency
    ORDER BY frequency DESC
    """
)
result

### Let's have a look on how the money is exchanged across entities

Here we are using some Cypher aggregations to perform analysis across all transactions in the database. The first phase calculates the total count and value of all transactions. Next we aggregate on each transaction type (label) to calculate the relative value and count percentages of each.

In [None]:
result = gds.run_cypher(
    """
    MATCH (t:Transaction)
    WITH sum(t.amount) AS globalSum, count(t) AS globalCnt
    MATCH (t:Transaction)
    WITH labels(t)[1] as txType, count(t) as txCnt, sum(t.amount) as txTotal, globalSum, globalCnt
    RETURN
        txType,
        toInteger(round(txTotal/1000000)) + 'M' AS TotalMarketValue,
        round(100 * txTotal / globalSum, 1) AS `%MarketValue`,
        round(100 * toFloat(txCnt) / globalCnt, 1) AS `%MarketTransactions`,
        toInteger(txTotal / txCnt) AS AvgTransactionValue,
        txCnt AS NumberOfTransactions
    ORDER BY `%MarketTransactions` DESC
    """
)
result

---
## Let's do some data Graph Data Science !

Now that our graph is constructed and filled with data, we can use Neo4j Graph Data Science to look for anomolies in the graph that are often associated with fraudulent behaviour.

One of the unique features of the Neo4j platform is that GDS can be used on projections generated directly from a live transactional database. Obviously this is critical for quickly identifying fraud as it occurs as opposed to performing batch post analysis on stale data.

Our first step is to define a new graph projection. Projections are created in memory from live data and may be used immediately for analysis. Our first graph projection will be used to analyse client identification data provided at signup. All projections must be uniquely named, here we are checking if our 'firstPartyFraud' graph already exists and if so we can drop and recreate it.

---

In [None]:
# My first graph project name to use wcc algorithm
graphName = 'firstPartyFraud'

# Remove existing projection with the same name, in case of a re run of the notebook
if gds.graph.exists(graphName).exists:
    gds.graph.drop(gds.graph.get(graphName))

### Discovering first party fraud

By checking connections among clients based on the identity information from their accounts, we can identify potentially fake profiles and shady clients. Our projection only needs to contain the slice of data pertinent to the type of analysis we are performing. Projections may also be created directly from a Cypher query to target even more specific data when required, however in this case we are using a 'native projection' based on the types of nodes and relationships only.

We can start with a memory estimate of our projection, this is an optional yet useful step for ensuring the size of our GDS instance is sufficient for the task at hand. No projection is created here, just some data estimating the size of the in memory footprint it would create.  

In [None]:
gds.graph.project.estimate(
    ['Client', 'SSN', 'Email', 'Phone'],     # Nodes to be added in the projection
    ['HAS_SSN', 'HAS_EMAIL', 'HAS_PHONE'])   # Relationships to be added in the projection

We can see from the output that projections are highly optimised and very compact in memory as they contain only the information we request (in this case only selected nodes and their connections, no unnecessary properties).


### Next, create the projection to be used

Once we are happy with the estimate, we use very similar syntax to create the actual projection. Here we are using our name and receiving a reference as the variable 'projection'. In addition prejectionPandas is returned to provide some statistics of the creation and the projection itself.

In [None]:
projection, projectionPandas = gds.graph.project(
    graphName,
    ['Client', 'SSN', 'Email', 'Phone'],
    ['HAS_SSN', 'HAS_EMAIL', 'HAS_PHONE'])

projectionPandas


### Selecting our algorithm - Weakly Connected Components

One hallmark of first party fraud is re-use of stolen personal data in the creation of multiple fraudulent accounts. Often bad actors purchase the same stolen information and it will therefore present as a number of accounts with various combinations of the same information.  

When you look at the information shared by multiple accounts as a graph, groups of fraudulent accounts tend to form a strongly connected subgraph. Legitimate accounts are typically isolated or only reuse some information (maybe a phone number or email in common) whereas large groups of clients with many connections are often associated with stolen information.

The [Weakly Connected Components](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) algorithm is perfect for this purpose as it identifies these connected groups of users that are weakly connected to the rest of the user graph. If we can find larger groups of connected clients using WCC, there is a strong chance they are related to stolen data reuse and first party fraud.


### Running WCC in streaming mode

Algorithms can be run in a number of modes depending on the use case. The streaming mode returns the result of an algorithm as a stream, just like the return of a cypher query. Let's try executing the WCC algorithm on our new projection in the streaming mode.

In [None]:
result = gds.wcc.stream(projection)
result.head(10)

In [None]:
result.groupby(['componentId']).count().sort_values('nodeId', ascending=False).head(10)

As we can see WCC streaming mode returns the component (or group) ID for each of the nodes represented in the projection. Just listing it didnt give very useful information however we can count the nodes in each group and find the ID of the largest groups discovered by the WCC algorithm (note these counts include all nodes in the group including the identifying data nodes).


### Writing group information directly to the database

While streaming mode is useful for identifying and analysing groups directly in the notebook, we may want to actually write group membership information into a property on the node itself to enable further analysis (either in Bloom or using it in a subsequent algorithm execution).

We can use the write mode directly on algorithm execution. In this case the WCC algorithm will write the componentId directly to the node into a property of our choice. This works well when we are happy for all nodes to have data written.

In the example below however, we *only* want to write a 'fraud_group' property when the node is a member of a group with more than 1 node... that is ignore groups of a single node as they are not of interest to our analysis. In this case we use cypher to process the output of the WCC algorithm, chose only those nodes in a group of size > 1, then manually write each componentId using a SET clause. Note also, this method only matches on Client labels as they are the only member of the groups we are interested in analysing.

In [None]:
#result_wcc = gds.wcc.write(projection, writeProperty='fraud_group')
#result_wcc

result_wcc = gds.run_cypher("""
CALL gds.wcc.stream('""" + graphName + """') YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).id) AS clientIds       // Fetch the Node instance from the db and use its PaySim id
WITH *, size(clientIds) AS groupSize WHERE groupSize > 1                 // Note that in this case, clients is a list of paysim ids.
UNWIND clientIds AS clientId                                             // Let's unwind the list, MATCH, and tag them individually.
    MATCH (c:Client {id:clientId})
    SET c.fraud_group = componentId;
""")
result_wcc

### Take a closer look at our potential fraud groups

Now that we have identified our possible fraud groups and have a property to identify which group (if any) that each client is a member of, we can use this data to start looking closer at the larger identified groups. We  also create an index on our new fraud_group property to make Cypher queries referencing this property even faster.

In [None]:
# Create an index on the new property just created by the wcc algorithm on Clients
gds.run_cypher("CREATE INDEX ClientFraudIndex IF NOT EXISTS FOR (c:Client) on c.fraud_group;")

In [None]:
# Look at the community created by the algorithm
# We can see the biggest community has 10 elements
result = gds.run_cypher("""
  MATCH (c:Client) WHERE c.fraud_group IS NOT NULL
  WITH c.fraud_group AS groupId, collect(c.id) AS members
  WITH groupId, size(members) AS groupSize
  WITH collect(groupId) AS groupsOfSize, groupSize
  RETURN groupSize, size(groupsOfSize) AS numOfGroups, groupsOfSize as FraudGroupIds
  ORDER BY groupSize DESC;
""")
result.head(10)

## Using Bloom to visualise fraud groups

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/opening_bloom.png?raw=1" alt="Opening Bloom" width="75%"  title="Opening Bloom">

Lets take a look at some of these communities in Neo4j Bloom.  We will download and import the perspective from the bloom directory of the workshop github repository.

<a id="raw-url" href="https://raw.githubusercontent.com/neo4j-field/graph-summit-apac-2023/main/bloom/graph_summit_workshop.json">Click here to download Bloom Perspective</a>

Now use the import feature button in bloom to add the perspective to our new Bloom instance.

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/import_perspective.png?raw=1" alt="Import Bloom Perspective" width="75%" title="Import Bloom Perspective">

We can now click on the perspective to open it and explore the dataset further, for example using the search bar for "Find client with name Carson Wynn" and then using a scene action (right click) on Carson's node to "Show suspected fraud group" can provides us with information about this users common data with others in the group.

Try the following search phrases using Bloom

* Find client with name John Kirby
  * Select and right click on John's node to use scene actions
* Show largest first party fraud groups
* Find client with name Carson Wynn
  * Select and right click on Carson to explain fraud group

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/first_party_fraud.png?raw=1" alt="Fraud group 4162" width="75%" title="Fraud group 4162">



## Finding interconnections *between* fraud groups

While finding, identifying and removing fraudulent accounts is a great use case for graph analytics, the real power comes from being able to dig deeper and further into connections using multiple datasets assembled into a powerful representation of connections in your data.

Now that we have suspected fraudulent accounts identified, what can we learn from any transaction activity they have been able to perform. There must be a way for members to profit from these accounts and looking deeper as connections between the groups might lead us to central players in a larger fraud operation.

The following Cypher looks at transactional relationships that members of larger fraud groups have with accounts outside of their immediate group. Obviously transfers within the group are expected but looking at how money moves out of the group is a key to finding the central actors in a larger organisation.

In [None]:
# We will focus on fraud groups above 5 members
fraudGroupMinSize = 5

result = gds.run_cypher("""
  MATCH (c:Client) WHERE c.fraud_group IS NOT NULL
  WITH c.fraud_group AS groupId, collect(c.id) AS members
  WITH groupId, size(members) AS groupSize WHERE groupSize > $gs
  MATCH (:Client {fraud_group:groupId})-[]-(txn:Transaction)-[]-(c:Client)
  WHERE c.fraud_group IS NULL
  UNWIND labels(txn) AS txnType
  RETURN distinct(txnType), count(txnType);
""", params= {'gs': fraudGroupMinSize} )
result

Here we see among the hundreds of thousands of transactions in the dataset, there are a relatively small number of transactions that eminate outward from these groups. We can capture this information as a layer in the graph and use it to further analyse these "suspicious" connections.

### Let's create a new property to identify suspect clients

Let's use these suspicious connections to create a new meta-graph of TRANSACTED_WITH relationships. The Cypher code below identifies these suspects transacting outside of each fraud ring, marks them with a suspect property and connects them together with the new relationship type.


In [None]:
result = gds.run_cypher("""
  MATCH (c:Client) WHERE c.fraud_group IS NOT NULL
  WITH c.fraud_group AS groupId, collect(c.id) AS members
  WITH groupId, size(members) AS groupSize WHERE groupSize > $gs
  MATCH (c1:Client {fraud_group:groupId})-[]-(t:Transaction)-[]-(c2:Client)
  WHERE c2.fraud_group IS NULL
  SET c1.suspect = true, c2.suspect = true
  MERGE (c1)-[r:TRANSACTED_WITH]->(c2)
  ON CREATE SET r += t
  RETURN count(r);
""", params= {'gs': fraudGroupMinSize})
result

### We found some suspect transactions, let's investigate them using again WCC

Now we have built a metagraph of transactions between members of our previously identified fraud groups and other clients outside of those groups, we are able to create a projection of these "intra-group" connections for further analysis.

In [None]:
graphName2 = 'intraGroupTransactions'

# Remove existing graph with the same name
if gds.graph.exists(graphName2).exists:
    gds.graph.drop(gds.graph.get(graphName2))

### Creating a new projection using only suspect clients

This time we are using a [cypher projection](https://neo4j.com/docs/graph-data-science/current/management-ops/projections/graph-project-cypher/) that specifically targets only those client nodes marked as suspects and our new TRANSACTED_WITH relationships.

In [None]:
projection2, projectionPandas2 = gds.graph.project.cypher(graphName2,
          'MATCH (c:Client {suspect:true}) RETURN id(c) AS id',
          'MATCH (c1:Client {suspect:true})-[r:TRANSACTED_WITH]->(c2:Client) RETURN id(c1) AS source, id(c2) as target')
projectionPandas2

We can now run the Weakly Connected Components algorithm across this subgraph projection to identify any communities that exist in this connected web of individual fraud groups. Note this time we are executing the WCC algorithm in 'write' mode, which will directly write the detected group id for every projected node as a property on these nodes in the database.

In [None]:
result = gds.wcc.write(projection2, writeProperty='intra_fraud_group');

In [None]:
# Create an index on the new property
gds.run_cypher("CREATE INDEX IntraGroupIndex IF NOT EXISTS FOR (c:Client) on c.intra_fraud_group;")

In [None]:
result = gds.run_cypher("""
MATCH (c:Client) WHERE c.intra_fraud_group IS NOT NULL
WITH c.intra_fraud_group AS intraGroupId, collect(c.id) AS members
RETURN intraGroupId, size(members) AS groupSize
ORDER BY groupSize DESC;
""")
result.head(5)

## Finding the *really* bad actors using Betweenness Centrality

Now we have discovered that there are communities of transaction activity between our original first party fraud groups, what information can we glean from it ? Typically when we have multiple 'cells' of fraudulent activity, there needs to be a process for 'exiting' or profiting from the underlying activity. Often we are looking for a central entity via which most of the fraudulent activity will eventually flow.

These accounts may at first have appeared legitimate and have been created with unique credentials that did not flag them as fraudulent, however using the [Betweenness Centrality](https://neo4j.com/docs/graph-data-science/current/algorithms/betweenness-centrality/) Graph Data Science algorithm we can quickly find these central nodes in a wider fraud operation.

We will make our final projection of the largest "intra-fraud" group discovered in the previous step. This will be the group that connects the most first party fraud "cells" together and it is likely to uncover and particularly important or central players in the wider operation.

In [None]:
graphName3 = 'betweenness'

# Remove existing graph with the same name
if gds.graph.exists(graphName3).exists:
    gds.graph.drop(gds.graph.get(graphName3))

In [None]:
# This projection selects only the largest intra fraud group community
projection3, projectionPandas3 = gds.graph.project.cypher(graphName3,
    """MATCH (c:Client) WHERE c.intra_fraud_group IS NOT NULL WITH c.intra_fraud_group AS secondGroupId, collect(c.id) AS members
       WITH secondGroupId, size(members) AS groupSize ORDER BY groupSize DESC LIMIT 1
       MATCH (c:Client {intra_fraud_group:secondGroupId})-[r:TRANSACTED_WITH]-(c2:Client)
       RETURN id(c) AS id
    """,
    """MATCH (c:Client) WHERE c.intra_fraud_group IS NOT NULL WITH c.intra_fraud_group AS secondGroupId, collect(c.id) AS members
       WITH secondGroupId, size(members) AS groupSize ORDER BY groupSize DESC LIMIT 1
       MATCH (c1:Client {intra_fraud_group:secondGroupId})-[:TRANSACTED_WITH]-(c2:Client)
       RETURN id(c1) AS source, id(c2) AS target
    """)

In [None]:
result = gds.betweenness.write(projection3, writeProperty='score')
result

# Using Bloom to highlight key fraudsters !

Now that we are able to identify the largest intra-fraud group community and have calculated a betweenness centrality score for each of the nodes in this group, we can visualise this data using Bloom. Using a [saved cypher search phase](https://neo4j.com/docs/bloom-user-guide/current/bloom-tutorial/search-phrases-advanced/), Bloom is able to search for and render the largest community.

Bloom is also able to use rule based scene rendering to colour and size nodes and relationships based on any of their data properties. In this example we have used the betweenness centralitity score calculated above to highlight the central or important nodes in the suspected fraud community.

Try the following search phrase in Bloom

* Show intra-group transactions
  * Use Bloom rule based styling to highlight centrality results

<img src="https://github.com/mihir-shah-bh/graph-summit-apac-2023/blob/develop/img/betweeness_analysis.png?raw=1" alt="Visualising betweeness centrality" width="100%" height="100%" title="Visualising betweeness centrality">
