# Fraud Analytics

## Creating a fraud graph

### Set-up

In [None]:
from utils.set_env_variables import set_environment_variables
import os

set_environment_variables()

neptune_cluster_id = "neptunedbcluster-gdaeeuitulg4"
neptune_endpoint= os.getenv('NEPTUNE_ENDPOINT')
neptune_port = os.getenv('NEPTUNE_PORT')
neptune_auth_mode = os.getenv('NEPTUNE_AUTH_MODE')
s3_role_arn = os.getenv('S3_ROLE_ARN')
ssl = os.getenv('SSL')
ssl_verify = os.getenv('SSL_VERIFY')
aws_region = os.getenv('AWS_REGION')

In [None]:
%%graph_notebook_config
{
  "host": "${neptune_endpoint}",
  "neptune_service": "neptune-db",
  "port": "${neptune_port}",
  "auth_mode": "${neptune_auth_mode}",
  "load_from_s3_arn": "${s3_role_arn}",
  "ssl": "${ssl}",
  "ssl_verify": "${ssl_verify}",
  "aws_region": "${aws_region}"
}

### Adding sample data

In [None]:
import graph_notebook as gn
config = gn.configuration.get_config.get_config()

s3_bucket = f"s3://aws-neptune-customer-samples-{config.aws_region}/sample-datasets/gremlin/Fraud/"
region = config.aws_region
load_arn = config.load_from_s3_arn

In [None]:
from utils.neptune_bulk_load import bulk_load_neptune

# Initiate the bulk load process
load_status = bulk_load_neptune(neptune_cluster_id, neptune_endpoint, neptune_port, s3_bucket, "csv", s3_role_arn, aws_region)

print(load_status)

In [None]:
%load_status 9461335e-212e-404f-87fe-bf13d30bebdc

### Switch connection to opencypher

In [None]:
%%graph_notebook_config
{
  "host": "${neptune_endpoint}",
  "neptune_service": "neptune-graph",
  "port": "${neptune_port}",
  "auth_mode": "${neptune_auth_mode}",
  "load_from_s3_arn": "${s3_role_arn}",
  "ssl": "${ssl}",
  "ssl_verify": "${ssl_verify}",
  "aws_region": "${aws_region}"
}

### Set visualization and configuration options

The cell below configures the visualization to use specific colors and icons for the different parts of the data model.

In [None]:
%%graph_notebook_vis_options

{
  "groups": {
    "Account": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf2bb",
        "color": "red"
      }
    },
    "Transaction": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf155",
        "color": "green"
      }
    },
    "Merchant": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf290",
        "color": "orange"
      }
    },
    "DateOfBirth": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf1fd",
        "color": "blue"
      }
    },
    "EmailAddress": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf1fa",
        "color": "blue"
      }
    },
    "Address": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf015",
        "color": "blue"
      }
    },
    "IpAddress": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf109",
        "color": "blue"
      }
    },
    "PhoneNumber": {
      "shape": "icon",
      "icon": {
        "face": "'Font Awesome 5 Free'",
        "weight": "bold",
        "code": "\uf095",
        "color": "blue"
      }
    }
  },
  "edges": {
    "color": {
      "inherit": false
    },
    "smooth": {
      "enabled": true,
      "type": "straightCross"
    },
    "arrows": {
      "to": {
        "enabled": false,
        "type": "arrow"
      }
    },
    "font": {
      "face": "courier new"
    }
  }
}

### Data model
The fraud graph included in this example contains synthetic data that models credit card accounts, account holder information, merchants, and the transactions performed when an account holder purchases goods or services from a merchant.

**Account and features**

An Account has a number of features, including physical Address, IpAddress, DateOfBirth of the account holder, EmailAddress, and contact PhoneNumber. An account holder can have multiple email addresses and phone numbers.

## Identifying Fraud Rings
Detecting fraud rings involves identifying unusual or suspicious patterns in data. These patterns can vary depending on the type of fraud and the context in which it occurs. Here are some common patterns that analysts and machine learning models might look for:

* Unusual Behavior Patterns:
    * Frequency: Unusually high or low transaction frequencies for certain accounts.
    * Time of Activity: Transactions occurring at unusual times or outside regular business hours.
    * Location: Transactions from unexpected or geographically distant locations.

* Transaction Specifics:
    * Transaction Amounts: Unusually large or small transactions compared to historical behavior.
    * Transaction Types: Identifying unusual types of transactions for a specific user.
    
* Social Network Analysis:
    * Connections: Identifying networks of accounts that frequently transact with each other.
    * Topology Analysis: Examining the structure of connections between accounts.
    
These are just a few on the patterns you can look for to find fraud rings.  For this notebook we will be looking at detecting anomalous behavior using Social Network Analysis to find groups of accounts that are disproportionately highly connected with one another.  We will then use these groups to perform a topological analysis of these accounts by looking at the structure of the connections between the accounts.

To begin this process we will start by running a graph algorithm that finds groups of highly connected nodes. Algorithms that accomplish this below to a category of algorithms called `Community Detection`.  Community detection algorithms calculate meaningful groups or clusters of nodes within a network, revealing hidden patterns and structures that can provide insights into the organization and dynamics of complex systems.

There are a variety of supported community detection algorithms in Neptune Analytics and for this demonstration we will be using one known as **Label Propagation**

The label propagation algorithm is a semi-supervised machine learning algorithm that assigns labels to nodes based on the consensus of their neighboring nodes.  This algorithm functions by assigning a label to a small subset of nodes.  These labels are then propagated to that nodes neighbors based on the maximum set of neighbor nodes.

In [None]:
# from utils.open_cypher_query import NeptuneCypherUtility

# neptune_util = NeptuneCypherUtility(cluster_endpoint=neptune_endpoint, region_name=aws_region)

# community_data = neptune_util.eval_label_propagation()
# print(community_data)

In [None]:
# %%oc

# MATCH (n)
# CALL neptune.algo.labelPropagation(n)
# YIELD community
# RETURN community, count(n) as size
# ORDER BY size DESC

**Investigate anomalous groups using the the `community_data` variable.**

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

# Create a numpy histogram with the nubmer of bins being the max size
df = pd.DataFrame(community_data['results'])
hist = np.histogram(df.get('size'), bins=community_data['results'][0]['size'])

# Plot the histogram using Plotly
fig = px.bar(hist[0].tolist(), title = "Community Size Distribution")
fig.update_layout(xaxis_title='Community Size', yaxis_title='Occurrences', title_x=0.5)
fig.update_traces(showlegend=False)
fig.show()

**Persist `community_data` to graph.**

In [None]:
# %%oc

# CALL neptune.algo.labelPropagation.mutate({writeProperty: 'community'})

**Query a single community**

In [None]:
%%oc -g community

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) WHERE n.community = community
MATCH p=(n)-[]->()
RETURN p

### Centrality

 Centrality in a graph refers to measures that identify the most important or central nodes in a network graph. Some common centrality measures include:

- Degree centrality - Counts the number of edges connected to a node. Nodes with higher degree are more central or "connected" in the graph.

- Closeness centrality - Calculates how close a node is to all other nodes by finding the shortest paths. Nodes with high closeness can spread information more quickly.

- PageRank - A variant of eigenvector centrality used by Google Search to rank website importance. Important pages are those linked to by other important pages.

In general, nodes with high centrality values are considered influential, visible, and critical to efficient network flow. Centrality helps identify the most important nodes to target in a network.

**Demonstrate evaluation of closeness centrality**
A measure of the average shortest path between a node and all other nodes in a network.

In [None]:
%%oc

MATCH (n) 
CALL neptune.algo.closenessCentrality(n, {numSources: 8192})
YIELD score
RETURN n, score 
ORDER BY score DESC LIMIT 1

**Persist centrality value back into our graph as `centrality`**

In [None]:
%%oc
CALL neptune.algo.closenessCentrality.mutate({numSources: 8192, writeProperty: "centrality"})

## Examining a Fraud Ring

A common workflow for fraud ring investigation is to look at the most important node inside an anomalous communities.  Use community anomalies with centrality measurements to find a list of the 5 most important nodes warranting investigation.

In [None]:
%%oc -g community

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) 
WHERE n.community = community
RETURN n
ORDER BY n.centrality DESC LIMIT 5

**Include graph traversal from nodes**

In [None]:
%%oc -g community -sd 30000

MATCH (n) 
WITH n.community as community, count(n.community) as community_size 
ORDER BY community_size DESC LIMIT 1
MATCH (n) 
WHERE n.community = community
WITH n ORDER BY n.centrality DESC LIMIT 5
MATCH p=(n)-[]-()-[]-()
RETURN p

## Analyzing the results

At this juncture, the suspicious nodes would be presented to a subject matter expert (i.e. fraud analyst) to perform a review and make a final indication as to whether or not it represents fraud.

### Mark as Fraud/Not Fraud

Assume a domain expert has made a determination that the `merchant-48` node in our graph is a fraudulent.  Mark the account above as fraudulent by setting the `isFraud` property to `True`

In [None]:
%%oc -d value -l 20
MATCH (a)
WHERE id(a)='merchant-48'
SET a.isFraud=True
RETURN a

### Find all items within three hops of the fraudulent merchant
Evaluate potential scrutiny for those close to the fraudulent actor.

In [None]:
%%oc -d value -l 20

MATCH p=(a)-[*1..3]-()
WHERE a.isFraud=True
RETURN p

## Conclusion

This notebook has utilized a credit card dataset with account- and transaction-centric queries to perform a graph based fraud ring analysis based on a guilt-by-associated approach.  First identifying the groups in our data.  Then identifying the most influential nodes within these groups and stored this information within our graph.  And finally, using this information we were able to explore the connections around the most influential entities to identify other potentially fraudulent accounts.

Finding and understanding fraud rings is a problem that requires the ability to query, analyze, and explore the connections between accounts, transactions, and account features.  Combining the ability to query a graph with the ability to run network analysis and graph algorithms on top of that data enables us to derive novel insights from this data. 