# A fraud example with GNN

There are cases in which graphs can detect hidden features in the data. This fraud example illustrates such a case.

The example is based on a real case, although it has been extremely simplified.

## Problem setup

The problem we will solve is commonly faced by e-commerce businesses. You have a list of products that users can buy, and a list of users. Users can be consumers or businesses. Consumers have a low entry bar: it's easy to create a consumer account and it is not usually verified. Consumers usually buy a lot less products than businesses. In this specific scenario, bad actors will attempt fraud by signing up for consumer accounts and buying lots of products, as if they were businesses.

### The data

For simplicity, the data consists only of three tables: a table of customers, a table of products and a table of purchases. 

Customers have two attributes: an identification number (customer_id) and an indicator variable is_consumer that denotes whether they are consumers (1 if they are consumers, 0 if they are businesses).

For simplicity, products have no attributes, only a product identification number called SKU.

Purchases have three attributes: the id of the customer that made the purchase, the id of the product and whether it was fraudulent or not (is_fraud).


## Generating the data

### Products 
The first table we will generate is the products data. The product ids start at 100001 and we have 10,000 products.

In [2]:
import pandas as pd

products_df = pd.DataFrame (range(100001, 110000+1), columns=['sku'])

In [3]:
products_df

Unnamed: 0,sku
0,100001
1,100002
2,100003
3,100004
4,100005
...,...
9995,109996
9996,109997
9997,109998
9998,109999


## Customers

The customers table will be generated using the following random probabilities: we're going to generate 10,000 customers. A customer will have an 87% probability of being a consumer, a 3% probability of being a bad actor (unknown to us) and a 10% probability of being a business.

In [4]:
from numpy.random import default_rng
rng = default_rng()
vals = rng.uniform(low=0.0, high=1.0, size=10000)    

In [5]:
customers_df = pd.DataFrame(columns = ['customer_id', 'is_consumer', 'is_bad_actor'])

cid = 0
for v in vals:
    cid = cid + 1
    if v < 0.03:
        df = pd.DataFrame([[cid, 1, 1]], columns = ['customer_id', 'is_consumer', 'is_bad_actor'])
        customers_df = pd.concat([customers_df, df], ignore_index = True)
    elif v < 0.9:
        df = pd.DataFrame([[cid, 1, 0]], columns = ['customer_id', 'is_consumer', 'is_bad_actor'])
        customers_df = pd.concat([customers_df, df], ignore_index = True)
    else:
        df = pd.DataFrame([[cid, 0, 0]], columns = ['customer_id', 'is_consumer', 'is_bad_actor'])
        customers_df = pd.concat([customers_df, df], ignore_index = True)

print(customers_df)


     customer_id is_consumer is_bad_actor
0              1           1            0
1              2           1            0
2              3           1            0
3              4           1            0
4              5           1            0
...          ...         ...          ...
9995        9996           0            0
9996        9997           1            0
9997        9998           1            0
9998        9999           1            0
9999       10000           0            0

[10000 rows x 3 columns]


## Orders

Now that we have customers and products, we can create the table of orders. Although the "is_bad_actor" variable is unknown to us, we can assume that we know at some point whether each order was fraudulent or not. At that point, we could cancel accounts that are generating lots of fraudulent orders, but in practice, bad actors just create new accounts.

The ordering profile is going to be as follows:
- Consumers order on average 5 products, with a fraud rate of 1% (if they are *not* bad actors)
- Businesses order on average 50 products, with a fraud rate of 0.1% (in this example, businesses are never bad actors)
- Bad actors look like consumers but mimic businesses, and will order 50 products, with a fraud rate of 99%

In [6]:
orders_df = pd.DataFrame(columns = ['customer_id', 'product_id', 'is_fraud'])

for num, row in customers_df.iterrows():
    rnd_fraud = rng.uniform(low=0.0, high=1.0, size=1)
    cid = row['customer_id']
    if row['is_bad_actor'] == 1:
        n_products = int(rng.normal(50, 10, 1))
        if n_products < 0:
            n_products = 0
        product_list = rng.integers(100001, 110000, n_products)
        fraud_list = [ 1 if f < 0.99 else 0 for f in rng.uniform(low=0.0, high=1.0, size=n_products) ]
        customer_list = [cid for i in range(n_products)]
    elif row['is_consumer'] == 1:
        n_products = int(rng.normal(5, 2, 1))
        if n_products < 0:
            n_products = 0
        product_list = rng.integers(100001, 110000, n_products)
        fraud_list = [ 1 if f < 0.01 else 0 for f in rng.uniform(low=0.0, high=1.0, size=n_products) ]
        customer_list = [cid for i in range(n_products)]            
    else: # business
        n_products = int(rng.normal(50, 10, 1))
        if n_products < 0:
            n_products = 0
        product_list = rng.integers(100001, 110000, n_products)
        fraud_list = [ 1 if f < 0.001 else 0 for f in rng.uniform(low=0.0, high=1.0, size=n_products) ]
        customer_list = [cid for i in range(n_products)]            


    orders_df = pd.concat([orders_df, pd.DataFrame(zip(customer_list, product_list, fraud_list), columns=['customer_id', 'product_id', 'is_fraud'])])

In [7]:
orders_df['is_fraud'].sum() / orders_df['is_fraud'].count()

0.1485726601226513

## Saving the data for the next step

Let's save the data in CSVs so that we can start from here in the next step.

In [8]:
customers_df.to_csv("customers.csv", index=False)
products_df.to_csv("products.csv", index=False)
orders_df.to_csv("orders.csv", index=False)