# The following code is for the analysis of detecing fraud

We want to first observe methodologies of determining whether or not a given transaction is 'fraud'. There are only some possible explanations given to us, and only a select few out of these can be attmpted to be observed. 

Information can be found here:
- https://www.bluefin.com/support/identifying-fraudulent-transactions/

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("MAST30034 Project 2")
    .config("spark.driver.memory", '4g')
    .config("spark.executor.memory", '8g')
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.parquet.enableVectorizedReader","false")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .getOrCreate()
)

22/09/15 17:28:55 WARN Utils: Your hostname, dash_surface resolves to a loopback address: 127.0.1.1; using 172.31.10.93 instead (on interface eth0)
22/09/15 17:28:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/15 17:28:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/15 17:28:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Read in the transactional data

Until the transactional preprocessing is done, we use the raw file.

In [3]:
sdf_transactions1 = spark.read.parquet('../data/tables/transactions_20210228_20210827_snapshot')
sdf_transactions2 = spark.read.parquet('../data/tables/transactions_20210828_20220227_snapshot')
sdf_transactions3 = spark.read.parquet('../data/tables/transactions_20220228_20220828_snapshot')

sdf_transactions = sdf_transactions1.union(sdf_transactions2)
sdf_transactions = sdf_transactions.union(sdf_transactions3)

                                                                                

In [7]:
sdf_transactions.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- merchant_abn: long (nullable = true)
 |-- dollar_value: double (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_datetime: date (nullable = true)



In [5]:
sdf_transactions.count()

                                                                                

14195505

In [15]:
sdf_transactions.select('order_id').distinct().count()

                                                                                

14195505

In [16]:
sdf_transactions

user_id,merchant_abn,dollar_value,order_id,order_datetime
18478,62191208634,63.255848959735246,949a63c8-29f7-4ab...,2021-08-20
2,15549624934,130.3505283105634,6a84c3cf-612a-457...,2021-08-20
18479,64403598239,120.15860593212784,b10dcc33-e53f-425...,2021-08-20
3,60956456424,136.6785200286976,0f09c5a5-784e-447...,2021-08-20
18479,94493496784,72.96316578355305,f6c78c1a-4600-4c5...,2021-08-20
3,76819856970,448.529684285612,5ace6a24-cdf0-4aa...,2021-08-20
18479,67609108741,86.4040605836911,d0e180f0-cb06-42a...,2021-08-20
3,34096466752,301.5793450525113,6fb1ff48-24bb-4f9...,2021-08-20
18482,70501974849,68.75486276223054,8505fb33-b69a-412...,2021-08-20
4,49891706470,48.89796461900801,ed11e477-b09f-4ae...,2021-08-20


Seems like there are no order ID which are duplicated - so checking multiple purchases of the same item isn't possible.

This means that given our current data, the only possibilities of checking for fraud is when a customer makes numerous purchases in a single day. We will check how this may be related and try to create some sort of heuristic.

In [17]:
from pyspark.sql import functions as F

In [51]:
user_purchase = sdf_transactions.groupBy(F.col('user_id'), F.col('order_datetime')).agg({
    'user_id': 'count'
})

In [25]:
user_purchase.count()

                                                                                

8976957

In [24]:
user_purchase

                                                                                

user_id,order_datetime,count(user_id)
18488,2021-08-20,3
686,2021-08-20,2
19292,2021-08-20,1
778,2021-08-20,2
786,2021-08-20,1
19476,2021-08-20,2
19497,2021-08-20,2
19628,2021-08-20,2
19631,2021-08-20,1
19672,2021-08-20,3


In [30]:
# Make folder directory to save per user information

import os

if not os.path.exists("../data/curated/fraud_analysis"):
    os.makedirs("../data/curated/fraud_analysis")

In [31]:
##################################
# DON'T RUN, NOT TIME EFFICIENT
##################################

# For each user ID, find min, max and average number of purchases, then create a list of tuples for easier access.
# Each tuple will contain (user_id, min, max, avg)

i = 1
maxIndex = 24081 # Largest user_id in database

while i <= maxIndex:

    user_purchase.where(F.col('user_id') == i).write.parquet(f"../data/curated/fraud_analysis/user_{i}")

    i += 1

ERROR:root:KeyboardInterrupt while sending command.                (0 + 8) / 26]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

In [34]:
user_1 = user_purchase.where(F.col('user_id') ==1)

In [36]:
user_1.select("count(user_id)").rdd.max()[0]

                                                                                

6

In [37]:
user_purchase.select("count(user_id)").rdd.max()[0]

                                                                                

12

In [39]:
user_purchase.where(F.col("count(user_id)") == 12)

                                                                                

user_id,order_datetime,count(user_id)
5369,2021-12-26,12


In [42]:
# Read in fraud data

fraud_sdf = spark.read.csv("../data/tables/consumer_fraud_probability.csv", header=True)

In [45]:
fraud_sdf

user_id,order_datetime,fraud_probability
6228,2021-12-19,97.6298077657765
21419,2021-12-10,99.24738020302328
5606,2021-10-17,84.05825045251777
3101,2021-04-17,91.42192091901347
22239,2021-10-19,94.70342477508036
16556,2022-02-20,89.65663294494827
10278,2021-09-28,83.59136689427714
15790,2021-12-30,71.77065889280253
5233,2021-08-29,85.87123303878818
230,2021-08-28,86.28328808934151


In [58]:
user_purchase.where((F.col("user_id") == 21419)).count()

                                                                                

380

In [48]:
user_purchase.where((F.col("user_id") == 21419) & (F.col("order_datetime") == "2021-12-10"))

user_id,order_datetime,count(user_id)
21419,2021-12-10,1


In [52]:
sdf_transactions.where((F.col("user_id") == 21419) & (F.col("order_datetime") == "2021-12-10"))

user_id,merchant_abn,dollar_value,order_id,order_datetime
21419,23686790459,67706.74019097649,079cc8aa-eadd-4f3...,2021-12-10


In [55]:
sdf_transactions.where(F.col("user_id") == 21419).select(F.mean("dollar_value"))

                                                                                

avg(dollar_value)
302.1584895025078


In [56]:
sdf_transactions.where(F.col("user_id") == 21419).select(F.stddev("dollar_value"))

                                                                                

stddev_samp(dollar_value)
2851.042573410824


It appears that the probability of fraud can be possibly found by calculating the probability of an 'extreme' value of some assumed distribution as a general customer base. Thus, we will run analysis on the data we have have to determine both "outliers" which we can then convert into probabilities from that given distirbution.

We also check if fraud is detectable by a large number of purchases made by a single person over a small timeframe.

In [None]:
# Join fraud data with transactional data

fraud_sdf.join(sdf_transactions, how='left', on=((fraud_sdf.user_id == sdf_transactions.user_id) & ()) )

## Feature engineering

The possible features that we came up with are:
1. Revenue band levels per transaction
2. Difference in purchase amount from a customer's 'average' spending
3. A customer who usually doesn't purchase much buying items very frequently