# Fraud Detection on Amazon SageMaker

Consider a scenario in which customers make transactions over time on terminals spread out in space.\
Our aim is to build a system which detects whether or not a transaction is fraudulent.

### Entities

__Customers__
- geo coordinates
- spending behavior
- available terminals (we assume customers only make transactions on terminals that are within a radius of _RADIUS_ of their geo)

__Terminals__
- geo coordinates

__Transactions__
- date
- customer
- terminal
- amount
- label: legitimate (0) or fraudulent (1)

#### Notes

The simulated datasets will highlight most of the issues that practitioners of fraud detection face using real-world data. In particular, they will include __class imbalance__ (less than 1% of fraudulent transactions), a mix of __numerical and categorical features__, non-trivial __relationships between features__, and __time-dependent__ fraud scenarios.

In [None]:
!pip install pandas --upgrade
from functions import *

### Customers generation

In [None]:
customers_df = generate_customers(5000)
customers_df

### Terminals generation

In [None]:
terminals_df = generate_terminals(10000)
terminals_df

Let's take a look at a customer and the terminals available to them

In [None]:
customer = customers_df.iloc[3]
print(customer)
RADIUS = 5
plot_customer_terminals(customer.x_customer_id, customer.y_customer_id, terminals_df, radius=RADIUS)

Let's associate terminals to customers

In [None]:
x_y_terminals = terminals_df[['x_terminal_id','y_terminal_id']].values.astype(float)
customers_df['available_terminals'] = customers_df.apply(lambda x : get_list_terminals_within_radius(x, x_y_terminals=x_y_terminals, r=RADIUS), axis=1)
customers_df['nb_terminals']=customers_df.available_terminals.apply(len)
customers_df

### Transactions generation

Let's generate transactions for customers

In [None]:
NUMBER_OF_DAYS = 183
transactions_df = customers_df.groupby('CUSTOMER_ID').apply(lambda x : generate_transactions(x.iloc[0], nb_days=NUMBER_OF_DAYS)).reset_index(drop=True)
transactions_df

Let's plot the generated transactions over time

In [None]:
plot_transactions_over_time(transactions_df)

Let's add the fraudulent label to the transactions according to 3 scenarios:
- __Scenario 1__: any transaction whose amount is more than 220. This will provide an obvious fraud pattern that should be detected always.
- __Scenario 2__: every day, a list of two terminals is drawn at random. All transactions on these terminals in the next 28 days will be marked as fraudulent.
- __Scenario 3__: every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent.


In [None]:
transactions_df = add_frauds(customers_df, terminals_df, transactions_df)

In [None]:
plot_transactions_over_time(transactions_df)

Let's plot some statistics of the transactions

In [None]:
plot_transactions_stats(transactions_df)

Let's save the data we just generated to S3

In [None]:
S3_PATH = 's3://sagemaker-us-east-1-996912938507/endtoendmlsm/data/generated/'
save_data(customers_df, terminals_df, transactions_df, S3_PATH)