# DSCI445 Term Project - Bank Account Fraud Detection
### Jakob Wickham, Nick Brady, Noah Sturgeon

This was hell to work on

To whoever wants to update this, read here to know how to make slides: https://www.geeksforgeeks.org/creating-interactive-slideshows-in-jupyter-notebooks/

In [1]:
import pandas as pd, kagglehub

In [2]:
# Only run if you want to locally have the dataset on your machine
path = kagglehub.dataset_download("sgpjesus/bank-account-fraud-dataset-neurips-2022")

data: pd.DataFrame = pd.read_csv(f"{path}/Base.csv")

## 1. Introduction

Fraud detection is a pretty important, yet very volatile system. With there being so many factors that can easily blend in with each other, how can one tell if a transaction was actually from the account's owner?

### The Dataset

The Bank Account Fraud (BAF) NeurIPS 2022 datasets are a collection of based-on-reality simulated datasets emulate real-world bank fraud data. Included is 32 columns, with 31 features and a classification column stating whether or not the transaction was fraudulent or not.

As with something like fraud detection, it's a very imbalanced dataset, containing mostly of non-fraudulent transactions and very few fraudulent ones. This was the start of the many hurdles we had to overcome.

In [10]:
pd.DataFrame(data.groupby(data['fraud_bool'])['fraud_bool'].count())

Unnamed: 0_level_0,fraud_bool
fraud_bool,Unnamed: 1_level_1
0,988971
1,11029


Because the data is imbalanced, a lot of normal techniques can't be easily performed on the dataset. Simple algorithms like logistic regression, K-nearest neighbors, or random forests don't perform well with the raw dataset as is.

On top of that, any general measurement to determine performance on those models meant nothing: accuracy, recall, precision, the likes; they only worked great on the non-fraud transactions. So, we had to do some digging into the dataset to see what we can do to make this better.

### Exploratory Data Analysis

So what are some features that we can remove without much worry?

Well, there's this feature:

In [6]:
pd.DataFrame(data.groupby(["device_fraud_count"])["fraud_bool"].count())

Unnamed: 0_level_0,fraud_bool
device_fraud_count,Unnamed: 1_level_1
0,1000000


It's all 0, so that's an easy feature to get rid of. But the rest have some values and meaning to them, so let's explore some more.

Some numerical features in this dataset contain invalid values. Let's check how many numerical features have a majority of their data missing or invalid:

In [None]:
numeric_missing_value_columns = ['prev_address_months_count', 'current_address_months_count', 'intended_balcon_amount', 'bank_months_count', 'session_length_in_minutes', 'device_distinct_emails_8w']

missing_data_summary = pd.DataFrame({
    'Missing Data Count': [(data[col] < 0).sum() for col in numeric_missing_value_columns],
    'Percentage Missing': [(data[col] < 0).mean() * 100 for col in numeric_missing_value_columns]
}, index=numeric_missing_value_columns)

missing_data_summary.sort_values(by='Percentage Missing', ascending=False)

Unnamed: 0,Missing Data Count,Percentage Missing
intended_balcon_amount,742523,74.2523
prev_address_months_count,712920,71.292
bank_months_count,253635,25.3635
current_address_months_count,4254,0.4254
session_length_in_minutes,2015,0.2015
device_distinct_emails_8w,359,0.0359


`intended_balcon_amount` and `prev_address_months_count` have a majority of their data as invalid or missing. We decided to get rid of these columns for that reason.

The rest of those columns we decided to impute to give those invalid values some meaning by replacing it with the median data of that column. Some models don't work well with missing data, so this was necessary.