# Setup

Import the required modules.

In [None]:
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sklearn.linear_model
import sklearn.neighbors
import sklearn.neural_network

import sklearn.metrics
import sklearn.model_selection
import sklearn.preprocessing

Initialize the random seed, for reproducibility, so we all get the same results

In [None]:
np.random.seed(42)

## Data directory and Google Colab

I write and run notebooks on Google Colab with data on Google Drive. This cell checks if the notebook is running on Google Colab. If so, it connects to Google Drive. Otherwise, it will look for the data in the current directory.

In [None]:
import sys
RUNNING_ON_COLAB = 'google.colab' in sys.modules

if RUNNING_ON_COLAB:
  from google.colab import drive
  drive.mount('/content/drive')
  DIR = "drive/MyDrive"
else:
  DIR = "."

# Comparison of 3 classifiers with credit card transactions

We have learned three classifiers in class:

1. k-Nearest Neighbors (with majority voting, i.e. the prediction is the most common class among the k nearest neighbors);

2. logistic regression (binomial / binary, or multinomial / multi-class);

3. neural networks, or the multi-layer perceptron.

Here, we estimate these three classifiers on a dataset of credit card transactions.

## Data

We use a publicly available dataset on [credit card transactions](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). For privacy reasons, the explanatory variables (`V1-V28`) in this dataset are obfuscated (they come from "dimensionality reduction" via Principal Components Analysis, or PCA, and these are the most relevant factors). The other variables are the transaction amount and the label (1 for fraud, 0 for non-fraud). You need to unzip and extract the dataset in `creditcardfraud.zip`.

In [None]:
DATA_FILEPATH = os.path.join(DIR, "creditcard.csv")
df = pd.read_csv(DATA_FILEPATH)

Rebalance the dataset, so that we have approximately the same number of legitimate and fraudulent transactions.

In [None]:
y = df["Class"]

initial_shape = df.shape

legitimate = df[y == 0]
fraudulent = df[y == 1]

legitimate = df.sample(n=fraudulent.shape[0], random_state=42)

# Students: this line is the dataframe equivalent of list.extend().
# The argument axis=0 ensures that we concatenate two dataframes vertically, by
# adding rows (axis=1 would concatenate horizontally, adding columns).

df = pd.concat([legitimate, fraudulent], axis=0)
final_shape = df.shape

print(f"Rebalanced dataset from {initial_shape} to {final_shape}")

## TODO: data check and summary statistics

_The first thing to do with a new dataset is to check a few rows of the data and print summary statistics._

## TODO: Plot the data

_The second thing to do with a new dataset is to plot the data. If you want, you can use `legitimate` and `fraudulent` from the cell above, without having the subset the dataframe like we did in lecture._

## TODO: Train-validation-test split: 60-20-20

_Regarding the split numbers, the choice of 80-20 split is known as the Pareto Principle and you can read about it [here](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)._

## TODO: Logistic regression / classification


## TODO: k-Nearest Neighbors

_Suggestion: use `k` in `[1, 5, 10, 50, 100]`._

## TODO: Neural networks

_Suggestion: use the same hidden layers as in lecture: `[(20, 15), (20, 10), (15, 10), (20, 15, 10)]`._

## TODO: Comparison of the three classifiers on test dataset

## TODO: summarize your findings

_What patterns did you find in the data; what is the best classifier, and why?_