Testing Branch

# Business Statement

- Problem solved
    - Hypothesis: non-fraud behavior does not change across time (data has consistent spacial & temporal features)
        - Consistency score as a feature?
 
- Prediction of fraud + the reason for fraud
    - Traditional models (interpretability) vs black-box models

- Current Challenges


# EDA (Data Understanding):

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc


In [None]:
pip install -U imbalanced-learn

In [None]:
cc_fraud = pd.read_csv('files/creditcard.csv')
cc_fraud.head()

In [None]:
transactions = pd.read_csv('files/cc_transactions.csv')
transactions.head()

In [None]:
baf_base = pd.read_csv('files/Base.csv')
baf_base.head()

In [None]:
baf_base.corr(numeric_only = "TRUE")

In [None]:
baf_base.info()

In [None]:
transactions.info()

In [None]:
'''
target = transactions['FRAUD']
features = transactions[['AMOUNT', 'TIME_SECONDS', 'DURING_WEEKEND']]
print(target.describe())
features.describe()
'''

In [None]:
'''
plt.figure(figsize=(20, 5))
for i, col in enumerate(features.columns):
    plt.subplot(1, 3, i+1)
    plt.plot(transactions[col], target, 'o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('FRAUD')
    plt.tight_layout()
'''

In [None]:
y = transactions['FRAUD']
X = transactions.drop(columns=['FRAUD', 'CUSTOMER_ID', 'DATETIME'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)


In [None]:
dt = DecisionTreeClassifier(criterion='entropy', random_state=10)
dt.fit(X_train, y_train)

In [None]:
y_pred = dt.predict(X_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

In [None]:
max_depths = list(range(1, 33))
train_results = []
test_results = []
for max_depth in max_depths:
    dt = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth, random_state=10)
    dt.fit(X_train, y_train)
    train_pred = dt.predict(X_train)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    y_pred = dt.predict(X_test)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(max_depths, train_results, 'b', label='Train AUC')
plt.plot(max_depths, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.legend()
plt.show()

- Current Datasets: Kaggle CC Fraud + Bank Account Fraud
- Distributions, trends, outliers
- Look at Normal (0) vs Fraud (1) class imbalance

# Data Processing

- Nulls
- Apply filters to realize our assumptions
- Standardize variables
- SMOTE to balance target
- Train/Test split


# Features

Types:
- Account related features: account number, card exp date, etc.
- Transaction related features: POS number, transaction time, amount, etc.
- Customer related features: customer number, type of customer, etc.

Feature transformation:
- Date/time variables: weekday or weekend
- Customer spending: average spending amount  + number of transactions
- Risk score: average number of fraud over a certain window 

# Baseline Model

- Logistic Regression 
    - Coefficients, statisitcal importance, explainability
- Decision Tree
    - Classification criteria, feature importance

# Evaluation

- Metrics: ROC (AUC Score), Recall, Confusion Matrix

# Hyperparameter Tuning --> Optimal Model

- Ensemble Methods:
    - Random Forest
    - Boosting

# Conclusion