<a href="https://colab.research.google.com/github/nickhitt/AMEX-fraud/blob/master/Amex_Credit_Default.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# American Express Fraud Detection Competition

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics.

In this competition, I apply my machine learning skills to predict credit default for American Express. Specifically, I leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. 

This notebook is organised according to the following sections.

1. Loading the data
2. Exploratory Data Analysis 
# 3. Pre-Processing/Feature Engineering
4. Train-Test Split
5. Model Selection
6. Model Tuning
7. Predictive Output



### 1.0 Data Loading

First lets download the AMEX data into Colab from the local machine

In [2]:
from google.colab import files
uploaded = files.upload()

KeyboardInterrupt: ignored

In [None]:
import pandas as pd

train = pd.read_feather('/content/train.feather')
#Due to memory constraints we will run this later
#test = pd.read_feather('/content/test.feather')

train_labels = pd.read_csv("/content/train_labels.csv")

### 2.0 Exploratory Data Analysis
Let's have a look around the data and take a look at what we're dealing with

In [None]:
train.info()

In [None]:
train.head()

**Multiple Observations Per Customer**

Just as the header shows - each customer has many observations (billing periods). We'll need to keep this in mind in the future perhaps when we do feature engineering and pre-processing

In [None]:
unique_customers = train['customer_ID'].nunique()
print(f'There are {unique_customers} customers')

total_obs = train.shape[0]
print(f'There are {total_obs} observations')

**Class Imbalance**

It looks like there is also a class imbalance between the defaults and non-defaults. This makes sense - as these individuals are already customers. Presumably AMEX has a criteria to evaluate individuals before they become a customer to make sure they don't default.

In [None]:
import seaborn as sns
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="target", data=train_labels)

### 3.0 Feature Engineering and Pre-Processing

In [None]:
# Feature engineering function
def process_and_feature_engineer(df):
    # FEATURE ENGINEERING FROM
    # https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created
    all_cols = [c for c in list(df.columns) if c not in ['customer_ID', 'S_2']]
    # We know there are categorical features - so let's name those so we can apply a special aggregation
    cat_features = ["B_30", "B_38", "D_114", "D_116", "D_117", "D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
    num_features = [col for col in all_cols if col not in cat_features]

    test_num_agg = df.groupby("customer_ID")[num_features].agg(['median', 'std', 'min', 'max'])
    test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]

    test_cat_agg = df.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
    test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]

    df = pd.concat([test_num_agg, test_cat_agg], axis=1)
    del test_num_agg, test_cat_agg
    print('shape after engineering', df.shape)

    return df

In [None]:
# Running function
train = process_and_feature_engineer(train)
# test = process_and_feature_engineer(test)

In [None]:
from sklearn.preprocessing import StandardScaler
