# ⚡⚡ PyTorch Quickstart for the American Express - Default Prediction competition
This notebook shows how to define and train a Pytorch LSTM to leverages the time series structure of the data.

I expect Deep Learning models to dominate in this competition, so here's a simple LSTM architecture.

Parameters were not really tweaked so the baseline is improvable.

**Please consider upvoting if you find this work helpful. Don't fork without upvoting !**



## Why PyTorch Lightning?

Lightning is simply organized PyTorch code. There's NO new framework to learn. For more details about Lightning visit the repo:

https://github.com/PyTorchLightning/pytorch-lightning

Run on CPU, GPU clusters or TPU, without any code changes



# Imports


In [None]:
# 把Google Drive挂载到Colab里
try:
    from google.colab import drive
    drive.mount('/content/drive')
except ImportError:
    pass

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 修改当前文件夹位置 假定notebook文件就在项目文件夹根目录
import os
def get_root_dir():
    if os.path.exists('/content/drive/MyDrive/Colab/'):
        return '/content/drive/MyDrive/Colab/4-AMEX/AMEX Project/notebooks' #在Colab里
    else:
        return './' #在本地

#调用系统命令，相当于cd，但是直接!cd是不行的
os.chdir(get_root_dir())

# %cd .//drive/MyDrive/AMEX\ Team/deep_learning

In [None]:
!pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import gc
import numpy as np
from tqdm.notebook import tqdm

# Torch and Sklearn
import pytorch_lightning as pl
import torch
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from torchmetrics import Metric
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import Trainer
import torch.nn.functional as F
from torch import nn, Tensor
from torchmetrics.utilities import rank_zero_warn

# Typing 
from typing import Optional

In [None]:
# File system
test_files    = [f"../data/4-PreCompressed/FilledWithRandomForest/test_RandomForest(PreCompressed)_{i}.parquet" for i in range(10)]
model_weights = 'mycheckpoints/epoch=6-step=10171.ckpt'

# Data
batch_size   = 256
num_workers  = 4

# Model 
in_features=171
hidden_dim=512
num_layers=1
learning_rate=8e-3

# Data

Reading and preprocessing the data

We read the data from @raddar's [dataset](http://https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format) that i splitted into chunks [dataset](https://www.kaggle.com/datasets/what5up/amex-data-integer-dtypes-100k-cid-per-chunk). @raddar has denoised the data so that we can achieve better results with his dataset than with the original competition csv files.

We also convert the dataframe into a 3D-tensor dataset as highlighted by Chris Deotte [here](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828) 



In [None]:
def load_test_df(test_file):
    test = pd.read_parquet(test_file)
    test['S_2'] = pd.to_datetime(test['S_2'])
    tmp = test[['customer_ID','S_2']].groupby('customer_ID').count()

    missing_cids = []
    for nb_available_rows in range(1, 14):
        cids = tmp[tmp['S_2'] == nb_available_rows].index.values
        batch_missing_cids = [cid for cid in cids for _ in range(13 - nb_available_rows)]
        missing_cids.extend(batch_missing_cids)

    test_part2 = test.iloc[:len(missing_cids)].copy()
    test_part2.loc[:] = np.nan
    test_part2['customer_ID'] = missing_cids

    test = pd.concat([test_part2, test])
    
    test = test.sort_values('customer_ID')
    return test

In [None]:
class DataModule(pl.LightningDataModule):

    def __init__(self, all_data: pd.DataFrame, batch_size: int = batch_size, num_workers: int = num_workers):
        super().__init__()
        self.all_data = all_data
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.sc = StandardScaler()

    def prepare_data(self):
        pass

    def setup(self, stage=None):
        # All data comumns except customer_ID, target, and S_2 are features
        features = self.all_data.columns[2:]
        self.all_data[features] = self.sc.fit_transform(self.all_data[features])
        self.all_data[features] = self.all_data[features].fillna(0)
        
        # https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828 !! Many Thanks @Chris Deotte for your sharing
        all_tensor_x   = torch.reshape(torch.tensor(self.all_data[features].to_numpy()), (-1, 13, 171)).float()
        
        self.predict_tensor = TensorDataset(all_tensor_x)
    def predict_dataloader(self):
        return DataLoader(self.predict_tensor, batch_size=self.batch_size, num_workers=self.num_workers)

# Metrics : [Implementation Source](https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations)

In [None]:
## https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations

class AmexMetric(Metric):
    is_differentiable: Optional[bool] = False

    # Set to True if the metric reaches it optimal value when the metric is maximized.
    # Set to False if it when the metric is minimized.
    higher_is_better: Optional[bool] = True

    # Set to True if the metric during 'update' requires access to the global metric
    # state for its calculations. If not, setting this to False indicates that all
    # batch states are independent and we will optimize the runtime of 'forward'
    full_state_update: bool = True

    def __init__(self):
        super().__init__()
        
        self.add_state("all_true", default=[], dist_reduce_fx="cat")
        self.add_state("all_pred", default=[], dist_reduce_fx="cat")

        rank_zero_warn(
            "Metric `Amex` will save all targets and predictions in buffer."
            " For large datasets this may lead to large memory footprint."
        )

    def update(self, y_pred: torch.Tensor, y_true: torch.Tensor):
        
        y_true = y_true.double()
        y_pred = y_pred.double()
        
        self.all_true.append(y_true)
        self.all_pred.append(y_pred)
        
    def compute(self):
        y_true = torch.cat(self.all_true)
        y_pred = torch.cat(self.all_pred)
        # count of positives and negatives
        n_pos = y_true.sum()
        n_neg = y_pred.shape[0] - n_pos

        # sorting by descring prediction values
        indices = torch.argsort(y_pred, dim=0, descending=True)
        preds, target = y_pred[indices], y_true[indices]

        # filter the top 4% by cumulative row weights
        weight = 20.0 - target * 19.0
        cum_norm_weight = (weight / weight.sum()).cumsum(dim=0)
        four_pct_filter = cum_norm_weight <= 0.04

        # default rate captured at 4%
        d = target[four_pct_filter].sum() / n_pos

        # weighted gini coefficient
        lorentz = (target / n_pos).cumsum(dim=0)
        gini = ((lorentz - cum_norm_weight) * weight).sum()

        # max weighted gini coefficient
        gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

        # normalized weighted gini coefficient
        g = gini / gini_max
        
        return 0.5 * (g + d)


def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()

    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)


## Model

In [None]:
class LSTMClassifier(nn.Module):
    """Very simple implementation of LSTM-based time-series classifier."""

    def __init__(self, input_dim, hidden_dim, num_layers, output_dim, device):
        super().__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.rnn = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, 100)  #just change this to 300
        self.fc2 = nn.Linear(100, output_dim)
        self.device = device

    def forward(self, x):
        h0, c0 = self.init_hidden(x)
        out, (_, _) = self.rnn(x, (h0, c0))
        out = F.relu(self.fc1(out[:, -1, :]))
        out = torch.sigmoid(self.fc2(out))
        #out = torch.nn.functional.softmax(out ,dim= 1)
        return out

    def init_hidden(self, x):
        batch_size = x.size(0)
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        if torch.cuda.is_available():
            h0, c0 = h0.cuda(), c0.cuda()
        return h0, c0


class TsLstmLightning(pl.LightningModule):
    def __init__(self, in_features, hidden_dim, num_layers, learning_rate):
        super(TsLstmLightning, self).__init__()

        self.learning_rate = learning_rate

        self.train_amex_metric = AmexMetric()
        self.val_amex_metric   = AmexMetric()

        self.model = LSTMClassifier(in_features, hidden_dim, num_layers, 1, device = self.device) # changed to 2

        self.num_parameters = count_parameters(self.model)

        self.loss_fn = nn.BCELoss(reduction="mean")

    def forward(self, x):
        res = self.model(x)
        return res

    def training_step(self, batch, batch_idx):
        X, target = batch
        preds = self(X)  # (batch_size, 1)
        preds = preds.squeeze(1)

        loss = self.loss_fn(preds, target)
        
        self.train_amex_metric.update(preds, target)

        self.log_dict({'train_loss': loss, 'train_amex_metric': self.train_amex_metric}, on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return {'loss': loss}

    def validation_step(self, batch, batch_idx):
        with torch.no_grad():
            X, target = batch
            preds = self(X)  
            preds = preds.squeeze(1)

            loss = self.loss_fn(preds, target)
            
            self.val_amex_metric.update(preds, target)

            self.log_dict({'val_loss': loss, 'val_amex_metric': self.val_amex_metric }, on_step=True, on_epoch=True, prog_bar=True, logger=True)

            return {'loss': loss}


    def predict_step(self, batch, batch_idx, dataloader_idx=None):
        with torch.no_grad():
            X = batch[0]
            preds = self(X)
            return preds.detach().cpu()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Inference Loop 

In [None]:
model = TsLstmLightning(in_features=in_features, hidden_dim=hidden_dim, num_layers=num_layers, learning_rate=learning_rate)
model = model.load_from_checkpoint(checkpoint_path=model_weights, in_features=in_features, hidden_dim=hidden_dim, num_layers=num_layers, learning_rate=learning_rate)
trainer = Trainer(gpus=[0] if torch.cuda.is_available() else None)
all_ss = []
for test_file in test_files:
    test_df = pd.read_parquet(test_file)
    print(f"Test Shape: {test_df.shape}")
    dm = DataModule(test_df, batch_size=batch_size)
    
    customer_ID = test_df['customer_ID'].unique()

    del test_df 
    gc.collect()
    
    prediction = trainer.predict(model, dm)
    
    prediction = torch.cat(prediction).detach().numpy().squeeze()
    print(pd.DataFrame(prediction))

    ss = pd.DataFrame({'customer_ID': customer_ID, 'prediction':prediction})
    all_ss.append(ss)
    
    del dm
    gc.collect()


ss = pd.concat(all_ss)
ss.to_csv('../results/4.4-LSTM-submission.csv', index=False)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.007913
1      0.000923
2      0.032652
3      0.588429
4      0.894673
...         ...
99995  0.000657
99996  0.205784
99997  0.004863
99998  0.475661
99999  0.000612

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.749944
1      0.001468
2      0.000433
3      0.046641
4      0.000076
...         ...
99995  0.009059
99996  0.532231
99997  0.004014
99998  0.856911
99999  0.108834

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.000228
1      0.006747
2      0.000759
3      0.050580
4      0.024868
...         ...
99995  0.702036
99996  0.817741
99997  0.001897
99998  0.003674
99999  0.862492

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.033723
1      0.000824
2      0.001562
3      0.000401
4      0.537696
...         ...
99995  0.000180
99996  0.361669
99997  0.093653
99998  0.022425
99999  0.000770

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.001599
1      0.799451
2      0.017060
3      0.035948
4      0.216876
...         ...
99995  0.339571
99996  0.005991
99997  0.585671
99998  0.591210
99999  0.000516

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.527176
1      0.014960
2      0.002837
3      0.981295
4      0.174950
...         ...
99995  0.004149
99996  0.000116
99997  0.000329
99998  0.000799
99999  0.410499

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.354432
1      0.554808
2      0.797854
3      0.000330
4      0.997213
...         ...
99995  0.000681
99996  0.019122
99997  0.671495
99998  0.126772
99999  0.000420

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.848950
1      0.000667
2      0.253712
3      0.150043
4      0.080738
...         ...
99995  0.986280
99996  0.002148
99997  0.207627
99998  0.764231
99999  0.002138

[100000 rows x 1 columns]
Test Shape: (1300000, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.199189
1      0.704597
2      0.001509
3      0.298037
4      0.018369
...         ...
99995  0.000558
99996  0.070768
99997  0.001007
99998  0.000234
99999  0.009504

[100000 rows x 1 columns]
Test Shape: (320073, 173)


  cpuset_checked))


Predicting: 0it [00:00, ?it/s]

              0
0      0.269025
1      0.028386
2      0.596153
3      0.383417
4      0.001522
...         ...
24616  0.006550
24617  0.592939
24618  0.685268
24619  0.346999
24620  0.172991

[24621 rows x 1 columns]


# Possible Nexts steps

1. Have a best handling of missiing values in the data ( Ex : Do not drop customers that don't have 13 records,  do not fill N/A with 0, Replace previous -1 values with N/A, ...)
2. Enhance the model (More LSTM / 1D CNN / Transformers / Param optimisation / ... ) 
3. Have a better understanding of the predictive features = Feature engineering ( feature selection or permutation feature importance, for instance.)
4. (1. + 2.--> Transformer Unsupervised training with Times Series) https://arxiv.org/abs/2010.02803
5. Model ensemble with other classification techniques
6. Enable Cross-validation (Stratified K-Fold, ...)


In [None]:
print(pd.read_parquet(train_file))

## Data split

In [None]:
def split_data(df):
    start=0
    end = 1300000
    for i in range(10):
        df_store = df.iloc[start:end, :]
        start += 1300000 
        end += 1300000 
        df_store.to_parquet(f"./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_{i}.parquet")
#to_parquet("./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_0.parquet")

In [None]:
#print(pd.read_parquet("./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_0.parquet"))