# ⚡⚡ PyTorch Quickstart for the American Express - Default Prediction competition
This notebook shows how to define and train a Pytorch LSTM to leverages the time series structure of the data.

I expect Deep Learning models to dominate in this competition, so here's a simple LSTM architecture.

Parameters were not really tweaked so the baseline is improvable.

**Please consider upvoting if you find this work helpful. Don't fork without upvoting !**



## Why PyTorch Lightning?

Lightning is simply organized PyTorch code. There's NO new framework to learn. For more details about Lightning visit the repo:

https://github.com/PyTorchLightning/pytorch-lightning

Run on CPU, GPU clusters or TPU, without any code changes



# Imports


In [1]:
# 把Google Drive挂载到Colab里
try:
    from google.colab import drive
    drive.mount('/content/drive')
except ImportError:
    pass

Mounted at /content/drive


In [2]:
# 修改当前文件夹位置 假定notebook文件就在项目文件夹根目录
import os
def get_root_dir():
    if os.path.exists('/content/drive/MyDrive/Colab/'):
        return '/content/drive/MyDrive/Colab/4-AMEX/AMEX Project/notebooks' #在Colab里
    else:
        return './' #在本地

#调用系统命令，相当于cd，但是直接!cd是不行的
os.chdir(get_root_dir())

# %cd .//drive/MyDrive/AMEX\ Team/deep_learning

In [3]:
!pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning
  Downloading pytorch_lightning-1.7.2-py3-none-any.whl (705 kB)
[K     |████████████████████████████████| 705 kB 14.5 MB/s 
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.9.3-py3-none-any.whl (419 kB)
[K     |████████████████████████████████| 419 kB 89.0 MB/s 
[?25hCollecting tensorboard>=2.9.1
  Downloading tensorboard-2.10.0-py3-none-any.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 66.5 MB/s 
[?25hCollecting pyDeprecate>=0.3.1
  Downloading pyDeprecate-0.3.2-py3-none-any.whl (10 kB)
Installing collected packages: torchmetrics, tensorboard, pyDeprecate, pytorch-lightning
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.8.0
    Uninstalling tensorboard-2.8.0:
      Successfully uninstalled tensorboard-2.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the pac

In [4]:
import pandas as pd
import gc
import numpy as np
from tqdm.notebook import tqdm

# Torch and Sklearn
import pytorch_lightning as pl
import torch
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from torchmetrics import Metric
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import Trainer
import torch.nn.functional as F
from torch import nn, Tensor
from torchmetrics.utilities import rank_zero_warn

# Typing 
from typing import Optional

In [12]:
# File system
test_files    = [f"../data/4-PreCompressed/FilledWithRandomForest/test_RandomForest(PreCompressed)_{i}.parquet" for i in range(10)]
model_weights = 'mycheckpoints/epoch=6-step=10171.ckpt'

# Data
batch_size   = 256
num_workers  = 2

# Model 
in_features=171
hidden_dim=512
num_layers=1
learning_rate=8e-3

# Data

Reading and preprocessing the data

We read the data from @raddar's [dataset](http://https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format) that i splitted into chunks [dataset](https://www.kaggle.com/datasets/what5up/amex-data-integer-dtypes-100k-cid-per-chunk). @raddar has denoised the data so that we can achieve better results with his dataset than with the original competition csv files.

We also convert the dataframe into a 3D-tensor dataset as highlighted by Chris Deotte [here](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828) 



In [13]:
def load_test_df(test_file):
    test = pd.read_parquet(test_file)
    test['S_2'] = pd.to_datetime(test['S_2'])
    tmp = test[['customer_ID','S_2']].groupby('customer_ID').count()

    missing_cids = []
    for nb_available_rows in range(1, 14):
        cids = tmp[tmp['S_2'] == nb_available_rows].index.values
        batch_missing_cids = [cid for cid in cids for _ in range(13 - nb_available_rows)]
        missing_cids.extend(batch_missing_cids)

    test_part2 = test.iloc[:len(missing_cids)].copy()
    test_part2.loc[:] = np.nan
    test_part2['customer_ID'] = missing_cids

    test = pd.concat([test_part2, test])
    
    test = test.sort_values('customer_ID')
    return test

In [14]:
class DataModule(pl.LightningDataModule):

    def __init__(self, all_data: pd.DataFrame, batch_size: int = batch_size, num_workers: int = num_workers):
        super().__init__()
        self.all_data = all_data
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.sc = StandardScaler()

    def prepare_data(self):
        pass

    def setup(self, stage=None):
        # All data comumns except customer_ID, target, and S_2 are features
        features = self.all_data.columns[2:]
        self.all_data[features] = self.sc.fit_transform(self.all_data[features])
        self.all_data[features] = self.all_data[features].fillna(0)
        
        # https://www.kaggle.com/competitions/amex-default-prediction/discussion/327828 !! Many Thanks @Chris Deotte for your sharing
        all_tensor_x   = torch.reshape(torch.tensor(self.all_data[features].to_numpy()), (-1, 13, 171)).float()
        
        self.predict_tensor = TensorDataset(all_tensor_x)
    def predict_dataloader(self):
        return DataLoader(self.predict_tensor, batch_size=self.batch_size, num_workers=self.num_workers)

# Metrics : [Implementation Source](https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations)

In [15]:
## https://www.kaggle.com/code/rohanrao/amex-competition-metric-implementations

class AmexMetric(Metric):
    is_differentiable: Optional[bool] = False

    # Set to True if the metric reaches it optimal value when the metric is maximized.
    # Set to False if it when the metric is minimized.
    higher_is_better: Optional[bool] = True

    # Set to True if the metric during 'update' requires access to the global metric
    # state for its calculations. If not, setting this to False indicates that all
    # batch states are independent and we will optimize the runtime of 'forward'
    full_state_update: bool = True

    def __init__(self):
        super().__init__()
        
        self.add_state("all_true", default=[], dist_reduce_fx="cat")
        self.add_state("all_pred", default=[], dist_reduce_fx="cat")

        rank_zero_warn(
            "Metric `Amex` will save all targets and predictions in buffer."
            " For large datasets this may lead to large memory footprint."
        )

    def update(self, y_pred: torch.Tensor, y_true: torch.Tensor):
        
        y_true = y_true.double()
        y_pred = y_pred.double()
        
        self.all_true.append(y_true)
        self.all_pred.append(y_pred)
        
    def compute(self):
        y_true = torch.cat(self.all_true)
        y_pred = torch.cat(self.all_pred)
        # count of positives and negatives
        n_pos = y_true.sum()
        n_neg = y_pred.shape[0] - n_pos

        # sorting by descring prediction values
        indices = torch.argsort(y_pred, dim=0, descending=True)
        preds, target = y_pred[indices], y_true[indices]

        # filter the top 4% by cumulative row weights
        weight = 20.0 - target * 19.0
        cum_norm_weight = (weight / weight.sum()).cumsum(dim=0)
        four_pct_filter = cum_norm_weight <= 0.04

        # default rate captured at 4%
        d = target[four_pct_filter].sum() / n_pos

        # weighted gini coefficient
        lorentz = (target / n_pos).cumsum(dim=0)
        gini = ((lorentz - cum_norm_weight) * weight).sum()

        # max weighted gini coefficient
        gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

        # normalized weighted gini coefficient
        g = gini / gini_max
        
        return 0.5 * (g + d)


def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()

    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x == 0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)


## Model

In [16]:
class LSTMClassifier(nn.Module):
    """Very simple implementation of LSTM-based time-series classifier."""

    def __init__(self, input_dim, hidden_dim, num_layers, output_dim, device):
        super().__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.rnn = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, 100)  #just change this to 300
        self.fc2 = nn.Linear(100, output_dim)
        self.device = device

    def forward(self, x):
        h0, c0 = self.init_hidden(x)
        out, (_, _) = self.rnn(x, (h0, c0))
        out = F.relu(self.fc1(out[:, -1, :]))
        #out = torch.sigmoid(self.fc2(out))
        out = torch.nn.functional.softmax(out ,dim= 1)
        return out

    def init_hidden(self, x):
        batch_size = x.size(0)
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim)
        if torch.cuda.is_available():
            h0, c0 = h0.cuda(), c0.cuda()
        return h0, c0


class TsLstmLightning(pl.LightningModule):
    def __init__(self, in_features, hidden_dim, num_layers, learning_rate):
        super(TsLstmLightning, self).__init__()

        self.learning_rate = learning_rate

        self.train_amex_metric = AmexMetric()
        self.val_amex_metric   = AmexMetric()

        self.model = LSTMClassifier(in_features, hidden_dim, num_layers, 1, device = self.device) # changed to 2

        self.num_parameters = count_parameters(self.model)

        self.loss_fn = nn.BCELoss(reduction="mean")

    def forward(self, x):
        res = self.model(x)
        return res

    def training_step(self, batch, batch_idx):
        X, target = batch
        preds = self(X)  # (batch_size, 1)
        preds = preds.squeeze(1)

        loss = self.loss_fn(preds, target)
        
        self.train_amex_metric.update(preds, target)

        self.log_dict({'train_loss': loss, 'train_amex_metric': self.train_amex_metric}, on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return {'loss': loss}

    def validation_step(self, batch, batch_idx):
        with torch.no_grad():
            X, target = batch
            preds = self(X)  
            preds = preds.squeeze(1)

            loss = self.loss_fn(preds, target)
            
            self.val_amex_metric.update(preds, target)

            self.log_dict({'val_loss': loss, 'val_amex_metric': self.val_amex_metric }, on_step=True, on_epoch=True, prog_bar=True, logger=True)

            return {'loss': loss}


    def predict_step(self, batch, batch_idx, dataloader_idx=None):
        with torch.no_grad():
            X = batch[0]
            preds = self(X)
            return preds.detach().cpu()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Inference Loop 

In [17]:
model = TsLstmLightning(in_features=in_features, hidden_dim=hidden_dim, num_layers=num_layers, learning_rate=learning_rate)
model = model.load_from_checkpoint(checkpoint_path=model_weights, in_features=in_features, hidden_dim=hidden_dim, num_layers=num_layers, learning_rate=learning_rate)
trainer = Trainer(gpus=[0] if torch.cuda.is_available() else None)
all_ss = []
for test_file in test_files:
    # test_df = load_test_df(test_file)
    test_df = pd.read_parquet(test_file)
    print(f"Test Shape: {test_df.shape}")
    dm = DataModule(test_df, batch_size=batch_size)
    
    customer_ID = test_df['customer_ID'].unique()

    del test_df 
    gc.collect()
    
    prediction = trainer.predict(model, dm)
    
    prediction = torch.cat(prediction).detach().numpy().squeeze()
    print(pd.DataFrame(prediction))

    # ss = pd.DataFrame({'customer_ID': customer_ID, 'prediction':prediction})
    # all_ss.append(ss)
    all_ss.append(prediction)
    del dm
    gc.collect()

  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Test Shape: (1300000, 173)


INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      2.055982e-07  2.055982e-07  2.055982e-07  2.055982e-07  2.055982e-07   
1      2.208800e-09  2.208800e-09  2.208800e-09  2.208800e-09  2.208800e-09   
2      2.607889e-05  2.607889e-05  2.607889e-05  2.607889e-05  2.607889e-05   
3      4.951381e-03  4.951381e-03  4.951381e-03  4.951381e-03  4.951381e-03   
4      1.613368e-05  1.613368e-05  1.613368e-05  1.613368e-05  1.613368e-05   
...             ...           ...           ...           ...           ...   
99995  1.764340e-10  1.764340e-10  1.764340e-10  1.764340e-10  1.764340e-10   
99996  7.909304e-03  7.909304e-03  7.909304e-03  7.909304e-03  7.909304e-03   
99997  2.319168e-07  2.319168e-07  2.319168e-07  2.319168e-07  2.319168e-07   
99998  8.819036e-03  8.819036e-03  8.819036e-03  8.819036e-03  8.819036e-03   
99999  6.964077e-11  6.964077e-11  6.964077e-11  6.964077e-11  6.964077e-11   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      4.098254e-04  4.098254e-04  4.098254e-04  4.098254e-04  4.098254e-04   
1      4.370426e-09  4.370426e-09  4.370426e-09  4.370426e-09  4.370426e-09   
2      1.833527e-10  1.833527e-10  1.833527e-10  1.833527e-10  1.833527e-10   
3      4.536791e-05  4.536791e-05  4.536791e-05  4.536791e-05  4.536791e-05   
4      1.801553e-12  1.801553e-12  1.801553e-12  1.801553e-12  1.801553e-12   
...             ...           ...           ...           ...           ...   
99995  4.441290e-07  4.441290e-07  4.441290e-07  4.441290e-07  4.441290e-07   
99996  7.122105e-03  7.122105e-03  7.122105e-03  7.122105e-03  7.122105e-03   
99997  8.560260e-09  8.560260e-09  8.560260e-09  8.560260e-09  8.560260e-09   
99998  1.861886e-05  1.861886e-05  1.861886e-05  1.861886e-05  1.861886e-05   
99999  1.784711e-03  1.784711e-03  1.784711e-03  1.784711e-03  1.784711e-03   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      3.810533e-11  3.810533e-11  3.810533e-11  3.810533e-11  3.810533e-11   
1      1.889068e-07  1.889068e-07  1.889068e-07  1.889068e-07  1.889068e-07   
2      3.000618e-10  3.000618e-10  3.000618e-10  3.000618e-10  3.000618e-10   
3      1.101843e-04  1.101843e-04  1.101843e-04  1.101843e-04  1.101843e-04   
4      5.007942e-05  5.007942e-05  5.007942e-05  5.007942e-05  5.007942e-05   
...             ...           ...           ...           ...           ...   
99995  2.130042e-03  2.130042e-03  2.130042e-03  2.130042e-03  2.130042e-03   
99996  7.530137e-05  7.530137e-05  7.530137e-05  7.530137e-05  7.530137e-05   
99997  1.164387e-08  1.164387e-08  1.164387e-08  1.164387e-08  1.164387e-08   
99998  2.815296e-08  2.815296e-08  2.815296e-08  2.815296e-08  2.815296e-08   
99999  6.185622e-05  6.185622e-05  6.185622e-05  6.185622e-05  6.185622e-05   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      4.337671e-05  4.337671e-05  4.337671e-05  4.337671e-05  4.337671e-05   
1      1.232520e-09  1.232520e-09  1.232520e-09  1.232520e-09  1.232520e-09   
2      6.566179e-09  6.566179e-09  6.566179e-09  6.566179e-09  6.566179e-09   
3      6.547959e-11  6.547959e-11  6.547959e-11  6.547959e-11  6.547959e-11   
4      7.067632e-03  7.067632e-03  7.067632e-03  7.067632e-03  7.067632e-03   
...             ...           ...           ...           ...           ...   
99995  1.311469e-11  1.311469e-11  1.311469e-11  1.311469e-11  1.311469e-11   
99996  9.263034e-03  9.263034e-03  9.263034e-03  9.263034e-03  9.263034e-03   
99997  7.645303e-04  7.645303e-04  7.645303e-04  7.645303e-04  7.645303e-04   
99998  1.405386e-06  1.405386e-06  1.405386e-06  1.405386e-06  1.405386e-06   
99999  1.383934e-09  1.383934e-09  1.383934e-09  1.383934e-09  1.383934e-09   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      1.478228e-08  1.478228e-08  1.478228e-08  1.478228e-08  1.478228e-08   
1      2.035447e-04  2.035447e-04  2.035447e-04  2.035447e-04  2.035447e-04   
2      3.421506e-06  3.421506e-06  3.421506e-06  3.421506e-06  3.421506e-06   
3      6.037209e-05  6.037209e-05  6.037209e-05  6.037209e-05  6.037209e-05   
4      8.373846e-03  8.373846e-03  8.373846e-03  8.373846e-03  8.373846e-03   
...             ...           ...           ...           ...           ...   
99995  9.827582e-03  9.827582e-03  9.827582e-03  9.827582e-03  9.827582e-03   
99996  9.838717e-08  9.838717e-08  9.838717e-08  9.838717e-08  9.838717e-08   
99997  4.625531e-03  4.625531e-03  4.625531e-03  4.625531e-03  4.625531e-03   
99998  4.708587e-03  4.708587e-03  4.708587e-03  4.708587e-03  4.708587e-03   
99999  5.053094e-10  5.053094e-10  5.053094e-10  5.053094e-10  5.053094e-10   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      7.419950e-03  7.419950e-03  7.419950e-03  7.419950e-03  7.419950e-03   
1      2.499035e-06  2.499035e-06  2.499035e-06  2.499035e-06  2.499035e-06   
2      1.680810e-08  1.680810e-08  1.680810e-08  1.680810e-08  1.680810e-08   
3      3.423283e-08  3.423283e-08  3.423283e-08  3.423283e-08  3.423283e-08   
4      6.391680e-03  6.391680e-03  6.391680e-03  6.391680e-03  6.391680e-03   
...             ...           ...           ...           ...           ...   
99995  7.696554e-08  7.696554e-08  7.696554e-08  7.696554e-08  7.696554e-08   
99996  1.749536e-12  1.749536e-12  1.749536e-12  1.749536e-12  1.749536e-12   
99997  1.018905e-10  1.018905e-10  1.018905e-10  1.018905e-10  1.018905e-10   
99998  5.269848e-10  5.269848e-10  5.269848e-10  5.269848e-10  5.269848e-10   
99999  8.862520e-03  8.862520e-03  8.862520e-03  8.862520e-03  8.862520e-03   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      9.968466e-03  9.968466e-03  9.968466e-03  9.968466e-03  9.968466e-03   
1      6.567898e-03  6.567898e-03  6.567898e-03  6.567898e-03  6.567898e-03   
2      2.611553e-04  2.611553e-04  2.611553e-04  2.611553e-04  2.611553e-04   
3      2.382486e-11  2.382486e-11  2.382486e-11  2.382486e-11  2.382486e-11   
4      2.343451e-09  2.343451e-09  2.343451e-09  2.343451e-09  2.343451e-09   
...             ...           ...           ...           ...           ...   
99995  1.422558e-09  1.422558e-09  1.422558e-09  1.422558e-09  1.422558e-09   
99996  2.960449e-06  2.960449e-06  2.960449e-06  2.960449e-06  2.960449e-06   
99997  1.783907e-03  1.783907e-03  1.783907e-03  1.783907e-03  1.783907e-03   
99998  2.488585e-03  2.488585e-03  2.488585e-03  2.488585e-03  2.488585e-03   
99999  1.282343e-10  1.282343e-10  1.282343e-10  1.282343e-10  1.282343e-10   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      2.787723e-05  2.787723e-05  2.787723e-05  2.787723e-05  2.787723e-05   
1      3.063334e-10  3.063334e-10  3.063334e-10  3.063334e-10  3.063334e-10   
2      9.109149e-03  9.109149e-03  9.109149e-03  9.109149e-03  9.109149e-03   
3      4.151918e-03  4.151918e-03  4.151918e-03  4.151918e-03  4.151918e-03   
4      7.172082e-04  7.172082e-04  7.172082e-04  7.172082e-04  7.172082e-04   
...             ...           ...           ...           ...           ...   
99995  4.763151e-08  4.763151e-08  4.763151e-08  4.763151e-08  4.763151e-08   
99996  9.693175e-09  9.693175e-09  9.693175e-09  9.693175e-09  9.693175e-09   
99997  7.992134e-03  7.992134e-03  7.992134e-03  7.992134e-03  7.992134e-03   
99998  2.375791e-04  2.375791e-04  2.375791e-04  2.375791e-04  2.375791e-04   
99999  9.556767e-09  9.556767e-09  9.556767e-09  9.556767e-09  9.556767e-09   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      7.630185e-03  7.630185e-03  7.630185e-03  7.630185e-03  7.630185e-03   
1      2.003130e-03  2.003130e-03  2.003130e-03  2.003130e-03  2.003130e-03   
2      5.686637e-09  5.686637e-09  5.686637e-09  5.686637e-09  5.686637e-09   
3      9.763117e-03  9.763117e-03  9.763117e-03  9.763117e-03  9.763117e-03   
4      5.026405e-06  5.026405e-06  5.026405e-06  5.026405e-06  5.026405e-06   
...             ...           ...           ...           ...           ...   
99995  4.470641e-10  4.470641e-10  4.470641e-10  4.470641e-10  4.470641e-10   
99996  5.803254e-04  5.803254e-04  5.803254e-04  5.803254e-04  5.803254e-04   
99997  2.003094e-09  2.003094e-09  2.003094e-09  2.003094e-09  2.003094e-09   
99998  3.099181e-11  3.099181e-11  3.099181e-11  3.099181e-11  3.099181e-11   
99999  1.365551e-07  1.365551e-07  1.365551e-07  1.365551e-07  1.365551e-07   

                 5             6             7     

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

                 0             1             2             3             4   \
0      9.355933e-03  9.355933e-03  9.355933e-03  9.355933e-03  9.355933e-03   
1      6.176445e-05  6.176445e-05  6.176445e-05  6.176445e-05  6.176445e-05   
2      4.624709e-03  4.624709e-03  4.624709e-03  4.624709e-03  4.624709e-03   
3      9.504968e-03  9.504968e-03  9.504968e-03  9.504968e-03  9.504968e-03   
4      3.194408e-09  3.194408e-09  3.194408e-09  3.194408e-09  3.194408e-09   
...             ...           ...           ...           ...           ...   
24616  4.844218e-07  4.844218e-07  4.844218e-07  4.844218e-07  4.844218e-07   
24617  4.647495e-03  4.647495e-03  4.647495e-03  4.647495e-03  4.647495e-03   
24618  1.428834e-03  1.428834e-03  1.428834e-03  1.428834e-03  1.428834e-03   
24619  9.748556e-03  9.748556e-03  9.748556e-03  9.748556e-03  9.748556e-03   
24620  3.052128e-03  3.052128e-03  5.986717e-03  3.052128e-03  3.052128e-03   

                 5             6             7     

In [18]:
column_names = [f"LSTM-Embedding{i}" for i in range(1, 101)]

In [19]:
df = pd.DataFrame()
for array in all_ss:
    new_df = pd.DataFrame(array, columns=column_names)
    print(new_df.shape)
    df = pd.concat([df, new_df], axis=0)
    print(df)

print(df.head(3))
print(df.shape)

df.to_parquet("test-LSTM-Embedding.parquet", engine="pyarrow")


(100000, 100)
       LSTM-Embedding1  LSTM-Embedding2  LSTM-Embedding3  LSTM-Embedding4  \
0         2.055982e-07     2.055982e-07     2.055982e-07     2.055982e-07   
1         2.208800e-09     2.208800e-09     2.208800e-09     2.208800e-09   
2         2.607889e-05     2.607889e-05     2.607889e-05     2.607889e-05   
3         4.951381e-03     4.951381e-03     4.951381e-03     4.951381e-03   
4         1.613368e-05     1.613368e-05     1.613368e-05     1.613368e-05   
...                ...              ...              ...              ...   
99995     1.764340e-10     1.764340e-10     1.764340e-10     1.764340e-10   
99996     7.909304e-03     7.909304e-03     7.909304e-03     7.909304e-03   
99997     2.319168e-07     2.319168e-07     2.319168e-07     2.319168e-07   
99998     8.819036e-03     8.819036e-03     8.819036e-03     8.819036e-03   
99999     6.964077e-11     6.964077e-11     6.964077e-11     6.964077e-11   

       LSTM-Embedding5  LSTM-Embedding6  LSTM-Embedding7  LST

# Possible Nexts steps

1. Have a best handling of missiing values in the data ( Ex : Do not drop customers that don't have 13 records,  do not fill N/A with 0, Replace previous -1 values with N/A, ...)
2. Enhance the model (More LSTM / 1D CNN / Transformers / Param optimisation / ... ) 
3. Have a better understanding of the predictive features = Feature engineering ( feature selection or permutation feature importance, for instance.)
4. (1. + 2.--> Transformer Unsupervised training with Times Series) https://arxiv.org/abs/2010.02803
5. Model ensemble with other classification techniques
6. Enable Cross-validation (Stratified K-Fold, ...)


In [None]:
print(pd.read_parquet(train_file))

## Data split

In [None]:
def split_data(df):
    start=0
    end = 1300000
    for i in range(10):
        df_store = df.iloc[start:end, :]
        start += 1300000 
        end += 1300000 
        df_store.to_parquet(f"./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_{i}.parquet")
# to_parquet("./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_0.parquet")

In [None]:
#print(pd.read_parquet("./input/FilledWithRandomForest/test_RandomForest(PreCompressed)_0.parquet"))