# The task

The goal of this [Kaggle competition](https://www.kaggle.com/c/rpaa-hackathon-time-series/overview) is the Time Series Classification problem.

## The data
We use the data about customers of a bank. Over the past 10 months, we have extracted information about monthly customer's activity. For each month, we have extracted 10 numerical features representing these activities.

## The target variable
The target of this challenge is to predict, whether a given customer will buy a bank product, in case he is contacted by the bank, based on his/her activity over past 10 months.

## Goal of the competition
The goal of this competition is to build a model with highest performance (AUC) on the test customers. You can explore various Time Series methodology in order to beat the top score.


# Load data

The data will be loaded into numpy as a 3D matrix with the following dimensions:

- Sample (Customer): Each sample consists of 10 last transactions of a single customers.
- Time Steps (Transactions): Last 10 transactions of a customer, ordered by time. The last one is the newest one.
- Features: Properties of each transaction. Each transaction has 10 features that are scaled from 0 to 1.

The target y in this problem is binary, where 0 represents that a customer will not get another product if he is offered one, and 1 that he will.

In [97]:
from functools import reduce
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import tensorflow as tf

In [102]:
number_of_features = 10

def load_data_into_numpy(file_path, number_of_features):
    loaded_2d_array = np.loadtxt(file_path)
    # reshaping to get original matrice with original shape.
    loaded_3d_array = loaded_2d_array.reshape(loaded_2d_array.shape[0], loaded_2d_array.shape[1] // number_of_features, number_of_features)
    return loaded_3d_array

X = load_data_into_numpy('X_train.csv', number_of_features)
y = np.loadtxt('y_train.csv').astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_test_submission = load_data_into_numpy('X_test.csv', number_of_features)

In [8]:
display(pd.DataFrame(X_train[0, :, :], columns=[f'f_{num}' for num in range(10)]))

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9
0,0.503082,0.504669,0.499191,0.501236,0.498047,0.504181,0.502396,0.495938,0.500259,0.500504
1,0.503113,0.503388,0.49968,0.500992,0.497269,0.50531,0.502121,0.494087,0.500061,0.501007
2,0.50267,0.501328,0.499283,0.500626,0.49855,0.506775,0.502304,0.496088,0.499863,0.501267
3,0.502808,0.499451,0.499542,0.500366,0.501068,0.506592,0.502365,0.495688,0.499542,0.501785
4,0.502716,0.497269,0.499939,0.499954,0.502365,0.497574,0.502319,0.494637,0.499557,0.502335
5,0.502762,0.499481,0.499512,0.499466,0.498932,0.500244,0.502731,0.492986,0.499496,0.502167
6,0.502029,0.499313,0.499924,0.499054,0.49649,0.502258,0.503143,0.48985,0.49939,0.502136
7,0.502106,0.498108,0.499985,0.4991,0.501556,0.502152,0.503998,0.489967,0.499603,0.502396
8,0.502121,0.5,0.499634,0.498962,0.501984,0.503433,0.504364,0.490184,0.499557,0.50206
9,0.501892,0.496704,0.5,0.498749,0.497482,0.497742,0.503876,0.49357,0.49971,0.501831


The target in this problem is balanced

In [23]:
y_train.mean()

0.49943416771944443

# Baseline model RandomForestClassifier

Let's build a baseline model, using a standard classifier, without taking into consideration the time dimension of the data.

In this case, we will take average value of each feature over the last 10 transactions

In [61]:
X_train_classifier = X_train.mean(axis=1)
X_test_classifier = X_test.mean(axis=1)

rf = RandomForestClassifier(max_depth=5, class_weight='balanced')
auc_val_rf = np.mean(cross_val_score(rf, X_train_classifier, y_train))
rf.fit(X_train_classifier, y_train)
auc_test_rf = roc_auc_score(y_test, rf.predict_proba(X_test_classifier)[:,1])

print(f'CV performance of Random Forest is {auc_val_rf}')
print(f'Test performance of Random Forest is {auc_test_rf}')

CV performance of Random Forest is 0.6442375
Test performance of Random Forest is 0.6920054766131971


## Add summary features

In [93]:
def gen_aggregate_feats(X):
    operations = ['mean', 'std', 'max', 'min', 'ptp', 'prod']
    feat_list = [getattr(X, op)(axis=ax) for op in operations for ax in [1,2]]
            
    return reduce(lambda a, b: np.append(a, b, axis=1), feat_list)

X_train_classifier = gen_aggregate_feats(X_train)
X_test_classifier = gen_aggregate_feats(X_test)

rf = RandomForestClassifier(max_depth=5, class_weight='balanced')
auc_val_rf = np.mean(cross_val_score(rf, X_train_classifier, y_train))
rf.fit(X_train_classifier, y_train)
auc_test_rf = roc_auc_score(y_test, rf.predict_proba(X_test_classifier)[:,1])

print(f'CV performance of Random Forest is {auc_val_rf}')
print(f'Test performance of Random Forest is {auc_test_rf}')

CV performance of Random Forest is 0.6789125
Test performance of Random Forest is 0.7451180079796056


In [95]:
X_classifier = gen_aggregate_feats(X)
rf.fit(X_classifier, y)
X_test_submission_classifier_rf = gen_aggregate_feats(X_test_submission)
y_test_submission_proba_rf = rf.predict_proba(X_test_submission_classifier_rf)[:, 1]

## Use all raw rows and features per customer

In [63]:
X_train_classifier = X_train.reshape((80000, 100))
X_test_classifier = X_test.reshape((20000, 100))

rf = RandomForestClassifier(max_depth=5, class_weight='balanced')
auc_val_rf = np.mean(cross_val_score(rf, X_train_classifier, y_train))
rf.fit(X_train_classifier, y_train)
auc_test_rf = roc_auc_score(y_test, rf.predict_proba(X_test_classifier)[:,1])

print(f'CV performance of Random Forest is {auc_val_rf}')
print(f'Test performance of Random Forest is {auc_test_rf}')

CV performance of Random Forest is 0.6512375
Test performance of Random Forest is 0.7022538251227204


## Summary + raw feats

In [105]:
X_train_classifier = np.append(gen_aggregate_feats(X_train), X_train.reshape((X_train.shape[0], 100)), axis=1)
X_test_classifier = np.append(gen_aggregate_feats(X_test), X_test.reshape((X_test.shape[0], 100)), axis=1)

rf = RandomForestClassifier(max_depth=5, class_weight='balanced')
auc_val_rf = np.mean(cross_val_score(rf, X_train_classifier, y_train))
rf.fit(X_train_classifier, y_train)
auc_test_rf = roc_auc_score(y_test, rf.predict_proba(X_test_classifier)[:,1])

print(f'CV performance of Random Forest is {auc_val_rf}')
print(f'Test performance of Random Forest is {auc_test_rf}')

CV performance of Random Forest is 0.6782625
Test performance of Random Forest is 0.7455067966335072


## Summary of raw + diff

In [158]:
def gen_aggregate_feats(X):
    operations = ['mean', 'std', 'max', 'min', 'ptp', 'prod', 'sum']
    feat_list = [getattr(X, op)(axis=ax) for op in operations for ax in [1,2]]
    
    feat_list += [getattr(np.diff(X, axis=ax), op)(axis=ax) for op in operations for ax in [1,2]]
            
    return reduce(lambda a, b: np.append(a, b, axis=1), feat_list)

X_train_classifier = gen_aggregate_feats(X_train)
X_test_classifier = gen_aggregate_feats(X_test)

In [114]:
rf = RandomForestClassifier(max_depth=5, class_weight='balanced')
auc_val_rf = np.mean(cross_val_score(rf, X_train_classifier, y_train))
rf.fit(X_train_classifier, y_train)
auc_test_rf = roc_auc_score(y_test, rf.predict_proba(X_test_classifier)[:,1])

print(f'CV performance of Random Forest is {auc_val_rf}')
print(f'Test performance of Random Forest is {auc_test_rf}')

CV performance of Random Forest is 0.710875
Test performance of Random Forest is 0.7897193365244342


In [115]:
X_classifier = gen_aggregate_feats(X)
rf.fit(X_classifier, y)
X_test_submission_classifier_rf = gen_aggregate_feats(X_test_submission)
y_test_submission_proba_rf = rf.predict_proba(X_test_submission_classifier_rf)[:, 1]

# Build baseline model LSTM

Now we can take into consideration the time dimension of the data

In [16]:
y_encoded = np.vstack([1-y_train, y_train]).T

lstm = Sequential()
lstm.add(LSTM(32, input_shape=(10, 10)))
lstm.add(Dense(2, activation = "softmax"))
lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.AUC()])
lstm.fit(X_train, y_encoded, epochs=100, batch_size=100, validation_split=0.1, verbose=2)


Epoch 1/100
900/900 - 4s - loss: 0.6934 - auc: 0.4990 - val_loss: 0.6931 - val_auc: 0.5244
Epoch 2/100
900/900 - 2s - loss: 0.6930 - auc: 0.5038 - val_loss: 0.6926 - val_auc: 0.5143
Epoch 3/100
900/900 - 2s - loss: 0.6913 - auc: 0.5116 - val_loss: 0.6932 - val_auc: 0.4950
Epoch 4/100
900/900 - 2s - loss: 0.6827 - auc: 0.5399 - val_loss: 0.6854 - val_auc: 0.5177
Epoch 5/100
900/900 - 2s - loss: 0.6682 - auc: 0.5801 - val_loss: 0.6648 - val_auc: 0.5924
Epoch 6/100
900/900 - 2s - loss: 0.6647 - auc: 0.5900 - val_loss: 0.6649 - val_auc: 0.5898
Epoch 7/100
900/900 - 2s - loss: 0.6643 - auc: 0.5900 - val_loss: 0.6622 - val_auc: 0.5932
Epoch 8/100
900/900 - 2s - loss: 0.6631 - auc: 0.5939 - val_loss: 0.6726 - val_auc: 0.5718
Epoch 9/100
900/900 - 2s - loss: 0.6615 - auc: 0.5984 - val_loss: 0.6618 - val_auc: 0.5885
Epoch 10/100
900/900 - 2s - loss: 0.6585 - auc: 0.6066 - val_loss: 0.6612 - val_auc: 0.5806
Epoch 11/100
900/900 - 2s - loss: 0.6531 - auc: 0.6192 - val_loss: 0.6516 - val_auc: 0.62

<tensorflow.python.keras.callbacks.History at 0x7f909cb8da00>

In [17]:
y_test_proba_lstm = lstm.predict_proba(X_test)[:, 1]



# Prepare submission files

In [84]:
submission = pd.DataFrame({'y_pred': y_test_proba_lstm})
submission.index.name = 'id'
submission.to_csv('lstm_submission.csv')

submission = pd.DataFrame({'y_pred': y_test_proba_rf})
submission.index.name = 'id'
submission.to_csv('rf_submission.csv')

In [59]:
X_test_submission_classifier_rf = X_test_submission.reshape((X_test_submission.shape[0], 100))
y_test_submission_proba_rf = rf.predict_proba(X_test_submission_classifier_rf)[:, 1]

In [116]:
submission = pd.DataFrame({'y_pred': y_test_submission_proba_rf})
submission.index.name = 'id'
submission.to_csv('rf_submission.csv')

# Ideas for improvement


- LSTM improvements e.g. Early stopping, Dropout, more layers
- Hyperparameters optimization
- Other time series models to try:
    - Hidden Markov models
    - Conditional random field
    - GRU
    - Dilated Recurrent Neural Networks
    - CNN
    - Fully Convolutional Network
    - LSTM + CNN
    - Rotation Forest / Proximity Forest
- Packages to check:
    - [neural_prophet](https://github.com/ourownstory/neural_prophet)
    - [pyts](https://johannfaouzi.github.io/pyts/)
    - [cesium](https://github.com/cesium-ml/cesium)
    
Sources:
https://dzlab.github.io/timeseries/2018/11/24/timeseries-classification/