**<font size=5>Santander-Customer-Transaction-Prediction</font>**

This is my second kaggle competition completed, where I am supposed to predict whether a Santander customer will perform a specific transaction or not.
In this kernel, I am recording how the work is done.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
import random
import lightgbm as lgb
import seaborn as sns

**<font size=5>Loading data</font>**

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train_copy = train.copy()
test_copy = test.copy()

**<font size=5>Data exploration</font>**

In [None]:
train_copy.head()

In [None]:
test_copy.head()

In [None]:
train_copy.shape

In [None]:
test_copy.shape

As we can see, there are 200 features without their real world name and 200k rows of customer record in both training and testing data, also columns of customer ID and target in training set which means whether they did the transaction.
And there are 200k rows in test data too.

In [None]:
if (train_copy.isnull().sum().sum() == 0):
    print('There is no missing value in training set')
else:
    print('There are ' + str(train_copy.isnull().sum().sum()) + ' missing values in training set')

In [None]:
if (test_copy.isnull().sum().sum() == 0):
    print('There is no missing value in testing set')
else:
    print('There are ' + str(train_copy.isnull().sum().sum()) + ' missing values in testing set')

Great! There is no missing value in both training and testing set.

In [None]:
train_copy.describe()

In [None]:
test_copy.describe()

As shown, training data and testing data are quite similar in each feature.

In [None]:
sns.countplot(train_copy['target'])

So there exsists imbalanced data issue with only 10% records responding 1.

To handle this problem, one can just cut '0' responding data to around numbers of positive records using 'sample()' command, but he will loss lots of information, or apply resampling mathod to create positive records.
As a result, resampling does not improve the auc much while the former way does improve but not good enough.

In [None]:
train_y = train_copy['target'].values
train_X_column_name = train_copy.drop(['target', 'ID_code'], axis=1).columns
train_X = train_copy.drop(['target', 'ID_code'], axis=1).values
test_X = test_copy.drop(['ID_code'], axis=1).values
train_X_copy = train_X.copy()
test_X_copy = test_X.copy()

Until now, I have separated the training data into features and targets, and picked the features from testing data.

**<font size=5>Preprocessing data</font>**

Thanks to Felipe Mello's idea that [detecting fake data](http://www.kaggle.com/felipemello/step-by-step-guide-to-the-magic-lb-0-922) in test set, the auc score of my results improved a lot.

So, the basic ideas are:
1. There exsists fake data in test dataset
2. The unique values in each column of test set are important

To get the real samples, I will look for samples that have an unique value at least in one feature among all 200k records, because it indicates that no other samples copyed from this sample. And it is obvious that if all features of a sample are not unique, this sample is probably sythesized from others.

In [None]:
unique_samples = []
unique_count = np.zeros_like(test_X)
for feature in range(test_X.shape[1]):
    _, index_, count_ = np.unique(test_X[:, feature], return_index=True, return_counts=True)
    unique_count[index_[count_ == 1], feature] += 1

real_sample_index = np.argwhere(np.sum(unique_count, axis=1) > 0)[:, 0]
synthetic_sample_index = np.argwhere(np.sum(unique_count, axis=1) == 0)[:, 0]

test_X_real = test_X[real_sample_index].copy()

Then we show how much real and synthetic data are in the test set.

In [None]:
print('There are ' + str(len(real_sample_index)) + ' real data samples in test set')
print('There are ' + str(len(synthetic_sample_index)) + ' synthetic data samples in test set')

Which is reasonable.

Because there are exactly 100k real test data and 100k synthetic data, for each synthetic sample, I will capture those features that have only one instance in the real samples set with the same value, this instance has to be one of the samples' generators.

In [None]:
generator_for_each_synthetic_sample = []
for cur_sample_index in synthetic_sample_index[:20000]:
    cur_synthetic_sample = test_X[cur_sample_index]
    potential_generators = test_X_real == cur_synthetic_sample

    features_mask = np.sum(potential_generators, axis=0) == 1
    verified_generators_mask = np.any(potential_generators[:, features_mask], axis=1)
    verified_generators_for_sample = real_sample_index[np.argwhere(verified_generators_mask)[:, 0]]
    generator_for_each_synthetic_sample.append(set(verified_generators_for_sample))


Then find the Public and Private splits

In [None]:
public_LB = generator_for_each_synthetic_sample[0]
for x in generator_for_each_synthetic_sample:
    if public_LB.intersection(x):
        public_LB = public_LB.union(x)

private_LB = generator_for_each_synthetic_sample[1]
for x in generator_for_each_synthetic_sample:
    if private_LB.intersection(x):
        private_LB = private_LB.union(x)

private_LB = list(private_LB)
public_LB = list(public_LB)
full = np.concatenate([train_X, np.concatenate([test_X[private_LB], test_X[public_LB]])])

In [None]:
print('There are ' + str(len(private_LB)) + ' data samples for public score in real data set')
print('There are ' + str(len(public_LB)) + ' data samples for private score in real data set')

Which is obvious because half of the test data are used for private score and the other half are used for public score.
So this technique here is specially for the competition, it may not be helpful in the real world.

In [None]:
full = pd.DataFrame(full)
full.columns = train_X_column_name
train_X = pd.DataFrame(train_X)
train_X.columns = train_X_column_name
test_X = pd.DataFrame(test_X)
test_X .columns = train_X_column_name

for feat in ['var_' + str(x) for x in range(200)]:
    count_values = full.groupby(feat)[feat].count()
    train_X['new_' + feat] = count_values.loc[train_X[feat]].values
    test_X['new_' + feat] = count_values.loc[test_X[feat]].values

After putting training data and real testing data without target together, I created new columns for each feature that indicates unique values count.

Here, it is OK to take test data into consideration because test part will not be used when training the model, only training data and their corresponding targets are pluged in the model.

In [None]:
seed = 0
param = {
    'num_leaves': 8,
    'min_data_in_leaf': 17,
    'learning_rate': 0.01,
    'min_sum_hessian_in_leaf': 9.67,
    'bagging_fraction': 0.8329,
    'bagging_freq': 2,
    'feature_fraction': 1,
    'lambda_l1': 0.6426,
    'lambda_l2': 0.3067,
    'min_gain_to_split': 0.02832,
    'max_depth': -1,
    'seed': seed,
    'feature_fraction_seed': seed,
    'bagging_seed': seed,
    'drop_seed': seed,
    'data_random_seed': seed,
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'verbosity': -1,
    'metric': 'auc',
    'is_unbalance': True,
    'save_binary': True,
    'boost_from_average': 'false',
    'num_threads': 8
}


iterations = 110
test_hat = np.zeros([200000, 200])
i = 0
for feature in ['var_' + str(x) for x in range(200)]:  # loop over all features
    # print(feature)
    feat_choices = [feature, 'new_' + feature]
    lgb_train = lgb.Dataset(train_X[feat_choices], train_y)
    gbm = lgb.train(param, lgb_train, iterations, verbose_eval=-1)
    test_hat[:, i] = gbm.predict(test_X[feat_choices], num_iteration=gbm.best_iteration)
    i += 1

sub_preds = test_hat.sum(axis=1)

So I used Light Gredient Boosting Machine here. 

The reason is LightGBM is an ensemble method, so it performs better than single algorithm, and 'Light' means it runs faster than grediant boosting.

Another trick here is I have applied LightGBM columns by columns, which speeded up the process and is proved having a better result.

In [None]:
sub = pd.DataFrame()
sub['ID_code'] = test_copy['ID_code']
sub['target'] = sub_preds
sub.to_csv('submission.csv', index=False)