# Santander Value Prediction Challenge
Link [here](https://www.kaggle.com/c/santander-value-prediction-challenge/data).

First, we make the regular imports needed for this type of project. It's structured data and I have no clue but let's try a random forest because it's literally the only thing I know right now.

## Imports

In [133]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [134]:
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics
import feather

In [135]:
PATH = "data/santander/"
!ls {PATH}

[31msample_submission.csv[m[m [31mtest.csv[m[m              [31mtrain.csv[m[m


## The problem

According to Epsilon research, 80% of customers are more likely to do business with you if you provide personalized service. Banking is no exception.

The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner… and often before they´ve even realized they need the service. In their 3rd Kaggle competition, Santander Group aims to go a step beyond recognizing that there is a need to provide a customer a financial service and intends to determine the amount or value of the customer's transaction. This means anticipating customer needs in a more concrete, but also simple and personal way. With so many choices for financial services, this need is greater now than ever before.

In this competition, Santander Group is asking Kagglers to help them identify the value of transactions for each potential customer. This is a first step that Santander needs to nail in order to personalize their services at scale.

## The data

We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. The task is to predict the value of target column in the test set.

### File descriptions

- `train.csv` - the training set
- `test.csv` - the test set
- `sample_submission.csv` - a sample submission file in the correct format

## Let's look at the data

In [136]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

df_raw = pd.read_csv(f'{PATH}train.csv', low_memory=False)
df_test_raw = pd.read_csv(f'{PATH}test.csv', low_memory=False)

In [137]:
display_all(df_raw.tail().T)

Unnamed: 0,4454,4455,4456,4457,4458
ID,ff85154c8,ffb6b3f4f,ffcf61eb6,ffea67e98,ffeb15d25
target,1.065e+06,48000,2.8e+06,1e+07,2e+07
48df886f9,0,0,0,0,0
0deb4b6a8,0,0,0,0,0
34b15f335,0,0,0,0,0
a8cb14b00,0,0,0,0,0
2f0771a37,0,0,0,0,0
30347e683,0,0,0,0,0
d08d1fbe3,0,0,0,0,0
6ee66e115,0,0,0,0,0


There's not much we can conclude here. The data is pretty straightforward, with lots of zeroes on many sections, which leads me to think that either:

1. The zeroes are missing data and we should ignore them, or
2. The zeroes actually mean something.

Unfortunately the feature descriptions are very much meaningless – hex numbers that don't make much sense. Watching the insane amount of zeroes leads me to hypothesize that many of these features don't even have values and are all zeroes. If this is correct, then we can safely remove those columns from our prediction model. To prove this, we'll describe the data.

In [138]:
description = df_raw.describe().T
display_all(description)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
target,4459.0,5.944923e+06,8.234312e+06,30000.0,600000.0,2260000.0,8000000.00,4.000000e+07
48df886f9,4459.0,1.465493e+04,3.893298e+05,0.0,0.0,0.0,0.00,2.000000e+07
0deb4b6a8,4459.0,1.390895e+03,6.428302e+04,0.0,0.0,0.0,0.00,4.000000e+06
34b15f335,4459.0,2.672245e+04,5.699652e+05,0.0,0.0,0.0,0.00,2.000000e+07
a8cb14b00,4459.0,4.530164e+03,2.359124e+05,0.0,0.0,0.0,0.00,1.480000e+07
2f0771a37,4459.0,2.640996e+04,1.514730e+06,0.0,0.0,0.0,0.00,1.000000e+08
30347e683,4459.0,3.070811e+04,5.770590e+05,0.0,0.0,0.0,0.00,2.070800e+07
d08d1fbe3,4459.0,1.686522e+04,7.512756e+05,0.0,0.0,0.0,0.00,4.000000e+07
6ee66e115,4459.0,4.669208e+03,1.879449e+05,0.0,0.0,0.0,0.00,1.040000e+07
20aa07010,4459.0,2.569407e+06,9.610183e+06,0.0,0.0,0.0,600000.00,3.196120e+08


In [139]:
columns_with_zeroes = list(description[(description['min'] == 0.0) & (description['max'] == 0.0)].T)

As we can see, there are 256 features that are all zeroes. We won't be able to make predictions over them.

## Evaluation

The evaluation metric for this competition is Root Mean Squared Logarithmic Error. Therefore we take the log of the target, so that RMSE will give us what we need.

### Submission File

For every row in the `test.csv`, submission files should contain two columns: `ID` and `target`.  The `ID` corresponds to the column of that `ID` in the `test.csv`. The file should contain a header and have the following format:

```
ID,target
000137c73,5944923.322036332
00021489f,5944923.322036332
0004d7953,5944923.322036332
etc.
```

In [140]:
df_raw.target = np.log(df_raw.target)

## Initial processing
Let's start by brute force. No dropping of columns more than `target` (the variable we need to predict) and `ID` (not relevant for modelling).

In [141]:
regressor = RandomForestRegressor(n_jobs=-1)

In [142]:
regressor.fit(df_raw.drop(['target', 'ID'], axis=1), df_raw.target)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [143]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/santander-raw')
df_test_raw.to_feather('tmp/santander-test-raw')

In [144]:
df_raw = feather.read_dataframe('tmp/santander-raw')

In [145]:
display_all(df_raw.tail().T)

Unnamed: 0,4454,4455,4456,4457,4458
ID,ff85154c8,ffb6b3f4f,ffcf61eb6,ffea67e98,ffeb15d25
target,13.8785,10.779,14.8451,16.1181,16.8112
48df886f9,0,0,0,0,0
0deb4b6a8,0,0,0,0,0
34b15f335,0,0,0,0,0
a8cb14b00,0,0,0,0,0
2f0771a37,0,0,0,0,0
30347e683,0,0,0,0,0
d08d1fbe3,0,0,0,0,0
6ee66e115,0,0,0,0,0


In [146]:
columns_with_zeroes.append('ID')
df, y, nas = proc_df(df_raw.drop(columns_with_zeroes, axis=1), 'target')

In [147]:
regressor.fit(df, y)
regressor.score(df, y)

0.8675575327807954

So an `r^2` of 0.86 huh? It's alright but let's actually see how good we are by creating a validation set taken from the training set. This validation set will have the same length as the test set.

In [148]:
len(df_test_raw)

49342

In [149]:
len(df_raw)

4459

In [150]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = int(len(df_raw) / 10)  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((4014, 4735), (4014,), (445, 4735), (445,))

Let's try our model again, this time with separate training and validation sets.

In [151]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [152]:
%time regressor.fit(X_train, y_train)
print_score(regressor)

CPU times: user 22.2 s, sys: 120 ms, total: 22.3 s
Wall time: 7.07 s
[0.62597457481766, 1.4611807023963985, 0.8715314137457423, 0.322310799772487]


So we have an r^2 of ~0.87 for the training set and ~0.32 for the validation set. This means we're overfitting badly. We're at position ~3100/4600 of the Kaggle leaderboard per our RMSLE right now.