In this notebook, we preprocessed the data and feed the data to gradient boosting tree models, and got 1.39 on public leaderboard.

the workflow is as follows:
1. **Data preprocessing**. The purpose of data preprocessing is to achieve higher time/space efficiency. What we did includes round, constant features removal, duplicate features removal, insignificant features removal, etc. The key here is to ensure the preprocessing shall not hurt the accuracy.
2. **Feature transform**. The purpose of feature transform is to help the models to better grasp the information in the data, and fight overfitting. What we did includes dropping features which "live" on different distributions on training/testing set, adding statistical features, adding low-dimensional representation as features. 
3. **Modeling**.  We used 2 models: xgboost and lightgbm. We averaged the 2 models for the final prediction.

Stay tuned, more update will come. 

references:
* [Distribution of Test vs. Training data](https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data)
* [Ensemble of LGBM and XGB](https://www.kaggle.com/lightsalsa/ensemble-of-lgbm-and-xgb)
* [predict house prices-model tuning & ensemble](https://www.kaggle.com/alexpengxiao/predict-house-prices-model-tuning-ensemble)
* [Stacked Regressions : Top 4% on LeaderBoard](https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard)

**step 1**: load train & test data, drop duplicate columns, round the features to NUM_OF_DECIMALS decimals. here NUM_OF_DECIMALS is a experience value which can be tuned.

In [1]:
#coding:utf-8
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import warnings
warnings.filterwarnings("ignore")

In [2]:
print(os.listdir("/Users/szkfzx/datasets/santander-value-prediction-challenge"))
train = pd.read_csv('/Users/szkfzx/datasets/santander-value-prediction-challenge/train.csv')
test = pd.read_csv('/Users/szkfzx/datasets/santander-value-prediction-challenge/test.csv')

['test.csv', 'train.csv', 'sample_submission.csv']


In [3]:
#去掉ID，处理target列
test_ID = test['ID']
y_train = train['target']
y_train = np.log1p(y_train) # log1p=log(x+1)
train.drop("ID", axis = 1, inplace = True)
train.drop("target", axis = 1, inplace = True)
test.drop("ID", axis = 1, inplace = True)

In [4]:
#去掉只有一个值的列
cols_with_onlyone_val = train.columns[train.nunique() == 1]
train.drop(cols_with_onlyone_val.values, axis=1, inplace=True)
test.drop(cols_with_onlyone_val.values, axis=1, inplace=True)

In [5]:
#四舍五入，保留32位小数
NUM_OF_DECIMALS = 32
train = train.round(NUM_OF_DECIMALS)
test = test.round(NUM_OF_DECIMALS)

In [6]:
#删除重复的列
colsToRemove = []
columns = train.columns
for i in range(len(columns)-1):
    v = train[columns[i]].values
    dupCols = []
    for j in range(i + 1,len(columns)):
        if np.array_equal(v, train[columns[j]].values):
            colsToRemove.append(columns[j])
train.drop(colsToRemove, axis=1, inplace=True) 
test.drop(colsToRemove, axis=1, inplace=True) 
print(train.shape)
print(test.shape)

(4459, 4730)
(49342, 4730)


**step 2**: Select features by importance. here we used a weak RandomForestRegressor to get the feature importance. here we select top NUM_OF_FEATURES important features. NUM_OF_FEATURES here is a hyper parameter that can be tuned.

In [7]:
# model_selection和ensemble通常一起使用
from sklearn import model_selection
from sklearn import ensemble
NUM_OF_FEATURES = 1000
def rmsle(y, pred):
    return np.sqrt(np.mean(np.power(y - pred, 2)))

x1, x2, y1, y2 = model_selection.train_test_split(
    train, y_train.values, test_size=0.20, random_state=5)
model = ensemble.RandomForestRegressor(n_jobs=-1, random_state=7)
model.fit(x1, y1)
print(rmsle(y2, model.predict(x2)))

col = pd.DataFrame({'importance': model.feature_importances_, 'feature': train.columns}).sort_values(
    by=['importance'], ascending=[False])[:NUM_OF_FEATURES]['feature'].values
train = train[col]
test = test[col]
train.shape

1.5367277914935242


(4459, 1000)

In [8]:
col #最重要的1000个特征


array(['f190486d6', 'eeb9cd3aa', 'c47340d97', '58e2e02e6', '15ace8c9f',
       '6eef030c1', '20aa07010', '0c9462c08', 'b43a7cfd5', '9306da53f',
       '6c0e0801a', '024c577b9', '9fd594eec', '2ec5b290f', '2288333b4',
       'fc99f9426', '62e59a501', 'fb0f5dbfe', '1702b5bf0', '58232a6fb',
       '66ace2992', 'f74e8f13d', '6786ea46d', 'd6bb78916', 'adb64ff71',
       '491b9ee45', 'aac52d8d9', 'ced6a7e91', '7a7da3079', '73687e512',
       'dd85a900c', '703885424', '5c6487af1', '8e4d0fe45', '4da206d28',
       'bb1113dbb', 'c9eda7d9c', 'df838756c', 'e176a204a', 'd746efbfe',
       '241f0f867', 'cd24eae8a', '26ab20ff9', '32174174c', '26fc93eb7',
       'c5a231d81', 'f8b733d3f', 'e17f1f07c', '8f3740670', '18cad608c',
       '963a49cdc', '08af3dd45', '64e38e7a2', '0ff32eb98', '324921c7b',
       '30b3daec2', 'e222309b0', '7623d805a', '707f193d9', '70feb1494',
       '29c059dd2', 'fb49e4212', 'aa164b93b', '3c8a3ced0', 'ed8951a75',
       '1931ccfdd', 'c270cb02b', '02861e414', '0d51722ca', '87ff

**step 3**: we try to test the training data and testing data with Kolmogorov-Smirnov test. This is a two-sided test for the null hypothesis that whether 2 independent samples are drawn from the same continuous distribution([see more](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html)). If a feature has different distributions in training set than in testing set, we should remove this feature since what we learned during training cannot generalize. THRESHOLD_P_VALUE and THRESHOLD_STATISTIC are hyper parameters.

In [9]:
#去掉训练数据集和测试数据集中分布差异很大的列
from scipy.stats import ks_2samp
THRESHOLD_P_VALUE = 0.01 #need tuned
THRESHOLD_STATISTIC = 0.3 #need tuned
diff_cols = []
for col in train.columns:
    statistic, pvalue = ks_2samp(train[col].values, test[col].values)
    if pvalue <= THRESHOLD_P_VALUE and np.abs(statistic) > THRESHOLD_STATISTIC:
        diff_cols.append(col)
for col in diff_cols:
    if col in train.columns:
        train.drop(col, axis=1, inplace=True)
        test.drop(col, axis=1, inplace=True)

In [10]:
print(train.shape)
print(test.shape)

(4459, 1000)
(49342, 1000)


**step 4**: We add some additional statistical features to the original features. Moreover, we also added low-dimensional representations as features. NUM_OF_COM is hyper parameter

In [11]:
#增加一些统计特征作为训练数据集合测试数据集的列
from sklearn import random_projection
ntrain = len(train)
ntest = len(test)
tmp = pd.concat([train,test])#RandomProjection
weight = ((train != 0).sum()/len(train)).values
tmp_train = train[train!=0]
tmp_test = test[test!=0]
#权重
train["weight_count"] = (tmp_train*weight).sum(axis=1)
test["weight_count"] = (tmp_test*weight).sum(axis=1)
#非0
train["count_not0"] = (train != 0).sum(axis=1)
test["count_not0"] = (test != 0).sum(axis=1)
#和
train["sum"] = train.sum(axis=1)
test["sum"] = test.sum(axis=1)
#方差
train["var"] = tmp_train.var(axis=1)
test["var"] = tmp_test.var(axis=1)
#中位数
train["median"] = tmp_train.median(axis=1)
test["median"] = tmp_test.median(axis=1)
#均值
train["mean"] = tmp_train.mean(axis=1)
test["mean"] = tmp_test.mean(axis=1)
#标准差
train["std"] = tmp_train.std(axis=1)
test["std"] = tmp_test.std(axis=1)
#最大值
train["max"] = tmp_train.max(axis=1)
test["max"] = tmp_test.max(axis=1)
#最小值
train["min"] = tmp_train.min(axis=1)
test["min"] = tmp_test.min(axis=1)
#偏度
train["skew"] = tmp_train.skew(axis=1)
test["skew"] = tmp_test.skew(axis=1)
#峰度
train["kurtosis"] = tmp_train.kurtosis(axis=1)
test["kurtosis"] = tmp_test.kurtosis(axis=1)

del(tmp_train)
del(tmp_test)

NUM_OF_COM = 100 #need tuned
transformer = random_projection.SparseRandomProjection(n_components = NUM_OF_COM)
RP = transformer.fit_transform(tmp)
rp = pd.DataFrame(RP)
columns = ["RandomProjection{}".format(i) for i in range(NUM_OF_COM)]
rp.columns = columns

rp_train = rp[:ntrain]
rp_test = rp[ntrain:]
rp_test.index = test.index

#concat RandomProjection and raw data
train = pd.concat([train,rp_train],axis=1)
test = pd.concat([test,rp_test],axis=1)

del(rp_train)
del(rp_test)

(4459, 1111)

In [22]:
train.shape

(4459, 1111)

In [23]:
test.shape

(49342, 1111)

In [15]:
tmp.head()

Unnamed: 0,f190486d6,eeb9cd3aa,c47340d97,58e2e02e6,15ace8c9f,6eef030c1,20aa07010,0c9462c08,b43a7cfd5,9306da53f,...,2b58a21fc,c3f400e36,e8bd579ae,4bf2b8e7c,1bf8c2597,b7d59d3b5,ea26c7fe6,467c54d35,1834f29f5,acc4a8e68
0,1866666.66,700000.0,0.0,12066666.66,4100000.0,900000.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4000000.0,0.0,0.0,0.0,0.0
1,0.0,2225000.0,0.0,2850000.0,0.0,800000.0,2200000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,8000000.0,0.0,0.0,37662000.0,2000000.0,5400000.0,0.0,1180000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
weight


array([3.46265979e-01, 3.39089482e-01, 3.48732900e-01, 3.39313747e-01,
       3.33034313e-01, 3.26306347e-01, 3.35052702e-01, 1.59677058e-01,
       3.22493833e-01, 1.61919713e-01, 1.67302086e-01, 3.15990132e-01,
       3.40883606e-01, 3.31240188e-01, 1.72460193e-01, 3.40435075e-01,
       3.31240188e-01, 3.35276968e-01, 3.36398296e-01, 3.33034313e-01,
       3.41107872e-01, 3.35052702e-01, 5.09082754e-02, 3.22269567e-01,
       3.46938776e-01, 3.41556403e-01, 1.59452792e-01, 1.59228527e-01,
       1.55640278e-01, 1.55640278e-01, 3.11729087e-02, 3.39313747e-01,
       3.45144651e-01, 1.65732227e-01, 1.78515362e-01, 1.53173357e-01,
       1.97353667e-02, 1.29625477e-01, 3.44247589e-01, 4.53016371e-02,
       3.27651940e-01, 1.58107199e-01, 5.27023996e-02, 1.61246916e-01,
       3.34604171e-01, 3.50975555e-01, 3.07243777e-02, 3.22942364e-02,
       1.56088809e-01, 2.01838977e-02, 3.29894595e-01, 2.33236152e-02,
       5.83090379e-02, 3.50975555e-01, 3.37295358e-01, 1.47566719e-01,
      

In [20]:
train.head()

Unnamed: 0,f190486d6,eeb9cd3aa,c47340d97,58e2e02e6,15ace8c9f,6eef030c1,20aa07010,0c9462c08,b43a7cfd5,9306da53f,...,RandomProjection90,RandomProjection91,RandomProjection92,RandomProjection93,RandomProjection94,RandomProjection95,RandomProjection96,RandomProjection97,RandomProjection98,RandomProjection99
0,1866666.66,700000.0,0.0,12066666.66,4100000.0,900000.0,0.0,0.0,0.0,0.0,...,7732193.0,-17376350.0,-16423180.0,3936389.0,7347927.0,-1799492.0,2249365.0,0.0,0.0,4611199.0
1,0.0,2225000.0,0.0,2850000.0,0.0,800000.0,2200000.0,0.0,0.0,0.0,...,-674809.6,-1124683.0,1124683.0,9447334.0,4498731.0,5061072.0,0.0,0.0,21368970.0,-14845810.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,449873.1,0.0,0.0,0.0,4498731.0,3092877.0,0.0,0.0,0.0,6748096.0
3,2000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-56234.13,-56234.13
4,0.0,0.0,8000000.0,0.0,0.0,37662000.0,2000000.0,5400000.0,0.0,1180000.0,...,0.0,3036643.0,0.0,17804850.0,5623413.0,0.0,112468.3,0.0,0.0,23428260.0


In [21]:
test.head()

Unnamed: 0,f190486d6,eeb9cd3aa,c47340d97,58e2e02e6,15ace8c9f,6eef030c1,20aa07010,0c9462c08,b43a7cfd5,9306da53f,...,RandomProjection90,RandomProjection91,RandomProjection92,RandomProjection93,RandomProjection94,RandomProjection95,RandomProjection96,RandomProjection97,RandomProjection98,RandomProjection99
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,113949400.0,0.0,0.0,-7660.336,-8676830.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,14000000.0,245000.0,0.0,0.0,2533333.34,4383000.0,...,0.0,12237110.0,-112468.265038,1424598.0,14620.87,1210159.0,-2811707.0,0.0,-10684490.0,5061072.0
3,0.0,40000000.0,0.0,0.0,40000000.0,0.0,0.0,0.0,25010000.0,0.0,...,2118527.0,11779930.0,0.0,11028640.0,16193640.0,-22109980.0,-23699310.0,1690398.0,2811707.0,356149.5
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**step 5**: Define cross-validation methods and models. xgboost and lightgbm are used as base models. the hyper parameters are already tuned by grid search, here we use them directly. NUM_FOLDS can be treat as hyper parameter

In [12]:
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone #转换器基类、回归基类
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
#define evaluation method for a given model. we use k-fold cross validation on the training set. 
#the loss function is root mean square logarithm error between target and prediction
#note: train and y_train are feeded as global variables
NUM_FOLDS = 5 #need tuned
def rmsle_cv(model):
    kf = KFold(NUM_FOLDS, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)
#ensemble method: model averaging
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    # the reason of clone is avoiding affect the original base models
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]  
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)
        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([ model.predict(X) for model in self.models_ ])
        return np.mean(predictions, axis=1)

model_xgb = xgb.XGBRegressor(colsample_bytree=0.055, colsample_bylevel =0.5, 
                             gamma=1.5, learning_rate=0.02, max_depth=32, 
                             objective='reg:linear',booster='gbtree',
                             min_child_weight=57, n_estimators=1000, reg_alpha=0, 
                             reg_lambda = 0,eval_metric = 'rmse', subsample=0.7, 
                             silent=1, n_jobs = -1, early_stopping_rounds = 14,
                             random_state =7, nthread = -1)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=144,
                              learning_rate=0.005, n_estimators=720, max_depth=13,
                              metric='rmse',is_training_metric=True,
                              max_bin = 55, bagging_fraction = 0.8,verbose=-1,
                              bagging_freq = 5, feature_fraction = 0.9) 
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
averaged_models = AveragingModels(models = (model_xgb, model_lgb))
score = rmsle_cv(averaged_models)
print("averaged score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

Xgboost score: 1.3541 (0.0640)

LGBM score: 1.3450 (0.0518)

averaged score: 1.3413 (0.0580)



**step 6**: average the two base models and submit the final predictions.

In [13]:
averaged_models.fit(train.values, y_train)
pred = np.expm1(averaged_models.predict(test.values))
ensemble = pred
sub = pd.DataFrame()
sub['ID'] = test_ID
sub['target'] = ensemble
sub.to_csv('submission.csv',index=False)

#Xgboost score: 1.3582 (0.0640)
#LGBM score: 1.3437 (0.0519)
#averaged score: 1.3431 (0.0586)

#Xgboost score: 1.3566 (0.0525)
#LGBM score: 1.3477 (0.0497)
#averaged score: 1.3438 (0.0516)

#Xgboost score: 1.3540 (0.0621)
#LGBM score: 1.3463 (0.0485)
#averaged score: 1.3423 (0.0556)

In [14]:
sub.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49332,49333,49334,49335,49336,49337,49338,49339,49340,49341
ID,000137c73,00021489f,0004d7953,00056a333,00056d8eb,0005fc190,7.870000000000001e+88,0008510a0,000895faf,000986fba,...,ffef8aa08,fff0ee67d,fff2aa673,fff479492,fff64bf93,fff73b677,fff7b5923,fff7c698f,fff8dba89,fffbe2f6f
target,1.86981e+06,2.58875e+06,2.27823e+06,4.80144e+06,2.59181e+06,1.57824e+06,4110160.0,4.00744e+06,990123,2.49286e+06,...,1.1938e+06,1.02373e+06,1.8121e+06,1.04772e+06,2.14463e+06,2.07522e+06,3.89852e+06,1.45459e+06,199857,3.45621e+06
