# Combining your model with a model without outlier

Assuming that you have already finished your feature engineering and you have two dataset:

- ***train_clean.csv***
- ***test_clean.csv***

In train_clean.csv, there's an **'outlier' column with values 1/0. **

Besides, you have your best LB submission:
- ***3.695.csv*** (thanks  **Ashish Patel(阿希什)** My original model can't rich this score, so I try to use the idea to improve your submission to get better LB socre.）

The flows of this pipline is as follows:
1. Training a model using a training set without outliers. (we get: **Model_1**)
2. Training a model to classify outliers. (we get: **Model_2**)
3. Using **Model_2** to predict whether an card_id in test set is an outliers. (we get:**Outlier_Likelyhood**)
4. Spliting out the card_id from **Outlier_Likelyhood** with top 10% (or some other ratio) score. (we get:**Outlier_ID**)
5. Combining your submission using your **best submission (that is, your best model)** to predict **Outlier_ID** in test set and using **Model_1** to predict the rest of the test set.

The  basic idea behind this pipline is:
1. Training model without outliers make the model more accurate for non-outliers.
2. A great proportion of the error is caused by outliers, so we need to use a model training with outliers to predict them. How to find them out? build a classifier!

In [3]:
import numpy as np
import pandas as pd
import time
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss

  return f(*args, **kwds)
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


# Part 1 Training Model Without Outliers

In [4]:
%%time
df_train = pd.read_csv('transformed_train_data.csv')
df_test = pd.read_csv('transformed_test_data.csv')

CPU times: user 18.8 s, sys: 1.61 s, total: 20.4 s
Wall time: 21.6 s


In [5]:
df_train['outliers'] = (df_train['target'] < -30).astype('int64')

## filtering out outliers

In [6]:
df_train = df_train[df_train['outliers'] == 0]
target = df_train['target']
del df_train['target']
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','outliers']]
categorical_feats = [c for c in features if 'feature_' in c]

## parameters

In [7]:
param = {'objective':'regression',
         'num_leaves': 31,
         'min_data_in_leaf': 25,
         'max_depth': 7,
         'learning_rate': 0.01,
         'lambda_l1':0.13,
         "boosting": "gbdt",
         "feature_fraction":0.85,
         'bagging_freq':8,
         "bagging_fraction": 0.9 ,
         "metric": 'rmse',
         "verbosity": -1,
         "random_state": 2333}

## training model

In [8]:
%%time
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2333)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval= 100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold 0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 1.60829	valid_1's rmse: 1.61962
[200]	training's rmse: 1.5775	valid_1's rmse: 1.59335
[300]	training's rmse: 1.56206	valid_1's rmse: 1.58225
[400]	training's rmse: 1.55129	valid_1's rmse: 1.57565
[500]	training's rmse: 1.54312	valid_1's rmse: 1.57182
[600]	training's rmse: 1.53633	valid_1's rmse: 1.56944
[700]	training's rmse: 1.5303	valid_1's rmse: 1.56763
[800]	training's rmse: 1.52504	valid_1's rmse: 1.56655
[900]	training's rmse: 1.52021	valid_1's rmse: 1.56575
[1000]	training's rmse: 1.51562	valid_1's rmse: 1.56498
[1100]	training's rmse: 1.51132	valid_1's rmse: 1.5646
[1200]	training's rmse: 1.50705	valid_1's rmse: 1.56435
[1300]	training's rmse: 1.50316	valid_1's rmse: 1.56417
[1400]	training's rmse: 1.49923	valid_1's rmse: 1.56391
[1500]	training's rmse: 1.49521	valid_1's rmse: 1.56383
[1600]	training's rmse: 1.49136	valid_1's rmse: 1.56392
[1700]	training's rmse: 1.48751	valid_1's rmse

In [9]:
model_without_outliers = pd.DataFrame({"card_id":df_test["card_id"].values})
model_without_outliers["target"] = predictions

# Part 2 Training Model For Outliers Classification

In [10]:
%%time
df_train = pd.read_csv('transformed_train_data.csv')
df_test = pd.read_csv('transformed_test_data.csv')


CPU times: user 17.8 s, sys: 1.43 s, total: 19.3 s
Wall time: 19.8 s


In [11]:
df_train['outliers'] = (df_train['target'] < -30).astype('int64')

## using outliers column as labels instead of target column

In [12]:
target = df_train['outliers']
del df_train['outliers']
del df_train['target']

In [13]:
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month']]
categorical_feats = [c for c in features if 'feature_' in c]

## parameters

In [14]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 6,
         'learning_rate': 0.01,
         "boosting": "rf",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "random_state": 2333}

## training model

In [15]:
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

start = time.time()


for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(log_loss(target, oof)))

fold n°0




Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0446375	valid_1's binary_logloss: 0.0472483
[200]	training's binary_logloss: 0.044623	valid_1's binary_logloss: 0.0472331
Early stopping, best iteration is:
[71]	training's binary_logloss: 0.0446202	valid_1's binary_logloss: 0.0472369
fold n°1
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0449786	valid_1's binary_logloss: 0.0457934
[200]	training's binary_logloss: 0.0449465	valid_1's binary_logloss: 0.0457732
Early stopping, best iteration is:
[31]	training's binary_logloss: 0.0449785	valid_1's binary_logloss: 0.0457192
fold n°2
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0453193	valid_1's binary_logloss: 0.0451575
[200]	training's binary_logloss: 0.0453964	valid_1's binary_logloss: 0.0452476
Early stopping, best iteration is:
[2]	training's binary_logloss: 0.0457734	valid_1's binary_l

In [16]:
### 'target' is the probability of whether an observation is an outlier
df_outlier_prob = pd.DataFrame({"card_id":df_test["card_id"].values})
df_outlier_prob["target"] = predictions
df_outlier_prob.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,0.071122
1,C_ID_130fd0cbdd,0.001902
2,C_ID_b709037bc5,0.006806
3,C_ID_d27d835a9f,0.001891
4,C_ID_2b5e3df5c2,0.001891


# Part 3 Combining Submission:
So far so good !
We now have three dataset:

1. Best Submission
2. Prediction Using Model Without Outliers
3. Probability of Outliers In Test set


In [17]:
# if the test set has the same ratio of outliers as training set, 
# then the numbuer of outliers in test is about: (1.06% outliers in training set)
123623*0.0106

1310.4038

In [23]:
# In case missing some predictable outlier, we choose top 25000 with highest outliers likelyhood.
outlier_id = pd.DataFrame(df_outlier_prob.sort_values(by='target',ascending = False).head(10000)['card_id'])

In [24]:
best_submission = pd.read_csv('tidy_elo_3.63306.csv')

In [25]:
most_likely_liers = best_submission.merge(outlier_id,how='right')
most_likely_liers.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,-3.017024
1,C_ID_7f1041e8e1,-6.830067
2,C_ID_8eaa79db4f,-4.653352
3,C_ID_17cb2f55f2,1.552973
4,C_ID_562a791678,-3.105685


In [26]:
%%time
for card_id in most_likely_liers['card_id']:
    model_without_outliers.loc[model_without_outliers['card_id']==card_id,'target']\
    = most_likely_liers.loc[most_likely_liers['card_id']==card_id,'target'].values

CPU times: user 1min 54s, sys: 2.39 s, total: 1min 56s
Wall time: 1min 57s


In [27]:
model_without_outliers.to_csv("combining_submission.csv", index=False)