# Combining your model with a model without outlier

Assuming that you have already finished your feature engineering and you have two dataset:

- ***train_clean.csv***
- ***test_clean.csv***

The flows of this pipline is as follows:
1. Training a model using a training set without outliers. (we get: **Model_1**)
2. Training a model to classify outliers. (we get: **Model_2**)
3. Using **Model_2** to predict whether an card_id in test set is an outliers. (we get:**Outlier_Likelyhood**)
4. Spliting out the card_id from **Outlier_Likelyhood** with top 10% (or some other ratio) score. (we get:**Outlier_ID**)
5. Combining your submission using your **best submission (that is, your best model)** to predict **Outlier_ID** in test set and using **Model_1** to predict the rest of the test set.

The  basic idea behind this pipline is:
1. Training model without outliers make the model more accurate for non-outliers.
2. A great proportion of the error is caused by outliers, so we need to use a model training with outliers to predict them. How to find them out? build a classifier!

In [1]:
import numpy as np
import pandas as pd
import time
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss

# Part 1 Training Model Without Outliers

In [2]:
%%time
df_train = pd.read_csv('../input/predicting-outliers-to-improve-your-score/train_clean.csv')
df_test = pd.read_csv('../input/predicting-outliers-to-improve-your-score/test_clean.csv')

CPU times: user 6.66 s, sys: 1.6 s, total: 8.25 s
Wall time: 8.28 s


## filtering out outliers

In [3]:
df_train = df_train[df_train['outliers'] == 0]
target = df_train['target']
del df_train['target']
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','outliers']]
categorical_feats = [c for c in features if 'feature_' in c]

## parameters

In [4]:
param = {'objective':'regression',
         'num_leaves': 31,
         'min_data_in_leaf': 25,
         'max_depth': 7,
         'learning_rate': 0.01,
         'lambda_l1':0.13,
         "boosting": "gbdt",
         "feature_fraction":0.85,
         'bagging_freq':8,
         "bagging_fraction": 0.9 ,
         "metric": 'rmse',
         "verbosity": -1,
         "random_state": 2333}

## training model

In [5]:
%%time
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2333)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval= 100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold 0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 1.60695	valid_1's rmse: 1.61793
[200]	training's rmse: 1.57615	valid_1's rmse: 1.59127
[300]	training's rmse: 1.5615	valid_1's rmse: 1.58071
[400]	training's rmse: 1.55208	valid_1's rmse: 1.57484
[500]	training's rmse: 1.54492	valid_1's rmse: 1.57124
[600]	training's rmse: 1.53916	valid_1's rmse: 1.5691
[700]	training's rmse: 1.53428	valid_1's rmse: 1.56768
[800]	training's rmse: 1.52983	valid_1's rmse: 1.56669
[900]	training's rmse: 1.52574	valid_1's rmse: 1.56596
[1000]	training's rmse: 1.5219	valid_1's rmse: 1.56532
[1100]	training's rmse: 1.51842	valid_1's rmse: 1.56506
[1200]	training's rmse: 1.51513	valid_1's rmse: 1.56475
[1300]	training's rmse: 1.51195	valid_1's rmse: 1.56444
[1400]	training's rmse: 1.50873	valid_1's rmse: 1.56409
[1500]	training's rmse: 1.50566	valid_1's rmse: 1.56382
[1600]	training's rmse: 1.5024	valid_1's rmse: 1.56365
[1700]	training's rmse: 1.49932	valid_1's rmse:

In [6]:
model_without_outliers = pd.DataFrame({"card_id":df_test["card_id"].values})
model_without_outliers["target"] = predictions

# Part 2 Training Model For Outliers Classification

In [7]:
%%time
df_train = pd.read_csv('../input/predicting-outliers-to-improve-your-score/train_clean.csv')
df_test = pd.read_csv('../input/predicting-outliers-to-improve-your-score/test_clean.csv')

CPU times: user 6.63 s, sys: 192 ms, total: 6.82 s
Wall time: 6.83 s


## using outliers column as labels instead of target column

In [8]:
target = df_train['outliers']
del df_train['outliers']
del df_train['target']

In [9]:
features = [c for c in df_train.columns if c not in ['card_id', 'first_active_month']]
categorical_feats = [c for c in features if 'feature_' in c]

## parameters

In [10]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 6,
         'learning_rate': 0.01,
         "boosting": "rf",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "random_state": 2333}

## training model

In [11]:
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

start = time.time()


for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(log_loss(target, oof)))

fold n°0




Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.174642	valid_1's binary_logloss: 0.186048
[200]	training's binary_logloss: 0.172788	valid_1's binary_logloss: 0.183575
[300]	training's binary_logloss: 0.169804	valid_1's binary_logloss: 0.180559
[400]	training's binary_logloss: 0.170139	valid_1's binary_logloss: 0.181009
[500]	training's binary_logloss: 0.170433	valid_1's binary_logloss: 0.181282
Early stopping, best iteration is:
[309]	training's binary_logloss: 0.169472	valid_1's binary_logloss: 0.180257
fold n°1
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.184304	valid_1's binary_logloss: 0.175788
[200]	training's binary_logloss: 0.183427	valid_1's binary_logloss: 0.175073
[300]	training's binary_logloss: 0.180141	valid_1's binary_logloss: 0.172068
[400]	training's binary_logloss: 0.180621	valid_1's binary_logloss: 0.17233
[500]	training's binary_logloss: 0.181361	valid_1's binary_lo

In [12]:
### 'target' is the probability of whether an observation is an outlier
df_outlier_prob = pd.DataFrame({"card_id":df_test["card_id"].values})
df_outlier_prob["target"] = predictions
df_outlier_prob.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,0.212083
1,C_ID_130fd0cbdd,0.00468
2,C_ID_b709037bc5,0.007228
3,C_ID_d27d835a9f,0.00468
4,C_ID_2b5e3df5c2,0.00468


# Part 3 Combining Submission:
So far so good !
We now have three dataset:

1. Best Submission
2. Prediction Using Model Without Outliers
3. Probability of Outliers In Test set


In [13]:
# if the test set has the same ratio of outliers as training set, 
# then the numbuer of outliers in test is about: (1.06% outliers in training set)
123623*0.0106

1310.4038

In [14]:
# In case missing some predictable outlier, we choose top 25000 with highest outliers likelyhood.
outlier_id = pd.DataFrame(df_outlier_prob.sort_values(by='target',ascending = False).head(25000)['card_id'])

In [15]:
best_submission = pd.read_csv('../input/finaldata/submission_ashish.csv')

In [16]:
most_likely_liers = best_submission.merge(outlier_id,how='right')
most_likely_liers.head()

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,-2.636719
1,C_ID_6d8dba8475,-0.907665
2,C_ID_7f1041e8e1,-5.957256
3,C_ID_22e4a47c72,-0.08232
4,C_ID_b54cfad8b2,-0.793633


In [17]:
%%time
for card_id in most_likely_liers['card_id']:
    model_without_outliers.loc[model_without_outliers['card_id']==card_id,'target']\
    = most_likely_liers.loc[most_likely_liers['card_id']==card_id,'target'].values

CPU times: user 7min, sys: 416 ms, total: 7min
Wall time: 7min


In [18]:
model_without_outliers.to_csv("combining_submission.csv", index=False)