# `kts` Workflow
## Stacking 
You can stack models in your `Modeling.ipynb` notebook, but I decided to show it in a separate one.

In [1]:
%pylab inline
import pandas as pd
import kts
from kts import *

Populating the interactive namespace from numpy and matplotlib


## Data Loading

In [2]:
kts.ls()
features
features.define_in_scope(globals())

['train', 'test']

[family_size, family_size_encode, encode_age_and_sex, select_numeric]

In [3]:
%%time
train = kts.load('train')
test = kts.load('test')

CPU times: user 13.8 ms, sys: 4.54 ms, total: 18.4 ms
Wall time: 25.2 ms


## Stacking Section

In [4]:
lb

Unnamed: 0_level_0,Score,std,Model,FS,Description,FS description,Model source,FS source,Splitter
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6F5CC3,0.875949,0.0204519,cb_default,fs_4,"same catboost, numeric + family_sz + age + sex + f_sz encoded","numeric and family size, (age, sex, family_sz) encoded",CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex, family_size_encode]),  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
686777,0.874757,0.0216935,cb_default,fs_4_bltn_20,"same catboost, (numeric + family_sz + age + sex + f_sz encoded) select 20 best",Select 20 best features from fs_4 using BuiltinImportance,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex, family_size_encode]),  target_column='Survived', group_column=None).select(20, lb['6F5CC3'], BuiltinImportance())","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
CD1636,0.872937,0.0254346,cb_default,fs_3,"same catboost, numeric + family_sz + age + sex","numeric and family size, age and sex",CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex]),  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
38E933,0.773755,0.0443118,cb_default,fs_2,"same catboost, numeric + family_sz",original numeric and family size,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.empty_like,  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
B1C44F,0.750426,0.0642686,cb_default,fs_1,baseline catboost on numeric features,Baseline: original numeric features,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=select_numeric,  fc_after=stl.empty_like,  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"


Let's take 3 best models from the top of lb:

In [5]:
best_ids = list(lb.head(3).index)
best_ids

['6F5CC3', '686777', 'CD1636']

`kts.stack(ids_to_stack, inner_splitter)` produces a special Validator for stacking and a FeatureConstructor for stacking, which can be passed to FeatureSet.

In [6]:
from sklearn.model_selection import StratifiedKFold

val_stack, fc_stack = kts.stack(best_ids, inner_splitter=StratifiedKFold(3))
val_stack
fc_stack

Validator(Refiner(inner_splitter=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
    outer_splitter=StratifiedKFold(n_splits=5, random_state=None, shuffle=False)), roc_auc_score)

stl.stack(ids=['6F5CC3', '686777', 'CD1636'])

First let's blend them using LogisticRegression:

In [7]:
fs_6 = FeatureSet(fc_stack,
                  df_input=train,
                  target_column='Survived',
                  description='fs for blending of 3 best models'
                 )

In [8]:
fs_6

FeatureSet(fc_before=stl.stack(ids=['6F5CC3', '686777', 'CD1636']),
           fc_after=stl.empty_like,
           target_column='Survived', group_column=None)

In [9]:
fs_6[:5]

Unnamed: 0_level_0,6F5CC3,686777,CD1636
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.106602,0.072579,0.075637
2,0.974403,0.976034,0.974329
3,0.580968,0.618617,0.585384
4,0.975883,0.977233,0.977854
5,0.138547,0.168221,0.140078


`fs_6` consisted of only predictions of 1st level models, but `fs_7` includes all the features from `fs_4`, which was best for single models. Let's see whether it will work:

In [12]:
fs_7 = FeatureSet([select_numeric, family_size, fc_stack],
                  [encode_age_and_sex, family_size_encode],
                  df_input=train,
                  target_column='Survived',
                  description='fs_4 features + stacking of 3 best'
                 )

## Validation Section

`Validator(sklearn_splitter, sklearn_metric)` is used to validate models on features.

In [10]:
lr = zoo.bc.LogisticRegression()
val_stack.score(lr, fs_6, description='blending of 3 best models with LogReg')



0.943435728218337

Wow, not bad for just blending! Let's stack another CatBoost over those three models.

In [15]:
cb_stack = zoo.bc.CatBoostClassifier(iterations=100)
val_stack.score(cb_stack, fs_7, description='light catboost, fs_4 + stack(3 best)', verbose=False)

0.947121330816983

Yes, it is slightly better. Here's our leaderboard by this moment:

In [16]:
lb

Unnamed: 0_level_0,Score,std,Model,FS,Description,FS description,Model source,FS source,Splitter
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
82762B,0.947121,0.022932,cb_09c,fs_7,"light catboost, fs_4 + stack(3 best)",fs_4 features + stacking of 3 best,"CatBoostClassifier(loss_function='Logloss', iterations=100)","FeatureSet(fc_before=stl.concat([select_numeric, family_size, stl.stack(ids=['6F5CC3', '686777', 'CD1636'])]),  fc_after=stl.concat([encode_age_and_sex, family_size_encode]),  target_column='Survived', group_column=None)","Refiner(inner_splitter=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),  outer_splitter=StratifiedKFold(n_splits=5, random_state=None, shuffle=False))"
24E01D,0.943436,0.0276431,lr_293,fs_6,blending of 3 best models with LogReg,fs for blending of 3 best models,"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)","FeatureSet(fc_before=stl.stack(ids=['6F5CC3', '686777', 'CD1636']),  fc_after=stl.empty_like,  target_column='Survived', group_column=None)","Refiner(inner_splitter=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),  outer_splitter=StratifiedKFold(n_splits=5, random_state=None, shuffle=False))"
6F5CC3,0.875949,0.0204519,cb_default,fs_4,"same catboost, numeric + family_sz + age + sex + f_sz encoded","numeric and family size, (age, sex, family_sz) encoded",CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex, family_size_encode]),  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
686777,0.874757,0.0216935,cb_default,fs_4_bltn_20,"same catboost, (numeric + family_sz + age + sex + f_sz encoded) select 20 best",Select 20 best features from fs_4 using BuiltinImportance,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex, family_size_encode]),  target_column='Survived', group_column=None).select(20, lb['6F5CC3'], BuiltinImportance())","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
CD1636,0.872937,0.0254346,cb_default,fs_3,"same catboost, numeric + family_sz + age + sex","numeric and family size, age and sex",CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.concat([encode_age_and_sex]),  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
38E933,0.773755,0.0443118,cb_default,fs_2,"same catboost, numeric + family_sz",original numeric and family size,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=stl.concat([select_numeric, family_size]),  fc_after=stl.empty_like,  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"
B1C44F,0.750426,0.0642686,cb_default,fs_1,baseline catboost on numeric features,Baseline: original numeric features,CatBoostClassifier(loss_function='Logloss'),"FeatureSet(fc_before=select_numeric,  fc_after=stl.empty_like,  target_column='Survived', group_column=None)","StratifiedKFold(n_splits=5, random_state=None, shuffle=False)"


Here I end my basic introduction to `kts`. We created features in `kts` format, trained some models, examined feature importances, and even blended and stacked models.  
I didn't cover some important topics like creating your own feature constructors which would behave differently for train and validation dataframes (as stl.target_encoding), defining your own metrics or creating features using more than one dataframe.  
All this stuff is possible and will be done in more advanced tutorials, based on real competitions.  
You can always contact [me](https://telegram.me/konodyuk) to clarify anything about `kts`, report a bug or suggest a competition to be used as a next example.