# `kts` Workflow
## Feature Engineering 

In [1]:
%pylab inline
import pandas as pd
import kts
from kts import *

Populating the interactive namespace from numpy and matplotlib


Let's read the data and save it to kts storage for faster access. We need to set `PassengerId` as index as kts assumes each object has a unique index.

In [3]:
%%time
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.set_index('PassengerId', inplace=True)
test.set_index('PassengerId', inplace=True)
test['Survived'] = -1
kts.save(train, 'train')
kts.save(test, 'test')

CPU times: user 30 ms, sys: 6.67 ms, total: 36.7 ms
Wall time: 39.9 ms


In [4]:
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now let's do some feature engineering. Each function should produce a block of new and uniquely named features and have one argument: an initial dataframe (`train` in this case). We can preview our feature constructor with `@preview(df_name, [sizes_to_eval])`:

In [10]:
@preview(train)
def family_size(df):
    res = stl.empty_like(df)
    res['family_sz'] = df['SibSp'] + df['Parch'] + 1
    res['is_alone'] = (res['family_sz'] == 1).astype(int)
    return res

Unnamed: 0_level_0,family_sz,is_alone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,0
2,2,0


Unnamed: 0_level_0,family_sz,is_alone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,0
2,2,0
3,1,1
4,2,0


Unnamed: 0_level_0,family_sz,is_alone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,0
2,2,0
3,1,1
4,2,0
5,1,1
6,1,1


In [12]:
@preview(train, [10])
def family_size(df):
    res = stl.empty_like(df)
    res['family_sz'] = df['SibSp'] + df['Parch'] + 1
    res['is_alone'] = (res['family_sz'] == 1).astype(int)
    return res

Unnamed: 0_level_0,family_sz,is_alone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,0
2,2,0
3,1,1
4,2,0
5,1,1
6,1,1
7,1,1
8,5,0
9,3,0
10,2,0


When we're sure the function works fine, we can `@register` it. Of course, in a real project we would preview and register a function at the same cell.

In [13]:
@register
def family_size(df):
    res = stl.empty_like(df)
    res['family_sz'] = df['SibSp'] + df['Parch'] + 1
    res['is_alone'] = (res['family_sz'] == 1).astype(int)
    return res

Now the function is saved in `kts.features` list and can be defined in any notebook with `features.define_in_scope(globals())`, check `Modelling.ipynb` for reference.

In [17]:
features

[family_size]

kts has a standard library of simple feature generators:
* `stl.discretize(columns, n_bins)`
* `stl.discretize_quantile(columns, n_bins)`
* `stl.target_encoding(columns_to_encode, target_columns, aggregation)`
* `stl.make_ohe(columns)`

and some service functors:
* `stl.empty_like(df)` - return only index of a dataframe
* `stl.identity(df)` - return the same dataframe
* `stl.concat([func1, func2])(df)` - create features with `func1` and `func2`, and merge them
* `stl.compose([func1, func2])(df)` - apply `func1` and `func2` to `df` sequentially

In [21]:
stl.empty_like(train.head(2))
stl.identity(train.head(2))

1
2


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


We can use previously created features to create new. You may want to use preview with different sizes of input dataframe to ensure that target encoding works fine:

In [25]:
@preview(train, [20, 30])
def family_size_encode(df):
    tmp = family_size(df)
    tmp['Survived'] = df['Survived']
    return stl.concat([
        stl.target_encoding(['family_sz', 'is_alone'], 'Survived'),
        stl.make_ohe(['family_sz', 'is_alone'])
    ])(tmp)

Unnamed: 0_level_0,family_sz_te_Survived_mean,is_alone_te_Survived_mean,family_sz_ohe_1,family_sz_ohe_2,family_sz_ohe_3,family_sz_ohe_5,family_sz_ohe_6,family_sz_ohe_7,is_alone_ohe_0,is_alone_ohe_1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.6,0.5,0,1,0,0,0,0,1,0
2,0.6,0.5,0,1,0,0,0,0,1,0
3,0.5,0.5,1,0,0,0,0,0,0,1
4,0.6,0.5,0,1,0,0,0,0,1,0
5,0.5,0.5,1,0,0,0,0,0,0,1
6,0.5,0.5,1,0,0,0,0,0,0,1
7,0.5,0.5,1,0,0,0,0,0,0,1
8,0.0,0.5,0,0,0,1,0,0,1,0
9,1.0,0.5,0,0,1,0,0,0,1,0
10,0.6,0.5,0,1,0,0,0,0,1,0


Unnamed: 0_level_0,family_sz_te_Survived_mean,is_alone_te_Survived_mean,family_sz_ohe_1,family_sz_ohe_2,family_sz_ohe_3,family_sz_ohe_5,family_sz_ohe_6,family_sz_ohe_7,is_alone_ohe_0,is_alone_ohe_1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.6,0.461538,0,1,0,0,0,0,1,0
2,0.6,0.461538,0,1,0,0,0,0,1,0
3,0.529412,0.529412,1,0,0,0,0,0,0,1
4,0.6,0.461538,0,1,0,0,0,0,1,0
5,0.529412,0.529412,1,0,0,0,0,0,0,1
6,0.529412,0.529412,1,0,0,0,0,0,0,1
7,0.529412,0.529412,1,0,0,0,0,0,0,1
8,0.0,0.461538,0,0,0,1,0,0,1,0
9,1.0,0.461538,0,0,1,0,0,0,1,0
10,0.6,0.461538,0,1,0,0,0,0,1,0


In [26]:
@register
def family_size_encode(df):
    tmp = family_size(df)
    tmp['Survived'] = df['Survived']
    return stl.concat([
        stl.target_encoding(['family_sz', 'is_alone'], 'Survived'),
        stl.make_ohe(['family_sz', 'is_alone'])
    ])(tmp)

And let's create a bit more features:

In [43]:
# @preview(train, [200])
@register
def encode_age_and_sex(df):
    tmp = stl.discretize(['Age'], 5)(df)
    tmp['Sex'] = df['Sex']
    tmp['Survived'] = df['Survived']
    return stl.concat([
        stl.target_encoding(['disc_5_Age', 'Sex'], 'Survived'),
        stl.make_ohe(['disc_5_Age', 'Sex'])
    ])(tmp)

In [49]:
# @preview(train)
@register
def select_numeric(df):
    res = stl.column_selector(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])(df)
    return res.fillna(-1)

Now we are ready to train our first model in `ModelingStart.ipynb`.