# Tutorial

xfeat provides transformation classes for feature engineering and exploration.

In [1]:
import pandas as pd
import xfeat

# This tutorial show you how to use xfeat with an example from a synthetic dataset.
csv_filepath = "./example_dataset.csv"

# Serialize the original csv file into the feather file format.
# For simplicity, we assume that train and test are concatenated to perform feature engineering.
xfeat.utils.compress_df(pd.read_csv(csv_filepath)).to_feather("./example_dataset.ftr")

In [2]:
# Check the serialized data.
df = pd.read_feather("./example_dataset.ftr")
df.head()

Unnamed: 0,id,user_id,target,loan_amt,funded_amt,installment,cat1,cat2,cat3,cat4,cat5
0,b026324c6904b,2a9cb4b88d6d61c81d1,1,5465.659668,5465.659668,137,AE,Q,DD,Y,O9
1,26ab0db90d72e,28ad0ba1e22ee510510,1,9444.073242,9444.073242,100,AE,I,AA,X,O5
2,6d7fce9fee471,194aa8b5b6e47267f03,0,13622.605469,13622.605469,176,CC,Q,A,X,O2
3,48a24b70a0b37,6535542b996af517398,1,1582.478394,1582.478394,179,CC,O,AE,X,O5
4,1dcca23355272,056f04fe8bf20edfce0,1,17355.650391,20826.779297,133,AE,O,AE,X,O5


------

## Categorical columns

xfeat's Data Transformation classes use a DataFrame as an input and a DataFrame as an output.
It supports both pandas and cuDF DataFrames.

`xfeat.SelectCategorical` extracts only the column of categorical data from the input dataframe.

`xfeat.Pipeline` sequentially combines Data Transformation objects.

In [3]:
import xfeat

xfeat.SelectCategorical().fit_transform(df).head()

Unnamed: 0,id,user_id,cat1,cat2,cat3,cat4,cat5
0,b026324c6904b,2a9cb4b88d6d61c81d1,AE,Q,DD,Y,O9
1,26ab0db90d72e,28ad0ba1e22ee510510,AE,I,AA,X,O5
2,6d7fce9fee471,194aa8b5b6e47267f03,CC,Q,A,X,O2
3,48a24b70a0b37,6535542b996af517398,CC,O,AE,X,O5
4,1dcca23355272,056f04fe8bf20edfce0,AE,O,AE,X,O5


In [4]:
from xfeat import SelectCategorical, LabelEncoder, Pipeline

In [5]:
# Takes categorical columns from the data frame and performs label encoding on them.
# The converted data is stored in the column with suffix defined in `output_suffix`.
# By defining `output_suffix=""`, it is possible to store the result in the same column.
encoder = Pipeline([
    SelectCategorical(exclude_cols=["id", "user_id"]),
    LabelEncoder(output_suffix=""),
])

In [6]:
encoder.fit_transform(df).head()

Unnamed: 0,cat1,cat2,cat3,cat4,cat5
0,0,0,0,0,0
1,0,1,1,1,1
2,1,0,2,1,2
3,1,2,3,1,1
4,0,2,3,1,1


`xfeat.cat_encoder.ConcatCombination` creates new columns by combining the input columns.

In [7]:
from xfeat import ConcatCombination


encoder = Pipeline([
    SelectCategorical(exclude_cols=["id", "user_id"]),

    # If there are many categorical columns,
    # users can specify the columns to be combined with `input_cols` kwargs.
    # `r=2` specifies the number of columns to combine the columns.
    ConcatCombination(drop_origin=True, output_suffix="", r=2),
    
    LabelEncoder(output_suffix=""),
])
encoder.fit_transform(df).head()

Unnamed: 0,cat1cat2,cat1cat3,cat1cat4,cat1cat5,cat2cat3,cat2cat4,cat2cat5,cat3cat4,cat3cat5,cat4cat5
0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1
2,2,2,2,2,2,2,2,2,2,2
3,3,3,2,3,3,3,3,3,3,1
4,4,4,1,1,3,3,3,3,3,1


------

## Numerical columns

`xfeat.SelectNumerical` extracts only the column of numerical data from the input dataframe.

`xfeat.ArithmeticCombinations` creates new columns by applying arithmetic combinations.

In [8]:
from xfeat import SelectNumerical

SelectNumerical(exclude_cols=["target"]).fit_transform(df).head()

Unnamed: 0,loan_amt,funded_amt,installment
0,5465.659668,5465.659668,137
1,9444.073242,9444.073242,100
2,13622.605469,13622.605469,176
3,1582.478394,1582.478394,179
4,17355.650391,20826.779297,133


In [9]:
from xfeat import ArithmeticCombinations


encoder = Pipeline([
    SelectNumerical(exclude_cols=["target"]),
    ArithmeticCombinations(
        drop_origin=True,
        operator="+",
        r=2,
        output_suffix="",
    ),
])
encoder.fit_transform(df).head()

Unnamed: 0,loan_amtfunded_amt,loan_amtinstallment,funded_amtinstallment
0,10931.319336,5602.659668,5602.659668
1,18888.146484,9544.073242,9544.073242
2,27245.210938,13798.605469,13798.605469
3,3164.956787,1761.478394,1761.478394
4,38182.429688,17488.650391,20959.779297


-----

## Lambda Encoder

`xfeat.LambdaEncoder` takes a lambda function as an argument and transforms the columns of the data frame.

In [10]:
from xfeat import LambdaEncoder
import numpy as np


encoder = Pipeline([
    SelectNumerical(exclude_cols=["target"]),
    ArithmeticCombinations(
        drop_origin=True,
        operator="+",
        r=2,
        output_suffix="",
    ),

    LambdaEncoder(
        lambda x: float(str(x)[:5]),
        output_prefix="",
        output_suffix="",
        drop_origin=True,
    ),
])

encoder.fit_transform(df).head()

Unnamed: 0,loan_amtfunded_amt,loan_amtinstallment,funded_amtinstallment
0,10931.0,5602.0,5602.0
1,18888.0,9544.0,9544.0
2,27245.0,13798.0,13798.0
3,3164.0,1761.0,1761.0
4,38182.0,17488.0,20959.0


## Serialize/Deserialize

The parameters of the encoder can be serialized/deserialized by pickle.

**Serialize:**

In [11]:
import pickle

df_train = pd.read_feather("./example_dataset.ftr").head(10)
df_test = pd.read_feather("./example_dataset.ftr").tail(10)


encoder = Pipeline([
    SelectCategorical(exclude_cols=["id", "user_id"]),
    LabelEncoder(output_suffix=""),
])
df_train_encoded = encoder.fit_transform(df_train)

with open("label_encoder.pkl", "wb") as f:
    pickle.dump(encoder, f)
    
df_train_encoded.head()

Unnamed: 0,cat1,cat2,cat3,cat4,cat5
0,0,0,0,0,0
1,0,1,1,1,1
2,1,0,2,1,2
3,1,2,3,1,1
4,0,2,3,1,1


**Deserialize:**

In [12]:
with open("label_encoder.pkl", "rb") as f:
    encoder = pickle.load(f)

encoder.transform(df_test).head()

Unnamed: 0,cat1,cat2,cat3,cat4,cat5
10,-1,1,6,1,0
11,1,1,-1,0,2
12,-1,2,-1,1,-1
13,1,0,2,-1,0
14,0,3,-1,0,-1


The Label encoding mapping is kept in train.csv and test.csv. Unseen values are assigned to -1 in this case.