For the following examples we'll use `seaborn` titanic dataset


In [1]:
import pandas as pd
import seaborn as sns

df = sns.load_dataset("titanic")
df["signup_date"] = pd.date_range(start="1912-04-01", periods=len(df)).astype(str)
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,signup_date
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,1912-04-01
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1912-04-02
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,1912-04-03
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,1912-04-04
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,1912-04-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,1914-09-04
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,1914-09-05
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,1914-09-06
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1914-09-07


## handle_data

helper function for data handling, including features such as formatting columns, handling NaN values


In [2]:
from ml_qol import handle_data

df = handle_data(df, date_col="signup_date")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,signup_date
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,Nan,Southampton,no,False,1912-04-01
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1912-04-02
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,Nan,Southampton,yes,True,1912-04-03
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,1912-04-04
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,Nan,Southampton,no,True,1912-04-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,Nan,Southampton,no,True,1914-09-04
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,1914-09-05
888,0,3,female,,1,2,23.4500,S,Third,woman,False,Nan,Southampton,no,False,1914-09-06
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1914-09-07


#### `handle_data` parameters

- **df**: Dataset as pd.DataFrame
- **date_col**: name of date column to convert with pd.to_datetime()
- **log_cols**: list of column names to transform with np.log1p. Useful for reducing skewed distributions


# Feature Engineering


## expand_date

use date column to add features such as year, month, day, week of year etc


In [11]:
from ml_qol import expand_date

fe_df = expand_date(df, date_col="signup_date")
fe_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,...,embark_town,alive,alone,signup_date,year,month,day,weekday,is_weekend,week_of_year
0,0,3,male,22.0,1,0,7.2500,S,Third,man,...,Southampton,no,False,1912-04-01,1912,4,1,0,False,14
1,1,1,female,38.0,1,0,71.2833,C,First,woman,...,Cherbourg,yes,False,1912-04-02,1912,4,2,1,False,14
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,...,Southampton,yes,True,1912-04-03,1912,4,3,2,False,14
3,1,1,female,35.0,1,0,53.1000,S,First,woman,...,Southampton,yes,False,1912-04-04,1912,4,4,3,False,14
4,0,3,male,35.0,0,0,8.0500,S,Third,man,...,Southampton,no,True,1912-04-05,1912,4,5,4,False,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,...,Southampton,no,True,1914-09-04,1914,9,4,4,False,36
887,1,1,female,19.0,0,0,30.0000,S,First,woman,...,Southampton,yes,True,1914-09-05,1914,9,5,5,True,36
888,0,3,female,,1,2,23.4500,S,Third,woman,...,Southampton,no,False,1914-09-06,1914,9,6,6,True,36
889,1,1,male,26.0,0,0,30.0000,C,First,man,...,Cherbourg,yes,True,1914-09-07,1914,9,7,0,False,37


#### `expand_date` parameters

- **df**: Dataset as pd.DataFrame
- **date_col**: name of date column used to create additional features


## combine_features

create new features by combining numerical and categorical columns, such as creating products, ratios, concatenating columns together etc.


In [12]:
from ml_qol import combine_features

fe_df = combine_features(
    fe_df,
    target_col="survived",
    num_features=["fare", "age"],
    cat_features=["sex", "class"],
)
fe_df

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,...,year,month,day,weekday,is_weekend,week_of_year,fare_times_age,fare_div_age,sex_class,survived
0,3,male,22.0,1,0,7.2500,S,Third,man,True,...,1912,4,1,0,False,14,159.5000,0.329545,male_Third,0
1,1,female,38.0,1,0,71.2833,C,First,woman,False,...,1912,4,2,1,False,14,2708.7654,1.875876,female_First,1
2,3,female,26.0,0,0,7.9250,S,Third,woman,False,...,1912,4,3,2,False,14,206.0500,0.304808,female_Third,1
3,1,female,35.0,1,0,53.1000,S,First,woman,False,...,1912,4,4,3,False,14,1858.5000,1.517143,female_First,1
4,3,male,35.0,0,0,8.0500,S,Third,man,True,...,1912,4,5,4,False,14,281.7500,0.230000,male_Third,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,13.0000,S,Second,man,True,...,1914,9,4,4,False,36,351.0000,0.481481,male_Second,0
887,1,female,19.0,0,0,30.0000,S,First,woman,False,...,1914,9,5,5,True,36,570.0000,1.578947,female_First,1
888,3,female,,1,2,23.4500,S,Third,woman,False,...,1914,9,6,6,True,36,,,female_Third,0
889,1,male,26.0,0,0,30.0000,C,First,man,True,...,1914,9,7,0,False,37,780.0000,1.153846,male_First,1


#### `combine_features` parameters

- **df**: Dataset as pd.DataFrame
- **target_col**: name of target column. Passing this ensures it is dropped and does not cause target leakage when creating new features
- **num_features**: list of numerical columns to create new features based on. Creates products and ratios for each combination. Defaults to all numerical columns, (this may lead to high memory usage and bad performance)
- **cat_features**: list of categorical columns to create new features based on. Creates new features by concatenating two categorical columns together. Defaults to all categorical columns, (this may lead to high memory usage and bad performance)
- **safety_check**: Ensures length of num_features/cat_features does not exceed 50, since it may cause high memory usage and bad performance. Defaults to True. Pass safety_check=False to allow larger computations


## target_encode

Target encodes based on list of columns provided. This performs a kfold split to ensure no data leakage occurs. If a column is numerical, it will bin the column first, then perform target encoding

Currently only supports regression tasks, hence we'll use `fare` as target column


In [None]:
from ml_qol import target_encode, target_encode_test
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)

te_df = target_encode(train_df, target_col="fare", cols=["age", "class", "sex"])
te_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,...,age_te_max,age_te_min,class_te_mean,class_te_median,class_te_max,class_te_min,sex_te_mean,sex_te_median,sex_te_max,sex_te_min
0,1,2,female,28.0,0,0,13.0000,S,Second,woman,...,82.1708,7.7958,21.365726,17.37500,73.5000,0.0,45.544580,23.0,512.3292,7.225
1,1,2,female,48.0,1,2,65.0000,S,Second,woman,...,76.7292,7.8542,21.365726,17.37500,73.5000,0.0,45.544580,23.0,512.3292,7.225
2,0,3,female,45.0,1,4,27.9000,S,Third,woman,...,164.8667,6.9750,13.704060,8.05000,69.5500,0.0,45.544580,23.0,512.3292,7.225
3,0,1,male,24.0,0,0,79.2000,C,First,man,...,263.0000,7.0500,83.533028,61.37920,512.3292,0.0,26.450229,13.0,512.3292,0.000
4,0,2,female,24.0,0,0,13.0000,S,Second,woman,...,263.0000,7.0500,21.365726,17.37500,73.5000,0.0,45.544580,23.0,512.3292,7.225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,0,2,male,31.0,1,1,37.0042,C,Second,man,...,57.0000,7.7500,19.516824,13.93125,73.5000,0.0,26.025379,11.5,512.3292,0.000
708,0,3,male,21.0,0,0,7.9250,S,Third,man,...,34.3750,7.2500,14.099136,8.05000,69.5500,0.0,26.025379,11.5,512.3292,0.000
709,0,1,male,36.0,0,0,40.1250,C,First,man,...,512.3292,0.0000,84.445144,57.00000,512.3292,0.0,26.025379,11.5,512.3292,0.000
710,0,3,male,25.0,0,0,7.7417,Q,Third,man,...,151.5500,7.0500,14.099136,8.05000,69.5500,0.0,26.025379,11.5,512.3292,0.000


`target_encode_test` can be used to target encode the test set based on the training set


In [39]:
X_test = test_df.drop(columns="fare")

te_test_df = target_encode_test(
    train_df, X_test, target_col="fare", cols=["age", "class", "sex"]
)
te_test_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,class,who,adult_male,...,age_te_max,age_te_min,class_te_mean,class_te_median,class_te_max,class_te_min,sex_te_mean,sex_te_median,sex_te_max,sex_te_min
0,1,2,female,3.0,1,2,C,Second,child,False,...,31.3875,15.9000,20.504589,14.75,73.5000,0.0,43.959491,22.67915,512.3292,7.225
1,0,3,male,27.0,1,0,C,Third,man,True,...,211.5000,7.8958,13.829861,8.05,69.5500,0.0,25.631248,11.50000,512.3292,0.000
2,1,3,female,4.0,0,1,C,Third,child,False,...,81.8583,11.1333,13.829861,8.05,69.5500,0.0,43.959491,22.67915,512.3292,7.225
3,0,3,male,,0,0,Q,Third,man,True,...,,,13.829861,8.05,69.5500,0.0,25.631248,11.50000,512.3292,0.000
4,0,3,male,21.0,0,0,S,Third,man,True,...,77.9583,7.2500,13.829861,8.05,69.5500,0.0,25.631248,11.50000,512.3292,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174,0,1,male,,0,0,S,First,man,True,...,,,81.428702,59.40,512.3292,0.0,25.631248,11.50000,512.3292,0.000
175,1,1,female,,1,0,C,First,woman,False,...,,,81.428702,59.40,512.3292,0.0,43.959491,22.67915,512.3292,7.225
176,0,1,male,54.0,0,0,S,First,man,True,...,78.2667,14.0000,81.428702,59.40,512.3292,0.0,25.631248,11.50000,512.3292,0.000
177,0,3,male,36.0,1,1,S,Third,man,True,...,512.3292,0.0000,13.829861,8.05,69.5500,0.0,25.631248,11.50000,512.3292,0.000
