Skip to content

Latest commit

 

History

History
172 lines (112 loc) · 5.95 KB

README.md

File metadata and controls

172 lines (112 loc) · 5.95 KB

Target Encoding benchmarks

This repo contains a set of benchmarks comparing different Target Encoding options (and comparing with a One Hot Encoder).

Target encoding is a method to encode a categorical variable in which each category is encoded given the effect it has on the target variable y. It can especially be useful for high-cardinality categorical data (for which a one hot encoding would result in a high dimensionality).

References:

That paper describes an Empirical Bayes method to perform shrinkage of the expected value of the target.
Other algorithms to prevent overfitting are a leave one out method, where the target of the sample itself is not used in determining the expected value, adding noise, or determining the expected value in a cross-validation.

Currently available implemenations:

Set-up

Needed packages:

conda install scikit-learn pandas matplotlib seaborn statsmodels xgboost lightgbm joblib

Additional packages I installed from git master:

cd repos

git clone https://github.com/Robin888/hccEncoding-project.git
cd hccEncoding-project
pip install -e .
cd ..

git clone https://github.com/dirty-cat/dirty_cat.git
cd dirty_cat
pip install -e .
cd ..

git clone https://github.com/scikit-learn-contrib/categorical-encoding.git
cd categorical-encoding
pip install -e .
cd ..

Download the data:

python download_data.py

Brief overview of the datasets is given in overview_datasets.ipynb.

Run the benchmarks:

python main_test.py

TODO:

  • datasets: look for other appropriate datasets. Current ideas:

    • add Criteo Terabyt Click Log dataset
    • generated dataset (both with uniform distribution as one with rare catgories)

    And expand overview of the categories in each dataset.

  • Add the LeaveOneOutEncoder from category_encoders, and the CountFeaturizer from the sklearn PR.

  • Investigate the different options:

    • Check the different implementations and what the differences are.

    • More clearly benchmark the different options (with/without shrinking, with/without cross-validation, different hyperparameters, ..), and investigate the different results for those.

Overview of initial runs of the benchmark are in overview_results.ipynb (on nbviewer)(but the results still need to be investigated).

Benchmark code based on the provided code by Patricio Cerda et al (2018): https://arxiv.org/pdf/1806.00979.pdf ("Similarity encoding for learning with dirty categorical variables").

Literature review

What we are describing here as "target encoding" is also known as likelihood encoding, impact coding or effect coding.

https://datascience.stackexchange.com/questions/11024/encoding-categorical-variables-using-likelihood-estimation

Calculating the statistics naively, you will get a biased estimate and risk on overfitting. Several methods to prevent this have been described

Other implementation:

https://datascience.stackexchange.com/questions/11024/encoding-categorical-variables-using-likelihood-estimation

Win-vector blog:

https://stats.stackexchange.com/questions/52132/how-to-do-regression-with-effect-coding-instead-of-dummy-coding-in-r

https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features

vtreat

catboost

https://en.wikipedia.org/wiki/Bayesian_average https://stackoverflow.com/questions/34314277/what-should-be-taken-as-m-in-m-estimate-of-probability-in-naive-bayes

https://github.com/Dpananos/Categorical-Features

Possible data:

http://varianceexplained.org/r/empirical-bayes-book/ http://varianceexplained.org/r/simulation-bayes-baseball/ http://varianceexplained.org/statistics/beta_distribution_and_baseball/ http://varianceexplained.org/r/empirical_bayes_baseball/ http://varianceexplained.org/r/hierarchical_bayes_baseball/ https://en.wikipedia.org/wiki/Empirical_Bayes_method https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2872278/