# Entry 14 - Encoding Categoricals

## The Problem

Most machine learning algorithms require features to be numeric. Per usual, decision trees/random forests are the exception (the algorithm is just more forgiving in general). Last time I played with R, categorical variables were allowed to remain categorical for decision trees/random forests.

My tool of choice, scikit-learn, [doesn't allow for categoricals](https://scikit-learn.org/stable/faq.html#why-do-categorical-variables-need-preprocessing-in-scikit-learn-compared-to-other-tools). All features must be encoded as numeric values. The reasons for this have to do with the [extensive amount of work](https://scikit-learn.org/stable/faq.html#why-does-scikit-learn-not-directly-work-with-for-example-pandas-dataframe) needed to support categorical types.

## The Options

- Scikit-learn's preprocessing module
  - Binarizer
  - LabelBinarizer
  - LabelEncoder
  - OneHotEncoder
  - OrdinalEncoder
  - label_binarize
- Scikit-learn's feature_extractor module
  - DictVectorizer
  - FeatureHasher
- category-encoders
  - Backward Difference Contrast
  - BaseN
  - Binary
  - Count
  - Hashing
  - Helmert Contrast
  - James-Stein Estimator
  - LeaveOneOut
  - M-estimator
  - Ordinal
  - One-Hot
  - Polynomial Contrast
  - Sum Contrast
  - Target Encoding
  - Weight of Evidence
- pandas
  - .astype('category') method + .cat.codes method
  - .get_dummies()
  - .replace() method + dictionary mapper

## The Proposed Solution

The **category-encoders** module appeals to me. Benefits include:
- Fully compatible with scikit-learn's transformers (it can be included in pipelines)
- First-class support for pandas dataframes as an input (and optionally as output)
- Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type
- Portability: train a transformer on data, pickle it, reuse it later and get the same thing out
- All methods are imported in one library
- Largest number of encoding options from the three module choices
- The BaseN option allows for multiple encoding methods to allow encoding to become a tunable hyperparameter

Compared with **Scikit-learn** where:
- Each encoding method has to be imported by name from the preprocessing module
- No explicit configuration of which columns to encode (it assumes all columns passed to it are categorical)

The **pandas** options are rather limited - only three methods. They also seem to require more code than the other options.