## Integer Encoding

Integer encoding consist in **replacing the categories by digits** from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable. The numbers are **assigned arbitrarily.** This encoding method allows for quick benchmarking of machine learning models. **Advantages:** Straightforward to implement! Does not expand the feature space! **Limitations:** Does not capture any information about the categories labels! Not suitable for linear models!

In [31]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from feature_engine.encoding import OrdinalEncoder

In [32]:
data = pd.read_excel('HousingPrices.xls', 
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])
data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500.0
1,Veenker,MetalSd,MetalSd,181500.0
2,CollgCr,VinylSd,VinylSd,223500.0
3,Crawfor,Wd Sdng,Wd Shng,140000.0
4,NoRidge,VinylSd,VinylSd,250000.0


**Check how many labels each variable has!**

In [33]:
for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  16  labels
Exterior2nd :  17  labels
SalePrice :  664  labels


**Explore the unique categories!**

In [34]:
data['Neighborhood'].unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [35]:
data['Exterior1st'].unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock', nan], dtype=object)

In [36]:
data['Exterior2nd'].unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock', nan], dtype=object)

### Encoding important

We select which digit to assign to each category using the train set, and then use those mappings in the test set.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((2043, 3), (876, 3))

## Integer encoding with pandas

**It is quick and returns pandas dataframe** But it does not preserve information from train data to propagate to test data! **We need to capture and save the mappings one by one manually**, if we are planing to use those in production.

**Create a dictionary with the mappings of categories to numbers!**

In [38]:
ordinal_mapping = {k: i for i, k in 
                   enumerate(X_train['Neighborhood'].unique(), 0)}
ordinal_mapping

{'Edwards': 0,
 'BrkSide': 1,
 'Veenker': 2,
 'ClearCr': 3,
 'Timber': 4,
 'OldTown': 5,
 'Somerst': 6,
 'NAmes': 7,
 'NWAmes': 8,
 'NridgHt': 9,
 'CollgCr': 10,
 'Sawyer': 11,
 'SawyerW': 12,
 'Crawfor': 13,
 'Mitchel': 14,
 'IDOTRR': 15,
 'Gilbert': 16,
 'StoneBr': 17,
 'Blmngtn': 18,
 'MeadowV': 19,
 'SWISU': 20,
 'BrDale': 21,
 'NoRidge': 22,
 'NPkVill': 23,
 'Blueste': 24}

The dictionary indicates which number will replace each category. Numbers were assigned arbitrarily from 0 to n - 1 where n is the number of distinct categories.

**Replace the labels with the integers!**

In [39]:
X_train['Neighborhood'] = X_train['Neighborhood'].map(ordinal_mapping)
X_test['Neighborhood'] = X_test['Neighborhood'].map(ordinal_mapping)

**Explore the result!**

In [40]:
X_train['Neighborhood'].head(10)

1448    0
1397    1
1       2
384     3
530     4
588     3
1027    4
2779    5
453     6
2057    7
Name: Neighborhood, dtype: int64

**We can turn the previous commands into 2 functions!**

In [41]:
def find_category_mappings(df, variable):
    return {k: i for i, k in enumerate(df[variable].unique(), 0)}

def integer_encode(train, test, variable, ordinal_mapping):
    X_train[variable] = X_train[variable].map(ordinal_mapping)
    X_test[variable] = X_test[variable].map(ordinal_mapping)

**A loop over the remaining categorical variables!**

In [42]:
for variable in ['Exterior1st', 'Exterior2nd']:
    mappings = find_category_mappings(X_train, variable)
    integer_encode(X_train, X_test, variable, mappings)

**The result!**

In [43]:
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
1448,0,0,0
1397,1,0,1
1,2,0,1
384,3,1,0
530,4,1,0


## Integer Encoding with Scikit-learn

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((2043, 3), (876, 3))

**Create an encoder!**

In [45]:
le = LabelEncoder()
le.fit(X_train['Neighborhood'])

LabelEncoder()

**See the unique classes!**

In [46]:
le.classes_

array(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel',
       'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown',
       'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber',
       'Veenker'], dtype=object)

In [47]:
X_train['Neighborhood'] = le.transform(X_train['Neighborhood'])
X_test['Neighborhood'] = le.transform(X_test['Neighborhood'])
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
1448,7,MetalSd,HdBoard
1397,3,MetalSd,MetalSd
1,24,MetalSd,MetalSd
384,4,HdBoard,HdBoard
530,23,HdBoard,HdBoard


Unfortunately, the **LabelEncoder** works one variable at the time. However **there is a way to automate this for all the categorical variables**. I took the below from this [stackoverflow thread](https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn)

In [48]:
from collections import defaultdict  # additional import required

In [49]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((2043, 3), (876, 3))

In [50]:
d = defaultdict(LabelEncoder)

**Encode the variable! Then use the dictionary to encode future data!**

In [54]:
train_transformed = X_train.apply(lambda x: d[x.name].fit_transform(x))
test_transformed = X_test.apply(lambda x: d[x.name].fit_transform(x))

In [55]:
train_transformed.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
1448,7,6,6
1397,3,6,8
1,24,6,8
384,4,5,6
530,23,5,6


In [56]:
test_transformed.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
1815,17,9,11
2871,7,7,8
1232,12,5,6
1977,16,10,12
22,5,10,12


**Inverse the encoded, to inverse transform to recover the original labels!**

tmp = train_transformed.apply(lambda x: d[x.name].inverse_transform(x))
tmp.head()

Finally, there is another Scikit-learn transformer, the OrdinalEncoder, to encode multiple variables at the same time. However, this transformer returns a NumPy array without column names, so it is not my favourite implementation. More details here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html 

## Integer Encoding with Feature-Engine

In [69]:
data1= data[['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']].dropna()

In [70]:
X_train, X_test, y_train, y_test = train_test_split(
    data1[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data1['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [71]:
ordinal_enc = OrdinalEncoder(
    encoding_method='arbitrary',
    variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])
ordinal_enc.fit(X_train)

OrdinalEncoder(encoding_method='arbitrary',
               variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])

**In the encoder dict we can observe the numbers assigned to each category for all the indicated variables!**

In [72]:
ordinal_enc.encoder_dict_

{'Neighborhood': {'CollgCr': 0,
  'ClearCr': 1,
  'BrkSide': 2,
  'Edwards': 3,
  'SWISU': 4,
  'Sawyer': 5,
  'Crawfor': 6,
  'NAmes': 7,
  'Mitchel': 8,
  'Timber': 9,
  'Gilbert': 10,
  'Somerst': 11,
  'MeadowV': 12,
  'OldTown': 13,
  'BrDale': 14,
  'NWAmes': 15,
  'NridgHt': 16,
  'SawyerW': 17,
  'NoRidge': 18,
  'IDOTRR': 19,
  'NPkVill': 20,
  'StoneBr': 21,
  'Blmngtn': 22,
  'Veenker': 23,
  'Blueste': 24},
 'Exterior1st': {'VinylSd': 0,
  'Wd Sdng': 1,
  'WdShing': 2,
  'HdBoard': 3,
  'MetalSd': 4,
  'AsphShn': 5,
  'BrkFace': 6,
  'Plywood': 7,
  'CemntBd': 8,
  'Stucco': 9,
  'BrkComm': 10,
  'AsbShng': 11,
  'ImStucc': 12,
  'CBlock': 13,
  'Stone': 14},
 'Exterior2nd': {'VinylSd': 0,
  'Wd Sdng': 1,
  'Plywood': 2,
  'Wd Shng': 3,
  'HdBoard': 4,
  'MetalSd': 5,
  'AsphShn': 6,
  'CmentBd': 7,
  'BrkFace': 8,
  'Stucco': 9,
  'ImStucc': 10,
  'Stone': 11,
  'AsbShng': 12,
  'Brk Cmn': 13,
  'CBlock': 14,
  'Other': 15}}

**The list of variables that the encoder will transform!**

In [73]:
ordinal_enc.variables_

['Neighborhood', 'Exterior1st', 'Exterior2nd']

**Explore the result!**

In [74]:
X_train = ordinal_enc.transform(X_train)
X_test = ordinal_enc.transform(X_test)
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0,0,0
682,1,1,1
960,2,1,2
1384,3,2,3
1100,4,1,1


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.