# Encoding categorical data for machine learning models

Categorical data needs to be encoded in numerics in order to be used in scikit-learn models.
Important: Simply applying a numerical value to each distinct category cannot be used because it would represent a continuous input (as the catagories being ordered), which is not desired. The right way to encode categorical data is using a "one-hot" binary representation of a feature, that is, representing n distinct samples in a feature with a sparse array of length n, in such way that there is a single 1 in the array in a different position for each category.

Encoding could be done with a few tools in scikit-learn, such as: OneHotEncoder, LabelBinarizer; with pandas.get_dummies method etc.

Preprocessing steps:
- is data sparse? 
    - if yes use **MaxAbsScaler**, **maxabs_scale**, or **scale** / **StandardScaler** with scipy.sparse matrix as input and with_mean=False.
- are there outliers in data?
    - if yes use **RobustScaler** or **robust_scale**

## Encoding numpy matrices with OneHotEncoder

When a matrix is given as an input to a scikit-learn model, the model takes the columns as features and the rows as data samples. In the same way, when fitting a OneHotEncoder to a data matrix, it encodes each column _i_ separately in an array of length _n_ _ _i_, where _n_ _ _i_ is the number of distinct values the column.

The following example shows how to encode a numpy matrix and a numpy vector with OneHotEncoder.

In [23]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data = np.array([[1, 2], [0, 4], [2, 0]])
print(data)

[[1 2]
 [0 4]
 [2 0]]


In [24]:
enc = OneHotEncoder()
enc.fit(data)
enc.n_values_
encoded = enc.transform(data)
print(encoded.toarray())

# If vector c is a non-categorical feature, we can just append it to the encoded categorical data as follows
# .toarray() must be used!
c = np.array([[1], [0], [3]])
d = np.concatenate((encoded.toarray(), c), axis=1)
print(d)

# todo: compare with pd.factorize

[[ 0.  1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.]]
[[ 0.  1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  1.  1.  0.  0.  3.]]


So the values in column 0 go up to value 2, which gives a maximum of 3 possible values (including the zero). Column 1 goes up to 4, which result sin 5 possible values. Therefore, an encoded sample of the data will be represented with 8 binary digits: 3 for the first feature, and 5 for the second. Anyway, in our example data feature 2 is missing two values: 1 and 3, so the encoder will have 6 digits in total, instead of 8. An example follows.

new_sample = np.array([[0, 2]])
enc_sample = enc.transform(new_sample).toarray()
print(enc_sample)

If we use a non-existing value in feature 2 (which is still smaller than the maximum value), the value is not encoded, resulting in all zeros:

In [26]:
enc.transform(np.array([[0,3]])).toarray()

array([[ 1.,  0.,  0.,  0.,  0.,  0.]])

Using a value which is larger than the maximum value in the feature to which the encoder is fitted, the transform method throws an error: "ValueError: : unknown categorical feature present [9] during transform".

In [27]:
# enc.transform(np.array([[0,9]])).toarray()

## Encoding pandas dataframe with OneHotEncoder

In [28]:
import pandas as pd

df = pd.DataFrame({'a': [0, 2, 3, 1], 'b': [0, 5, 2, 1]}, columns=['a', 'b'])
df

Unnamed: 0,a,b
0,0,0
1,2,5
2,3,2
3,1,1


In [29]:
enc.fit(df)

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [30]:
enc.n_values_

array([4, 6])

Both features in a and b columns have 4 distinct values, so the encoder will use 8 digits in total to represent a sample.

In [31]:
new_sample = pd.DataFrame({'a': [2], 'b': 5}, columns=['a', 'b'])
enc.transform(new_sample).toarray()

array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

Let's encode a whole dataframe (I'll use the same one as earlier).

In [32]:
enc.transform(df).toarray()

array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.]])

If for some reason you need to encode only a single column of a pandas dataframe, you can do it in the following way, by first reshaping the column form a Series object to a numpy vector:

In [33]:
enc.fit(df.a.values.reshape(len(df.a),1))
print("encoded sample:", enc.transform(np.array([[2]])).toarray())

print("df.a type:", type(df.a))
print("df.a reshaped type:", type(df.a.values.reshape(len(df.a),1)))


encoded sample: [[ 0.  0.  1.  0.]]
df.a type: <class 'pandas.core.series.Series'>
df.a reshaped type: <class 'numpy.ndarray'>


Note that only **categorical** data needs to be encoded, not **continous numerical**. OneHotEncoder only works on numeric categorical data, and not with alphabetic categorical data.

## LabelBinarizer

In [12]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])
print(lb.classes_)
lb.transform([1,4])

[1 2 4 6]


array([[1, 0, 0, 0],
       [0, 0, 1, 0]])

In [13]:
feature = ['london', 'new york', 'chicago']
lb.fit(feature)
print(lb.classes_)

# Encode the entire list
feature_encoded = lb.transform(feature)
print(feature_encoded)
print('-'*10)

# Encode single inputs
print('london:', lb.transform(['london']))
print('chicago:', lb.transform(['chicago']))
print('new york:', lb.transform(['new york']))
print('-'*10)

# Reverse back 
print(lb.inverse_transform(feature_encoded))

['chicago' 'london' 'new york']
[[0 1 0]
 [0 0 1]
 [1 0 0]]
----------
london: [[0 1 0]]
chicago: [[1 0 0]]
new york: [[0 0 1]]
----------
['london' 'new york' 'chicago']


Ok, LabelBinarizer first orders the labels alphabetically, and then encodes them.

In [14]:
lb.fit_transform(['yes', 'no', 'no', 'yes'])

array([[1],
       [0],
       [0],
       [1]])

When encoding a whole matrix, the best way is to pass columnn by column:

In [15]:
# Our data has two features: city, country
ml_data = np.array([['amsterdam', 'netherlands'], ['new york', 'us'], ['tokyo', 'japan']])

encoded_matrix = []
for i in range(ml_data.shape[1]):
    column = ml_data[:,i]
    lb.fit(column)
    print("lb classes:", lb.classes_)
    encoded_feature = lb.transform(column)
    print(encoded_feature)
    if i == 0:
        encoded_matrix = encoded_feature
    else:
        encoded_matrix = np.concatenate([encoded_matrix, encoded_feature], axis=1)
        
print('matrix:')
print(encoded_matrix)

lb classes: ['amsterdam' 'new york' 'tokyo']
[[1 0 0]
 [0 1 0]
 [0 0 1]]
lb classes: ['japan' 'netherlands' 'us']
[[0 1 0]
 [0 0 1]
 [1 0 0]]
matrix:
[[1 0 0 0 1 0]
 [0 1 0 0 0 1]
 [0 0 1 1 0 0]]


The resulting matrix (3x6) could be used as input to an ML model. In such way we expanded 2 features (city and state) into 6 features encoded in a 'one-hot' style.

How about encoding pandas DataFrames..

In [39]:
import pandas as pd
df = pd.DataFrame(ml_data, columns=['city', 'state'])
df_new = pd.DataFrame([])

for column in df.columns:
    lb.fit(df[column])
    encoded_feature = lb.transform(df[column])
    for i in range(encoded_feature.shape[1]):
        cols = sorted(df[column])
        df_new[cols[i]] = encoded_feature[:,i]
        
print(df)
print(df_new)

        city        state
0  amsterdam  netherlands
1   new york           us
2      tokyo        japan
   amsterdam  new york  tokyo  japan  netherlands  us
0          1         0      0      0            1   0
1          0         1      0      0            0   1
2          0         0      1      1            0   0


## pandas.get_dummies()

The above example is done much easier with pandas $get\_dummies$ method.

In [72]:
pd.get_dummies(df)

Unnamed: 0,city_amsterdam,city_new york,city_tokyo,state_japan,state_netherlands,state_us
0,1,0,0,0,1,0
1,0,1,0,0,0,1
2,0,0,1,1,0,0


In [74]:
pd.get_dummies(['one', 'two', 'three', 'one'])

Unnamed: 0,one,three,two
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0


## LabelEncoder

$LabelEncoder$ is used to simply assign a numerical value to each distinct categorical input. Similar to $LabelBinarizer$ it cannot be used on a matrix, but input should be passed in a list or array.

In [38]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder().fit_transform(['london', 'new york', 'tokyo', 'london'])
le

array([0, 1, 2, 0])

Once categorica input is encoded in a numerical value, we can use the $OneHotEncoder$ to represent the numerics with binary data in 'one-hot' style.

In [66]:
enc = OneHotEncoder(sparse=True)
le_column = np.reshape(le.T, (len(le), 1))
enc_input = enc.fit_transform(le_column).toarray()
enc_input

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]])

And then possibly, if needed, you can put it in a pandas DataFrame.

In [59]:
dfn = pd.DataFrame(enc_input, columns=['london', 'new york', 'tokyo'])
dfn

Unnamed: 0,london,new york,tokyo
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,1.0,0.0,0.0


With the above 3-4 paragraphs of code, we managed to convert one single feature (given as a list) into 3 features in binary 'one-hot'-like format.

## Sparse matrices

In [69]:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
print(type(X))
X.toarray()

<class 'scipy.sparse.csc.csc_matrix'>


array([[1, 2],
       [0, 3],
       [7, 6]], dtype=int64)