# Codificación de categorías

Hello again, welcome back to the machine learning book with scikit-learn. In this chapter, we'll discuss category encoding.

The features of our data are sometimes in the form of labels or categories. For example, the state where they live, educational level, or marital status. And remember, at the risk of sounding repetitive, machine learning algorithms work with numerical values.

In [None]:
import pandas as pd
import numpy as np

dataset = pd.DataFrame([
 ("Mexico", "Married", "High school"),
 ("Colombia", "Single", "Undergraduate"),
 ("Guinea Equatorial", "Divorced", "College"),
 ("Mexico", "Single", "Primary"),
 ("Colombia", "Single", "Primary"),
], columns=["Country", "Marital status", "Education" ])

dataset


In this chapter, we'll discuss various ways to encode categorical values so that they can be used by machine learning algorithms.

## One-hot encoding

A first attempt to represent categorical variables as numerical values is using *One-hot encoding*.

In simple terms, one-hot encoding converts a categorical variable into a matrix of zeros and ones. Each column in the matrix represents a unique value that can be taken within the categories of the variable, and each row represents an observation or sample. If a sample belongs to a specific category, the corresponding entry in the matrix will be 1, while all other entries will be zeros.

For example, taking our sample dataset, let's encode the country using scikit-learn's *One-hot encoder*:

We import from `sklearn.preprocessing` and create an instance:

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()


And we train our encoder using `fit` by passing it the column we want to encode:

In [None]:
encoder.fit(dataset[['Country']])


And then we can transform with `transform`, by default, `OneHotEncoder` returns a sparse matrix, because in One-hot encoding the resulting matrix is full of zeros, so we convert it to a dense matrix with `todense`:

In [None]:
country_transformed = encoder.transform(dataset[['Country']])
country_transformed.todense()


You can see the order of the columns by inspecting the `categories_` property:

In [None]:
encoder.categories_


And if you notice, these coincide with the order in which the values appear in the matrix.

### Inverse transformation

Like many other transformers, `OneHotEncoder` also has the `inverse_transform` method:

In [None]:
encoder.inverse_transform(
    np.asarray(country_transformed.todense())
)


### Extra arguments

The `OneHotEncoder` class has several extra arguments, but I consider only a couple of them important to mention.

It's common to train your encoder with a dataset, in our case we only had three countries in the training dataset, but what will happen when your model receives another country in the future? That's precisely what we can control with the `handle_unknown` argument.

Let's create two encoders, setting a different behavior for each one. And while we're at it, we'll specify that we want our encoder to give us a dense matrix by default with `sparse_output`:

In [None]:
error_encoder = OneHotEncoder(handle_unknown='error', sparse_output=False)
ignore_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)


Then we train them with our existing data

In [None]:
error_encoder.fit(dataset[['Country']])
ignore_encoder.fit(dataset[['Country']])


And let's see what happens when we try to test them with new data:

In [None]:
new_data = pd.DataFrame(['Costa Rica'], columns=['Country'])


First, we should try the error approach. In fact, I'm going to put it in a *try-except* block to catch the error - it's important to note that this is the default behavior.

In [None]:
try:
	error_encoder.transform(new_data)
except ValueError as ve:
	print(ve)


If we try with the one we told it to ignore, it will return only zeros since it ignores it:

In [None]:
ignore_encoder.transform(new_data)


### When to use `OneHotEncoder`?

It's good to use this tool when our categories don't have a predefined order, such as in the case of countries. We can't define which one is greater than the other, no matter how patriotic we feel.

## Ordinal encoding

There are other types of variables that do allow us to encode a certain notion of order and hierarchy, such as ordinal categorical variables. Think about the level of education within our dataset.

Depending on the problem we're facing, we can define that having completed primary education is less than having completed higher education.

To reflect these types of relationships, we can use the `OrdinalEncoder`:

In [None]:
from sklearn.preprocessing import OrdinalEncoder


And we create an object of the class, passing as an argument the categories that our variable can take in the order we want them to be considered - if they are not established, the numbers will be assigned randomly:

In [None]:
ordinal_encoder = OrdinalEncoder(categories=[[
 "Primary", "Secondary", "High school", "Undergraduate", "College"
]])


And now we can train the encoder:

In [None]:
ordinal_encoder.fit(dataset[['Education']])


And when transforming the dataset, we obtain the expected result:

In [None]:
ordinal_encoder.transform(dataset[['Education']])


### Extra Arguments

Like the *one-hot* encoder, `OrdinalEncoder` has several extra arguments, but perhaps the most important is the one that specifies how to behave with previously unseen information.

Let's experiment with the two possible values, `error` and `use_encoded_value`:

In [None]:
error_encoder = OrdinalEncoder(categories=[[
 "Primary", "Secondary", "High school", "Undergraduate", "College"
]], handle_unknown='error')

error_encoder.fit(dataset[['Education']])


Once again, to handle the error, it must be placed in a *try-except* block:

In [None]:
try:
	error_encoder.transform([["Kindergarten"]])
except ValueError as ve:
	print(ve)


On the other hand, if we create one that uses the default value, we can set `handle_unknown` to `use_encoded_value`. In this case, it is also necessary to set the `unknown_value` argument:

In [None]:
default_encoder = OrdinalEncoder(categories=[[
 "Primary", "Secondary", "High school", "Undergraduate", "College"
]],
 handle_unknown='use_encoded_value',
unknown_value=np.nan)

default_encoder.fit(dataset[['Education']])


And if we try to transform a value that didn't exist previously:

In [None]:
default_encoder.transform([["Kindergarten"]])


Where it will receive the value of `np.nan` by default instead of failing.

### When is it better to use `OrdinalEncoder`?

Use ordinal encoder when your variables have a sense of order between them, so you can preserve it when converting from strings to numbers.

```{hint} Both `OrdinalEncoder` and `OneHotEncoder` allow training on more than one column at a time. What do you think about encoding the marital status of the data at the same time as either of the other two? Better yet, which encoder makes more sense to use for that attribute of our data?

```
In the next chapter, we'll see an interesting technique that allows you to go from continuous values to categorical values. I'll see you in the next chapter.