# Feature hashing

Hello, welcome back to this Machine Learning book with scikit-learn. In this chapter, I will talk about how one can deal with high-cardinality categorical data, that is, those variables that can take on many values.

In machine learning, there is a well-known technique called *feature hashing* or, among friends, the *hashing trick*. This technique involves applying a hash function to the value of a feature to associate it with a position within an array.

The basic idea behind hashing is to convert an input into a more compact and easier-to-process form. Instead of storing a complete list of all features in their original form, the input is transformed into a simpler numerical representation.

Let's see it with an example, we'll start by creating an array of dictionaries:

In [None]:
data = [{'apple': 2, 'banana': 1, 'orange': 3},
        {'banana': 4, 'orange': 1},
        {'kiwi': 3, 'pineapple': 5}]

Then we import the `FeatureHasher` class from the `sklearn.feature_extraction` module:

In [None]:
from sklearn.feature_extraction import FeatureHasher

And we create an object of the class, setting the `n_features` parameter, which in turn will represent the number of inputs in the resulting vector:

In [None]:
hasher = FeatureHasher(n_features=10)

And then we can call the `fit_transform` method to transform our dataset in a single action:

In [None]:
hashed_data = hasher.fit_transform(data)

The result of executing transform with our data is always a sparse matrix given the typical use of the `FeatureHasher` class, which is why here I'm converting it back to a NumPy array using the `todense` method:

In [None]:
hashed_data.todense()

If the results are not what you expected, I understand, at first glance it's difficult to interpret what the hasher is doing. For now, just remember that we obtained 4-dimensional vectors, exactly as we specified in the constructor with `n_features` equal to 4.

## Extra Parameters

In the previous example, we used dictionaries as input values; however, it is also common to use strings as inputs. For this, we can set the `input_type` argument as `string`:

In [None]:
hasher = FeatureHasher(n_features=4, input_type='string')
hashed_data = hasher.transform([
    ['cat', 'dog', 'bird'],
    ['cat', 'bird'],
    ['fish', 'dog'],
])
hashed_data.todense()

## Explanation of the values

Returning to the resulting confusing values, this happens because when there are hash functions involved in a process, we are bound to suffer collisions, particularly if we have a sufficiently low number of features, as in our case with `n_features` equal to 4. This causes different values to be assigned the same position within the vector. To mitigate the effects of this collision, `FeatureHasher` has another function responsible for determining the sign of the value to be added, with the purpose of having collisions cancel each other out, which is why you may also see negative values.

```{hint} For homework, I leave it to you to experiment with the value of `n_features`, choose a sufficiently large value to prevent collisions. The scikit-learn recommendation is that this value should always be a power of two.
```

## Conclusion

While feature hashing is a powerful technique used in ML, it is not as beneficial to apply it in all scenarios, particularly when we have attributes with low cardinality, since as we saw in the example, when `n_features` has a low value, using *hashing* can cause us to lose information.

In these cases, other techniques such as one-hot encoding or label encoding may be more appropriate.

Want to learn how to work with text? See you in the next chapter.