# Feature Hashing

Also known as **hashing trick**.

It consist of apply a hash function to the value of certain caracteristic to associate it to a specific position inside the array.
The idea is to produce a compact and simple output.

It is not ideal to work with data with low cardinality (use one-hot encoding instead)

In [5]:
data = [
    {'apple': 2, 'banana': 1, 'orange': 3},
    {'banana': 4, 'orange': 1},
    {'kiwi':3, 'pineapple': 5}
]

In [6]:
from sklearn.feature_extraction import FeatureHasher

In [7]:
# n_features: number of output features
hasher = FeatureHasher(n_features=10)

In [8]:
# FeatureHasher is astateless estimator -> use fit_transform to train and transform the hasher with the input data
hashed_data = hasher.fit_transform(data)

In [9]:
# Sparse matrix of:
# 3: dicts of data
# 10: n_features defined for the FeatureHasher
hashed_data

<3x10 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [10]:
# Convert the sparse matrix to numpy to show the data (only use this on small data matrix)
hashed_data.todense()

matrix([[ 2.,  0.,  0.,  0.,  0., -4.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0., -5.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., -8.,  0.,  0.,  0.,  0.,  0.,  0.]])

---

You can use strings as input for the hasher

In [12]:
hasher = FeatureHasher(n_features=10, input_type='string')

In [14]:
hashed_data = hasher.transform([
    ['cat', 'dog', 'bird'],
    ['cat', 'bird'],
    ['fish', 'dog'],
])

In [18]:
# Negative values are related to collisions after using different hasher functions on the data
hashed_data.todense()

matrix([[ 0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0., -1.,  0.,  1.,  0.,  0.,  0.,  0.]])

In [32]:
# scikit-learn recommends using a power of 2 for the n_features value
hasher = FeatureHasher(n_features=128, input_type='string')

In [33]:
hashed_data = hasher.transform([
    ['cat', 'dog', 'bird'],
    ['cat', 'bird'],
    ['fish', 'dog'],
])

In [34]:
hashed_data.todense()

matrix([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,
          1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.