# Design Pattern: Hashed Feature

### Problem

The Hashed Feature design pattern addresses three possible problems associated with categorical features:
- Incomplete vocabulary &rarr; When we don't have all the possible categories represented in our data / training data may not contain every possible category due to random sampling.
- Model size due to cardinality &rarr; When we have feature vectors whose length is in the thousands to millions. Training data (observations) might be insufficient.
- Cold-start &rarr; After the model is placed into production, new categories of our feature might be included. The model will be unable to make predictions for these, and so a separate serving infrastructure will be required to handle such cold- start problems.

It groups categoricals features and accepts the tradeoff of collisions (loss of information) in the data representation.

### Solution
The Hashed Feature design pattern represents a categorical input variable by doing the following:
1. Converting the categorical input into a unique string.
2. Invoking a deterministic (no random seeds or salt) and portable (so that the same algorithm can be used in both training and serving) hashing algorithm on the string.
3. Taking the remainder when the hash result is divided by the desired number of buckets. Typically, the hashing algorithm returns an integer that can be negative and the modulo of a negative integer is negative. So, the absolute value of the result is taken.


External References:
- https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63

## Imports and gathering data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# available on https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

airport = pd.read_csv("data/flights_2019-01.csv", usecols=[0,1])
airport = airport.groupby('DEST').sum().reset_index()

In [3]:
print(f'Data points: {len(airport)}')
airport.head()

Data points: 346


Unnamed: 0,DEST,FLIGHTS
0,ABE,340.0
1,ABI,170.0
2,ABQ,1732.0
3,ABR,62.0
4,ABY,84.0


## Hashing data

In this section, we implement our hashes by using sklearn and tensorflow.

### sklearn

In [4]:
from sklearn.feature_extraction import FeatureHasher

In [5]:
# determine the number of buckets to group our data
num_buckets = 10

fh = FeatureHasher(n_features=num_buckets, input_type='string')
hashed_features = fh.fit_transform(airport['DEST'])
hashed_features = hashed_features.toarray()

In [6]:
hashed_airports = pd.concat([airport, pd.DataFrame(hashed_features)], axis=1)

Now, we finally see our hashed data into 10 different bins.  
Remember you can drop the 'DEST' column in a real case scenario.

In [7]:
hashed_airports.head(10)

Unnamed: 0,DEST,FLIGHTS,0,1,2,3,4,5,6,7,8,9
0,ABE,340.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,ABI,170.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,ABQ,1732.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ABR,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,ABY,84.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
5,ACT,116.0,0.0,-2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,ACV,130.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,ACY,298.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
8,ADK,9.0,-1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9,ADQ,53.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Tensorflow

https://www.tensorflow.org/tutorials/structured_data/feature_columns

In [8]:
import tensorflow as tf

In [9]:
from tensorflow import feature_column
from tensorflow.keras import layers

In [10]:
num_buckets = 10
airports = feature_column.categorical_column_with_hash_bucket(
    key='DEST',
    hash_bucket_size=num_buckets,
    dtype=tf.dtypes.string)