# **Encoding Categorical Variables**

The most basic approach to representing categorical values as numeric data is to create dummy or indicator variables.

In [1]:
import pandas as pd
import numpy as np

When having a rare categorical variable, it might not appear in the resampling process. A zero-variance filter might help in that case.

Prior to resampling we might try to establish which variables are near zero variance ones and remove them from the pool.
A <b style="color: #ff4299">threshold</b> can be used to determine when a variable is near-zero variance.

Even though, we might not want to remove those variables. Another approach is to include a category <u style="color: #ff4299">Other</u> that pools the rarer categories.

Again, a <b style="color: #ff4299">threshold</b> can be used to determine when a variable is the rare case. Since this step might affect the metrics of the model, we include it in the resampling process as well.

Another approach is hashing. When using hashes to create dummy variables, the procedure is called **“feature hashing**” or the **“hash trick**” 

In [3]:
from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features=16, dtype=bool)
D = [{'dog': 1, 'cat': 2, 'elephant': 4}, {'dog': 2, 'run': 5}]
f = h.transform(location[location.isin(sample_towns)])
f.toarray()

NameError: name 'location' is not defined

#### Load the data

In [4]:
okc_train_df = pd.read_csv("./data/okc/okc_train.csv")
okc_test_df = pd.read_csv("./data/okc/okc_test.csv")
okc_down_df = pd.read_csv("./data/okc/okc_down.csv")
okc_sampled_df = pd.read_csv("./data/okc/okc_sampled.csv")

In [28]:
from sklearn.feature_extraction import FeatureHasher

sample_towns = [
  'alameda', 'belmont', 'benicia', 'berkeley', 'castro_valley', 'daly_city', 
  'emeryville', 'fairfax', 'martinez', 'menlo_park', 'mountain_view', 'oakland', 
  'other', 'palo_alto', 'san_francisco', 'san_leandro', 'san_mateo', 
  'san_rafael', 'south_san_francisco', 'walnut_creek'
]

location = okc_train_df['where_town'].drop_duplicates().sort_values().reset_index()
location.head(3)

# feature_hasher = FeatureHasher(n_features=16, alternate_sign=False)

# feature_hasher.fit_transform(location['where_town'])
# location

Unnamed: 0,index,where_town
0,77,alameda
1,625,albany
2,50,belmont


In [27]:
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer

vectorizer = HashingVectorizer(n_features=2**4, binary=True, alternate_sign=False)
X = vectorizer.fit_transform(location['where_town'])
result_df = pd.DataFrame(X.toarray())

# result_df['location'] = location['where_town']

result_df.insert(0, 'location', location['where_town'])

result_df = result_df[result_df['location'].isin(sample_towns)]

print(location.shape)
print(result_df)

(51, 2)
               location    0    1    2    3    4    5    6    7    8    9  \
0               alameda  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0   
2               belmont  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0   
4               benicia  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
5              berkeley  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
7         castro_valley  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
9             daly_city  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0   
12           emeryville  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
13              fairfax  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
21             martinez  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
22           menlo_park  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
26        mountain_view  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
28              oakland  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0

- Less aliasing allows for better interpretation of the results. If a predictor has a significant impact on the model, it is probably a good idea understand why. When multiple categories are aliased due to a collision, untangling the true effect may be very difficult. Although this may help narrow down which categories are affecting the response.

### Supervised Encoding

#### **Likelihood encoding**

For example, for the Ames housing data, we might calculate the mean or median sale price of a house for each neighborhood from the training data and use this statistic to represent the factor level in the model. There are a variety of ways to estimate the effects and these will be discussed below.