# Chapter 05

# Categorical Variables: Counting Eggs in the Age of Robotic Chickens

## 1. Encoding Categorical Variables

Categorical Variables are usually not numeric. 

How do we change non-numeric categories into numbers?



### Method 1. One-Hot Encoding

- Assinging 1 to one variable and rest as 0. 
- You can do this by ``` sklearn.preprocessing.OneHotEncoder ```
- A categorical variable with _k_ possible categories is encoded as a feature vector of length _k_.

| Cities|e1|e2|e3|
| ------|---|---|---|
| San Francisco|1|0|0|
| New York |0|1|0|
| Seattle|0|0|1|

- You can put this formally:

    #### _e_<sub>1</sub> + _e_<sub>2</sub> + .. + _e_<sub>k</sub> = 1

### Method 2. Dummy Coding

- What's the matter with one-hot-encoding? 
    : it allows for k degrees of freedom, while the variable itself needs only _k_-1. 
    
- Dummy Coding removes the extra degree of freedom by using only _k_-1 features in the representation. 


| Cities|e1|e2|
| ------|---|---|
| San Francisco|1|0|
| New York |0|1|
| Seattle|0|0|

- Outcomes are more interpretable than One-hot encoding. 

Toy Dataset Example

Suppose we have a following table of house price data

| |City|Rent|
| ------|---|---|
| 0|SF|3999|
| 1|SF|4000|
| 2|SF|4001|
| 3|NYC|3488|
| 4|NYC|3500|
| 5|NYC|3501|
| 6|Seattle|2499|
| 7|Seattle|2500|
| 8|Seattle|2501|


We can try a *linear regressor* to predict rental price based solely on the identity of the city. 

- Suppose we have a linear regression model: 

y = w<sub>1</sub>x<sub>1</sub> + .. + w<sub>n</sub>x<sub>n</sub> + _b_

In [3]:
import pandas as pd
from sklearn import linear_model

Define a toy dataset of apartment rental prices in New York, San Francisco, and Seattle

In [4]:
df = pd.DataFrame({
    'City' : ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC', 
             'Seattle', 'Seattle', 'Seattle'], 
    'Rent' : [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]
})

In [5]:
df['Rent'].mean()

3333.3333333333335

Convert the categorical variables in the DataFrame to one-hot encoding and fit a linear regression model

In [6]:
one_hot_df = pd.get_dummies(df, prefix = ['city'])

In [7]:
one_hot_df

Unnamed: 0,Rent,city_NYC,city_SF,city_Seattle
0,3999,0,1,0
1,4000,0,1,0
2,4001,0,1,0
3,3499,1,0,0
4,3500,1,0,0
5,3501,1,0,0
6,2499,0,0,1
7,2500,0,0,1
8,2501,0,0,1


책에는 

``` model = linear_regression.LinearRegression()``` 이라고 되어있으나, 에러가 나옴.
    
``` linear_regression.LinearRegression()``` 메소드는 아래와 같이 update됨

**

``` linear_model.LinearRegression() ```

From: Sklearn Official Document
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [10]:
model = linear_model.LinearRegression()

In [11]:
model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],
         one_hot_df['Rent'])

LinearRegression()

In [12]:
model.coef_

array([ 166.66666667,  666.66666667, -833.33333333])

In [13]:
model.intercept_

3333.3333333333335

Train a linear regression model on dummy code. 

Specify the ```drop_first``` flag to get dummy coding


In [14]:
dummy_df = pd.get_dummies(df, prefix = ['city'], drop_first = True)

In [15]:
dummy_df

Unnamed: 0,Rent,city_SF,city_Seattle
0,3999,1,0
1,4000,1,0
2,4001,1,0
3,3499,0,0
4,3500,0,0
5,3501,0,0
6,2499,0,1
7,2500,0,1
8,2501,0,1


In [16]:
model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])

LinearRegression()

In [17]:
model.coef_

array([  500., -1000.])

In [18]:
model.intercept_

3500.0

#### Summary: 

Linear Regression Learned Coefficients

| |x1|x2|x3|b|
|---|---|---|---|---|
|One-hot encoding|166.67|666.67|-833.33|3333.33|
|Dummy coding|0|500|-1000|3500|

- One-hot-encoding
    - the intercept term (b)  = the global mean of the target variable, ``` Rent ```. 
    - each of the linear coefficients = how much that city's average rent **differs** from the **global mean** (b) 

- Dummy-coding
    - bias coefficient = mean value of the response variable _y_ for the reference category (e.g., NYC)
    - coefficient for the _i_ th feature = difference between the mean response value for the _i_ th category and the mean of the reference category (e.g., NYC)
    
    

### Effect Coding

- another variant of categorical variable encoding
- very similar to dummy coding
- what's different? the reference category is now represented by the vector of all **-1's**

Effect coding of a categorical variable representing 3 cities

| |e1|e2|
|---|---|---|
|San Francisco|1|0|
|New York|0|1|
|Seattle|-1|-1|

- The results in linear regression models that are even **simpler to interpret**

- See what happens below if we put **effect coding as input**

In [19]:
#Linear regression with effect coding
effect_df = dummy_df.copy()

책에는 ```df.ix```라고 되어있으나, Pandas 최근 버전에서는 .ix 메소드를 없앴다. 대신 ```loc```으로 대체하면 된다.

p.82, Example 5-2, 2번째 코드:

Old ver. 

``` effect_df.ix[3:5, ['city_SF', 'city_Seattle']] = -1.0 ```

=> 

Updated ver. 

``` effect_df.loc[3:5,['city_SF','city_Seattle']] = -1.0
```

In [23]:
effect_df.loc[3:5,['city_SF','city_Seattle']] = -1.0

In [24]:
effect_df

Unnamed: 0,Rent,city_SF,city_Seattle
0,3999,1.0,0.0
1,4000,1.0,0.0
2,4001,1.0,0.0
3,3499,-1.0,-1.0
4,3500,-1.0,-1.0
5,3501,-1.0,-1.0
6,2499,0.0,1.0
7,2500,0.0,1.0
8,2501,0.0,1.0


In [25]:
model.fit(effect_df[['city_SF', 'city_Seattle']],effect_df['Rent'])

LinearRegression()

In [26]:
model.coef_

array([ 666.66666667, -833.33333333])

In [27]:
model.intercept_

3333.3333333333335

****

### Pros and Cons of Categorical Variable Encodings

#### 1. One-hot encoding
    - **Pros**: each feature clearly corresponds to a category, missing data can be encoded as the all-zeros vector, and the output should be the overall mean of the target variable
    - **Cons**: redundant, allows for multiple valid models for the same problem
    
#### 2. Dummy coding 
    - **Pros**: not redundant
    - **Cons**: can use unique and interpretable models, cannot easily handle missing data (all-zeros vector is already mapped to the reference category = look strange..)   

#### 3. Effect coding
    - ** Pros ** : use different code for reference category
    - ** Cons ** : vector of all -1's = dense vector, expensive for both storage and computation
    
    
---

Popular ML software packages (e.g., Pandas and scikit-learn) have opted for dummy coding or one-hot encoding (instead of effect coding)


### HOWEVER, 

all 3 encoding techniques break down when the number of categories becomes ***VERY LARGE***. ***Different strategies*** are needed to handle extremely large categorical variables. 

    

Existing Solutions: 

- Do noting fancy with the encoding! Just use LOTS and LOTS of machines!

- Compress the features (Feature hashing or Bin counting)

### Solution 1. Feature Hashing

A hash function : 
- a deterministic function
- maps a potentially unbounded integer to a finite integer range [1,m]. 
- Since the input domain is potentially larger than the output range

<img src = "./img-y/Figure5.png/">

Uniform hash function: 

- intakes numbered keys and routes them to one of _m_ bins
- Keys with the same number will always get routed to the same bin 
- maintains feature space while reducing the storage and processing time during ML training and evaluation cycles. 

Hash functions can be constructed for any object that can be represented numerically: 
    - numbers
    - strings
    - complex structures, etc. 

Feature hasing: 
    - compresses the original feature vector into an _m_-dimension vector by applying a hash function to the feature ID

In [28]:
#Feature hashing for word features

def hash_features(word_list, m):
    output = [0]*m
    for word in word_list:
        index = has_fcn(word)% m
        output[index] += 1
    return output

Signed feature hashing

- adding sign component to feature hashing so that counts are either added to or subtracted from the hased bin 

- ensures that the inner products between hased features are equal in expectation to those of the original features

In [29]:
#Example 5-4. Signed feature hashing

def hash_features(word_list, m):
    output = [0] * m 
    for word in word_list:
        index = hash_fcn(word) % m 
        sign_bit = sign_hash(word) % 2
        if (sign_bit == 0):
            output[index] -= 1
        else:
            output[index] += 1
    return output

From the book: 

    The value of the inner product after hashing is within O( 1 m)of the original inner product, so the size of the hash table m can be selected based on acceptable errors. In practice, picking the right m could take some trial and error.
    
    Feature hashing can be used for models that involve the inner product of feature vec‐ tors and coefficients, such as linear models and kernel methods. It has been demon‐ strated to be successful in the task of spam filtering (Weinberger et al., 2009). In the case of targeted advertising, McMahan et al. (2013) report not being able to get the prediction errors down to an acceptable level unless m is on the order of billions, which does not constitute enough saving in space.
    
    One downside to feature hashing is that the hashed features, being aggregates of orig‐ inal features, are no longer interpretable.

Let's take a look at another example, ***Yelp reviews***

In [1]:
#Example 5-5. Feature hashing  (a.k.a. "the hashing trick")

import pandas as pd
import json 


In [2]:
# Load the first 10,000 reviews

f = open('/Users/yklee/study/yelp/yelp_academic_dataset_review.json')

In [3]:
js = [] #json file

In [4]:
for i in range(10000):
    js.append(json.loads(f.readline()))

In [5]:
f.close()

In [6]:
review_df = pd.DataFrame(js)

In [7]:
review_df.shape

(10000, 9)

In [68]:
#고유한 business_id개수
#Define m as equal to the unique number of business_ids

In [8]:
m = len(review_df.business_id.unique())
m #책에는 528이라고 되어있는데 그 사이에 더 추가된듯..? 한글판에는 8577..

4398

In [9]:
#Let's hash features!

from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features=m, input_type='string')

f = h.transform(review_df['business_id'])

how does this affect feature **intepretability** ? 

In [10]:
review_df['business_id'].unique().tolist()[0:5]

['-MhfebM0QIsKt87iDN-FNw',
 'lbrU8StCq3yDfr-QMnGrmQ',
 'HQl28KMwrEKHqhFrrDqVNQ',
 '5JxlZaqCnk1MnbgRirs40Q',
 'IS4cv902ykd8wj1TR0N3-A']

In [11]:
f.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

... Not great. BUT! let's see the storage size of our feature. 

In [12]:
from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id']))

print('Our hashed numpy array, in bytes: ', getsizeof(f))

Our pandas Series, in bytes:  790160
Our hashed numpy array, in bytes:  64


From the book: 

    We can clearly see how using feature hashing will benefit us computationally, **sacrificing immediate user interpretability.** This is an easy trade-off to accept when progressing from data exploration and visualization into a machine learning pipeline for large datasets.

## Example 5-6. Bin-counting example

### click-through rate prediction by Avazu

link => https://www.kaggle.com/c/avazu-ctr-prediction

add.ref = https://www.kaggle.com/ozlerhakan/counting-eggs-in-the-age-of-robotic-chickens


In [None]:
/Users/yklee/study/click-dataset/

In [13]:
df = pd.read_csv('/Users/yklee/study/click-dataset/train.csv', nrows=10000)

FileNotFoundError: [Errno 2] File /Users/yklee/study/click-dataset/train.csv does not exist: '/Users/yklee/study/click-dataset/train.csv'

In [15]:
head -n100000 train.csv > train_subset.csv

SyntaxError: invalid syntax (<ipython-input-15-5904224bce83>, line 1)

In [17]:
import numpy as n
import random
import pandas as pd
import gzip
n = 40428967  #total number of records in the clickstream data 
sample_size = 1000000
skip_values = sorted(random.sample(range(1,n), n-sample_size))
parse_date = lambda val : pd.datetime.strptime(val, '%y%m%d%H')
with gzip.open('/Users/yklee/study/click-dataset/train.gz') as f:
    train = pd.read_csv(f, parse_dates = ['hour'], date_parser = parse_date, skiprows = skip_values)

KeyboardInterrupt: 

In [18]:
import pandas as pd

In [23]:
bin_df = pd.read_csv('/Users/yklee/study/click-dataset/train.gz',nrows=10000, compression='gzip', 
                                 header=0, sep=',', quotechar='"',error_bad_lines=False)

In [31]:
bin_df.head(100)

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,10015376300289320595,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15701,320,50,1722,0,35,100084,79
96,10015405794859644629,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15701,320,50,1722,0,35,100084,79
97,10015629448289660116,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15708,320,50,1722,0,35,-1,79
98,100156980486870304,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,-1,79


In [33]:
#how many unique features should we have after?
len(df['device_id'].unique())

1075

In [34]:
def click_counting(x, bin_column):
    clicks = pd.Series(x[x['click'] > 0][bin_column].value_counts(), name='clicks')
    no_clicks = pd.Series(x[x['click'] < 1][bin_column].value_counts(), name='no_clicks')
    
    counts = pd.DataFrame([clicks,no_clicks]).T.fillna('0')
    counts['total_clicks'] = counts['clicks'].astype('int64') + counts['no_clicks'].astype('int64')
    return counts

In [36]:
def bin_counting(counts):
    counts['N+'] = counts['clicks']\
                    .astype('int64')\
                    .divide(counts['total_clicks'].astype('int64'))
    counts['N-'] = counts['no_clicks']\
                    .astype('int64')\
                    .divide(counts['total_clicks'].astype('int64'))
    counts['log_N+'] = counts['N+'].divide(counts['N-'])
    # If we wanted to only return bin-counting properties, 
    # we would filter here
    bin_counts = counts.filter(items= ['N+', 'N-', 'log_N+'])
    return counts, bin_counts
    

In [37]:
bin_column = 'device_id'

In [39]:
device_clicks = click_counting(df.filter(items = [bin_column, 'click']), bin_column)

device_all, device_bin_counts = bin_counting(device_clicks.copy())

In [40]:
device_clicks.head()

Unnamed: 0,clicks,no_clicks,total_clicks
a99f214a,1561,7163,8724
25635c83,3,0,3
c357dbff,2,15,17
9af87478,2,0,2
135f7d9a,2,0,2


In [41]:
device_all.head()

Unnamed: 0,clicks,no_clicks,total_clicks,N+,N-,log_N+
a99f214a,1561,7163,8724,0.178932,0.821068,0.217925
25635c83,3,0,3,1.0,0.0,inf
c357dbff,2,15,17,0.117647,0.882353,0.133333
9af87478,2,0,2,1.0,0.0,inf
135f7d9a,2,0,2,1.0,0.0,inf


In [42]:
device_bin_counts.head()

Unnamed: 0,N+,N-,log_N+
a99f214a,0.178932,0.821068,0.217925
25635c83,1.0,0.0,inf
c357dbff,0.117647,0.882353,0.133333
9af87478,1.0,0.0,inf
135f7d9a,1.0,0.0,inf


In [43]:
len(device_bin_counts)

1075

In [44]:
device_all.sort_values(by = 'total_clicks', ascending = False).head(4)

Unnamed: 0,clicks,no_clicks,total_clicks,N+,N-,log_N+
a99f214a,1561,7163,8724,0.178932,0.821068,0.217925
c357dbff,2,15,17,0.117647,0.882353,0.133333
3c0208dc,0,9,9,0.0,1.0,0.0
a167aa83,0,9,9,0.0,1.0,0.0
