<h1><center>STAT/CSE 416 Section 7: Variable Encoding</center></h1>
<center><b>Section:</b>Review</center>
<center><b>Instructor:</b>Emilija Perković</center>
<center><b>TA:</b>---</center>
<center><b>Date:</b> February 23, 2023</center>

<center><b>Authors</b>: Anne Wagner</center>

# Variable Encoding with Python

Many of the previous methods are usable with both categorical and continuous data, but when using any model with categorical data there are some extra data processing steps we should take into consideration. Consider this data set about five dinosaurs.

|  Name |      Species     |    Diet   |      Diet (Specific)     | Mesozoic Period |
|:-----:|:----------------:|:---------:|:------------------------:|:---------------:|
| Terry |   Tyrannosaurus  | Carnivore |     Whoever it wants     |    Cretaceous   |
| Danny |    Diplodocus    | Herbivore |       Tree foliage       |     Jurassic    |
| Stacy |    Stegosaurus   | Herbivore |          Grazing         |     Jurassic    |
| Timmy |    Triceratops   | Herbivore |          Grazing         |    Cretaceous   |
| Penny | Procompsognathus | Carnivore | Smaller, cuter dinosaurs |     Triassic    |

We have three categories, *Diet*, *Diet (Specific)*, and *Mesozoic Period* that have respectively 2, 4, and 3 categories each. How we treat these depends on how we want to use these variables, as well as what type of model we intend to fit. Lets consider first the *Diet* variable.

## Binary encoding

The simplest encoding, when we only have two categories, is to assign one category a value of $0$, and the other a category of $1$. We might reexpress this table as the following.

|  Name |      Species     |    Diet   | Diet Enc |      Diet (Specific)     | Mesozoic Period |
|:-----:|:----------------:|:---------:|:--------:|:------------------------:|:---------------:|
| Terry |   Tyrannosaurus  | Carnivore |     0    |     Whoever it wants     |    Cretaceous   |
| Danny |    Diplodocus    | Herbivore |     1    |       Tree foliage       |     Jurassic    |
| Stacy |    Stegosaurus   | Herbivore |     1    |          Grazing         |     Jurassic    |
| Timmy |    Triceratops   | Herbivore |     1    |          Grazing         |    Cretaceous   |
| Penny | Procompsognathus | Carnivore |     0    | Smaller, cuter dinosaurs |     Triassic    |

A binary classification allows us to use the *Diet* variable in a model such as Linear Regression, as a modification to the intercept or as an interaction with other variables by creating interaction terms. A Decision Tree can split on the variable based on the value being <.5 or >.5, or the diet can influence the score of a Logistic Regression while treating it the same way we would any other numeric valued feature. 

This only works for the case where we have two possible categories, so hopefully no dinosaurs decide to be omnivorous. This also assumes the model *can* use the variable as any other continuous variable. Naive Bayes does not mix categorical and continuous variables well, as the model would assume Gaussian distributions on binary classifiers. 

## N-class encoding

When a feature has multiple classes, such as the *Period* or *Diet (Specific)* features, a simple binary encoding won't suffice. We could consider assigning each value a distinct integer to tell them apart, such as this table.


|  Name |      Species     |    Diet   | Diet Enc |      Diet (Specific)     | DietSpec Enc | Mesozoic Period | Period Enc |
|:-----:|:----------------:|:---------:|:--------:|:------------------------:|:------------:|:---------------:|:----------:|
| Terry |   Tyrannosaurus  | Carnivore |     0    |     Whoever it wants     |       0      |    Cretaceous   |      2     |
| Danny |    Diplodocus    | Herbivore |     1    |       Tree foliage       |       1      |     Jurassic    |      1     |
| Stacy |    Stegosaurus   | Herbivore |     1    |          Grazing         |       2      |     Jurassic    |      1     |
| Timmy |    Triceratops   | Herbivore |     1    |          Grazing         |       2      |    Cretaceous   |      2     |
| Penny | Procompsognathus | Carnivore |     0    | Smaller, cuter dinosaurs |       3      |     Triassic    |      0     |

This potentially works for one variable, but causes many problems for the other. Which is okay, and why? Which has problems?

### K or K-1 Binary Encoding



It makes no sense to treat the *Diet (Specific)* variable as being in some way ordered, and if we were to use it in the current state in a decision tree, it might try to identify herbivores by splitting on $DietSpec Enc>0.5$, classifying the Tyrannosaurus correctly and the Procompsognathus incorrectly. The ordering of these categories is meaningless; while the split might improve prediction accuracy, it is similarly meaningless.

We instead choose to encode the variable as either $K$ or $K-1$ binary encodings, where $K$ is the number of observed classes. The data set would then look like so.

|  Name |      Species     |    Diet   | Diet Enc |      Diet (Specific)     | DSE WiW | DSE TF | DSE G | DSE SCD | Mesozoic Period | Period Enc |
|:-----:|:----------------:|:---------:|:--------:|:------------------------:|:-------:|:------:|:-----:|:-------:|:---------------:|:----------:|
| Terry |   Tyrannosaurus  | Carnivore |     0    |     Whoever it wants     |    1    |    0   |   0   |    0    |    Cretaceous   |      2     |
| Danny |    Diplodocus    | Herbivore |     1    |       Tree foliage       |    0    |    1   |   0   |    0    |     Jurassic    |      1     |
| Stacy |    Stegosaurus   | Herbivore |     1    |          Grazing         |    0    |    0   |   1   |    0    |     Jurassic    |      1     |
| Timmy |    Triceratops   | Herbivore |     1    |          Grazing         |    0    |    0   |   1   |    0    |    Cretaceous   |      2     |
| Penny | Procompsognathus | Carnivore |     0    | Smaller, cuter dinosaurs |    0    |    0   |   0   |    1    |     Triassic    |      0     |

We can either choose to leave one of these categories off, or include all $K$ categories. 

If we include all $K$ categories, 
* Unobserved classes can be handled, and will use no baseline with each encoding equal to 0.
* Some methods may have co-linearity issues that cause convergence problems (Linear regression).

If we drop one class down to $K-1$,
* Unobserved classes will be treated as an instance of the dropped class, which is rolled into model 'intercept' terms.
* Avoids co-linearity problems when not using regularization.
* Our *Diet Enc* variable is a $K-1$ encoding that could not handle an 'Omnivore' in the dataset.

Similar to previous binary encoding, this lets us treat the variables just as we would any other continuous, numerical values.

### Ordinal Encoding

Unlike the encoding of the specific diets, the order of the *Period Enc* does have meaning. The numeric value of each encoding corresponds chronologically to the order in which each period took place. If a decision tree were to split on $Period Enc > 0.5$ it would specifically be splitting on dinosaur species that lived after the Triassic period. Due to the ordered nature of the categories, we can take advantage of a similarly ordered encoding. 

An ordinal encoding
* Maintains the ordered relationship of categories
* Assumes a constant distance between categories (IE: 4 is as far from 2 as 3 is from 1)
* Assumes a linear relationship between categories (Similar to above)

This can be useful any time a clear ordering is present, such as age ranges, but unequal ranges can run counter to the linear relationship assumption. The ranges [(18-25),(26-40),(40-50),(50-65),(65+)] are clearly ordered, but of unequal sizes. This would be fine in a decision tree, but would be less useful in something like logistic regression.

# Examples

In [1]:
import pandas as pd

Terry={"Name":"Terry", 
       "Species":"Tyrannosaurus", 
       "Diet":"Carnivore", 
       "DietSpec":"Whoever it wants", 
       "Period":"Cretaceous"}
Danny={"Name":"Danny", 
       "Species":"Diplodocus", 
       "Diet":"Herbivore", 
       "DietSpec":"Tree foliage", 
       "Period":"Jurassic"}
Stacy={"Name":"Stacy", 
       "Species":"Stegosaurus", 
       "Diet":"Herbivore", 
       "DietSpec":"Grazing", 
       "Period":"Jurassic"}
Timmy={"Name":"Timmy", 
       "Species":"Triceratops", 
       "Diet":"Herbivore", 
       "DietSpec":"Grazing", 
       "Period":"Cretaceous"}
Penny={"Name":"Penny", 
       "Species":"Procompsognathus", 
       "Diet":"Carnivore", 
       "DietSpec":"Smaller, cuter dinosaurs", 
       "Period":"Triassic"}

data=pd.DataFrame([Terry,Danny,Stacy,Timmy,Penny])
print(data)

    Name           Species       Diet                  DietSpec      Period
0  Terry     Tyrannosaurus  Carnivore          Whoever it wants  Cretaceous
1  Danny        Diplodocus  Herbivore              Tree foliage    Jurassic
2  Stacy       Stegosaurus  Herbivore                   Grazing    Jurassic
3  Timmy       Triceratops  Herbivore                   Grazing  Cretaceous
4  Penny  Procompsognathus  Carnivore  Smaller, cuter dinosaurs    Triassic


In [2]:
from sklearn import preprocessing
enc=preprocessing.LabelEncoder()

#We can use the LabelEncoder for a simple binary encoding
enc.fit(data['Diet'])

#We have to 'fit' it to determine all the classes.
data['DietEnc']=enc.transform(data['Diet'])

print(data[['Name','Diet','DietEnc']])

    Name       Diet  DietEnc
0  Terry  Carnivore        0
1  Danny  Herbivore        1
2  Stacy  Herbivore        1
3  Timmy  Herbivore        1
4  Penny  Carnivore        0


In [3]:
ord_enc=preprocessing.OrdinalEncoder(categories=[['Triassic','Jurassic','Cretaceous']])

#For an orderinal encoding, we need to tell it the order of the categories
ord_enc.fit(data[['Period']])
data['PeriodEnc']=ord_enc.transform(data[['Period']])

print(data[['Name','Period','PeriodEnc']])

    Name      Period  PeriodEnc
0  Terry  Cretaceous        2.0
1  Danny    Jurassic        1.0
2  Stacy    Jurassic        1.0
3  Timmy  Cretaceous        2.0
4  Penny    Triassic        0.0


In [4]:
#The OneHotEncoder attractively encodes either K or K-1 binary features
ohe = preprocessing.OneHotEncoder(drop='first')

#We will first encode with K-1 features by 'dropping' 
#  the first feature and folding it into the baseline
ohe.fit(data[['DietSpec']])
tmp=ohe.transform(data[['DietSpec']]).toarray()
data['DSEf0']=tmp[:,0]
data['DSEf1']=tmp[:,1]
data['DSEf2']=tmp[:,2]

print(data[['Name','DietSpec','DSEf0','DSEf1','DSEf2']])
#Note that grazing has been taken as a 'base case'

    Name                  DietSpec  DSEf0  DSEf1  DSEf2
0  Terry          Whoever it wants    0.0    0.0    1.0
1  Danny              Tree foliage    0.0    1.0    0.0
2  Stacy                   Grazing    0.0    0.0    0.0
3  Timmy                   Grazing    0.0    0.0    0.0
4  Penny  Smaller, cuter dinosaurs    1.0    0.0    0.0


In [5]:
ohe2 = preprocessing.OneHotEncoder()

#We will encode all K features
ohe2.fit(data[['DietSpec']])
tmp=ohe2.transform(data[['DietSpec']]).toarray()
data['DSEf0']=tmp[:,0]
data['DSEf1']=tmp[:,1]
data['DSEf2']=tmp[:,2]
data['DSEf3']=tmp[:,3]

print(data[['Name','DietSpec','DSEf0','DSEf1','DSEf2','DSEf3']])

    Name                  DietSpec  DSEf0  DSEf1  DSEf2  DSEf3
0  Terry          Whoever it wants    0.0    0.0    0.0    1.0
1  Danny              Tree foliage    0.0    0.0    1.0    0.0
2  Stacy                   Grazing    1.0    0.0    0.0    0.0
3  Timmy                   Grazing    1.0    0.0    0.0    0.0
4  Penny  Smaller, cuter dinosaurs    0.0    1.0    0.0    0.0


In [6]:
#We can even use the OneHotEncoder to encode multiple variables at once.
ohe3 = preprocessing.OneHotEncoder()

#We will encode all K features
ohe3.fit(data[['Diet','DietSpec','Period']])
tmp=ohe3.transform(data[['Diet','DietSpec','Period']]).toarray()

print(ohe3.categories_)
print(tmp)

[array(['Carnivore', 'Herbivore'], dtype=object), array(['Grazing', 'Smaller, cuter dinosaurs', 'Tree foliage',
       'Whoever it wants'], dtype=object), array(['Cretaceous', 'Jurassic', 'Triassic'], dtype=object)]
[[1. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [0. 1. 1. 0. 0. 0. 0. 1. 0.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 0. 1.]]


In [7]:
#We can even use the OneHotEncoder to encode multiple variables at once.
ohe4 = preprocessing.OneHotEncoder(drop='first')

#We will encode K-1 features
ohe4.fit(data[['Diet','DietSpec','Period']])
tmp=ohe4.transform(data[['Diet','DietSpec','Period']]).toarray()

print(ohe4.categories_)
print(tmp)

[array(['Carnivore', 'Herbivore'], dtype=object), array(['Grazing', 'Smaller, cuter dinosaurs', 'Tree foliage',
       'Whoever it wants'], dtype=object), array(['Cretaceous', 'Jurassic', 'Triassic'], dtype=object)]
[[0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 1. 0.]
 [1. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1.]]
