# CATEGORICAL DATA

To be effective, many machine learning algorithms need the data passed to them be discerning, discriminating and independent.

## OVERVIEW

- [Categorical Features](#CATEGORICAL-FEATURES)
- [Encoding Categorical Data](#ENCODING-CATEGORICAL-DATA)  
    - [Using PANDAS](#Using-PANDAS)  
    - [Using SKLEARN](#Using-SKLEARN)
- [Dummy Variable Trap in Regression Models](#DUMMY-VARIABLE-TRAP-in-REGRESSION-MODELS)

## CATEGORICAL FEATURES

If data is categorical (textual) it has to be encoded as a numeric data taking into the account if the categorical data is **ordinal** or **nominal**. **Ordinal** features are categorical values that are ordered or can be sorted. In contrast, **Nominal** features don't imply any order.

In case of categorical data that is nominal one solution is using **one-hot encoding** to encode the data.

## ENCODING CATEGORICAL DATA

In [1]:
import pandas as pd

dataset = pd.read_csv('../data/00_DataPreparation/missing_data_example.csv')

In [2]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
dataset.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

### Using PANDAS

**Encode**

In [4]:
df = dataset.copy()
df['Country_encoded'] = df['Country'].astype('category').cat.codes
df

Unnamed: 0,Country,Age,Salary,Purchased,Country_encoded
0,France,44.0,72000.0,No,0
1,Spain,27.0,48000.0,Yes,2
2,Germany,30.0,54000.0,No,1
3,Spain,38.0,61000.0,No,2
4,Germany,40.0,,Yes,1
5,France,35.0,58000.0,Yes,0
6,Spain,,52000.0,No,2
7,France,48.0,79000.0,Yes,0
8,Germany,50.0,83000.0,No,1
9,France,37.0,67000.0,Yes,0


**Mapping**

In [5]:
df = dataset.copy()
mapper = {'France':0,
          'Germany':1,
          'Spain':2}
print(mapper)
df['Country_encoded'] = df['Country'].map(mapper)
df

{'France': 0, 'Germany': 1, 'Spain': 2}


Unnamed: 0,Country,Age,Salary,Purchased,Country_encoded
0,France,44.0,72000.0,No,0
1,Spain,27.0,48000.0,Yes,2
2,Germany,30.0,54000.0,No,1
3,Spain,38.0,61000.0,No,2
4,Germany,40.0,,Yes,1
5,France,35.0,58000.0,Yes,0
6,Spain,,52000.0,No,2
7,France,48.0,79000.0,Yes,0
8,Germany,50.0,83000.0,No,1
9,France,37.0,67000.0,Yes,0


In [6]:
df = dataset.copy()
mapper = {label:idx for idx, label in enumerate(df['Country'].unique())}
print(mapper)
df['Country_encoded'] = df['Country'].map(mapper)
df

{'France': 0, 'Spain': 1, 'Germany': 2}


Unnamed: 0,Country,Age,Salary,Purchased,Country_encoded
0,France,44.0,72000.0,No,0
1,Spain,27.0,48000.0,Yes,1
2,Germany,30.0,54000.0,No,2
3,Spain,38.0,61000.0,No,1
4,Germany,40.0,,Yes,2
5,France,35.0,58000.0,Yes,0
6,Spain,,52000.0,No,1
7,France,48.0,79000.0,Yes,0
8,Germany,50.0,83000.0,No,2
9,France,37.0,67000.0,Yes,0


**One-hot-encoding**

In [7]:
df = dataset.copy()
df = pd.get_dummies(df, columns=['Country'])
df

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,No,1,0,0
1,27.0,48000.0,Yes,0,0,1
2,30.0,54000.0,No,0,1,0
3,38.0,61000.0,No,0,0,1
4,40.0,,Yes,0,1,0
5,35.0,58000.0,Yes,1,0,0
6,,52000.0,No,0,0,1
7,48.0,79000.0,Yes,1,0,0
8,50.0,83000.0,No,0,1,0
9,37.0,67000.0,Yes,1,0,0


### Using SKLEARN

**Encode**

In [8]:
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
df = dataset.copy()
df['Country_encoded'] = labelEncoder.fit_transform(df['Country'])
df

Unnamed: 0,Country,Age,Salary,Purchased,Country_encoded
0,France,44.0,72000.0,No,0
1,Spain,27.0,48000.0,Yes,2
2,Germany,30.0,54000.0,No,1
3,Spain,38.0,61000.0,No,2
4,Germany,40.0,,Yes,1
5,France,35.0,58000.0,Yes,0
6,Spain,,52000.0,No,2
7,France,48.0,79000.0,Yes,0
8,Germany,50.0,83000.0,No,1
9,France,37.0,67000.0,Yes,0


**One-hot-encode**

In [9]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder = LabelEncoder()
df = dataset.copy()
df['Country_encoded'] = labelEncoder.fit_transform(df['Country'])
oneHotEncoder = OneHotEncoder()
data_feature_one_hot_encoded = oneHotEncoder.fit_transform(df[['Country_encoded']]).toarray()
data_feature_one_hot_encoded

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [10]:
import numpy as np

countries = [df[df['Country_encoded']==i]['Country'].unique()[0] for i in np.sort(df['Country_encoded'].unique())]
for i, country in enumerate(countries):
    df[country] = data_feature_one_hot_encoded[:, i]
df

Unnamed: 0,Country,Age,Salary,Purchased,Country_encoded,France,Germany,Spain
0,France,44.0,72000.0,No,0,1.0,0.0,0.0
1,Spain,27.0,48000.0,Yes,2,0.0,0.0,1.0
2,Germany,30.0,54000.0,No,1,0.0,1.0,0.0
3,Spain,38.0,61000.0,No,2,0.0,0.0,1.0
4,Germany,40.0,,Yes,1,0.0,1.0,0.0
5,France,35.0,58000.0,Yes,0,1.0,0.0,0.0
6,Spain,,52000.0,No,2,0.0,0.0,1.0
7,France,48.0,79000.0,Yes,0,1.0,0.0,0.0
8,Germany,50.0,83000.0,No,1,0.0,1.0,0.0
9,France,37.0,67000.0,Yes,0,1.0,0.0,0.0


## DUMMY VARIABLE TRAP in REGRESSION MODELS

Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values which represent categories - data values with a fixed and unordered number of values, for instance gender (male/female) or season (summer/winder/spring/fall). In a regression model, these values can be represented by dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical value.  

By including dummy variable in a regression model however, one should be careful of the **Dummy Variable Trap**. The Dummy Variable trap is a scenario in which the independent variables are **multicollinear** - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.  

To demonstrate the Dummy Variable Trap, take the case of gender (male/female) as an example. Including a dummy variable for each is redundant (of male is 0, female is 1, and vice-versa), however doing so will result in the following linear model:  

y ~ b + {0|1} male + {0|1} female

Intuitively, there is a duplicate category: if we dropped the male category it is inherently defined in the female category (zero female value indicate male, and vice-versa).  

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

In [11]:
import pandas as pd

example = ['male', 'female', 'female', 'male', 'male']
example

['male', 'female', 'female', 'male', 'male']

To drop the first dummy variable, we can specify the `drop_first` parameter in the `get_dummies` function

In [12]:
pd.get_dummies(example, drop_first=True)

Unnamed: 0,male
0,1
1,0
2,0
3,1
4,1


If we have more than two categories, the dropped variable can be thought of as the abasence of all other options, represented by zeros in every column.