# Handling Categorical Data

In this section, we will make use of simple yet effective examples to see how to deal with this type of data in numerical computing libraries. When we are talking about categorical data, we have to further distinguish between ordinal and nominal features. Ordinal features can be understood as categorical values that can be sorted or ordered.

In [24]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [41]:
df = pd.DataFrame({
    'color': ['green', 'red', 'blue'],
    'size' : ['M', 'L', 'XL'],
    'price': [10.1, 13.5, 15.3],
    'class_label' : ['class_2', 'class_1', 'class_2']
                 })

df.head()

Unnamed: 0,color,size,price,class_label
0,green,M,10.1,class_2
1,red,L,13.5,class_1
2,blue,XL,15.3,class_2


The data above contains the ff:
 - Norminal Feature: color
 - Ordinal Feature: Size
 - Numeric Feature: Price

### Mapping Ordinal Features

In [5]:
# Convert Categorical Values into Integers
size_mapping = {
                'XL' : 3,
                'L' : 2,
                'M' : 1
}

df['size'].map(size_mapping)


0    1
1    2
2    3
Name: size, dtype: int64

In [14]:
categories = CategoricalDtype(['M', 'L', 'XL'], ordered = True)

df['size'] = df['size'].astype(categories)

df['size']

0     M
1     L
2    XL
Name: size, dtype: category
Categories (3, object): ['M' < 'L' < 'XL']

## Encoding Class Labels

We need to remember that class labels are not ordinal, and it doesn’t matter which integer number we
assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0:

In [17]:
class_mapping = {
                label: idx for idx, label in enumerate(np.unique(df['class_label']))
}

df['class_label'] = df['class_label'].map(class_mapping)

df.head()

Unnamed: 0,color,size,price,class_label
0,green,M,10.1,1
1,red,L,13.5,0
2,blue,XL,15.3,1


In [20]:
# We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class
# labels back to the original string representation:

inv_class_mapping = {v:k for k, v in class_mapping.items()}

df['class_label'] = df['class_label'].map(inv_class_mapping)

df

Unnamed: 0,color,size,price,class_label
0,green,M,10.1,class_2
1,red,L,13.5,class_1
2,blue,XL,15.3,class_2


Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this:

```
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
```
Note that the fit_transform method is just a shortcut for calling fit and transform separately, and
we can use the inverse_transform method to transform the integer class labels back into their original string representation:

```
class_le.inverse_transform(y)
```


## Performing one-hot encoding on nominal features

Since scikit-learn’s estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers.

In [33]:
X = df[['color', 'size', 'price']]

color_le = LabelEncoder()

X['color'] = color_le.fit_transform(X['color'])

X

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['color'] = color_le.fit_transform(X['color'])


Unnamed: 0,color,size,price
0,1,M,10.1
1,2,L,13.5
2,0,XL,15.3


After executing the preceding code, the first column of the NumPy array, X, now holds the new color
values, which are encoded as follows:
- blue = 0
- green = 1
- red = 2

> If we stop at this point and feed the array to our classifier, we will make one of the most common
mistakes in dealing with categorical data. Although the color values don’t
come in any particular order, common classification models, such as the ones covered in the previous
chapters, will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, a classifier could still produce useful results. However, those results would not be optimal.

In [40]:
# Using OneHotEncoder to encode norminal values
X = df[['color', 'size', 'price']].values

color_ohe = OneHotEncoder()

color_ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

Note that we applied the OneHotEncoder to only a single column, (X[:, 0].reshape(-1, 1)), to avoid
modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer, which accepts a list of (name, transformer, column(s)) tuples as follows:

```
from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].values
c_transf = ColumnTransformer([('onehot', OneHotEncoder(), [0]),
                              ('nothing', 'passthrough', [1, 2])])
c_transf.fit_transform(X).astype(float)
```
> In the preceding code example, we specified that we want to modify only the first column and leave
the other two columns untouched via the 'passthrough' argument.

In [43]:
# Using get_dummies
pd.get_dummies(df[['price', 'size', 'price']], drop_first=True)

Unnamed: 0,price,price.1,size_M,size_XL
0,10.1,10.1,1,0
1,13.5,13.5,0,0
2,15.3,15.3,0,1
