<a href="https://colab.research.google.com/github/mojtaba732/ML_Practice/blob/main/categoricalData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Categorical data encoding with pandas**

In [34]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [5]:
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


**Mapping ordinal features**

In [7]:
size_mapping = {'XL': 3,'L': 2,'M': 1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


In [8]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

**Encoding class labels**

In [11]:
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

In [12]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


In [13]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


In [16]:
#Alternatively, there is a convenient LabelEncoder class directly
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([1, 0, 1])

In [17]:
class_le.inverse_transform(y)

array(['class2', 'class1', 'class2'], dtype=object)

**Performing one-hot encoding on nominal features**

In [19]:
X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [25]:
X0 = color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()
X0

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [33]:
NX = np.concatenate((X0,X[:,1:]),axis=1)
NX

array([[0.0, 1.0, 0.0, 1, 10.1],
       [0.0, 0.0, 1.0, 2, 13.5],
       [1.0, 0.0, 0.0, 3, 15.3]], dtype=object)

we can use the ColumnTransformer

In [35]:
X = df[['color', 'size', 'price']].values

In [36]:
c_transf = ColumnTransformer([
                              ('onehot', OneHotEncoder(), [0]), 
                              ('nothing', 'passthrough', [1, 2]) 
                              ])
c_transf.fit_transform(X).astype(float)

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

When we are using one-hot encoding datasets, we have to keep in mind that this introduces multicollinearity,
which can be an issue for certain methods (for instance, methods that require matrix
inversion). If features are highly correlated, matrices are computationally difficult to invert, which
can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply
remove one feature column from the one-hot encoded array. Note that we do not lose any important
information by removing a feature column, though; for example, if we remove the column color_blue,
the feature information is still preserved since if we observe color_green=0 and color_red=0, it
implies that the observation must be blue.

**In order to drop a redundant column ** via the OneHotEncoder, we need to set drop='first' and set categories='auto' as follows:

In [38]:
color_ohe = OneHotEncoder(categories='auto', drop='first')
c_transf = ColumnTransformer([('onehot', color_ohe, [0]), ('nothing', 'passthrough', [1, 2]) ])
c_transf.fit_transform(X).astype(float)

array([[ 1. ,  0. ,  1. , 10.1],
       [ 0. ,  1. ,  2. , 13.5],
       [ 0. ,  0. ,  3. , 15.3]])