# **Encoding Nominal Categorical Feature**

**Encoding Nominal Categorical Features**  
**Problem**   
You have a feature with nominal classes that has no intrinsic ordering (e.g., apple, pear, banana).
**Solution**    
One-hot encode the feature using scikit-learn's LabelBinarizer:

In [None]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
feature = np.array([
            ["Texas"],
            ["California"],
            ["Texas"],
            ["Delaware"],
            ["Texas"]])

# create one-hot encoder
one_hot = LabelBinarizer()

# one_hot encoder feature
one_hot.fit_transform(feature)


array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

In [None]:
# view feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

In [None]:
# reverse one_hot encoding
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

**We can even use pandas to one-hot encode the feature with get_dummies**

In [None]:
import pandas as pd
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,False,False,True
1,True,False,False
2,False,False,True
3,False,True,False
4,False,False,True


**To handle a situation where each observation lists multiple classes with**

In [None]:
# create multiclass feature
multiclass_feature = [
    ("Texas","Florida"),
    ("Calfornia","Alabama"),
    ("Texas","Florida"),
    ("Delaware","Florida"),
    ("Texas","Alabama")
]

# create multiclass one_hot encode
one_hot_multiclass = MultiLabelBinarizer()

# one_hot encode multiclass feature
one_hot_multiclass.fit_transform(multiclass_feature)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [None]:
# view classes
one_hot_multiclass.classes_

array(['Alabama', 'Calfornia', 'Delaware', 'Florida', 'Texas'],
      dtype=object)

# **5.2 Encoding Ordinal Categorical Features**   
**Problem**     
You have an ordinal categorical feature (e.g., high, medium, low).     
**Solution**    
Use pandas DataFrame's replace method to transform string labels to numerical equivalents

In [None]:
import pandas as pd
# create feature
df = pd.DataFrame({"Score" : ["Low", "Low", "Medium", "High"]})

# create mapper
scale_mapper = {
    "Low" : 1,
    "Medium" : 2,
    "High" : 3
}

# replace feature values with scale
df["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    3
Name: Score, dtype: int64

# **Encoding Dictionaries of Features**    
**Problem**   
You have a dictionary and want to convert it into a feature matrix.     
**Solution**   
Use DictVectorizer:


In [None]:
from sklearn.feature_extraction import DictVectorizer
data_dict = [
    {"Red": 2, "Blue": 4},
    {"Red": 4, "Blue": 3},
    {"Red": 1, "Yellow": 2},
    {"Red": 2, "Yellow": 2}
]

# create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse = False) #force DictVectorizer to output

# covert dictionary to feature martrix
features = dictvectorizer.fit_transform(data_dict)
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

In [None]:
# get feature names
dictvectorizer.get_feature_names_out()

array(['Blue', 'Red', 'Yellow'], dtype=object)

# **Imputing Missing Class Values**    
**Problem**   
You have a categorical feature containing missing values that you want to replace with predicted
values.    
**Solution**   
The ideal solution is to train a machine learning classifier algorithm to predict the missing
values, commonly a k-nearest neighbors (KNN) classifier:


In [None]:
# load libraries
import numpy as ap
from sklearn.neighbors import KNeighborsClassifier

X = np.array([[0, 2.10, 1.45],
              [1, 1.18, 1.33],
              [0, 1.22, 1.27],
              [1, -0.21, -1.19]])

X_with_nan = np.array([[np.nan, 0.87, 1.13],
                       [np.nan, -0.67, -0.22]])

# train KNN learner
clf = KNeighborsClassifier(3, weights = 'distance')
trained_model = clf.fit(X[:, 1:], X[:, 0])

# predict missing values, class
imputed_values = trained_model.predict(X_with_nan[:, 1:])

# join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1, 1), X_with_nan[:, 1:]))

#join two feature matricies
np.vstack((X_with_imputed, X))



array([[ 0.  ,  0.87,  1.13],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

In [None]:
# from sklearn.preprocessor import Imputer
from sklearn.impute import SimpleImputer

# join the two feature matrices
X_complete = np.vstack((X_with_nan, X))

imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

imputer.fit_transform(X_complete)

array([[ 0.5 ,  0.87,  1.13],
       [ 0.5 , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

# **Handling Imbalanced Classes**    
**Problem**     
You have a target vector (dataset) with highly imbalanced classes.      
Fisher's Iris dataset (download link 1 : https://www.kaggle.com/uciml/iris      
Link2: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/)


In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()

features = iris.data
target = iris.target

# remove first 40 observation
features = features[40:, :]
target = target[40:]

# create binary target vector indicating if class 0
target = np.where((target == 0),0, 1)
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
# create weights
weights = {0: .9, 1: 0.1}

# create random forest classifier with weights
RandomForestClassifier(class_weight = weights)

In [None]:
RandomForestClassifier(class_weight = "balanced")

In [None]:
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]

n_class0 = len(i_class0)
n_class1 = len(i_class1)

i_class1_downsampled = np.random.choice(i_class1, size = n_class0, replace = False)

np.hstack((target[i_class0], target[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
# Join together class 0's feature matrix with the class 1's feature matrix
np.vstack((features[i_class0, :], features[i_class1_downsampled, :]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4]])

In [None]:
i_class0_upsampled = np.random.choice(i_class0, size = n_class1, replace = True)
np.concatenate((target[i_class0_upsampled],target[i_class1]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [None]:
# join together class 0's upsample features matrix with class 1's feature matrix
np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

array([[5.3, 3.7, 1.5, 0.2],
       [4.4, 3.2, 1.3, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [4.5, 2.3, 1.3, 0.3]])

# **With Dataset**

**Encoding Nominal Categorical Features**  
**Problem**   
You have a feature with nominal classes that has no intrinsic ordering (e.g., apple, pear, banana).    
**Solution**    
One-hot encode the feature using scikit-learn's LabelBinarizer:

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelBinarizer
feature1 = np.array(load_iris().data)  # Select only the features (data)
target1 = np.array(load_iris().target)  # Select only the target (labels)

one_hot1 = LabelBinarizer()  # Create an instance of LabelBinarizer
one_hot_encoded = one_hot1.fit_transform(target1)  # Fit and transform the labels
print(one_hot_encoded)

[[1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 

In [None]:
# view feature classes
one_hot1.classes_

array([0, 1, 2])

In [None]:
# reverse one_hot1 encoding
one_hot1.inverse_transform(one_hot1.transform(target1))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
import pandas as pd
pd.get_dummies(feature1[:,0])

Unnamed: 0,4.3,4.4,4.5,4.6,4.7,4.8,4.9,5.0,5.1,5.2,...,6.8,6.9,7.0,7.1,7.2,7.3,7.4,7.6,7.7,7.9
0,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
146,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
147,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
148,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MultiLabelBinarizer
feature2 = np.array(load_iris().data)  # Select only the features (data)
target2 = np.array(load_iris().target)  # Select only the target (labels)

one_hot2 = MultiLabelBinarizer()  # Create an instance of LabelBinarizer
one_hot_encoded2 = one_hot2.fit_transform(feature2)  # Fit and transform the labels
print(one_hot_encoded2)

[[0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
# view feature classes
one_hot2.classes_

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6,
       1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9,
       3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2,
       4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5,
       5.6, 5.7, 5.8, 5.9, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8,
       6.9, 7.0, 7.1, 7.2, 7.3, 7.4, 7.6, 7.7, 7.9], dtype=object)

In [None]:
# reverse one_hot2 encoding
one_hot2.inverse_transform(one_hot2.fit_transform(feature2))

[(0.2, 1.4, 3.5, 5.1),
 (0.2, 1.4, 3.0, 4.9),
 (0.2, 1.3, 3.2, 4.7),
 (0.2, 1.5, 3.1, 4.6),
 (0.2, 1.4, 3.6, 5.0),
 (0.4, 1.7, 3.9, 5.4),
 (0.3, 1.4, 3.4, 4.6),
 (0.2, 1.5, 3.4, 5.0),
 (0.2, 1.4, 2.9, 4.4),
 (0.1, 1.5, 3.1, 4.9),
 (0.2, 1.5, 3.7, 5.4),
 (0.2, 1.6, 3.4, 4.8),
 (0.1, 1.4, 3.0, 4.8),
 (0.1, 1.1, 3.0, 4.3),
 (0.2, 1.2, 4.0, 5.8),
 (0.4, 1.5, 4.4, 5.7),
 (0.4, 1.3, 3.9, 5.4),
 (0.3, 1.4, 3.5, 5.1),
 (0.3, 1.7, 3.8, 5.7),
 (0.3, 1.5, 3.8, 5.1),
 (0.2, 1.7, 3.4, 5.4),
 (0.4, 1.5, 3.7, 5.1),
 (0.2, 1.0, 3.6, 4.6),
 (0.5, 1.7, 3.3, 5.1),
 (0.2, 1.9, 3.4, 4.8),
 (0.2, 1.6, 3.0, 5.0),
 (0.4, 1.6, 3.4, 5.0),
 (0.2, 1.5, 3.5, 5.2),
 (0.2, 1.4, 3.4, 5.2),
 (0.2, 1.6, 3.2, 4.7),
 (0.2, 1.6, 3.1, 4.8),
 (0.4, 1.5, 3.4, 5.4),
 (0.1, 1.5, 4.1, 5.2),
 (0.2, 1.4, 4.2, 5.5),
 (0.2, 1.5, 3.1, 4.9),
 (0.2, 1.2, 3.2, 5.0),
 (0.2, 1.3, 3.5, 5.5),
 (0.1, 1.4, 3.6, 4.9),
 (0.2, 1.3, 3.0, 4.4),
 (0.2, 1.5, 3.4, 5.1),
 (0.3, 1.3, 3.5, 5.0),
 (0.3, 1.3, 2.3, 4.5),
 (0.2, 1.3, 3.2, 4.4),
 (0.6, 1.6,

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

# Load Iris dataset into a DataFrame
iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris["Species"] = iris.target

# Define mapper
species_mapper = {
    0: "setosa",
    1: "versicolor",
    2: "virginica"
}

# Replace 'Species' values with mapped values
df_iris["Species"].replace(species_mapper)
df_iris["Species"].head(150)

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: Species, Length: 150, dtype: int64

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_extraction import DictVectorizer

# Load Iris dataset into a DataFrame
iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris["Species"] = iris.target

# Convert DataFrame to a list of dictionaries
data_dict = df_iris.to_dict(orient='records')

# Create a dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)

# Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

print(features)

[[0.  1.4 0.2 5.1 3.5]
 [0.  1.4 0.2 4.9 3. ]
 [0.  1.3 0.2 4.7 3.2]
 [0.  1.5 0.2 4.6 3.1]
 [0.  1.4 0.2 5.  3.6]
 [0.  1.7 0.4 5.4 3.9]
 [0.  1.4 0.3 4.6 3.4]
 [0.  1.5 0.2 5.  3.4]
 [0.  1.4 0.2 4.4 2.9]
 [0.  1.5 0.1 4.9 3.1]
 [0.  1.5 0.2 5.4 3.7]
 [0.  1.6 0.2 4.8 3.4]
 [0.  1.4 0.1 4.8 3. ]
 [0.  1.1 0.1 4.3 3. ]
 [0.  1.2 0.2 5.8 4. ]
 [0.  1.5 0.4 5.7 4.4]
 [0.  1.3 0.4 5.4 3.9]
 [0.  1.4 0.3 5.1 3.5]
 [0.  1.7 0.3 5.7 3.8]
 [0.  1.5 0.3 5.1 3.8]
 [0.  1.7 0.2 5.4 3.4]
 [0.  1.5 0.4 5.1 3.7]
 [0.  1.  0.2 4.6 3.6]
 [0.  1.7 0.5 5.1 3.3]
 [0.  1.9 0.2 4.8 3.4]
 [0.  1.6 0.2 5.  3. ]
 [0.  1.6 0.4 5.  3.4]
 [0.  1.5 0.2 5.2 3.5]
 [0.  1.4 0.2 5.2 3.4]
 [0.  1.6 0.2 4.7 3.2]
 [0.  1.6 0.2 4.8 3.1]
 [0.  1.5 0.4 5.4 3.4]
 [0.  1.5 0.1 5.2 4.1]
 [0.  1.4 0.2 5.5 4.2]
 [0.  1.5 0.2 4.9 3.1]
 [0.  1.2 0.2 5.  3.2]
 [0.  1.3 0.2 5.5 3.5]
 [0.  1.4 0.1 4.9 3.6]
 [0.  1.3 0.2 4.4 3. ]
 [0.  1.5 0.2 5.1 3.4]
 [0.  1.3 0.3 5.  3.5]
 [0.  1.3 0.3 4.5 2.3]
 [0.  1.3 0.2 4.4 3.2]
 [0.  1.6 0