# Handling Categorical Data
We frequently represent qualitative information in categories such as gender, colors, or brand of car.<br>
Sets of categories with no intrinsic ordering are called nominal.<br>
Examples of nominal categories include:
- Blue, Red, Green
- Man, Woman
- Banana, Strawberry, Apple
<br>
In contrast, when a set of categories has some natural ordering we refer to it as ordinal. 
For example:
- Low, Medium, High
- Young, Old
- Agree, Neutral, Disagree
<br>
Furthermore, categorical information is often represented in data as a vector or column of strings (e.g., "Maine", "Texas", "Delaware").<br>
Our goal is to transform the data in a way that properly captures the information in the categories (ordinality, relative intervals between categories, etc.).

In [None]:
# Libraries
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


In [None]:
# Variables
feature = np.array([["Texas"],["California"],["Texas"],["Delaware"],["Texas"]])
multiclass_feature = [("Texas", "Florida"),("California", "Alabama"),("Texas", "Florida"),("Delaware", "Florida"),("Texas", "Alabama")]
data_dict = [
    {"Red": 2, "Blue": 4},
    {"Red": 4, "Blue": 3},
    {"Red": 1, "Yellow": 2},
    {"Red": 2, "Yellow": 2}
]

## Problem
You have a feature with nominal classes that has no intrinsic ordering (e.g., apple,
pear, banana), and you want to encode the feature into numerical values.
## Solution
One-hot encode the feature using scikit-learn’s LabelBinarizer:

In [None]:
# Encoding Nominal Categorical Features
one_hot = LabelBinarizer()
one_hot.fit_transform(feature)
# Output Classes
one_hot.classes_
# Reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

In [None]:
# Using Pandas to one-hot encode the feature
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,False,False,True
1,True,False,False
2,False,False,True
3,False,True,False
4,False,False,True


In [None]:
# Multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()
one_hot_multiclass.fit_transform(multiclass_feature)
one_hot_multiclass.classes_

array(['Alabama', 'California', 'Delaware', 'Florida', 'Texas'],
      dtype=object)

In [None]:
# Encoding Ordinal Categorical Features
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High"]})
# Create mapper
scale_mapper = {"Low":1,"Medium":2,"High":3}
dataframe["Score"].replace(scale_mapper)

  dataframe["Score"].replace(scale_mapper)


0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

In [None]:
# Encoding Dictionaries of Features
dictvectorizer = DictVectorizer(sparse=False)
features = dictvectorizer.fit_transform(data_dict)
feature_names = dictvectorizer.get_feature_names_out()
feature_names
pd.DataFrame(features, columns=feature_names)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


In [None]:
# Create word count dictionaries for four documents
doc_1_word_count = {"Red": 2, "Blue": 4}
doc_2_word_count = {"Red": 4, "Blue": 3}
doc_3_word_count = {"Red": 1, "Yellow": 2}
doc_4_word_count = {"Red": 2, "Yellow": 2}
# Create list
doc_word_counts = [doc_1_word_count, doc_2_word_count, doc_3_word_count, doc_4_word_count]
# Convert list of word count dictionaries into feature matrix
dictvectorizer.fit_transform(doc_word_counts)

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

In [None]:
# Imputing Missing class values
X = np.array([[0, 2.10, 1.45],[1, 1.18, 1.33],[0, 1.22, 1.27],[1, -0.21, -1.19]])
X_with_nan = np.array([[np.nan, 0.87, 1.31],[np.nan, -0.67, -0.22]])
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])
imputed_values = trained_model.predict(X_with_nan[:,1:])
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))
# Join two feature matrices
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

In [None]:
# Fill in missing values with the feature’s most frequent value
X_complete = np.vstack((X_with_nan, X))
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

In [None]:
# Handling Imbalanced Cases
iris  = load_iris()
iris_features = iris.data
iris_target = iris.target
iris_features = iris_features[40:,:]
iris_target = iris_target[40:]
# Create binary target vector indicating if class 0
iris_target = np.where((iris_target == 0), 0, 1)
# Look at the imbalanced target vector
iris_target

# Create weights
weights = {0: 0.9, 1: 0.1}
# Create random forest classifier with weights
RandomForestClassifier(class_weight=weights)
RandomForestClassifier(class_weight='balanced')

In [None]:
# DownSampling
i_class0 = np.where(iris_target == 0)[0]
i_class1 = np.where(iris_target == 1)[0]
# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)
# For every observation of class 0, randomly sample from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False)
# Join together class 0's target vector with the downsampled class 1's target vector
np.hstack((iris_target[i_class0], iris_target[i_class1_downsampled]))
# Join together class 0's feature matrix with the downsampled class 1's feature matrix
np.vstack((iris_features[i_class0,:], iris_features[i_class1_downsampled,:]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4]])

In [None]:
# Upsampling
# For every observation in class 1, randomly sample from class 0 with replacement
i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)
# Join together class 0's upsampled target vector with class 1's target vector
np.concatenate((iris_target[i_class0_upsampled], iris_target[i_class1]))
# Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vstack((iris_features[i_class0_upsampled,:], iris_features[i_class1,:]))[0:5]

In the real world, imbalanced classes are everywhere—most visitors don’t click the
buy button, and many types of cancer are thankfully rare. For this reason, handling
imbalanced classes is a common activity in machine learning.
Our best strategy is simply to collect more observations—especially observations
from the minority class. However, often this is just not possible, so we have to resort
to other options.
A second strategy is to use a model evaluation metric better suited to imbalanced
classes. Accuracy is often used as a metric for evaluating the performance of a model,
but when imbalanced classes are present, accuracy can be ill suited. For example,
if only 0.5% of observations have some rare cancer, then even a naive model that
predicts nobody has cancer will be 99.5% accurate. Clearly this is not ideal. Some
better metrics we discuss in later chapters are confusion matrices, precision, recall, F1
scores, and ROC curves.
A third strategy is to use the class weighing parameters included in implementations
of some models. This allows the algorithm to adjust for imbalanced classes. Fortu‐
nately, many scikit-learn classifiers have a class_weight parameter, making it a good
option.
The fourth and fifth strategies are related: downsampling and upsampling. In down‐
sampling we create a random subset of the majority class of equal size to the minority
class. In upsampling we repeatedly sample with replacement from the minority class
to make it of equal size as the majority class. The decision between using downsam‐
pling and upsampling is context-specific, and in general we should try both to see
which produces better results.