# Qualitative data, Nominal and Ordinal

**It is often useful to measure objects in terms of quality not quantity. This qualitative infomation is often represented as an observation's membership in a discrete category such as gender, color, brand of car etc. Sets of categories with no intrinsic ordering is called nominal. Examples of nominal categories: 1. Red, Green, Blue 2. Man, Woman 3. Mango, Orange, Apple. When a set of categories has some natural ordering we refer to it as ordinal. Examples are: 1. Low, Medium and High 2. Young, Old 3. Agree, Neutral, Disagree**

# Encoding nominal categorical features

## One Hot Encoding

In [1]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

In [2]:
feature = np.array([
    ["Chittagong"],
    ["Sylhet"],
    ["Toronto"],
    ["Toronto"],
    ["Nashville"],
    ["Singapore city"]
])

In [3]:
one_hot = LabelBinarizer()

In [4]:
one_hot.fit_transform(feature)

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0]])

In [5]:
#use classes_ method to output the classes
one_hot.classes_

array(['Chittagong', 'Nashville', 'Singapore city', 'Sylhet', 'Toronto'],
      dtype='<U14')

In [6]:
#if we want to reverse the one hot encoding, we can use inverse_transform
one_hot.inverse_transform(one_hot.transform(feature))

array(['Chittagong', 'Sylhet', 'Toronto', 'Toronto', 'Nashville',
       'Singapore city'], dtype='<U14')

### Using pandas for one hot encoding

In [7]:
import pandas as pd

In [8]:
pd.get_dummies(feature[:, 0]) #create dummy variables for features

Unnamed: 0,Chittagong,Nashville,Singapore city,Sylhet,Toronto
0,1,0,0,0,0
1,0,0,0,1,0
2,0,0,0,0,1
3,0,0,0,0,1
4,0,1,0,0,0
5,0,0,1,0,0


In [9]:
#one helpful ability of sklearn is to handle a situation where each observation lists multiple classes
#create multiclass features
multiclass_feature = np.array([
    ("Chittagong", "Sylhet"),
    ("Singapore city", "Manilla"),
    ("Toronto", "Montreal"),
])

In [10]:
one_hot_multiclass = MultiLabelBinarizer() #create multiclass one hot encoder

In [11]:
one_hot_multiclass.fit_transform(multiclass_feature)

array([[1, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 1]])

In [12]:
one_hot_multiclass.classes_

array(['Chittagong', 'Manilla', 'Montreal', 'Singapore city', 'Sylhet',
       'Toronto'], dtype=object)

**Since the classes have no intrinsic ordering, numerical values create an ordering errorneously that is not present. The proper strategy is to create a binary feature for each class in the original feature. This is called one hot encoding. It if often recommended that after one hot encoding a feature, we drop one of the one hot encoded features in the resulting matrix to avoid linear dependence.**

# Encoding ordinal categorical features

In [15]:
#use pandas dataframe's replace method to transform string labels to numerical equivalents
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "Low", "Medium", "High"]})

In [16]:
#create mapper
scale_mapper = {"Low": 1,"Medium": 2, "High": 3}

In [17]:
#replace feature values with scale
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    1
5    2
6    3
Name: Score, dtype: int64

In [23]:
#often we have a feature with classes that have some kind of natural ordering such as: strongly agree, agree, neutral, disagree, strongly disagree.
#it is important that our choice of numeric values is based on our prior information on the ordianl classes
#in our solution, high is literally three times larger than low
#this is fine in any instances, but can break down if the assumed intervals between the classes are not equal
dataframe = pd.DataFrame({
    "Score": [
        "Low", "Low", "Medium", "Medium", "High", "Barely More Than Medium"
    ]
})

In [24]:
scale_mapper = {"Low": 1, "Medium": 2, "Barely More Than Medium": 3, "High": 4}

In [25]:
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    4
5    3
Name: Score, dtype: int64

In [26]:
#in this example, the distance between Low and Medium is the same as the distance between Medium and Barely More Than Medium
#which is almost certainly not accurate
#the best approach is to be conscious about the numerical values mapped to classes
scale_mapper = {"Low": 1, "Medium": 2, "Barely More Than Medium": 2.1, "High": 3}

In [27]:
dataframe["Score"].replace(scale_mapper)

0    1.0
1    1.0
2    2.0
3    2.0
4    3.0
5    2.1
Name: Score, dtype: float64

# Encoding dictionaries of features

In [28]:
from sklearn.feature_extraction import DictVectorizer

In [29]:
data_dict = [
    {"Red": 0, "Green": 0, "Blue": 0},
    {"Red": 10, "Green": 0, "Blue": 20},
    {"Red": 20, "Green": 20, "Blue": 120},
    {"Red": 30, "Green": 10, "Blue": 10},
    {"Red": 0, "Green": 50, "Blue": 0},
    {"Red": 0, "Green": 10, "Blue": 30},
    {"Red": 50, "Green": 10, "Blue": 0}
]

In [30]:
#create dicitonary vectorizer
dictvectorizer = DictVectorizer(sparse = False)

In [31]:
#convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

In [32]:
features

array([[  0.,   0.,   0.],
       [ 20.,   0.,  10.],
       [120.,  20.,  20.],
       [ 10.,  10.,  30.],
       [  0.,  50.,   0.],
       [ 30.,  10.,   0.],
       [  0.,  10.,  50.]])

In [33]:
#by default DictVectorizer outputs a sparse matrix that only stores elements with a nonzero number. THis can be very helpful when we have massive matrices
#and we want to minimize the memory requirements
#we can force DictVectorizer to output a dense matrix using sparse = False

In [34]:
#get the feature names
dictvectorizer.get_feature_names_out() #get_feature_names is deprecated

array(['Blue', 'Green', 'Red'], dtype=object)

In [35]:
pd.DataFrame(features, columns = dictvectorizer.get_feature_names_out())

Unnamed: 0,Blue,Green,Red
0,0.0,0.0,0.0
1,20.0,0.0,10.0
2,120.0,20.0,20.0
3,10.0,10.0,30.0
4,0.0,50.0,0.0
5,30.0,10.0,0.0
6,0.0,10.0,50.0


**This is a common situation when working with NLP. For example, we might have a collection of documents and for each document we have a dictionary containing the number of times every word appears in the document. Using DictVectorizer, we can easily create a feature matrix where every feature is the number of times a word appears in each document.**

In [36]:
#create word count dictionaries for five documents
doc_1_word_count = {"A": 10, "As": 50, "Are": 10, "Be": 30, "Biscuits": 10}
doc_2_word_count = {"A": 0, "As": 20, "Are": 10, "Be": 30, "Biscuits": 50}
doc_3_word_count = {"A": 40, "As": 30, "Are": 10, "Be": 20, "Biscuits": 0}
doc_4_word_count = {"A": 10, "As": 20, "Are": 30, "Be": 30, "Biscuits": 50}
doc_5_word_count = {"A": 50, "As": 50, "Are": 10, "Be": 0, "Biscuits": 0}

In [37]:
#create list
doc_word_counts = [doc_1_word_count, doc_2_word_count, doc_3_word_count, doc_4_word_count, doc_5_word_count]

In [38]:
#convert list of word count dictionaries into feature matrix
dictvectorizer.fit_transform(doc_word_counts)

array([[10., 10., 50., 30., 10.],
       [ 0., 10., 20., 30., 50.],
       [40., 10., 30., 20.,  0.],
       [10., 30., 20., 30., 50.],
       [50., 10., 50.,  0.,  0.]])

In [39]:
#this is just a toy example where there are only five unique words, so there are only five features in our matrix
#you can imagine that if each document was actually a book in a university library our feature matrix would be very large
#so we want to set sparse to True

# Imputing missing class values

In [78]:
#suppose you have a categorical feature containing missing values that you want to replace with predicted values
#the ideal solution is to train a ML classifier algorithm to predict the missing values
#commonly a knn classifier

In [79]:
from sklearn.neighbors import KNeighborsClassifier

In [80]:
#create a feature matrix with categorical feature
X = np.array([
    [1, 1.2, 20.2],
    [1, 10.2, 2.6],
    [0, 1.7, 2.6],
    [1, 53.5, 3.2]
])

In [81]:
#create feature matrix with missing values in the categorical feature
X_with_nan = np.array([
    [np.nan, 20.1, 20.1],
    [np.nan, 20.3, 2.1]
])

In [82]:
#train knn learner
clf = KNeighborsClassifier(3, weights = "distance")
trained_model = clf.fit(X[:, 1:], X[:, 0])

In [83]:
#predict missing values' class
imputed_values = trained_model.predict(X_with_nan[:, 1:])

In [84]:
#join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1, 1), X_with_nan[:, 1:]))

In [85]:
X_with_imputed

array([[ 1. , 20.1, 20.1],
       [ 1. , 20.3,  2.1]])

In [86]:
#join two feature matrices
np.vstack((X_with_imputed, X))

array([[ 1. , 20.1, 20.1],
       [ 1. , 20.3,  2.1],
       [ 1. ,  1.2, 20.2],
       [ 1. , 10.2,  2.6],
       [ 0. ,  1.7,  2.6],
       [ 1. , 53.5,  3.2]])

In [87]:
#an alternative solution is to fill in missing values with the features' most frequent value
from sklearn.impute import SimpleImputer

In [88]:
X_complete = np.vstack((X_with_nan, X))

In [89]:
X_complete

array([[ nan, 20.1, 20.1],
       [ nan, 20.3,  2.1],
       [ 1. ,  1.2, 20.2],
       [ 1. , 10.2,  2.6],
       [ 0. ,  1.7,  2.6],
       [ 1. , 53.5,  3.2]])

In [90]:
imputer = SimpleImputer(strategy = "most_frequent")
imputer.fit_transform(X_complete)

array([[ 1. , 20.1, 20.1],
       [ 1. , 20.3,  2.1],
       [ 1. ,  1.2, 20.2],
       [ 1. , 10.2,  2.6],
       [ 0. ,  1.7,  2.6],
       [ 1. , 53.5,  3.2]])

In [91]:
#it is recommended to include a binary feature indicating which observations contain imputed values

# Handling imbalanced classes

**Suppose you have a target vector with highly imbalanced classes. Collect more data. If that isn't possible, change the metrics used to evaluate your model. If that doesn't work, consider using a model's built-in class weight parameters, downsampling or upsampling.**

In [93]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

In [94]:
iris = load_iris()
feature = iris.data
target = iris.target

In [99]:
#remove first 40 observations
features = features[40:, :]
target = target[40:]

In [101]:
#create binary target vector indicating if class 0
target = np.where((target == 0), 0, 1)

In [104]:
target #imbalanced target vector

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [105]:
#many algorithms in sklearn offer a parameter to weight classes during training to counteract the effect of their imbalance
#RandomForestClassifier includes a class_weight parameter
weights = {0: 0.9, 1: 0.1}
#create random forest classifier with weights
RandomForestClassifier(class_weight = weights)

RandomForestClassifier(class_weight={0: 0.9, 1: 0.1})

In [106]:
#or you can pass balanced to automatically create weights inversely propotional to class frequencies
RandomForestClassifier(class_weight = "balanced")

RandomForestClassifier(class_weight='balanced')

In [107]:
#another way is downsampling the majority class or upsample the minority class
#indices of each class' observations
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]

In [109]:
#number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)
n_class0, n_class1

(10, 100)

In [113]:
#for every observation of class 0, randomly sample
#from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size = n_class0, replace = False)

In [114]:
#join together class 0's target vector with the downsampled class 1's target vector
np.hstack((target[i_class0], target[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [119]:
#join together class 0's feature matrix with the downsampled class 1's feature matrix
#np.vstack((features[i_class0, :], features[i_class1_downsampled, :]))[0:5]

In [120]:
#for every observation in class 1, randomly sample from class 0 with replacement
i_class0_upsampled = np.random.choice(i_class0, size = n_class1, replace = True)

In [122]:
len(i_class0_upsampled)

100

In [123]:
i_class0_upsampled

array([2, 4, 5, 0, 6, 8, 5, 6, 8, 6, 0, 6, 7, 6, 6, 2, 8, 8, 1, 5, 6, 6,
       6, 2, 1, 1, 1, 3, 2, 9, 2, 4, 6, 3, 9, 1, 9, 1, 1, 4, 7, 9, 4, 6,
       7, 1, 2, 9, 2, 1, 7, 8, 4, 3, 1, 7, 5, 5, 5, 8, 1, 8, 5, 4, 5, 5,
       9, 7, 4, 5, 1, 6, 5, 4, 0, 5, 7, 3, 0, 9, 4, 3, 3, 6, 7, 6, 2, 2,
       4, 5, 2, 2, 5, 3, 4, 6, 5, 5, 7, 8], dtype=int64)

In [124]:
#join together class 0's upsampled target vector with class 1's target vector
np.concatenate((target[i_class0_upsampled], target[i_class1]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [128]:
#join together class 0's upsampled feature matrix with class 1's feature matrix
#np.vstack((features[i_class0_upsampled, :], features[i_class1, :]))[0:5]