# Introduction:

> It is often useful to measure objects not in terms of their quantity but in terms of
some quality. We frequently represent qualitative information in categories such as
gender, colors, or brand of car. However, not all categorical data is the same.
> > Sets
of categories with no intrinsic ordering are called `nominal`. Examples of nominal
categories include:
> > - Blue, Red, Green
> > - Man, Woman
> > - Banana, Strawberry, Apple

>> In contrast, when a set of categories has some natural ordering we refer to it as
`ordinal`. For example:
>> - Low, Medium, High
>> - Young, Old
>> - Agree, Neutral, Disagree

> - The problem is that most
machine learning algorithms require inputs to be numerical values.
> - The k-nearest neighbors algorithm is an example of an algorithm that requires
numerical data.
>> One step in the algorithm is calculating the distances between observations (often using Euclidean distance)

> - we need to convert the string into some
numerical format so that it can be input into the Euclidean distance equation. 

# Encoding Nominal Categorical Features
> - We might think the proper strategy is to assign each class a numerical value (e.g.,
Texas = 1, California = 2).
> - However, when our classes have no intrinsic ordering
(e.g., Texas isn’t “less” than California), our numerical values erroneously create an
ordering that is not present.
>> - The proper strategy is to create a binary feature for each class in the original feature.
This is often called `one-hot encoding` (in machine learning literature) or `dummying`
(in statistical and research literature).

In [1]:
# If you have a feature with nominal classes that has no intrinsic ordering (e.g., apple,pear, banana), 
# and you want to encode the feature into numerical values, 
# you must One-hot encode the feature using scikit-learn’s LabelBinarizer:
# Import libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
# Create feature
feature = np.array([["Texas"],
 ["California"],
 ["Texas"],
 ["Delaware"],
 ["Texas"]])
# Create one-hot encoder
one_hot = LabelBinarizer()
# One-hot encode feature
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

In [3]:
# We can use the classes_ attribute to output the classes:
# View feature classes
one_hot.classes_
# array(['California', 'Delaware', 'Texas'],dtype='<U10')

array(['California', 'Delaware', 'Texas'], dtype='<U10')

In [4]:
# If we want to reverse the one-hot encoding, we can use inverse_transform:
# Reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))
# array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'])

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

In [6]:
# We can even use pandas to one-hot encode the feature:
# Import library
import pandas as pd
# Create dummy variables from feature
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,False,False,True
1,True,False,False
2,False,False,True
3,False,True,False
4,False,False,True


In [8]:
# One helpful feature of scikit-learn is the ability to handle a situation where each
# observation lists multiple classes:
# Create multiclass feature
multiclass_feature = [("Texas", "Florida"),
 ("California", "Alabama"),
 ("Texas", "Florida"),
 ("Delaware", "Florida"),
 ("Texas", "Alabama")]
# Create multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()
# One-hot encode multiclass feature
one_hot_multiclass.fit_transform(multiclass_feature)
# array([[0, 0, 0, 1, 1],
#  [1, 1, 0, 0, 0],
#  [0, 0, 0, 1, 1],
#  [0, 0, 1, 1, 0],
#  [1, 0, 0, 0, 1]])

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [9]:
# Once again, we can see the classes with the classes_ method:
# View classes
one_hot_multiclass.classes_
# array(['Alabama', 'California', 'Delaware', 'Florida', 'Texas'], dtype=object)

array(['Alabama', 'California', 'Delaware', 'Florida', 'Texas'],
      dtype=object)

# Encoding Ordinal Categorical Features
> Often we have a feature with classes that have some kind of natural ordering.
>> A famous example is the Likert scale:
>>> Strongly Agree • Agree • Neutral • Disagree • Strongly Disagree

> - When encoding the feature for use in machine learning, we need to transform the
ordinal classes into numerical values that maintain the notion of ordering.

In [10]:
# If you have an ordinal categorical feature (e.g., high, medium, low), 
# and you want to transform it into numerical values, 
# use the pandas DataFrame replace method to transform string labels to numerical equivalents:
# Load library
import pandas as pd
# Create features
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High"]})
# Create mapper
scale_mapper = {"Low":1,
 "Medium":2,
 "High":3}
# Replace feature values with scale
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

In [11]:
# It is important that our choice of numeric values is based on our prior information on
# the ordinal classes. In our solution, high is literally three times larger than low. This
# is fine in many instances but can break down if the assumed intervals between the classes are not equal:
dataframe = pd.DataFrame({"Score": ["Low",
                                    "Low",
                                    "Medium",
                                    "Medium",
                                    "High",
                                    "Barely More Than Medium"]})
scale_mapper = {"Low":1,
                "Medium":2,
                "Barely More Than Medium":3,
                "High":4}
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    4
5    3
Name: Score, dtype: int64

In [12]:
# Correct way:
# The best approach is to be conscious about the numerical values mapped to classes:
scale_mapper = {"Low":1,
                "Medium":2,
                "Barely More Than Medium":2.1,
                "High":3}
dataframe["Score"].replace(scale_mapper)

0    1.0
1    1.0
2    2.0
3    2.0
4    3.0
5    2.1
Name: Score, dtype: float64

# Encoding Dictionaries of Features
> - This is a common situation when working with natural language processing.
> - For example, we might have a collection of documents and for each document we have
a dictionary containing the number of times every word appears in the document.
>> Using `DictVectorizer`, we can easily create a feature matrix where every feature is
the number of times a word appears in each document:
>>> 1. Create word count dictionaries for four documents
>>>`doc_1_word_count = {"Red": 2, "Blue": 4}` <br>
`doc_2_word_count = {"Red": 4, "Blue": 3}`<br>
`doc_3_word_count = {"Red": 1, "Yellow": 2}`<br>
`doc_4_word_count = {"Red": 2, "Yellow": 2}`<br><br>
>>> 2. Create list
`doc_word_counts = [doc_1_word_count,
 doc_2_word_count,
 doc_3_word_count,
 doc_4_word_count]`<br>
>>> 3. Convert list of word count dictionaries into feature matrix

In [19]:
# If you have a dictionary and want to convert it into a feature matrix, use DictVectorizer:
# Import library
from sklearn.feature_extraction import DictVectorizer
# Create dictionary
data_dict = [{"Red": 2, "Blue": 4},
             {"Red": 4, "Blue": 3},
             {"Red": 1, "Yellow": 2},
             {"Red": 2, "Yellow": 2}]

# Create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)
# Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)
# View feature matrix
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

- By default `DictVectorizer` outputs a `sparse matrix` that only stores elements with
a value other than 0. This can be very helpful when we have massive matrices
(often encountered in natural language processing) and want to minimize the memory requirements.
- We can force `DictVectorizer` to output a dense matrix using
`sparse=False`.

In [23]:
# We can get the names of each generated feature using the get_feature_names method:
# Get feature names
feature_names = dictvectorizer.feature_names_
# View feature names
feature_names
# ['Blue', 'Red', 'Yellow']

['Blue', 'Red', 'Yellow']

In [24]:
# While not necessary, for the sake of illustration we can create a pandas DataFrame to
# view the output better:
# Import library
import pandas as pd
# Create dataframe from features
pd.DataFrame(features, columns=feature_names)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


- In the above toy example there are only three unique words (Red, Yellow, Blue) so there
are only three features in our matrix;
- However, you can imagine that if each document
was actually a book in a university library our feature matrix would be very large (and
then we would want to set sparse to True).

# Imputing Missing Class Values
> best solution is to open
our toolbox of machine learning algorithms to predict the values of the missing
observations.
> > We can accomplish this by treating the feature with the missing values
as the target vector and the other features as the feature matrix.
> > >A commonly used
algorithm is KNN

In [25]:
# If you have a categorical feature containing missing values that you want to replace with predicted values.
# Then the ideal solution is to train a machine learning classifier algorithm to predict the
# missing values, commonly a k-nearest neighbors (KNN) classifier:
# Load libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# Create feature matrix with categorical feature
X = np.array([[0, 2.10, 1.45],
 [1, 1.18, 1.33],
 [0, 1.22, 1.27],
 [1, -0.21, -1.19]])
# Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([[np.nan, 0.87, 1.31],
 [np.nan, -0.67, -0.22]])
# Train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])
# Predict class of missing values
imputed_values = trained_model.predict(X_with_nan[:,1:])
# Join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))
# Join two feature matrices
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

In [26]:
# An alternative solution is to fill in missing values with the feature’s most frequent value:
from sklearn.impute import SimpleImputer
# Join the two feature matrices
X_complete = np.vstack((X_with_nan, X))
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

# Handling Imbalanced Classes
> - Our best strategy is simply to collect more observations—especially observations
from the minority class. However, often this is just not possible, so we have to resort
to other options.
>________________________________________
>> - A second strategy is to use a model evaluation metric better suited to imbalanced
classes.
>>> - Accuracy is often used as a metric for evaluating the performance of a model,
but when imbalanced classes are present, accuracy can be ill suited. For example,
if only 0.5% of observations have some rare cancer, then even a naive model that
predicts nobody has cancer will be 99.5% accurate. Clearly this is not ideal. Some
better metrics we discuss in later chapters are confusion matrices, precision, recall, F1
scores, and ROC curves.
>>___________________________________________
>> - A third strategy is to use the class weighing parameters included in implementations
of some models. This allows the algorithm to adjust for imbalanced classes. Fortu‐
nately, many scikit-learn classifiers have a class_weight parameter, making it a good
option.
>>__________________________________________
>> - The fourth and fifth strategies are related: downsampling and upsampling. In down‐
sampling we create a random subset of the majority class of equal size to the minority
class.
>>> - In upsampling we repeatedly sample with replacement from the minority class
to make it of equal size as the majority class.
>>> - The decision between using downsampling and upsampling is context-specific, and in general we should try both to see
which produces better results.

In [27]:
# If you have a target vector with highly imbalanced classes, 
# and you want to make adjustments so that you can handle the class imbalance.
# 1- Collect more data. 
# 2- If that isn’t possible, change the metrics used to evaluate your
# model. 
# 3- If that doesn’t work, consider using a model’s built-in class weight parame‐
# ters (if available), downsampling, or upsampling. We cover evaluation metrics in a
# later chapter, so for now let’s focus on class weight parameters, downsampling, and
# upsampling.
# ---------------------------------------------------------------------------------------
# To demonstrate our solutions, we need to create some data with imbalanced classes.
# Fisher’s Iris dataset contains three balanced classes of 50 observations, each indicating
# the species of flower (Iris setosa, Iris virginica, and Iris versicolor).

# The result is 10 observations of Iris setosa (class 0) and 100 observations of not Iris setosa (class 1):
# Load libraries
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load iris data
iris = load_iris()
# Create feature matrix
features = iris.data
# Create target vector
target = iris.target
# Remove first 40 observations
features = features[40:,:]
target = target[40:]
# Create binary target vector indicating if class 0
target = np.where((target == 0), 0, 1)
# Look at the imbalanced target vector
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [28]:
# Many algorithms in scikit-learn offer a parameter to weight classes during training
# to counteract the effect of their imbalance. 
# Random ForestClassifier is a popular classification algorithm and includes a class_weight parameter
# You can pass an argument explicitly specifying the desired class weights:
# Create weights
weights = {0: 0.9, 1: 0.1}
# Create random forest classifier with weights
RandomForestClassifier(class_weight=weights)

In [30]:
# Or you can pass balanced, which automatically creates weights inversely propor‐
# tional to class frequencies:
# Train a random forest with balanced class weights
RandomForestClassifier(class_weight="balanced")

In [31]:
# Alternatively, we can downsample the majority class or upsample the minority class.
# In downsampling, we randomly sample without replacement from the majority class
# (i.e., the class with more observations) to create a new subset of observations equal
# in size to the minority class:

# Indicies of each class's observations
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]
# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)
# For every observation of class 0, randomly sample
# from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False)
# Join together class 0's target vector with the
# downsampled class 1's target vector
np.hstack((target[i_class0], target[i_class1_downsampled]))
# Join together class 0's feature matrix with the
# downsampled class 1's feature matrix
np.vstack((features[i_class0,:], features[i_class1_downsampled,:]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4]])

In [32]:
# Our other option is to upsample the minority class. 
# In upsampling, for every observation in the majority class,
# we randomly select an observation from the minority class with replacement.
# The result is the same number of observations from the minority and majority classes. 
# Upsampling is implemented very similarly to downsampling, just in reverse:
# For every observation in class 1, randomly sample from class 0 with
# replacement
i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)
# Join together class 0's upsampled target vector with class 1's target vector
np.concatenate((target[i_class0_upsampled], target[i_class1]))
# Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [5.1, 3.8, 1.9, 0.4],
       [4.4, 3.2, 1.3, 0.2],
       [5.1, 3.8, 1.6, 0.2],
       [5.1, 3.8, 1.6, 0.2]])

# END of Chapter 5 --> Handling Categorical Data