## Intermediate Machine Learning

Categorical Variables

https://www.kaggle.com/code/alexisbcook/categorical-variables

A **categorical variable** takes only a limited number of values. These require preprocessing before plugging them into an ML model.
- Examples:
    - Never, Rarely, Most days, Every day
    - Honda, Toyota, Ford

### Three Approaches

#### 1. Drop Categorical Variables
Easiest approach is to remove categorical variables from the dataset. This is only useful if the columns did not contain useful information. 

#### 2. Ordinal Encoding
**Ordinal encoding** assigns a unique value to a different integer. In the example below, this assumes that the categories are ordered and ranked in a certain way. Not all categorical variables have a clear order, instead, these are called **ordinal variables**. Ordinal encoding is expected to work well with ordinal variables in tree-based models. 

| Breakfast | Breakfast |
| :--- | :---: |
| Every day | 3 |
| Never | 0 |
| Rarely | 1 |
| Most days | 2 |
| Never | 0 |

#### 3. One-Hot Encoding
**One-hot encoding** creates new columns showing the presense/absence of each possible value in the original data. For example: 

| Color | | Red | Yellow | Green | 
| --- | --- | :---: | :---: | :---: |
| Red | | 1 | 0 | 0 |
| Red | | 1 | 0 | 0 |
| Yellow | | 0 | 1 | 0 |
| Green | | 0 | 0 | 1 |
| Yellow | | 0 | 1 | 0 |

*Color* is a categorical variable with three categories. The corresponding one-hot encoding contains a column for each possible value, and one row for each row in the original dataset. In contrast to ordinal encoding, one-hot encoding *does not* assume an ordering of the categories. You can expect this approach to work particularly well if there is no clear ordering in the categorical data. Categorical variables without an intrinsic ranking are called **nominal variables**. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('./melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [2]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


Obtain a list of all the categorical variables in the training data. This can be done by checking the data type (**dtype**) of each column. The `object` dtype indicates a column has text. For this dataset, the columns with text indicate categorical variables. 

In [3]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables: ")
print(object_cols)

Categorical variables: 
['Type', 'Method', 'Regionname']


### Define Function to Measure Quality of Each Approach

Define a function `score_dataset()` to compare the three different approaches to dealing with categorical variables. This function report the MAE from a random forest model. Generally, the MAE should be as low as possible. 

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Score from Approach 1 (Drop Categorival Variables)

Drop the `object` columns with the `select_dtypes()` method. 

In [5]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables): ")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables): 
175703.48185157913


### Score from Approach 2 (Ordinal Encoding)

`scikit-learn` has an `OrdinalEncorder` class that can be used to get ordinal endodings. We loop over the categorical variables and apply the encoder to each column.

In [6]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with cateogrical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding): ")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding): 
165936.40548390493


In the previous code cell, for each column, each unique value is assigned a different integer. This is common and is simpler than providing custom labels. We can expect a boost in performance if better-informed lables are provided for all ordinal variables. 

#### Score from Approach 3 (One-Hot Encoding)

`OneHotEncoder` class from scikit-learn is used to get one-hot encodings. There are a number of parameters that can be used to customize its behavior. 
- Set `handle_unknown='ignore'` to avoid errors when the validation data contains classes that aren't represented int he training data, and
- setting `sparse=False` ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix). 

To use the encoder, supply the only the categorical columns that we want to be one-hot encoded. 

In [10]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removes the index so it has to be put back:
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns to be replaced by one-hot encoding
num_X_train = X_train.drop(object_cols, axis = 1)
num_X_valid = X_valid.drop(object_cols, axis = 1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis = 1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis = 1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding): ")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding): 
166089.4893009678


#### Which approach is the best? 

Dropping the categorical columns (**Approach 1**) performed the worse. It had the highest MAE. 
As for **Approach 2** and **Approach 3**, the returned MAE scores were close in value that there does not appear to be any meaningful benefit to one over the other. 

Generally, one-hot encoding (**Approach 3**) will typically perform best, and dropping categorical columns (**Approach 1**) typically performs the worse. But, it does very on a case-by-case basis. 

### Conclusion

A lot of datasets contain categorical data. A data scientist will be much more effective knowing how to use this common data type. 