## Categorical Variables

### Pre. Obtain a list of all of the categorical variables
```python
    s = (X_train.dtypes == 'object')
    object_cols = list(s[s].index)
```

### Option 1. Drop Categorical Variables

* This approach will only work well if the columns did not contain useful information.

```python
    drop_X_train = X_train.select_dtypes(exclude=['object'])
    drop_X_valid = X_valid.select_dtypes(exclude=['object'])
```

### Option 2. Ordinal Encoding

* assigns each unique value to a different integer.
* In many times, validation datat contains values that don't also appear in the training data and this will cause an error.
* There are many approaches to fixing this issue.<br>
&nbsp; - you can write a custom ordinal encoder to deal with new categories.<br>
&nbsp; - or, you can drop the problematic categorical columns.

```python
    # Categorical columns in the training data
    object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

    # Columns that can be safely ordinal encoded
    good_label_cols = [col for col in object_cols if 
                    set(X_valid[col]).issubset(set(X_train[col]))]
            
    # Problematic columns that will be dropped from the dataset
    bad_label_cols = list(set(object_cols)-set(good_label_cols))

    # Drop categorical columns that will not be encoded
    label_X_train = X_train.drop(bad_label_cols, axis=1)
    label_X_valid = X_valid.drop(bad_label_cols, axis=1)

    # Apply ordinal encoder 
    ordinal_encoder = OrdinalEncoder()
    label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
    label_X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])
```

### Option 3. One-Hot Encoding

* creates new columns indicating the presence (or absence) of each possible value in the original data.
* does not assume an ordering of the categories.
* works particularly well if there is no clear ordering in the categorical data.
* does not perform well if the categorical variable takes on a large number of values.

```python
    # cf) handle_unknown='ignore' : avoid errors when the validation data contains classes that aren't represented in the training data
    # cf) sparse=False : ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
    from sklearn.preprocessing import OneHotEncoder

    # Apply one-hot encoder to each column with categorical data
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
    OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

    # One-hot encoding removed index; put it back
    OH_cols_train.index = X_train.index
    OH_cols_valid.index = X_valid.index

    # Remove categorical columns (will replace with one-hot encoding)
    num_X_train = X_train.drop(object_cols, axis=1)
    num_X_valid = X_valid.drop(object_cols, axis=1)

    # Add one-hot encoded columns to numerical features
    OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
    OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
```

## Exercise Notes

* Check the number of unique values per column before One-Hot Encoding
```python
    # Get number of unique entries in each column with categorical data
    object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
    d = dict(zip(object_cols, object_nunique))

    # Print number of unique entries by column, in ascending order
    sorted(d.items(), key=lambda x: x[1])
```

* Separate low cardinality columns from high cardinality columns
```python
    # Columns that will be one-hot encoded
    low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

    # Columns that will be dropped from the dataset
    high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
```