# Introduction

A **categorical variable** takes only a limited number of values.  

- Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day".  In this case, the data is categorical, because responses fall into a fixed set of categories.
- If people responded to a survey about which what brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford".  In this case, the data is also categorical.

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.  In this tutorial, we'll compare three approaches that you can use to prepare your categorical data.

# Three Approaches

### 1) Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove them from the dataset.  This approach will only work well if the columns did not contain useful information.

### 2) Ordinal Encoding

**Ordinal encoding** assigns each unique value to a different integer.

![tut3_ordinalencode](https://storage.googleapis.com/kaggle-media/learn/images/tEogUAr.png)

This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories.  Not all categorical variables have a clear ordering in the values, but we refer to those that do as **ordinal variables**.  For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables.

### 3) One-Hot Encoding

**One-hot encoding** creates new columns indicating the presence (or absence) of each possible value in the original data.  To understand this, we'll work through an example.

![tut3_onehot](https://storage.googleapis.com/kaggle-media/learn/images/TW5m0aJ.png)

In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green".  The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset.  Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was "Yellow", we put a 1 in the "Yellow" column, and so on.  

In contrast to ordinal encoding, one-hot encoding *does not* assume an ordering of the categories.  Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither _more_ nor _less_ than "Yellow").  We refer to categorical variables without an intrinsic ranking as **nominal variables**.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

# Example

We will work with the [Melbourne Housing dataset](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home).  

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in `X_train`, `X_valid`, `y_train`, and `y_valid`.

In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('https://raw.githubusercontent.com/modengann/Data-sets/master/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

After running the above code, we have:
X_train, X_valid, y_train, and y_valid

We take a peek at the training data with the `head()` method below.

In [None]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [None]:
y_train

12167     481000.0
6524      895000.0
8413      651500.0
2919      482500.0
6043      591000.0
           ...    
13123    1280000.0
3264      915000.0
9845     1020000.0
10799     760000.0
2732     1225000.0
Name: Price, Length: 10864, dtype: float64

Next, we obtain a list of all of the categorical variables in the training data.

We do this by checking the data type (or **dtype**) of each column.  The `object` dtype indicates a column has text (there are other things it could theoretically be, but that's unimportant for our purposes).  For this dataset, the columns with text indicate categorical variables.

In [None]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['Type', 'Method', 'Regionname']


In [None]:
X_train.dtypes

Type              object
Method            object
Regionname        object
Rooms              int64
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Landsize         float64
Lattitude        float64
Longtitude       float64
Propertycount    float64
dtype: object

We have three categorical columns here: Type, Method, and Regionname. We can create dummy columns for each of these:

## One-hot encoding
One-hot encoding converts n categories into n features as shown here. You can use the get_dummies() function to one-hot encode columns. The function takes a DataFrame and a list of categorical columns you want converted into one hot encoded columns, and returns an updated DataFrame with these columns included. Specifying a prefix with the prefix argument can improve readability like the letter C for country has been used here.

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(X_train, prefix = ['type', 'method', 'region'])
one_hot_encoded_training_predictors

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount,type_h,...,method_SP,method_VB,reg_Eastern Metropolitan,reg_Eastern Victoria,reg_Northern Metropolitan,reg_Northern Victoria,reg_South-Eastern Metropolitan,reg_Southern Metropolitan,reg_Western Metropolitan,reg_Western Victoria
12167,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.98670,13240.0,0,...,0,0,0,0,0,0,0,1,0,0
6524,2,8.0,3016.0,2.0,2.0,193.0,-37.85800,144.90050,6380.0,1,...,0,0,0,0,0,0,0,0,1,0
8413,3,12.6,3020.0,3.0,1.0,555.0,-37.79880,144.82200,3755.0,1,...,0,0,0,0,0,0,0,0,1,0
2919,3,13.0,3046.0,3.0,1.0,265.0,-37.70830,144.91580,8870.0,0,...,1,0,0,0,1,0,0,0,0,0
6043,3,13.3,3020.0,3.0,1.0,673.0,-37.76230,144.82720,4217.0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13123,3,5.2,3056.0,3.0,1.0,212.0,-37.77695,144.95785,11918.0,1,...,1,0,0,0,1,0,0,0,0,0
3264,3,10.5,3081.0,3.0,1.0,748.0,-37.74160,145.04810,2947.0,1,...,0,0,1,0,0,0,0,0,0,0
9845,4,6.7,3058.0,4.0,2.0,441.0,-37.73572,144.97256,11204.0,1,...,0,0,0,0,1,0,0,0,0,0
10799,3,12.0,3073.0,3.0,1.0,606.0,-37.72057,145.02615,21650.0,1,...,0,0,0,0,1,0,0,0,0,0


Two things to notice:
1 - original column is dropped
2 - new columns each have a prefix. Defined in the get_dummies() method

## Dummy encoding
On the other hand, dummy encoding creates n-1 features for n categories, omitting the first category. In dummy encoding, the base value is encoded by its absence in other columns. For example, in the Type column of our original DataFrame there were only three possible outcomes. It might not make sense to have three separate columns when we could just have two columns. The third outcome is indicated by 0s in both of the included columns. This can be useful because when we create two columns of colinear data, we are overemphasizing that one data point. This is most important with small sets of data or when there are a limited number of columns.

In [None]:
pd.get_dummies(X_train, columns = ["Type"], drop_first = True, prefix = "T")

Unnamed: 0,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount,T_t,T_u
12167,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.98670,13240.0,0,1
6524,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.85800,144.90050,6380.0,0,0
8413,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.79880,144.82200,3755.0,0,0
2919,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.70830,144.91580,8870.0,0,1
6043,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.76230,144.82720,4217.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13123,SP,Northern Metropolitan,3,5.2,3056.0,3.0,1.0,212.0,-37.77695,144.95785,11918.0,0,0
3264,S,Eastern Metropolitan,3,10.5,3081.0,3.0,1.0,748.0,-37.74160,145.04810,2947.0,0,0
9845,PI,Northern Metropolitan,4,6.7,3058.0,4.0,2.0,441.0,-37.73572,144.97256,11204.0,0,0
10799,S,Northern Metropolitan,3,12.0,3073.0,3.0,1.0,606.0,-37.72057,145.02615,21650.0,0,0


# One-hot vs. dummies
Both these methods have different advantages. One-hot encoding generally creates much more explainable features, as each country will have its own weight that can be observed after training. But one must be aware that one hot encoding may create features that are entirely collinear due to the same information being represented multiple times.

Take for example a simpler categorical column recording the sex of the survey takers. By recording a 1 for male the information of whether the person is female is already known when the male column is 0. This double representation can lead to instability in your models and dummy values would be more appropriate.

# Other considerations
## Limiting your columns
However, both one-hot encoding and dummy encoding may result in a huge number of columns being created if there are too many different categories in a column. In these cases, you may want to only create columns for the most common values. You can check the number of occurrences of different features in a column using the value_counts() method on a specific column.

Once you have your counts of occurrences, you can use it to limit what values you will include by first creating a mask of the values that occur less than n times. A mask is a list of booleans outlining which values in a column should be affected. First we find the categories that occur less than n times using the index attribute and wrap this inside the isin() method. After you create the mask, you can use it to replace these categories that occur less than n times with a value of your choice as shown here.