# Cardinality - Categorical Variables

## Introduction

In this section, we are going to understand what cardinality is and the things we need to consider when you see highly cardinal variables to build machine learning models.

## Cardinality definition

- The values of a categorical variable are selected from a group of categories (also called labels). This is, the values of categorical variables are generally strings and not numbers. 
- The number of different labels in a variable is known as **cardinality**.

## Cardinality examples

- The variable gender contains only two labels (the cardinality is 2) in this example
- Vehicle Make contains nine labels in the example table
- The variables, city or postcode, can contain a huge number of different labels. They are highly cardinal.

![](../imgs/cardinality.png)


## Carcinality effects

Are multiple labels in a categorical variable a challenge for machine learning?

What are the things we need to consider when working with categorical and highly cardinal variables? 

## Cardinality: Impacts

There are a few things to consider when working with categorical variables:

- First, scikit-learn, which is the standard Python library for machine learning, does not support strings as inputs. Therefore, we need to transform those strings into numbers if we want to use them. 
- Second, highly cardinal variables mainly may lead to an uneven distribution of the labels between the train and test sets. 
- Which may derive in overfitting particularly in tree-based algorithms
- Or operational problems when we want to use those models live.

# Strings and categorical encoding

Scikit-learn does not support strings as inputs. Therefore, we need to encode those categories into numbers.

We can use a variety of categorical variables encoding techniques to do so. And these techniques can alter the future space and also the interactions between the variables. 

# Uneven distribution between train and test sets

When splitting into train and test set some categories of the variable may land only on the train set and some only on the test set. 

If labels or categories land only on the training set, the trained models may end up over-fitting to those labels. 

And on the other hand, if the categories appear only in the test set, the models will not know how to interpret the label because they haven't seen those while they been in train.

For example, let's look at the variable `Vehicle Make` in this table. When we divide the dataset into train and set, we can see that some labels like Mercedes, Citroen and Nissan appear only on the training set, whereas other labels like Seat, Toyota and BMW appear only on the test set.

![](../imgs/cardinality2.png)

Those labels that appear only on the train set may cause over-fitting, whereas those labels that appear only on the test set may cause operational problems. 

## Overfitting

Why can those labels cause over-fitting?

### Cardinality and overfitting

- Variables with too many labels tend to dominate over those with fewer labels, particularly in **tree-based algorithms**.

- A big number of labels within a variable may introduce noise with little if any, information.

- Reducing cardinality may help improve model performance 

## Operational problems

- Models learn how to make predictions using the information they showed on the training set. 
- For new, unseen categories, the models will be unable to perform a calculation. They will not be able to understand what to make of the new label, and this could end up in the return of an error instead of a prediction. 

## Summary

- Strings need to be encoded as numbers for use with Scikit-Learn
- High cardinality may cause over-fitting and operation problems
- Reducing cardinality may improve model performance