# Cardinality

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable _gender_ the categories are male and female, whereas in the variable _city_ the labels could be London, Manchester, Brighton, and so on.

Categorical variables can contain different numbers of categories. The variable "gender" contains only 2 labels, but a variable like "city" or "postcode" can contain a huge number of labels.

The number of different labels is known as cardinality. A high number of labels within a variable is known as __high cardinality__.


## Is high cardinality a problem?

High cardinality poses the following challenges: 

- Variables with too many labels tend to dominate those with only a few labels, particularly in **decision tree-based** algorithms.

- High cardinality may introduce noise.

- Some of the labels may only be present in the training data set and not in the test set, so machine learning algorithms may over-fit to the training set.

- Some labels may appear only in the test set, leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

**Algorithms based on decision trees can be biased towards variables with high cardinality**.

Below is a demo about the effect of high cardinality on the performance of various machine learning algorithms.

## In this Demo:

- Learn how to quantify cardinality.
- See examples of high and low cardinality variables.
- Understand the effect of cardinality in train and test sets.
- Evaluate the effect of cardinality on machine learning model performance.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# let's load the titanic dataset.

data = pd.read_csv('shipdata.csv')

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The categorical variables are Name, Sex, Ticket, Cabin and Embarked.

**Note** that Ticket and Cabin contain both letters and numbers, so they could be treated as Mixed Variables. In this demo, I will treat them as categorical.

In [4]:
# Let's inspect the cardinality: the number
# of different labels.

print('Number of categories in the variable Name: {}'.format(
    len(data.Name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.Sex.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.Embarked.unique())))


Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Gender: 4


In [5]:
# let's explore the values of Cabin.

# We know from the previous cell that there are 148
# different cabins, therefore the variable
# is highly cardinal.

data.Cabin.unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

In [6]:
# Let's capture the first letter of cabin.

data['Cabin_reduced'] = data['Cabin'].astype(str).str[0]

data[['Cabin', 'Cabin_reduced']].head()

Unnamed: 0,Cabin,Cabin_reduced
0,,n
1,C85,C
2,,n
3,C123,C
4,,n


In [7]:
print('Number of categories in the variable Cabin: {}'.format(
    len(data.Cabin.unique())))

print('Number of categories in the variable Cabin reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

Number of categories in the variable Cabin: 148
Number of categories in the variable Cabin reduced: 9


The performance of the Random Forests on the training set is quite superior to its performance on the test set. This indicates that the model is over-fitting, which means that it does a great job of predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction to unseen data.