# Cardinality

We learned that when we are dealing with categorical variables, their categories are often called labels. We call the cardinality of a categorical variable the number of labels it has. For example, if the variable we are dealing with corresponds to a person's gender, it will have cardinality 2: man and woman. If the variable we are working on concerns cities, it can have a very high cardinality, since the variable can take on different values such as London, Rio de Janeiro, Tokyo, New York, etc.

In this notebook we will explore issue of cardinality with the aim of approaching it from the following aspects: 
- Show how the presence of categorical variables with high cardinality can harm machine learning models, mainly those based on decision trees;
- In tree-based models, the presence of categorical variables with high cardinality introduce a high bias in the model decisions, since the number of classes present in these variables is much higher than that of variables with lower cardinality;
- The presence of variables with high cardinality increases the chances of a subset of the labels only being present in the training data or test data;
- If a subset of the labels is only present in the training data, this causes the model to tend to overfitting, since it will learn the behavior of these labels in training but they are not present in the test set;
- If a subset of the labels is only present in the test data, the model's performance will be harmed, since it will deal with categories on which it was not trained;
- Show that creating variables with low cardinality from those with high cardinality improves the model's performance and is a good practice to be carried out.

============================================================================================================================================================================================

## Hands-On: Titanic Dataset

To explore the cardinality of categorical variables, we will work with the Titanic dataset. Let's start by observing the cardinality of the categorical variables present in it and, after that, reduce the cardinality in those that have a high cardinality by creating new variables from them.

In [1]:
# basic libraries
import pandas as pd 
import numpy as np 

# scikit-learn train test split
from sklearn.model_selection import train_test_split

# scikit-learn classifier models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# scikit-learn metric to evaluate the performance
from sklearn.metrics import roc_auc_score

In [2]:
# loading the titanic dataset
path_titanic = '../datasets/titanic.csv'
titanic_df = pd.read_csv(path_titanic)
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


The categorical variables in this dataset are: `sex`, `ticket`, `cabin` and `embarked`. The variables `cabin` and `ticket` involve both letters and numbers and can be included in the mixed variable type, which we have already discussed in the previous section. However, at this point we will treat them as categorical variables. Let's check the cardinality of each of them.

In [3]:
# verifying the cardinality of these categorical features
print(f"Number of names in Titanic: {len(titanic_df.name.unique())}")
print(f"Number of unique labels of sex variable: {len(titanic_df.sex.unique())}") # sex
print(f"Number of unique labels of ticket variable: {len(titanic_df.ticket.unique())}") # ticket
print(f"Number of unique labels of cabin variable: {len(titanic_df.cabin.unique())}") # cabin 
print(f"Number of unique labels of embarked variable: {len(titanic_df.embarked.unique())}") # embarked
print(f"Number of passengers in Titanic: {len(titanic_df)}")


Number of names in Titanic: 1307
Number of unique labels of sex variable: 2
Number of unique labels of ticket variable: 929
Number of unique labels of cabin variable: 182
Number of unique labels of embarked variable: 4
Number of passengers in Titanic: 1309


Here it is clear that some variables have low cardinality - `sex` and `embarked` - and others have high cardinality - `ticket` and `cabin`. Let's work with the `cabin` variable and look at the number of distinct labels it has:

In [4]:
# unique categories present on cabin feature
titanic_df.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


The idea now will be to reduce this cardinality by creating a new variable from it using only the first letter of the labels. The variable concerns the sectors of the rooms where people stayed on the Titanic. The letters in the acronym correspond to the separation in terms of price class. Therefore, it makes sense to use only the letters to create a new variable.

In [5]:
# creating a new variable from cabin to reduce cardinality
titanic_df['cabin_reduced'] = titanic_df.cabin.astype(str).str[0]

Let us verify the cardinality of this new variable:

In [6]:
print(f"Number of unique labels for cabin reduced variable: {len(titanic_df.cabin_reduced.unique())}")

Number of unique labels for cabin reduced variable: 9


We managed to reduce the cardinality from $182$ to $9$ using this new variable. As we will see later, using the variable with the highest cardinality will harm our machine learning model while the variable with the lowest cardinality that we just created will help it. Before that, however, let's show how high cardinality can harm our training and testing sets.

## High Cardinality creates uneven distribution of categories

Let's assume that our data consists only of the variables `cabin`, `cabin_reduced` and `sex`, in addition to the target `survived`, of course. Let's create the training and test set from them.

In [7]:
# creating training and test sets from this subset of features
features_subset = ['cabin', 'cabin_reduced', 'sex']

# splitting the data in train and test sets
X_train, X_test, y_train, y_test = (
    train_test_split(
        titanic_df[features_subset],
        titanic_df['survived'],
        test_size=0.3,
        random_state=0
    )
)

Now, it is interesting to do the following analysis: see how many of the labels for the variables `cabin` and `cabin_reduced` are found in the training data and the test data.

In [8]:
# set of labels which are in train set and not in test set - cabin feature
labels_in_train = [
    label for label in X_train.cabin.unique() if label not in X_test.cabin.unique()
]

In [9]:
# length of this set
len(labels_in_train)

113

Of the $182$ categories, $113$ is found in the training set only! As we will see, this will cause overfitting in our models. Let's see how many categories are present in the test set but not in the training set:

In [10]:
# set of labels which are in test set and not in train set - cabin feature
labels_in_test = [
    label for label in X_test.cabin.unique() if label not in X_train.cabin.unique()
]

In [11]:
# length of this set
len(labels_in_test)

36

A problem also occurs here. $36$ categories of the `cabin` variable are only present in the test set. This means that, when we train a model, it will not have learned these labels and will cause its performance to decline. So high cardinality is indeed a problem. Let's check if we use the variable `cabin_reduced` this problem is reduced.

In [12]:
# set of labels which are in train set and not in test set - cabin_reduced feature
labels_in_train = [
    label for label in X_train.cabin_reduced.unique() if label not in X_test.cabin_reduced.unique()
]

In [13]:
# length of this set
len(labels_in_train)

1

Very good! Only one label of the variable `cabin_reduced` is present in the training set and not in the test set. Let's check how many are present in the test set alone:

In [14]:
# set of labels which are in test set and not in train set - cabin_reduced feature
labels_in_test = [
    label for label in X_test.cabin_reduced.unique() if label not in X_train.cabin_reduced.unique()
]

In [15]:
# length of this set
len(labels_in_test)

0

In this case, all labels present in the test set are present in the training set! In other words, by using a variable with reduced cardinality, we are able to avoid the problems of uneven distribution of classes between the training and testing sets.

## Impact of Cardinality on the Performance of Machine Learning Models

Our next task is to verify the impact of cardinality on the performance of machine learning models. As we know, scikit-learn machine learning models do not work with variables in string format and must be numeric. Therefore, we first need to create a dictionary to map the labels of the variables `cabin` and `cabin_reduced` into numbers.

In [16]:
# creating a dictionary for cabin feature from X_train
dict_cabin = {
    i: j for j, i in enumerate(X_train.cabin.unique(), 0)
}

In [18]:
# creating a dictionary for cabin_reduced feature from X_train
dict_cabin_reduced = {
    i: j for j, i in enumerate(X_train.cabin_reduced.unique())
}

Now let's use them to actually map the variables `cabin` and `cabin_reduced` into numeric values.

In [19]:
# mapping cabin and cabin_reduced into numbers
X_train.loc[:, 'cabin_mapped'] = X_train.loc[:, 'cabin'].map(dict_cabin)
X_test.loc[:, 'cabin_mapped'] = X_test.loc[:, 'cabin'].map(dict_cabin)
X_train.loc[:, 'cabin_reduced_mapped'] = X_train.loc[:, 'cabin_reduced'].map(dict_cabin_reduced)
X_test.loc[:, 'cabin_reduced_mapped'] = X_test.loc[:, 'cabin_reduced'].map(dict_cabin_reduced)

In [20]:
X_train[['cabin', 'cabin_mapped']].head(25)

Unnamed: 0,cabin,cabin_mapped
501,,0
588,,0
402,,0
1193,,0
686,,0
971,,0
117,E36,1
540,,0
294,C68,2
261,E24,3


In [21]:
X_train[['cabin_reduced', 'cabin_reduced_mapped']].head(20)

Unnamed: 0,cabin_reduced,cabin_reduced_mapped
501,n,0
588,n,0
402,n,0
1193,n,0
686,n,0
971,n,0
117,E,1
540,n,0
294,C,2
261,E,1


Now we see that we can correctly map the variables `cabin` and `cabin_reduced` into numeric values. Let's now map the variable `sex` into numeric values:

In [22]:
# mapping sex feature into numerical one
X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'sex'] = X_test.loc[:, 'sex'].map({'male': 0, 'female': 1})

It is important to check that there are no more null values in the training and test sets:

In [23]:
# seeing if there are null values in X_train for mapped features
X_train[['cabin_mapped', 'cabin_reduced_mapped', 'sex']].isnull().sum()

cabin_mapped            0
cabin_reduced_mapped    0
sex                     0
dtype: int64

In [24]:
# seeing if there are null values in X_test for mapped features
X_test[['cabin_mapped', 'cabin_reduced_mapped', 'sex']].isnull().sum()

cabin_mapped            41
cabin_reduced_mapped     0
sex                      0
dtype: int64

The training data does not contain null values, as expected. However, for the test set there are $41$ null values in the `cabin_mapped` variable. This occurred because when we created the dictionary that performs the mapping we used the `X_train` set to create the map, and, as we saw, there are labels that are only present in the test set. Therefore, the mapping does not work for these labels - since they are not present in the dictionary - and they end up receiving null values. For these cases, we will set it to $0$.

### Random Forest Classifier 

The first model we will train will be a Random Forest. The idea is to show how the presence of high cardinality changes the model's predictions. We will use ROCAUC as a metric. First, we will carry out the training using the variable `cabin_mapped`, which has high cardinality.

In [25]:
# training a random forest classifier using cabin_mapped feature

# random forest model
random_forest = RandomForestClassifier(
    n_estimators=200,
    random_state=31,
)

# fitting
random_forest.fit(
    X_train[['cabin_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = random_forest.predict_proba(
    X_train[['cabin_mapped', 'sex']]
)
predictions_test = random_forest.predict_proba(
    X_test[['cabin_mapped', 'sex']].fillna(0)
)

Let's see the results by calculating ROCAUC:

In [26]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.854
Test Set:
The ROCAUC for the model tested on test data is: 0.773


As can be seen, the performance is different in the two scenarios, being significantly higher in the training data, which indicates the presence of overfitting. Let's now use the variable `cabin_reduced_mapped`:

In [27]:
# training a random forest classifier using cabin_reduced_mapped feature

# random forest model
random_forest = RandomForestClassifier(
    n_estimators=200,
    random_state=31,
)

# fitting
random_forest.fit(
    X_train[['cabin_reduced_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = random_forest.predict_proba(
    X_train[['cabin_reduced_mapped', 'sex']]
)
predictions_test = random_forest.predict_proba(
    X_test[['cabin_reduced_mapped', 'sex']].fillna(0)
)

In [28]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.816
Test Set:
The ROCAUC for the model tested on test data is: 0.802


Note that there is no longer any overfitting. The performance in both sets was the same. This indicates that decreasing the cardinality improves the performance of the models. Let's repeat the procedure for an AdaBoost.

### AdaBoost Classifier

In [29]:
# training a adaboost classifier using cabin_mapped feature

# random forest model
adaboost = AdaBoostClassifier(
    n_estimators=200,
    random_state=32,
)

# fitting
adaboost.fit(
    X_train[['cabin_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = adaboost.predict_proba(
    X_train[['cabin_mapped', 'sex']]
)
predictions_test = adaboost.predict_proba(
    X_test[['cabin_mapped', 'sex']].fillna(0)
)

Let's see the results by calculating ROCAUC:

In [30]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.83
Test Set:
The ROCAUC for the model tested on test data is: 0.76


Again, there is the presence of overfitting in the training data. Let's see what happens when we use the `cabin_reduced_mapped` variable:

In [31]:
# training a adaboost using cabin_reduced_mapped feature

# adaboost
adaboost = AdaBoostClassifier(
    n_estimators=200,
    random_state=32,
)

# fitting
adaboost.fit(
    X_train[['cabin_reduced_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = adaboost.predict_proba(
    X_train[['cabin_reduced_mapped', 'sex']]
)
predictions_test = adaboost.predict_proba(
    X_test[['cabin_reduced_mapped', 'sex']].fillna(0)
)

In [32]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.816
Test Set:
The ROCAUC for the model tested on test data is: 0.8


Again, performance improves. In other words, overfitting is overcome by using a variable with a lower cardinality. Let's repeat the same procedure for a logistic regression.

### Logistic Regression 

In [33]:
# training a logistic regression using cabin_mapped feature

# logistic regression
logit = LogisticRegression(
    solver='lbfgs',
    random_state=33,
)

# fitting
logit.fit(
    X_train[['cabin_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = logit.predict_proba(
    X_train[['cabin_mapped', 'sex']]
)
predictions_test = logit.predict_proba(
    X_test[['cabin_mapped', 'sex']].fillna(0)
)

Let's see the results by calculating ROCAUC:

In [34]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.813
Test Set:
The ROCAUC for the model tested on test data is: 0.775


Again, overfitting the training data. Performance will possibly improve when using `cabin_reduced_mapped`:

In [35]:
# training a logistic regression using cabin_reduced_mapped feature

# logistic regression
logit = LogisticRegression(
    solver='lbfgs',
    random_state=33,
)

# fitting
logit.fit(
    X_train[['cabin_reduced_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = logit.predict_proba(
    X_train[['cabin_reduced_mapped', 'sex']]
)
predictions_test = logit.predict_proba(
    X_test[['cabin_reduced_mapped', 'sex']].fillna(0)
)

In [36]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.812
Test Set:
The ROCAUC for the model tested on test data is: 0.801


And that is what happens, as we expected. Lastly, let's test Gradient Boosting:

### Gradient Boost Classifier

In [37]:
# training a gradient boosting classifier using cabin_mapped feature

# gradient boosting classifier
gradient_boosting = GradientBoostingClassifier(
    n_estimators=200,
    random_state=34,
)

# fitting
gradient_boosting.fit(
    X_train[['cabin_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = gradient_boosting.predict_proba(
    X_train[['cabin_mapped', 'sex']]
)
predictions_test = gradient_boosting.predict_proba(
    X_test[['cabin_mapped', 'sex']].fillna(0)
)

Let's see the results by calculating ROCAUC:

In [38]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.859
Test Set:
The ROCAUC for the model tested on test data is: 0.777


Again, there is the presence of overfitting in the training data. Let's see what happens when we use the `cabin_reduced_mapped` variable:

In [39]:
# training a gradient boosting using cabin_reduced_mapped feature

# gradient boosting classifier
gradient_boosting = GradientBoostingClassifier(
    n_estimators=200,
    random_state=34,
)

# fitting
gradient_boosting.fit(
    X_train[['cabin_reduced_mapped', 'sex']], 
    y_train
)

# predictions on train and test sets
predictions_train = gradient_boosting.predict_proba(
    X_train[['cabin_reduced_mapped', 'sex']]
)
predictions_test = gradient_boosting.predict_proba(
    X_test[['cabin_reduced_mapped', 'sex']].fillna(0)
)

In [40]:
print("Train Set:")
print(f"The ROCAUC for the model tested on trained data is: {np.round(roc_auc_score(y_train, predictions_train[:,1]), 3)}")
print("Test Set:")
print(f"The ROCAUC for the model tested on test data is: {np.round(roc_auc_score(y_test, predictions_test[:,1]), 3)}")

Train Set:
The ROCAUC for the model tested on trained data is: 0.817
Test Set:
The ROCAUC for the model tested on test data is: 0.802


Again, using a variable with a lower cardinality reduces overfitting.