### Cardinality

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.

###### Are multiple labels in a categorical variable a problem?
High cardinality may pose the following problems:

- Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.

- A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.

- Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.

- Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.

Below, I will show the effect of high cardinality of variables on the performance of different machine learning algorithms, and how a quick fix to reduce the number of labels, without any sort of data insight, already helps to boost performance.

This demo contains:
- Learn how to quantify cardinality
- See examples of high and low cardinality variables
- Understand the effect of cardinality when preparing train and test sets
- Visualise the effect of cardinality on Machine Learning Model performance

Dataset: Titanic 

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

# imports for ML models
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# evaluation metrics
from sklearn.metrics import roc_auc_score

# data split 
from sklearn.model_selection import train_test_split


In [2]:
# load the datasets
data = pd.read_csv('../datasets/titanic.csv')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
#lets investigate the number of different labels in different categorical variables.
print(f"The # categories in Name: {len(data.name.unique())}")
print(f"The # categories in Gender: {len(data.sex.unique())}, and counts:\n {data.sex.value_counts()}")
print(f"The # categories in Cabin: {len(data.cabin.unique())}, and counts:\n {data.cabin.value_counts()}")
print(f"The # categories in Embarked: {len(data.embarked.unique())}, and counts:\n {data.embarked.value_counts()}")
print(f"# Passangers:{data.shape[0]}")
print(f"The # categories in Ticket: {len(data.ticket.unique())}, and counts:\n {data.ticket.value_counts()}")

The # categories in Name: 1307
The # categories in Gender: 2, and counts:
 male      843
female    466
Name: sex, dtype: int64
The # categories in Cabin: 182, and counts:
 F       8
C23     6
G6      5
B57     5
F33     4
       ..
A16     1
B80     1
C103    1
C53     1
C90     1
Name: cabin, Length: 181, dtype: int64
The # categories in Embarked: 4, and counts:
 S    914
C    270
Q    123
Name: embarked, dtype: int64
# Passangers:1309
The # categories in Ticket: 929, and counts:
 CA. 2343        11
CA 2144          8
1601             8
PC 17608         7
S.O.C. 14879     7
                ..
349217           1
A. 2. 39186      1
A.5. 18509       1
11774            1
250653           1
Name: ticket, Length: 929, dtype: int64


In [4]:
# lets investigate cabin
data.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


The cabin has 181 cardinality. we need engineer it. How ?
- Lets strip the cabin by first letter. 

Rationale: the first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.



In [5]:
# creating new variable for cabin
data['new_cabin'] = data.cabin.astype(str).str[0]
data[['new_cabin', 'cabin']].head()

Unnamed: 0,new_cabin,cabin
0,B,B5
1,C,C22
2,C,C22
3,C,C22
4,C,C22


In [6]:
# now, lets see the new cabin variable
print(f"The # categories in new Cabin: {len(data.new_cabin.unique())}, and counts:\n {data.new_cabin.value_counts()}")

The # categories in new Cabin: 9, and counts:
 n    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: new_cabin, dtype: int64


In [7]:
# Now, lets use cabin, new_cabin, and sex columns to build ML models 
cols = ['cabin', 'new_cabin', 'sex']
X_train, X_test, y_train, y_test = train_test_split(
    data[cols], data['survived'],test_size=0.3, random_state=0
)

print(f"shape train X:{X_train.shape}, shape train Y: {X_test.shape}")

shape train X:(916, 3), shape train Y: (393, 3)


###### High cardinality leads to uneven distribution of categories in train and test sets

When a variable is highly cardinal, often some categories land only on the training set, or only on the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

In [11]:
# lets investigate our train and test dataset

only_in_train = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()
]

only_in_test = [
    test for test in X_test.cabin.unique() if test not in X_train.cabin.unique()
]

print(f"Cabins only present in train dataset: {len(only_in_train)}")
print(f"Cabins only present in train dataset: {len(only_in_test)}")

Cabins only present in train dataset: 113
Cabins only present in train dataset: 36


Variables with high cardinality tend to have values (i.e., categories) present in the training set, that are not present in the test set, and vice versa. This will bring problems at the time of training (due to over-fitting) and scoring of new data (how should the model deal with unseen categories?).

This problem is almost overcome by reducing the cardinality of the variable.

In [13]:
# Since we reduced the cabin cardinality. let investigate on that 

reduced_only_in_train = [
    x for x in X_train.new_cabin.unique() if x not in X_test.new_cabin.unique()
]

reduced_only_in_test = [
    test for test in X_test.new_cabin.unique() if test not in X_train.new_cabin.unique()
]

print(f"Reduced Cabins only present in train dataset: {len(reduced_only_in_train)}")
print(f"Reduced Cabins only present in train dataset: {len(reduced_only_in_test)}")

Reduced Cabins only present in train dataset: 1
Reduced Cabins only present in train dataset: 0


Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no label in the test set that is not contained in the training set as well.

##### How does the cardinality effects on Machine learning Models Performance

Now lets create dictionary to encode and map the variables. Here, the categories are replaced 
by a number. This is neither only nor best idea to implement. 

In [18]:
cabin_dict = {k:i for i,k in enumerate(X_train.cabin.unique(),0)}
cabin_dict

{nan: 0,
 'E36': 1,
 'C68': 2,
 'E24': 3,
 'C22': 4,
 'D38': 5,
 'B50': 6,
 'A24': 7,
 'C111': 8,
 'F': 9,
 'C6': 10,
 'C87': 11,
 'E8': 12,
 'B45': 13,
 'C93': 14,
 'D28': 15,
 'D36': 16,
 'C125': 17,
 'B35': 18,
 'T': 19,
 'B73': 20,
 'B57': 21,
 'A26': 22,
 'A18': 23,
 'B96': 24,
 'G6': 25,
 'C78': 26,
 'C101': 27,
 'D9': 28,
 'D33': 29,
 'C128': 30,
 'E50': 31,
 'B26': 32,
 'B69': 33,
 'E121': 34,
 'C123': 35,
 'B94': 36,
 'A34': 37,
 'D': 38,
 'C39': 39,
 'D43': 40,
 'E31': 41,
 'B5': 42,
 'D17': 43,
 'F33': 44,
 'E44': 45,
 'D7': 46,
 'A21': 47,
 'D34': 48,
 'A29': 49,
 'D35': 50,
 'A11': 51,
 'B51': 52,
 'D46': 53,
 'E60': 54,
 'C30': 55,
 'D26': 56,
 'E68': 57,
 'A9': 58,
 'B71': 59,
 'D37': 60,
 'F2': 61,
 'C55': 62,
 'C89': 63,
 'C124': 64,
 'C23': 65,
 'C126': 66,
 'E49': 67,
 'E46': 68,
 'D19': 69,
 'B58': 70,
 'C82': 71,
 'B52': 72,
 'C92': 73,
 'E45': 74,
 'C65': 75,
 'E25': 76,
 'B3': 77,
 'D40': 78,
 'C91': 79,
 'B102': 80,
 'B61': 81,
 'A20': 82,
 'B36': 83,
 'C7': 84,

In [21]:
# First lets add new column to data which maps the cabin categorical values with dict above
X_train.loc[:, 'cabin_mapped'] = X_train.loc[:, 'cabin'].map(cabin_dict)
X_test.loc[:, 'cabin_mapped'] = X_test.loc[:, 'cabin'].map(cabin_dict)

X_train[['cabin_mapped', 'cabin']].head(15)

Unnamed: 0,cabin_mapped,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,3,E24


In [22]:
# Previously, we have engineered the cabin variable into new_cabin. let redo the previous step 
# for this column

# create new cabin dict to map
new_cabin_dict = {k:i for i, k in enumerate(X_train.new_cabin.unique(), 0)}

X_train.loc[:, 'new_cabin'] = X_train.loc[:,'new_cabin'].map(new_cabin_dict)
X_test.loc[:, 'new_cabin'] = X_test.loc[:,'new_cabin'].map(new_cabin_dict)

X_train.head(15)

Unnamed: 0,cabin,new_cabin,sex,cabin_mapped
501,,0,female,0
588,,0,female,0
402,,0,female,0
1193,,0,male,0
686,,0,female,0
971,,0,male,0
117,E36,1,female,1
540,,0,female,0
294,C68,2,male,2
261,E24,1,male,3


We can see that the new_cabin value is same (2) for those cabin which name starts with C, and 1 which starts with E. But if we look at the cabin mapped there are different values. The reason - in case of new_cabin variable, we already have engineered the value.

In [23]:
# lets map the sex variable as well, we can use one-hot encoding as well, will see later

X_train.loc[:, 'sex'] = X_train.loc[:,'sex'].map({'male':0, 'female':1})
X_test.loc[:, 'sex'] = X_test.loc[:,'sex'].map({'male':0, 'female':1})

X_train.head(15)

Unnamed: 0,cabin,new_cabin,sex,cabin_mapped
501,,0,1,0
588,,0,1,0
402,,0,1,0
1193,,0,0,0
686,,0,1,0
971,,0,0,0
117,E36,1,1,1
540,,0,1,0
294,C68,2,0,2
261,E24,1,0,3


In [24]:
# assuring the null value present 

X_train[['new_cabin', 'cabin_mapped', 'sex']].isnull().sum()

new_cabin       0
cabin_mapped    0
sex             0
dtype: int64

In [25]:
X_test[['new_cabin', 'cabin_mapped', 'sex']].isnull().sum()

new_cabin        0
cabin_mapped    41
sex              0
dtype: int64

In the test set, there are now 41 missing values for the highly cardinal variable. These were introduced when encoding the categories into numbers.

How?

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, I will fill those missing values with 0.

In [27]:
# let's check the number of different categories in the encoded variables
len(X_train.cabin_mapped.unique()), len(X_train.new_cabin.unique())


(147, 9)

From the above we note immediately that from the original 182 cabins in the dataset, only 147 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.

Let's train ML models 

##### Random Forests 


In [28]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['cabin_mapped', 'sex']], y_train)

# Eval
pred_train = rf.predict_proba(X_train[['cabin_mapped', 'sex']])
pred_test = rf.predict_proba(X_test[['cabin_mapped', 'sex']].fillna(0))

print("Train set:")
print(f"RF roc-auc: {roc_auc_score(y_train, pred_train[:,1])}")

print("Test set:")
print(f"RF roc-auc: {roc_auc_score(y_test, pred_test[:,1])}")

Train set:
RF roc-auc: 0.853790650048556
Test set:
RF roc-auc: 0.7691361097284443


This is certainly an overfitting, meaning the model is not able to generalize as good as in training. Let's change the varibale and retrain the RF model.

In [35]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model with engineered feature
rf.fit(X_train[['new_cabin', 'sex']], y_train)

# Eval
pred_train = rf.predict_proba(X_train[['new_cabin', 'sex']])
pred_test = rf.predict_proba(X_test[['new_cabin', 'sex']])

print("Train set:")
print(f"RF roc-auc: {roc_auc_score(y_train, pred_train[:,1])}")

print("Test set:")
print(f"RF roc-auc: {roc_auc_score(y_test, pred_test[:,1])}")

Train set:
RF roc-auc: 0.8163420365403872
Test set:
RF roc-auc: 0.8017670482827277


We can see now that the Random Forests no longer over-fit to the training set. In addition, the model is much better at generalising the predictions (compare the roc-auc of this model on the test set vs the roc-auc of the model above also in the test set: 0.81 vs 0.80).

we can use another different method such as hyperparameter tuning. 


#### AdaBost 

In [37]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['cabin_mapped', 'sex']], y_train)

#eval 
pred_train = ada.predict_proba(X_train[['cabin_mapped', 'sex']])
pred_test = ada.predict_proba(X_test[['cabin_mapped', 'sex']].fillna(0))

print("Train set:")
print(f"AdaBoost roc-auc: {roc_auc_score(y_train, pred_train[:,1])}")

print("Test set:")
print(f"AdaBoost roc-auc: {roc_auc_score(y_test, pred_test[:,1])}")

Train set:
AdaBoost roc-auc: 0.8296861713101102
Test set:
AdaBoost roc-auc: 0.7604391350035948


We can see a overfitting 

In [38]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['new_cabin', 'sex']], y_train)

#eval 
pred_train = ada.predict_proba(X_train[['new_cabin', 'sex']])
pred_test = ada.predict_proba(X_test[['new_cabin', 'sex']])

print("Train set:")
print(f"AdaBoost roc-auc: {roc_auc_score(y_train, pred_train[:,1])}")

print("Test set:")
print(f"AdaBoost roc-auc: {roc_auc_score(y_test, pred_test[:,1])}")

Train set:
AdaBoost roc-auc: 0.8161256723642566
Test set:
AdaBoost roc-auc: 0.8001078480172557


Similarly, the Adaboost model trained on the variable with high cardinality is overfit to the train set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.


##### Logistic Regression

In [44]:
# lets train model with data with cabin_mapped, and then new_cabin

lr = LogisticRegression(solver='lbfgs', random_state=44)

# train the model
lr.fit(X_train[['cabin_mapped', 'sex']], y_train)

# eval 
pred_train = lr.predict_proba(X_train[['cabin_mapped', 'sex']])
pred_test = lr.predict_proba(X_test[['cabin_mapped', 'sex']].fillna(0))


print("Train set")
print(f"LR roc-auc: {roc_auc_score(y_train, pred_train[:, 1])}")

print("Test set")
print(f"LR roc-auc: {roc_auc_score(y_test, pred_test[:, 1])}")

Train set
LR roc-auc: 0.8133909298124677
Test set
LR roc-auc: 0.7750815773463858


In [45]:
# lets try using engineered feature for cabin
lr = LogisticRegression(solver='lbfgs', random_state=44)

# train the model
lr.fit(X_train[['new_cabin', 'sex']], y_train)

# eval 
pred_train = lr.predict_proba(X_train[['new_cabin', 'sex']])
pred_test = lr.predict_proba(X_test[['new_cabin', 'sex']])


print("Train set")
print(f"LR roc-auc: {roc_auc_score(y_train, pred_train[:, 1])}")

print("Test set")
print(f"LR roc-auc: {roc_auc_score(y_test, pred_test[:, 1])}")

Train set
LR roc-auc: 0.8123468468695123
Test set
LR roc-auc: 0.8008268347989602


We can conclude as previous model drew. lets use one last model

##### Gradient Boosting 

In [46]:
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['cabin_mapped', 'sex']], y_train)

# eval 
pred_train = gbc.predict_proba(X_train[['cabin_mapped', 'sex']])
pred_test = gbc.predict_proba(X_test[['cabin_mapped', 'sex']].fillna(0))


print("Train set")
print(f"GBM roc-auc: {roc_auc_score(y_train, pred_train[:, 1])}")

print("Test set")
print(f"GBM roc-auc: {roc_auc_score(y_test, pred_test[:, 1])}")


Train set
GBM roc-auc: 0.862631390919749
Test set
GBM roc-auc: 0.7733117637298823


In [47]:
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['new_cabin', 'sex']], y_train)

# eval 
pred_train = gbc.predict_proba(X_train[['new_cabin', 'sex']])
pred_test = gbc.predict_proba(X_test[['new_cabin', 'sex']].fillna(0))


print("Train set")
print(f"GBM roc-auc: {roc_auc_score(y_train, pred_train[:, 1])}")

print("Test set")
print(f"GBM roc-auc: {roc_auc_score(y_test, pred_test[:, 1])}")

Train set
GBM roc-auc: 0.816719415917359
Test set
GBM roc-auc: 0.8015181682429069


Gradient Boosted trees are indeed over-fitting to the training set in those cases where the variable Cabin has a lot of labels. This was expected as tree methods tend to be biased to variables with plenty of categories.


Note: 
A Feature having large cardianlity leads to model overfitting. It needs careful feature engineering technique. It depends on the nature of the problems and data, when it comes to variable encoding. We will examine more on this later on. 