## Cardinality

The values of a categorical variable are categories (labels ) can contains only 2 labels like gender, or more like city or postcode. A **high number of labels** within a variable is known as __high cardinality__. We should reduce the number of labels to boost performance in machine learning.

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression  # Build ML models
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score  # Evaluate the models
from sklearn.model_selection import train_test_split  # Separate data into train and test

In [38]:
data = pd.read_csv('titanic.csv')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


The categorical variables in this dataset are **Name, Sex, Ticket, Cabin and Embarked**. **Ticket and Cabin** contain both letters and numbers, so they could be treated as Mixed Variables. Now, we will treat them as categorical.

**How is the cardinality of these categorical variables?**

In [39]:
print('Number of categories in the variable Name: {}'.format(
    len(data.name.unique())))
print('Number of categories in the variable Gender: {}'.format(
    len(data.sex.unique())))
print('Number of categories in the variable Ticket: {}'.format(
    len(data.ticket.unique())))
print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))
print('Number of categories in the variable Embarked: {}'.format(
    len(data.embarked.unique())))
print('Total number of passengers in the Titanic: {}'.format(len(data)))

Number of categories in the variable Name: 1307
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 929
Number of categories in the variable Cabin: 182
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 1309


While the variable **Sex** contains only 2 categories and **embarked** 4 (low cardinality), the variables **Ticket, Name and Cabin** has high cardinality.

In [40]:
data.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


We capture only **the first letter** that indicates the deck on which the cabin was located, and therefore indicates social class status and proximity to the surface of the Titanic.

In [41]:
data['Cabin_reduced'] = data['cabin'].astype(str).str[0]
data[['cabin', 'Cabin_reduced']].head()

Unnamed: 0,cabin,Cabin_reduced
0,B5,B
1,C22,C
2,C22,C
3,C22,C
4,C22,C


In [42]:
print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))
print('Number of categories in the variable Cabin reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

Number of categories in the variable Cabin: 182
Number of categories in the variable Cabin reduced: 9


We reduced the number of different labels **from 182 to 9**.

**Separate into training and testing set for ML models!**

In [46]:
use_cols = ['cabin', 'Cabin_reduced', 'sex']
X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], 
    data['survived'],  
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((916, 3), (393, 3))

### High cardinality leads to uneven distribution of categories in train and test sets

In [47]:
unique_to_train_set = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()]
len(unique_to_train_set)

113

There are **113 Cabins** only present in the training set, and **not in the testing set**.

In [48]:
unique_to_test_set = [
    x for x in X_test.cabin.unique() if x not in X_train.cabin.unique()]
len(unique_to_test_set)

36

This **over-fitting problem** or the problem not to know how to handle is almost overcome by **reducing the cardinality** of the variable. See below.

In [49]:
unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()]
len(unique_to_train_set)

1

In [50]:
unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()]
len(unique_to_test_set)

0

**Re-map Cabin into numbers so we can use it to train ML models!**

Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no label in the test set that is not contained in the training set as well.

### Effect of cardinality on Machine Learning Model Performance

In order to evaluate the effect of categorical variables in machine learning models, I will quickly replace the categories by numbers. See below.

**Replace the labels in Cabin, using the dic created above!..**

In [51]:
cabin_dict = {k: i for i, k in enumerate(X_train.cabin.unique(), 0)}
cabin_dict

{nan: 0,
 'E36': 1,
 'C68': 2,
 'E24': 3,
 'C22': 4,
 'D38': 5,
 'B50': 6,
 'A24': 7,
 'C111': 8,
 'F': 9,
 'C6': 10,
 'C87': 11,
 'E8': 12,
 'B45': 13,
 'C93': 14,
 'D28': 15,
 'D36': 16,
 'C125': 17,
 'B35': 18,
 'T': 19,
 'B73': 20,
 'B57': 21,
 'A26': 22,
 'A18': 23,
 'B96': 24,
 'G6': 25,
 'C78': 26,
 'C101': 27,
 'D9': 28,
 'D33': 29,
 'C128': 30,
 'E50': 31,
 'B26': 32,
 'B69': 33,
 'E121': 34,
 'C123': 35,
 'B94': 36,
 'A34': 37,
 'D': 38,
 'C39': 39,
 'D43': 40,
 'E31': 41,
 'B5': 42,
 'D17': 43,
 'F33': 44,
 'E44': 45,
 'D7': 46,
 'A21': 47,
 'D34': 48,
 'A29': 49,
 'D35': 50,
 'A11': 51,
 'B51': 52,
 'D46': 53,
 'E60': 54,
 'C30': 55,
 'D26': 56,
 'E68': 57,
 'A9': 58,
 'B71': 59,
 'D37': 60,
 'F2': 61,
 'C55': 62,
 'C89': 63,
 'C124': 64,
 'C23': 65,
 'C126': 66,
 'E49': 67,
 'E46': 68,
 'D19': 69,
 'B58': 70,
 'C82': 71,
 'B52': 72,
 'C92': 73,
 'E45': 74,
 'C65': 75,
 'E25': 76,
 'B3': 77,
 'D40': 78,
 'C91': 79,
 'B102': 80,
 'B61': 81,
 'A20': 82,
 'B36': 83,
 'C7': 84,

**Replace the letters in the reduced cabin variable with the same procedure! First create replace dictionary!**

We see how NaN takes the value 0 in the new variable, E36 takes the value 1, C68 takes the value 2, and so on.

**Re-map the categorical variable Sex into numbers!**

We see now that E36 and E24 take the same number, 1, because we are capturing only the letter. They both start with E.

**Check the missing values!..**

In [54]:
X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'sex'] = X_test.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_train.sex.head()

501     1
588     1
402     1
1193    0
686     1
Name: sex, dtype: int64

In [55]:
X_train[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     0
Cabin_reduced    0
sex              0
dtype: int64

**Let's check the number of different categories in the encoded variables!**

In the test set, there are now 41 missing values for the highly cardinal variable. These were introduced when encoding the categories into numbers. 

How? 

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, I will fill those missing values with 0.

From the above we note immediately that from the **original 182 cabins** in the dataset, **only 147 are present in the training set**. We also see how we **reduced the number of different categories to just 9** in our previous step.

**Evaluate the effect of labels in machine learning algorithms! Build model on data with high cardinality for cabin! Call the model!..**

### Random Forests

We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction to unseen data.

**Build model on data with high cardinality for cabin! Call the model!..**

In [59]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train[['Cabin_reduced', 'sex']], y_train)  # train the model
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'sex']])  # make predictions
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'sex']])
print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.8163420365403872
Test set
Random Forests roc-auc: 0.8017670482827277


We can see now that **the Random Forests no longer over-fit to the training set**. In addition, the model is much better at generalising the predictions (compare the roc-auc of this model on the test set vs the roc-auc of the model above also in the test set: 0.81 vs 0.80).

**We can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests**. Here, I want to show you that given a same model, with identical hyper-parameters, **high cardinality may cause the model to over-fit**.

### AdaBoost

**Build model on data with plenty of categories in Cabin! Call the model!..**

In [60]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
ada.fit(X_train[['Cabin_mapped', 'sex']], y_train)# train the model
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'sex']])  # make predictions
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))
print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8296861713101102
Test set
Adaboost roc-auc: 0.7604391350035948


**Build model on data with fewer categories in Cabin! Call the model!..**

In [61]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
ada.fit(X_train[['Cabin_reduced', 'sex']], y_train)  # train the model
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'sex']])  # make predictions
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))
print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8161256723642566
Test set
Adaboost roc-auc: 0.8001078480172557


Similarly, the Adaboost model trained on the variable with **high cardinality is overfit** to the train set. Whereas the Adaboost trained on the **low cardinal variable is not overfitting** and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) **simpler** and b) should a **different category** in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.

### Logistic Regression

**Build model on data with plenty of categories in Cabin! Call the model!..**

In [62]:
logit = LogisticRegression(random_state=44, solver='lbfgs')
logit.fit(X_train[['Cabin_mapped', 'sex']], y_train)  # train the model
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'sex']])  # make predictions
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))
print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8133909298124677
Test set
Logistic regression roc-auc: 0.7750815773463858


**Build model on data with fewer categories in Cabin! Call the model!..**

In [63]:
logit = LogisticRegression(random_state=44, solver='lbfgs')
logit.fit(X_train[['Cabin_reduced', 'sex']], y_train)  # train the model
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'sex']])  # make predictions
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))
print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8123468468695123
Test set
Logistic regression roc-auc: 0.8008268347989602


We can draw the same conclusion for Logistic Regression: **reducing the cardinality improves the performance and generalisation of the algorithm**.

### Gradient Boosted Classifier

**Build model on data with plenty of categories in Cabin! Call the model!..**

In [64]:
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)
gbc.fit(X_train[['Cabin_mapped', 'sex']], y_train)  # train the model
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'sex']])  # make predictions
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))
print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.862631390919749
Test set
Gradient Boosted Trees roc-auc: 0.7733117637298823


**Build model on data with fewer categories in Cabin! Call the model!..**

In [65]:
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)
gbc.fit(X_train[['Cabin_reduced', 'sex']], y_train)  # train the model
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'sex']])  # make predictions
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))
print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.816719415917359
Test set
Gradient Boosted Trees roc-auc: 0.8015181682429069


**Gradient Boosted trees are indeed over-fitting** to the training set in those cases where the variable Cabin has a lot of labels. This was expected as **tree methods** tend to be biased to variables with plenty of categories.