See [Black-Box to Glass-Box modeling \[RANDOM_SEED=42\]](Black-Box%20to%20Glass-Box%20modeling%20%5BRANDOM_SEED%3D42%5D.ipynb) for an overview and steps 1 through 6
# Steps
7. Run SPLIT -> lookahead depth = 2, max_depth = 5, lambda = 0.005, 0.006, 0.007, 0.008, 0.009, 0.01
8. Run XGBoost to select features -> baseline iteration, then cumulative gain = 80%, 90%, 95% , 97.5%, 99%
9. Run SPLIT with the selected features

## Get the training, validation, and testing datasets

In [1]:
import pandas as pd
import dimex as dx

RANDOM_SEED = 50

test_encoded_filename = '../airline-passenger-satisfaction/test_clean_encoded.csv'
test_encoded = pd.read_csv(test_encoded_filename, index_col=0)

x_val, x_test, y_val, y_test = dx.split_dataset(test_encoded, test_size=2/3, random_state=RANDOM_SEED)

train_encoded_filename = '../airline-passenger-satisfaction/train_clean_encoded_balanced.csv'
train_encoded = pd.read_csv(train_encoded_filename, index_col=0)

labels = train_encoded.columns[-1]

x_train_balanced, y_train_balanced = train_encoded.drop(columns=[labels]), train_encoded[labels]

## 7 - Run SPLIT -> Lookahead depth = 2, max depth = 5, lambda = 0.005-0.01

In [2]:
# Iteration 0 -> Lambda = 0.005
model_0, tree_0, model_data_0 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.005)
train_prediction_0 = dx.prediction_split(model_0, x_train_balanced, y_train_balanced)
val_prediction_0 = dx.prediction_split(model_0, x_val, y_val)

# Iteration 1 -> Lambda = 0.006
model_1, tree_1, model_data_1 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.006)
train_prediction_1 = dx.prediction_split(model_1, x_train_balanced, y_train_balanced)
val_prediction_1 = dx.prediction_split(model_1, x_val, y_val)

# Iteration 2 -> Lambda = 0.007
model_2, tree_2, model_data_2 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.007)
train_prediction_2 = dx.prediction_split(model_2, x_train_balanced, y_train_balanced)
val_prediction_2 = dx.prediction_split(model_2, x_val, y_val)

# Iteration 3 -> Lambda = 0.008
model_3, tree_3, model_data_3 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.008)
train_prediction_3 = dx.prediction_split(model_3, x_train_balanced, y_train_balanced)
val_prediction_3 = dx.prediction_split(model_3, x_val, y_val)

# Iteration 4 -> Lambda = 0.009
model_4, tree_4, model_data_4 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.009)
train_prediction_4 = dx.prediction_split(model_4, x_train_balanced, y_train_balanced)
val_prediction_4 = dx.prediction_split(model_4, x_val, y_val)

# Iteration 5 -> Lambda = 0.01
model_5, tree_5, model_data_5 = dx.train_split(x_train_balanced, y_train_balanced, 2, 5, 0.01)
train_prediction_5 = dx.prediction_split(model_5, x_train_balanced, y_train_balanced)
val_prediction_5 = dx.prediction_split(model_5, x_val, y_val)

models_data = [model_data_0, model_data_1, model_data_2, model_data_3, model_data_4, model_data_5]
train_predictions = [train_prediction_0, train_prediction_1, train_prediction_2, train_prediction_3, train_prediction_4, train_prediction_5]
val_predictions = [val_prediction_0, val_prediction_1, val_prediction_2, val_prediction_3, val_prediction_4, val_prediction_5]

split_results = []

for i in range(6):
    split_results.extend([{"lambda": models_data[i]["lambda"],
                           "leaves": models_data[i]["leaves"],
                           "training runtime (s)": format(models_data[i]["runtime"], ".2f"),
                           "training accuracy": format(train_predictions[i][1], ".2%"),
                           "validation accuracy": format(val_predictions[i][1], ".2%"),}])

display(pd.DataFrame(split_results))

Unnamed: 0,lambda,leaves,training runtime (s),training accuracy,validation accuracy
0,0.005,6,5.32,88.12%,89.06%
1,0.006,6,5.19,88.12%,89.06%
2,0.007,6,5.16,88.12%,89.06%
3,0.008,5,5.13,87.27%,88.38%
4,0.009,5,5.16,87.27%,88.38%
5,0.01,5,5.15,87.27%,88.38%


In [3]:
print("lambda",models_data[0]["lambda"], "tree:", tree_0, "\n")
print("lambda",models_data[1]["lambda"], "tree:", tree_1, "\n")
print("lambda",models_data[2]["lambda"], "tree:", tree_2)

lambda 0.005 tree: { feature: 0 [ left child: { prediction: 1, loss: 0.002058142563328147 }, right child: { feature: 8 [ left child: { feature: 1 [ left child: { feature: 3 [ left child: { prediction: 0, loss: 0.03891390189528465 }, right child: { prediction: 1, loss: 0.02518787421286106 }] }, right child: { prediction: 1, loss: 0.037398576736450195 }] }, right child: { feature: 2 [ left child: { prediction: 0, loss: 0.02133789099752903 }, right child: { prediction: 1, loss: 0.0 }] }] }] } 

lambda 0.006 tree: { feature: 0 [ left child: { prediction: 1, loss: 0.002058142563328147 }, right child: { feature: 8 [ left child: { feature: 1 [ left child: { feature: 3 [ left child: { prediction: 0, loss: 0.03891390189528465 }, right child: { prediction: 1, loss: 0.02518787421286106 }] }, right child: { prediction: 1, loss: 0.037398576736450195 }] }, right child: { feature: 2 [ left child: { prediction: 0, loss: 0.02133789099752903 }, right child: { prediction: 1, loss: 0.0 }] }] }] } 

lambda

Despite the different lambdas, the 3 trees are identical. Same number of leaves, same loss, same accuracy. I'm picking lambda = 0.007 because it keeps accuracy and tree size the same while giving the most regularization cushion against small shifts in data and pipeline or overfitting.

## 8 - Run XGBoost -> baseline iteration, cumulative gain = 80%-99%

In [4]:
# baseline
xgb_baseline, size_baseline, runtime_baseline = dx.train_xgb(x_train_balanced, y_train_balanced, random_state=RANDOM_SEED)
y_train_pred_baseline, acc_baseline = dx.prediction_xgb(xgb_baseline, x_train_balanced, y_train_balanced)
y_val_pred_baseline, acc_val_baseline = dx.prediction_xgb(xgb_baseline, x_val, y_val)

gain_sorted, total_gain = dx.sort_by_gain(xgb_baseline)

# cumulative gain = 80%
xgb_0, size_0, runtime_0, features_0 = dx.cumulative_gain(x_train_balanced, y_train_balanced, gain_sorted, total_gain, .8, random_state=RANDOM_SEED)
y_train_pred_0, acc_0 = dx.prediction_xgb(xgb_0, x_train_balanced[features_0], y_train_balanced)
y_val_pred_0, acc_val_0 = dx.prediction_xgb(xgb_0, x_val[features_0], y_val)

# cumulative gain = 90%
xgb_1, size_1, runtime_1, features_1 = dx.cumulative_gain(x_train_balanced, y_train_balanced, gain_sorted, total_gain, .9, random_state=RANDOM_SEED)
y_train_pred_1, acc_1 = dx.prediction_xgb(xgb_1, x_train_balanced[features_1], y_train_balanced)
y_val_pred_1, acc_val_1 = dx.prediction_xgb(xgb_1, x_val[features_1], y_val)

# cumulative gain = 95%
xgb_2, size_2, runtime_2, features_2 = dx.cumulative_gain(x_train_balanced, y_train_balanced, gain_sorted, total_gain, .95, random_state=RANDOM_SEED)
y_train_pred_2, acc_2 = dx.prediction_xgb(xgb_2, x_train_balanced[features_2], y_train_balanced)
y_val_pred_2, acc_val_2 = dx.prediction_xgb(xgb_2, x_val[features_2], y_val)

# cumulative gain = 97.5%
xgb_3, size_3, runtime_3, features_3 = dx.cumulative_gain(x_train_balanced, y_train_balanced, gain_sorted, total_gain, .975, random_state=RANDOM_SEED)
y_train_pred_3, acc_3 = dx.prediction_xgb(xgb_3, x_train_balanced[features_3], y_train_balanced)
y_val_pred_3, acc_val_3 = dx.prediction_xgb(xgb_3, x_val[features_3], y_val)

# cumulative gain = 99%
xgb_4, size_4, runtime_4, features_4 = dx.cumulative_gain(x_train_balanced, y_train_balanced, gain_sorted, total_gain, .99, random_state=RANDOM_SEED)
y_train_pred_4, acc_4 = dx.prediction_xgb(xgb_4, x_train_balanced[features_4], y_train_balanced)
y_val_pred_4, acc_val_4 = dx.prediction_xgb(xgb_4, x_val[features_4], y_val)

xgb_iterations = ["baseline", "cumulative gain = 80%", "cumulative gain = 90%",
                  "cumulative gain = 95%", "cumulative gain = 97.5%",  "cumulative gain = 99%"]
xgb_train_acc = [acc_baseline, acc_0, acc_1, acc_2, acc_3, acc_4]
xgb_val_acc = [acc_val_baseline, acc_val_0, acc_val_1, acc_val_2, acc_val_3, acc_val_4]
xgb_n_feat = [23, len(features_0), len(features_1), len(features_2), len(features_3), len(features_4)]

xgb_results = []
for i in range(6):
    xgb_results.extend([{"iteration": xgb_iterations[i],
                         "training accuracy": format(xgb_train_acc[i],".2%"),
                         "validation accuracy": format(xgb_val_acc[i],".2%"),
                         "number of features": xgb_n_feat[i],}])

display(pd.DataFrame(xgb_results))

Unnamed: 0,iteration,training accuracy,validation accuracy,number of features
0,baseline,92.53%,93.99%,23
1,cumulative gain = 80%,91.43%,92.72%,9
2,cumulative gain = 90%,92.26%,93.65%,13
3,cumulative gain = 95%,92.39%,93.76%,16
4,cumulative gain = 97.5%,92.42%,93.80%,17
5,cumulative gain = 99%,92.49%,93.92%,19


Now, this is the trickiest part. Unlike with `RANDOM_SEED=42`, here there's clear inverse relationship between size and accuracy. The most accurate model is the one with the most features, while the least accurate is the one with less than half of the features of the original model. For that reason, I'll be picking both extremes to compare against SPLIT.

## 9 - Run SPLIT (lambda = 0.007) with the selected features

In [5]:
# 9 features, 80% cumulative gain
model_6, tree_6, model_data_6 = dx.train_split(x_train_balanced[features_0], y_train_balanced, 2, 5, 0.007)
train_prediction_6 = dx.prediction_split(model_6, x_train_balanced[features_0], y_train_balanced)
val_prediction_6 = dx.prediction_split(model_6, x_val[features_0], y_val)
test_prediction_6 = dx.prediction_split(model_6, x_test[features_0], y_test)

y_test_pred_0, acc_pred_0 = dx.prediction_xgb(xgb_0, x_test[features_0], y_test)

# 19 features, 99% cumulative gain
model_7, tree_7, model_data_7 = dx.train_split(x_train_balanced[features_4], y_train_balanced, 2, 5, 0.007)
train_prediction_7 = dx.prediction_split(model_7, x_train_balanced[features_4], y_train_balanced)
val_prediction_7 = dx.prediction_split(model_7, x_val[features_4], y_val)
test_prediction_7 = dx.prediction_split(model_7, x_test[features_4], y_test)

y_test_pred_4, acc_pred_4 = dx.prediction_xgb(xgb_4, x_test[features_4], y_test)

split_xgb_results = []
split_xgb_results.extend([{"model": "XGBoost (9 features)",
                           "size": str(size_0["trees"]) + " trees, " + str(size_0["leaves"]) + " leaves",
                           "training runtime (s)": format(runtime_0, ".2f"),
                           "training accuracy": format(acc_0, ".2%"),                           
                           "validation accuracy": format(acc_val_0, ".2%"),
                           "testing accuracy": format(acc_pred_0, ".2%"),},
                          {"model": "SPLIT (9 features)",
                           "size": str(model_data_6["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_6["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_6[1], ".2%"),
                           "validation accuracy": format(val_prediction_6[1], ".2%"),
                           "testing accuracy": format(test_prediction_6[1], ".2%"),},
                         {"model": "XGBoost (19 features)",
                           "size": str(size_4["trees"]) + " trees, " + str(size_4["leaves"]) + " leaves",
                           "training runtime (s)": format(runtime_4, ".2f"),
                           "training accuracy": format(acc_4, ".2%"),                           
                           "validation accuracy": format(acc_val_4, ".2%"),
                           "testing accuracy": format(acc_pred_4, ".2%"),},
                          {"model": "SPLIT (19 features)",
                           "size": str(model_data_7["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_7["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_7[1], ".2%"),
                           "validation accuracy": format(val_prediction_7[1], ".2%"),
                           "testing accuracy": format(test_prediction_7[1], ".2%"),},])

display(pd.DataFrame(split_xgb_results))

Unnamed: 0,model,size,training runtime (s),training accuracy,validation accuracy,testing accuracy
0,XGBoost (9 features),"100 trees, 758 leaves",0.14,91.43%,92.72%,92.49%
1,SPLIT (9 features),6 leaves,4.05,88.12%,89.06%,88.80%
2,XGBoost (19 features),"100 trees, 758 leaves",0.18,92.49%,93.92%,93.38%
3,SPLIT (19 features),6 leaves,5.06,88.12%,89.06%,88.80%


Now, this is interesting. While the feature selection affected XGBoost's outcome, it didn't for SPLIT. Accuracy stayed exactly the same with the smaller model, so I'll be proceeding with it.

**Features selected**

In [6]:
print(features_0)

['Online_boarding', 'Type_of_Travel_Personal Travel', 'Class_Eco', 'Inflight_wifi_service', 'On_board_service', 'Customer_Type_disloyal Customer', 'Flight_Distance', 'Inflight_entertainment', 'Leg_room_service']


## 9.1 - Run SPLIT (lookahead depth = 2-4, max depth = 6, lambda = 0.005) with selected features

Here I'm comparing both the baselines for SPLIT and XGBoost with the same models after feature selection, and testing a different set of parameters for SPLIT. I decided to extend SPLIT's maximum depth to 6, again try different values of the lookahead depth, and a smaller lambda to maximize interpretability.

In [7]:
split_xgb_results.pop()
split_xgb_results.pop()
split_xgb_results.pop()

# Lookahead depth = 2, max depth = 6, lambda = 0.005
model_8, tree_8, model_data_8 = dx.train_split(x_train_balanced[features_0], y_train_balanced, 2, 6, 0.005)
train_prediction_8 = dx.prediction_split(model_8, x_train_balanced[features_0], y_train_balanced)
val_prediction_8 = dx.prediction_split(model_8, x_val[features_0], y_val)
test_prediction_8 = dx.prediction_split(model_8, x_test[features_0], y_test)

# Lookahead depth = 3, max depth = 6, lambda = 0.005
model_9, tree_9, model_data_9 = dx.train_split(x_train_balanced[features_0], y_train_balanced, 3, 6, 0.005)
train_prediction_9 = dx.prediction_split(model_9, x_train_balanced[features_0], y_train_balanced)
val_prediction_9 = dx.prediction_split(model_9, x_val[features_0], y_val)
test_prediction_9 = dx.prediction_split(model_9, x_test[features_0], y_test)

# Lookahead depth = 4, max depth = 6, lambda = 0.005
model_10, tree_10, model_data_10 = dx.train_split(x_train_balanced[features_0], y_train_balanced, 4, 6, 0.005)
train_prediction_10 = dx.prediction_split(model_10, x_train_balanced[features_0], y_train_balanced)
val_prediction_10 = dx.prediction_split(model_10, x_val[features_0], y_val)
test_prediction_10 = dx.prediction_split(model_10, x_test[features_0], y_test)

split_xgb_results.extend([{"model": "SPLIT (lookahead depth=2, max depth=5, lambda=0.007)",
                           "size": str(model_data_6["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_6["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_6[1], ".2%"),
                           "validation accuracy": format(val_prediction_6[1], ".2%"),
                           "testing accuracy": format(test_prediction_6[1], ".2%"),},
                          {"model": "SPLIT (lookahead depth=2, max depth=6, lambda=0.005)",
                           "size": str(model_data_8["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_8["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_8[1], ".2%"),
                           "validation accuracy": format(val_prediction_8[1], ".2%"),
                           "testing accuracy": format(test_prediction_8[1], ".2%"),},
                            {"model": "SPLIT (lookahead depth=3, max depth=6, lambda=0.005)",
                           "size": str(model_data_9["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_9["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_9[1], ".2%"),
                           "validation accuracy": format(val_prediction_9[1], ".2%"),
                           "testing accuracy": format(test_prediction_9[1], ".2%"),},
                            {"model": "SPLIT (lookahead depth=4, max depth=6, lambda=0.005)",
                           "size": str(model_data_10["leaves"]) + " leaves",
                           "training runtime (s)": format(model_data_10["runtime"], ".2f"),
                           "training accuracy": format(train_prediction_10[1], ".2%"),
                           "validation accuracy": format(val_prediction_10[1], ".2%"),
                           "testing accuracy": format(test_prediction_10[1], ".2%"),},])

pd.set_option("display.max_colwidth", None)
display(pd.DataFrame(split_xgb_results))

Unnamed: 0,model,size,training runtime (s),training accuracy,validation accuracy,testing accuracy
0,XGBoost (9 features),"100 trees, 758 leaves",0.14,91.43%,92.72%,92.49%
1,"SPLIT (lookahead depth=2, max depth=5, lambda=0.007)",6 leaves,4.05,88.12%,89.06%,88.80%
2,"SPLIT (lookahead depth=2, max depth=6, lambda=0.005)",8 leaves,4.89,90.36%,92.27%,91.73%
3,"SPLIT (lookahead depth=3, max depth=6, lambda=0.005)",8 leaves,5.13,89.83%,90.40%,90.35%
4,"SPLIT (lookahead depth=4, max depth=6, lambda=0.005)",8 leaves,11.51,89.67%,90.40%,90.37%


Like with `RANDOM_SEED=42` the choice is **model[2]**, the difference from the baseline being the extra maximum depth, with minimal increase in runtime and maintaining a reasonable level of interpretability.

**Model[2] tree**

In [8]:
print(tree_8)

{ feature: 4 [ left child: { prediction: 1, loss: 0.002058142563328147 }, right child: { feature: 2 [ left child: { feature: 5 [ left child: { feature: 8 [ left child: { feature: 1 [ left child: { prediction: 0, loss: 0.033169761300086975 }, right child: { prediction: 1, loss: 0.0007312324596568942 }] }, right child: { feature: 7 [ left child: { prediction: 1, loss: 0.004202384036034346 }, right child: { prediction: 0, loss: 0.002792779356241226 }] }] }, right child: { prediction: 1, loss: 0.037398576736450195 }] }, right child: { feature: 6 [ left child: { prediction: 0, loss: 0.02133789099752903 }, right child: { prediction: 1, loss: 0.0 }] }] }] }
