# CP for explaining prediction on the Titanic dataset

author: Witalis Domitrz <witekdomitrz@gmail.com>

## 4. CP profiles for different observations comparision

Real value:
              Survived
PassengerId          
1000              0.0


Real value:
              Survived
PassengerId          
1006              1.0


Given the selected two observations the model behaves very differently when looking at the "Age" parameter. For the first observation - a male in the third class whose fare is 8.7£, embarked in Southampton - there is a gap between the kids and the adults. The kids are very likely to survive and the adults will most likely die. There are some minor fluctuations, but those are almost negligible compared with the huge (bigger than 80%) "adulthood gap". On the other hand for the second observation - a female in the first class whose fare was 221.8£, embarked in the same city - the age do have a positive impact - those older than 30 almost certainly survived and younger people with the same other parameters are from ~10% up to ~20% more likely to die. It is strange for me that the model predicts that the very young children were more likely to die when having all other parameters as in the first observation then as in the second one. Those two observations have almost the same (with the respect to the translation) CP profile of the "SibSp" variable - the number of the siblings. The fare profiles are also different, but they do share the same property - the fare above £250 has the same impact as exactly £250. The differences in the "Fare" and "Age" CP profiles might be caused by the different classes of the two observations.

## 5. Differences between CP profiles of the same observation for two models

Real value:
              Survived
PassengerId          
895               0.0


Real value:
              Survived
PassengerId          
895               0.0


Those two models predict very differently given the selected observation. The first model predict correctly and the second one does not. We can clearly see that the profile of the "Age" is much more reasonable for the first model and the second one might have overfitted - it has a few very strange and sharp edges. Note that the selected observation is from the test part of data (not the training one) so the overfitting would have negative impact on the accuracy of this prediction. Also the second model is convinced that having no siblings impact the probability of survival for this person positively. It might be a case, but so big difference between 0 and different number of siblings still looks alarming (of course the model might have found some strong dependency which was not noticed by the first model, but the accuracy hints us that it is not the case). The fluctuations of the "Fare" profile for the second modle also might mean that these is some overfitting, but still that "Age" profile is the strangest one.

# Appendix

## Preparation

### Install modules

In [3]:
!pip3 install pyCeterisParibus

Defaulting to user installation because normal site-packages is not writeable


### Download the data

In [4]:
!wget http://students.mimuw.edu.pl/~wd393711/iml/titanic.zip
!unzip -o titanic.zip
!rm titanic.zip

--2020-04-16 18:27:06--  http://students.mimuw.edu.pl/~wd393711/iml/titanic.zip
Resolving students.mimuw.edu.pl (students.mimuw.edu.pl)... 2001:6a0:5001:1::3, 193.0.96.129
Connecting to students.mimuw.edu.pl (students.mimuw.edu.pl)|2001:6a0:5001:1::3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34877 (34K) [application/zip]
Saving to: ‘titanic.zip’


2020-04-16 18:27:07 (330 KB/s) - ‘titanic.zip’ saved [34877/34877]

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


### Prepare the data

In [5]:
import numpy as np
import pandas as pd

used_columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived']
X_columns = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
Y_columns = ['Survived']

def load_data(fn):
    return pd.read_csv(fn).set_index('PassengerId')

def to_array(data):
    return pd.get_dummies(data).astype(dtype='float32')

def data_preprocessing():
    global train, test
    train = load_data('./train.csv')
    test = load_data('./gender_submission.csv').join(load_data('./test.csv'))  

    train['is_train'] = True
    test['is_train'] = False
    data = pd.concat([train, test])

    # Replace missing values with mean
    data.fillna(data.mean(), inplace=True)

    # Split test and train
    train, test = data[data['is_train']], data[data['is_train'] == False]

    # No unused columns
    train, test = train[used_columns], test[used_columns]

    train, test = to_array(train), to_array(test)
data_preprocessing()


def split_to_x_y(data):
    return to_array(data[X_columns]), to_array(data[Y_columns])

def get_data():
    return split_to_x_y(train), split_to_x_y(test)

def unlabel_data(*sets):
    return ((part.values for part in subset) for subset in sets)

In [6]:
(train_x, train_y), (test_x, test_y) = get_data()

## 1.
For the selected data set, train at least one tree-based ensemble model (random forest, gbm, catboost or any other boosting)

In [7]:
from xgboost import XGBClassifier as Model
from sklearn.pipeline import Pipeline

### Create a model

In [8]:
model = Model(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

### Train a model

In [9]:
model.fit(train_x, train_y.values[::,0])

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=1, nthread=1, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=0, silent=None,
              subsample=1, tree_method=None, validate_parameters=False,
              verbosity=1)

### Check the first model accuracy

In [10]:
from sklearn.metrics import accuracy_score
test_pred = model.predict(test_x)
test_accuracy1 = accuracy_score(test_y, [round(value) for value in test_pred])
print("Test accuracy of the first model: {}".format(test_accuracy1))

Test accuracy of the first model: 0.8827751196172249


## 2.
for some selected observation from this dataset, calculate the model predictions for model (1)

### Select an observation

In [11]:
def select_an_observation(id):
    obs = test.loc[[id]]
    x, y = split_to_x_y(obs)
    return (x, y), obs

In [12]:
i2 = 894
(x2, y2), obs2 = select_an_observation(i2)

In [13]:
obs2

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Survived,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
894,2.0,62.0,0.0,0.0,9.6875,0.0,0.0,1.0,0.0,1.0,0.0


### Calculate the model predictions

In [14]:
pred2 = model.predict_proba(x2)
print("The model preadictet that this person would survive with probability:", pred2[0][1], "and the truth is:", y2.values[0,0])

The model preadictet that this person would survive with probability: 0.10084572 and the truth is: 0.0


## 3.
for an observation selected in (2), calculate the decomposition of model prediction using Ceteris paribus / ICE profiles (packages for R: DALEX, ALEPlot, ingredients, packages for python: pyCeterisParibus).

### Use pyCeterisParibus

In [15]:
from ceteris_paribus.explainer import explain as Explainer

### Create an explainer function using Ceteris Paribus

In [16]:
from ceteris_paribus.profiles import individual_variable_profile
from ceteris_paribus.plots.plots import plot_notebook
from collections.abc import Iterable
# I have selected the variables based on the LIME explanations from the prievous week
DEFAULT_SELECTED_VARIABLES=['Fare', 'Age', 'SibSp']
def explain(ids, selected_variables=DEFAULT_SELECTED_VARIABLES, model=model):
    # Create an explainer
    explainer = Explainer(model, variable_names=test_x.columns, data=test_x, y=test_y.columns, predict_function=lambda X: model.predict_proba(X)[::,1], label=str(model))
    def helper(i):
        ((x, y), _) = select_an_observation(i)
        print("Real value:\n", y)
        # Ceteris Paribus
        cp = individual_variable_profile(explainer, x.values[0], y.values[0])
        plot_notebook(cp, selected_variables=selected_variables)
    
    # Support single value not a lise
    ids = ids if isinstance(ids, Iterable) else [ids]
    
    for i in ids:
        helper(i)

### Explain using Ceteris Paribus

In [17]:
i3 = i2
explain(i3)

Real value:
              Survived
PassengerId          
894               0.0


## 4.
find two observations in the data set, such that they have different CP profiles (e.g. model response is growing with age for one observations and lowering with age for another). Note that you need to have model with interactions to have such differences

In [18]:
explain([1000,1006])

Real value:
              Survived
PassengerId          
1000              0.0


Real value:
              Survived
PassengerId          
1006              1.0


## 5.
train a second model (of any class, neural nets, linear, other boosting) and find an observation for which CP profiles are different between the models

### Use scikit-learn random forest classifier

In [19]:
from sklearn.ensemble import RandomForestClassifier as Model2

### Create a model

In [20]:
model2 = Model2()

### Train a model

In [21]:
model2.fit(train_x, train_y.values[::,0])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### Check the second model accuracy

In [22]:
from sklearn.metrics import accuracy_score
test_pred = model2.predict(test_x)
test_accuracy1 = accuracy_score(test_y, [round(value) for value in test_pred])
print("Test accuracy of the second model: {}".format(test_accuracy1))

Test accuracy of the second model: 0.8157894736842105


### Find an observation for which CP profiles are different between the models

In [23]:
explainer2 = Explainer(model2, variable_names=test_x.columns, data=test_x, y=test_y.columns, predict_function=lambda X: model2.predict_proba(X)[::,1], label="RandomForestClassifier")

In [24]:
explain([895], model=model)
explain([895], model=model2)

Real value:
              Survived
PassengerId          
895               0.0


Real value:
              Survived
PassengerId          
895               0.0


## 6.
Comment on the results for points (4) and (5)

4. CP profiles for different observations comparision

    Given the selected two observations the model behaves very differently when looking at the "Age" parameter. For the first observation - a male in the third class whose fare is 8.7£, embarked in Southampton - there is a gap between the kids and the adults. The kids are very likely to survive and the adults will most likely die. There are some minor fluctuations, but those are almost negligible compared with the huge (bigger than 80%) "adulthood gap". On the other hand for the second observation - a female in the first class whose fare was 221.8£, embarked in the same city - the age do have a positive impact - those older than 30 almost certainly survived and younger people with the same other parameters are from ~10% up to ~20% more likely to die. It is strange for me that the model predicts that the very young children were more likely to die when having all other parameters as in the first observation then as in the second one. Those two observations have almost the same (with the respect to the translation) CP profile of the "SibSp" variable - the number of the siblings. The fare profiles are also different, but they do share the same property - the fare above £250 has the same impact as exactly £250. The differences in the "Fare" and "Age" CP profiles might be caused by the different classes of the two observations.

5. Differences between CP profiles of the same observation for two models

    Those two models predict very differently given the selected observation. The first model predict correctly and the second one does not. We can clearly see that the profile of the "Age" is much more reasonable for the first model and the second one might have overfitted - it has a few very strange and sharp edges. Note that the selected observation is from the test part of data (not the training one) so the overfitting would have negative impact on the accuracy of this prediction. Also the second model is convinced that having no siblings impact the probability of survival for this person positively. It might be a case, but so big difference between 0 and different number of siblings still looks alarming (of course the model might have found some strong dependency which was not noticed by the first model, but the accuracy hints us that it is not the case). The fluctuations of the "Fare" profile for the second modle also might mean that these is some overfitting, but still that "Age" profile is the strangest one.