## Homework 5
### Paweł Fijałkowski
#### XAI WB 2022L

In [1]:
import pandas as pd
import numpy as np
import dalex as dx
from math import pi 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

## Data fetch and feature engineering 
#### Similiar to previous homeworks

In [2]:
data = pd.read_csv("EPL_2021.csv")

In [3]:
data['result'] = (data['result'] == 'Goal')
data['distance'] = ((105 - (data['X'] * 105)) ** 2 + (32.5 - (data['Y'] * 68)) ** 2) ** 0.5
data["angle"] = np.abs(np.arctan((7.32 * (105 - (data['X'] * 105))) / ((105 - (data['X'] * 105))**2 + (32.5 - (data['Y'] * 68)) ** 2 - (7.32 / 2) ** 2)) * 180 / pi)
data = data[['result', 'h_a', 'situation', 'shotType', 'lastAction', 'minute', 'distance', 'angle']]

In [4]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,h,OpenPlay,Head,Aerial,10,10.034305,37.453252
1,False,h,OpenPlay,RightFoot,Throughball,11,14.699726,19.232346
2,True,h,OpenPlay,RightFoot,BallRecovery,21,19.973838,14.099715
3,False,h,OpenPlay,RightFoot,Pass,27,19.740004,21.007894
4,False,h,OpenPlay,RightFoot,Chipped,29,14.008206,24.418589


In [5]:
categorical_features = (data.dtypes == object)

In [6]:
data.loc[:,categorical_features] = data.loc[:,categorical_features].apply(LabelEncoder().fit_transform)

In [7]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,1,2,0,0,10,10.034305,37.453252
1,False,1,2,3,27,11,14.699726,19.232346
2,True,1,2,3,1,21,19.973838,14.099715
3,False,1,2,3,20,27,19.740004,21.007894
4,False,1,2,3,6,29,14.008206,24.418589


#### Models & predictions

In [8]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('result', axis=1), data['result'], test_size=0.15)

In [9]:
logistic_regression, random_forest, xgboost = LogisticRegression(), RandomForestClassifier(), GradientBoostingClassifier()
models = [logistic_regression, random_forest, xgboost] 

In [10]:
for model in models:
    model.fit(X_train, y_train)


In [11]:
for model in models:
    print(f"Score: {model.score(X_test, y_test)}")

Score: 0.8972046889089269
Score: 0.9008115419296664
Score: 0.9008115419296664


## Actual homework

#### Selecting observation, predictions

In [13]:
observation = 144
any_observation = X_test.iloc[[observation]]
[model.predict_proba(any_observation)[0][1] for model in models ]

[0.16790393400561074, 0.08, 0.10058551377848636]

#### Explainers

In [14]:
explainers = [dx.Explainer(model, X_test, y_test) for model in models]

Preparation of a new explainer is initiated

  -> data              : 1109 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1109 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x11a5727a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 1.77e-06, mean = 0.107, max = 0.725
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.725, mean = -0.0058, max = 0.986
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 1109 rows 7 cols
  -> target variable   : 



Now we calculate Partial Dependence Profiles (PDP)

In [15]:
partial_dependent_profiles = [explainer.model_profile(type="partial") for explainer in explainers] # default - logloss function

Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 97.04it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:01<00:00,  6.54it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 46.82it/s]


In [16]:
partial_dependent_profiles

[<dalex.model_explanations._aggregated_profiles.object.AggregatedProfiles at 0x12d2e2aa0>,
 <dalex.model_explanations._aggregated_profiles.object.AggregatedProfiles at 0x12d2e2500>,
 <dalex.model_explanations._aggregated_profiles.object.AggregatedProfiles at 0x12cfcabc0>]

In [17]:
partial_dependent_profiles[0].plot()

In [18]:
partial_dependent_profiles[1].plot()

In [19]:
partial_dependent_profiles[2].plot()

Now let's create accumulated local explanations for each of the models with respect to selected observation

In [20]:
accumulated_local_explanations = [explainer.model_profile(type="accumulated") for explainer in explainers]

Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 117.39it/s]
Calculating accumulated dependency: 100%|██████████| 7/7 [00:00<00:00, 12.25it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:01<00:00,  6.23it/s]
Calculating accumulated dependency: 100%|██████████| 7/7 [00:00<00:00, 11.59it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 44.64it/s]
Calculating accumulated dependency: 100%|██████████| 7/7 [00:00<00:00, 12.67it/s]


In [21]:
accumulated_local_explanations[0].plot()

In [22]:
accumulated_local_explanations[1].plot()

In [23]:
accumulated_local_explanations[2].plot()

### PDP conclusions
Similarly to permutional importance charts, Partial Dependence Profiles (`PDP`) suggest highest values of distance decrase probability of scoring a goal. Naturally, the lower the angle is, the harder is it to score and `PDP` reflects that in for all of the models. This kind of interpretation gives us confidence that methods that we are using actually reflect reality.

### ALE conclusions
Accumulated Local Explanation for selected observation `144` and each model seem to follow those from Partial Dependence Profiles. The diffrence that we can see between those explanations is that `ALE` seems to be more stable and increase/decrease is more quiescent.  