## Homework 4
### Paweł Fijałkowski
#### XAI WB 2022L

In [1]:
import pandas as pd
import numpy as np
import dalex as dx
from math import pi 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

## Data fetch and feature engineering 
#### Similiar to previous homeworks

In [2]:
data = pd.read_csv("EPL_2021.csv")

In [3]:
data['result'] = (data['result'] == 'Goal')
data['distance'] = ((105 - (data['X'] * 105)) ** 2 + (32.5 - (data['Y'] * 68)) ** 2) ** 0.5
data["angle"] = np.abs(np.arctan((7.32 * (105 - (data['X'] * 105))) / ((105 - (data['X'] * 105))**2 + (32.5 - (data['Y'] * 68)) ** 2 - (7.32 / 2) ** 2)) * 180 / pi)
data = data[['result', 'h_a', 'situation', 'shotType', 'lastAction', 'minute', 'distance', 'angle']]

In [4]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,h,OpenPlay,Head,Aerial,10,10.034305,37.453252
1,False,h,OpenPlay,RightFoot,Throughball,11,14.699726,19.232346
2,True,h,OpenPlay,RightFoot,BallRecovery,21,19.973838,14.099715
3,False,h,OpenPlay,RightFoot,Pass,27,19.740004,21.007894
4,False,h,OpenPlay,RightFoot,Chipped,29,14.008206,24.418589


In [5]:
categorical_features = (data.dtypes == object)

In [6]:
data.loc[:,categorical_features] = data.loc[:,categorical_features].apply(LabelEncoder().fit_transform)

In [7]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,1,2,0,0,10,10.034305,37.453252
1,False,1,2,3,27,11,14.699726,19.232346
2,True,1,2,3,1,21,19.973838,14.099715
3,False,1,2,3,20,27,19.740004,21.007894
4,False,1,2,3,6,29,14.008206,24.418589


#### Models & predictions

In [8]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('result', axis=1), data['result'], test_size=0.15)

In [9]:
logistic_regression, random_forest, xgboost = LogisticRegression(), RandomForestClassifier(), GradientBoostingClassifier()
models = [logistic_regression, random_forest, xgboost] 

In [10]:
for model in models:
    model.fit(X_train, y_train)


In [11]:
for model in models:
    print(f"Score: {model.score(X_test, y_test)}")

Score: 0.890892696122633
Score: 0.8954012623985572
Score: 0.8981064021641119


## Actual homework

#### Selecting observation, predictions

In [14]:
observation = 123
any_observation = X_test.iloc[[observation]]
[model.predict_proba(any_observation)[0][1] for model in models ]

[0.20546591676214518, 0.26, 0.22543906406612108]

#### Creating explainers, Ceteris-Paribus profiles

In [15]:
explainers = [dx.Explainer(model, X_test, y_test) for model in models]

Preparation of a new explainer is initiated

  -> data              : 1109 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1109 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x1256827a0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 4.74e-06, mean = 0.104, max = 0.746
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.71, mean = 0.00866, max = 0.987
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 1109 rows 7 cols
  -> target variable   : P



Having created explainers, we will calculate permutational importance of variables for each model.

In [16]:
perm_importance = [explainer.model_parts() for explainer in explainers] # default - logloss function

  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_true.sum()
  fpr = pd.Series([0]).append(fpr_temp.y_true.cumsum()) / (1 - _df.y_true).sum()
  tpr = pd.Series([0]).append(tpr_temp.y_true.cumsum()) / _df.y_

In [17]:
perm_importance

[<dalex.model_explanations._variable_importance.object.VariableImportance at 0x13832a8c0>,
 <dalex.model_explanations._variable_importance.object.VariableImportance at 0x111c09f30>,
 <dalex.model_explanations._variable_importance.object.VariableImportance at 0x111c456c0>]

In [18]:
perm_importance[0].plot()

In [19]:
perm_importance[1].plot()

In [20]:
perm_importance[2].plot()

## Conclusions
Permutational importance charts for all of the models agree on `angle` being (one of) the most important variables of the model (XGboost, RandomForest  - 1, LogsiticRegression - 2). Similarly, `distance` is high in order of variable importance (XBG - 1, RF - 3, LR - 2). For each of the model, type of shot (`shotType`) and (`lastAction`) are next in variable importance. It is inconclusive which one is more important globally as for XGB: lastAction > shotType, RF: shotType > lastAction, XGB: lastAction > shotType. Time of scoring or `h_a` seems to be irrelevant for all of the models. 

This fits to natural intution that exact point (distance + angle) seems to be the most important factor in quality attempt. They are closely related to variables `shotType` and `lastAction`, as for example, close range shots (low distance, angle close to 0) are often scored with head after center pass.