## Homework 3
### Paweł Fijałkowski
#### XAI WB 2022L

In [40]:
import pandas as pd
import numpy as np
import dalex as dx
from math import pi
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

#### Data fetch and feature engineering

In [32]:
data = pd.read_csv("EPL_2021.csv")

In [33]:
data['result'] = (data['result'] == 'Goal')
data['distance'] = ((105 - (data['X'] * 105)) ** 2 + (32.5 - (data['Y'] * 68)) ** 2) ** 0.5
data["angle"] = np.abs(np.arctan((7.32 * (105 - (data['X'] * 105))) / ((105 - (data['X'] * 105))**2 + (32.5 - (data['Y'] * 68)) ** 2 - (7.32 / 2) ** 2)) * 180 / pi)
data = data[['result', 'h_a', 'situation', 'shotType', 'lastAction', 'minute', 'distance', 'angle']]

In [6]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,h,OpenPlay,Head,Aerial,10,10.034305,37.453252
1,False,h,OpenPlay,RightFoot,Throughball,11,14.699726,19.232346
2,True,h,OpenPlay,RightFoot,BallRecovery,21,19.973838,14.099715
3,False,h,OpenPlay,RightFoot,Pass,27,19.740004,21.007894
4,False,h,OpenPlay,RightFoot,Chipped,29,14.008206,24.418589


In [7]:
categorical_features = (data.dtypes == object)

In [8]:
data.loc[:,categorical_features] = data.loc[:,categorical_features].apply(LabelEncoder().fit_transform)

In [9]:
data.head()

Unnamed: 0,result,h_a,situation,shotType,lastAction,minute,distance,angle
0,False,1,2,0,0,10,10.034305,37.453252
1,False,1,2,3,27,11,14.699726,19.232346
2,True,1,2,3,1,21,19.973838,14.099715
3,False,1,2,3,20,27,19.740004,21.007894
4,False,1,2,3,6,29,14.008206,24.418589


#### Models & predictions

In [10]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('result', axis=1), data['result'], test_size=0.15)

In [11]:
logistic_regression, random_forest, xgboost = LogisticRegression(), RandomForestClassifier(), GradientBoostingClassifier()
models = [logistic_regression, random_forest, xgboost] 

In [12]:
for model in models:
    model.fit(X_train, y_train)


In [13]:
for model in models:
    print(f"Score: {model.score(X_test, y_test)}")

Score: 0.8972046889089269
Score: 0.9008115419296664
Score: 0.9026149684400361


#### Selecting observation, predictions

In [34]:
any_observation = X_train.iloc[[99]]
[model.predict_proba(any_observation)[0][1] for model in models]

[0.39756797223035245, 0.02, 0.21179511646378305]

#### Creating explainers, Ceteris-Paribus profiles

In [35]:
explainers = [dx.Explainer(model, X_train, y_train) for model in models]

Preparation of a new explainer is initiated

  -> data              : 6280 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 6280 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x120711750> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 2.53e-06, mean = 0.104, max = 0.73
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.72, mean = 6.35e-06, max = 0.989
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 6280 rows 7 cols
  -> target variable   : P


X does not have valid feature names, but LogisticRegression was fitted with feature names


X does not have valid feature names, but RandomForestClassifier was fitted with feature names


X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names



  -> residuals         : min = -0.677, mean = -0.000888, max = 0.48
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 6280 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 6280 values
  -> model_class       : sklearn.ensemble._gb.GradientBoostingClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x120711750> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0121, mean = 0.105, max = 0.954
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -7.37e-05, max = 0.98
  -> model_info        : package sklearn

A new

In [38]:
explainers[0].predict_profile(any_observation).plot([explainer.predict_profile(any_observation) for explainer in explainers[1:]])

Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 288.28it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 74.33it/s]
Calculating ceteris paribus: 100%|██████████| 7/7 [00:00<00:00, 316.36it/s]


#### Conclusions

As we can see on the above figure, `Ceteris Paribus` profiles are very different for each tested model (at least for the selected observation `99`). This can be a result of significant diffrence between predictions of models (LogisticRegression: 0.39756797223035245, RandomForest: 0.02, XGBoost: 0.21179511646378305). 

Results are quite intuitive, shots from bigger angle are harder and less often result in the goal. Similarly with distance to net, bigger values indicate harder, less goalish shots. Averging out over models `minute` feature profile, we conclude that scoring chances of this particular observation would not improve or decrease if attempt would be taken at a different moment of the game. Interpretation of other values need domain specific knowledge. Generally, model predicting lowest probability (RandomForest) is the most "pesimistic" on C-P profile. We need to remember that C-P does not take into account the correlation between variables.