### Calculating xT (action-based)
Calculate action based Expected Threat using the previously created possession chains.  This will leverage a prepared dataset, but the created dataframe in the prior section could have been leveraged for this scope.

### Opening the dataset
Open the data. It is the file created in the Possesion Chain segment, but saved and reloaded. The files are available on the github repository. There were prepared using the script from the previous section.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

import os
import pathlib
import warnings
from joblib import load
from mplsoccer import Pitch

pd.options.mode.chained_assignment = None
warnings.filterwarnings('ignore')

In [2]:
from itertools import combinations_with_replacement
from sklearn.linear_model import LinearRegression

In [3]:
df = pd.DataFrame()
for i in range(11):
    file_name = 'possession_chains_England' + str(i+1) + '.json'
    path = os.path.join(str(pathlib.Path().resolve().parents[0]), 'footy_analytics', 'data', file_name)
    with open(path) as f:
        data = json.load(f)
    df = pd.concat([df, pd.DataFrame(data)], ignore_index = True)
df = df.reset_index()

In [4]:
df.head(3)

Unnamed: 0,level_0,index,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,...,possesion_chain,possesion_chain_team,xG,shot_end,x0,c0,x1,c1,y0,y1
0,0,0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,...,0,1609,0.0,0,51.45,0.68,32.55,19.04,34.68,14.96
1,1,1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,...,0,1609,0.0,0,32.55,19.04,53.55,17.0,14.96,17.0
2,2,2,8,Head pass,[{'id': 1801}],3319,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",2499719,Pass,1609,...,0,1609,0.0,0,53.55,17.0,36.75,14.28,17.0,19.72


### Preparing variables for models
The models will use all non-linear combinations of the starting and ending x coordinate and c (distance from the middle of the pitch). Create combinations with replacement of these variables - to get their non-linear transfomations. In the next step, multiply the columns in the combination and create a model with the output.

In [15]:
def create_model_combo_cols(df=df):
    '''Create column combinations from x0,x1, c0, and c1
     ---
     x0 = starting x position
     x1 = finishing x position on field
     c0 = starting distance from center of field
     c1 = finishing distance from center of field
    '''
    var = ["x0", "x1", "c0", "c1"] # column name placeholders
    inputs = []

    for v in range(1,4):
        inputs.extend(combinations_with_replacement(var, v))
    
    for i in inputs:
        if len(i) > 1:
            column = ''
            x = 1
            for c in i:
                column += c
                x = x*df[c]
            df[column] = x
            var.append(column)
    return var,inputs

In [16]:
v,i = create_model_combo_cols(df)

In [20]:
# investigate original and created column combinations
df[v[-3:]].head(3)

Unnamed: 0,c0c0c1,c0c1c1,c1c1c1
0,8.804096,246.514688,6902.411264
1,6162.8672,5502.56,4913.0
2,4126.92,3466.6128,2911.954752


### Calculating action-based Expected Threat values for passes
To predict the outcome of a shot requires training a model (XGB classifier) on the Bundesliga dataset. In the code, the model is saved in the file. It was trained using the xgboost library version 1.6.2. Training steps are provided, but commented out. 

Using it allows predicting the probability of a chain ending with a shot. Then, on chains ending with a shot, fit a linear regression to calculate the probability that a shot ended with a goal. Product of these 2 values is the action-based Expected Threat statistic.
1. Use XGBoost Classifier to predict the likelihood of a shoot
2. Linear regression to predict the likelihood of a goal
3. Multiply them (p_shot x p_goal) to yield action-based expected Threat

In [None]:
### TRAINING, it's not perfect ML procedure, but results in AUC 0.2 higher than Logistic Regression ###
#passes = df.loc[ df["eventName"].isin(["Pass"])]
#X = passes[var].values - note that this is different X, with data from BL
#y = passes["shot_end"].values
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 123, stratify = y)
#xgb = XGBRegressor(n_estimators = 100, ccp_alpha=0, max_depth=4, min_samples_leaf=10,
#                       random_state=123)
#from sklearn.model_selection import cross_val_score
#scores = cross_val_score(estimator = xgb, X = X_train, y = y_train, cv = 10, n_jobs = -1)
#print(np.mean(scores), np.std(scores))
#xgb.fit(X_train, y_train)
#print(xgb.score(X_train, y_train))
#y_pred = xgb.predict(X_test)
#print(xgb.score(X_test, y_test))

In [None]:
#predict if ended with shot
passes = df.loc[df["eventName"].isin(["Pass"])]
X = passes[var].values
y = passes["shot_end"].values
#path to saved model
path_model = os.path.join(str(pathlib.Path().resolve().parents[0]), 'possession_chain', 'finalized_model.sav')
model = load(path_model)
#predict probability of shot ended
y_pred_proba = model.predict_proba(X)[::,1]

passes["shot_prob"] = y_pred_proba
#OLS
shot_ended = passes.loc[passes["shot_end"] == 1]
X2 = shot_ended[var].values
y2 = shot_ended["xG"].values
lr = LinearRegression()
lr.fit(X2, y2)
y_pred = lr.predict(X)
passes["xG_pred"] = y_pred
#calculate xGchain
passes["xT"] = passes["xG_pred"]*passes["shot_prob"]

passes[["xG_pred", "shot_prob", "xT"]].head(5)