# What is xG?


Football outcomes can be significantly impacted by chance occurrences and "luck" more than any other sport. The final result can be determined by various factors such as shots narrowly missing, deflected shots, errors made by goalkeepers, and disputed referee calls. It's often said that in football, even small differences can have a big impact.

Given that many football matches are decided by narrow margins, chance and randomness can have a significant impact. This makes it challenging to assess the quality of a team's performance. For instance, a team might win 1-0 through sheer determination or simply due to a lucky break. Sometimes, it's not easy to tell the difference just by watching the game. We aim to provide a more objective evaluation of team performances by minimizing the influence of randomness and quantifying relevant factors when analyzing a match. 

xG is a valuable tool in football analysis that calculates the likelihood of a shot leading to a goal by considering several factors. These factors include the shot's distance from the goal, angle, the match's scoreline, whether it was a header, a counter-attack opportunity, and other variables. In this project, we will examine some of these factors. By using this method, we can add up all the opportunities in a match and estimate how many goals a team should have scored based on our model's aggregated factors. This approach can be extended to analyze a series of games, an entire season, or even a manager's tenure.

### Setting Up the Data

There are two main types of data used in football analysis: event data and tracking data. Event data records all significant actions that occur during a match, such as shots, passes, tackles, and dribbles, along with their location on the field. On the other hand, tracking data records the exact positions of players and the ball at regular intervals throughout the game.

For this project, we will be using event data provided by Wyscout. This dataset encompasses all events that occurred during matches played in the top five European domestic leagues (English Premier League, Ligue 1, Bundesliga, La Liga, and Serie A) during the 2017/2018 season and also the FIFA World Cup 2018.





In [None]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
%matplotlib inline


from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.metrics import log_loss
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

In [3]:
datasets=['England', 'European_Championship', 'France', 'Germany', 'Spain', 'Italy', 'World_Cup']

In [4]:
event_dataframes = []
match_dataframes =  []
for dataset in datasets:
    with open('Wyscout/events_'+dataset+'.json') as f:
        json_data = json.load(f)
        pandas_data = pd.DataFrame(json_data)
        event_dataframes.append(pandas_data)
    with open('Wyscout/matches_'+dataset+'.json') as f:
        json_data = json.load(f)
        pandas_data = pd.DataFrame(json_data)
        match_dataframes.append(pandas_data)
        
all_events_df = pd.concat(event_dataframes, axis=0).reset_index(drop=True)
matches_df = pd.concat(match_dataframes, axis=0).reset_index(drop=True)
with open('Wyscout/players.json') as f:
    player_json=json.load(f)
    player_df = pd.DataFrame(player_json)

In [5]:
all_events_df

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id
0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,1H,2.758649,85,177959171
1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,1H,4.946850,83,177959172
2,8,Head pass,[{'id': 1801}],3319,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",2499719,Pass,1609,1H,6.542188,82,177959173
3,8,Head pass,[{'id': 1801}],120339,"[{'y': 71, 'x': 35}, {'y': 95, 'x': 41}]",2499719,Pass,1609,1H,8.143395,82,177959174
4,8,Simple pass,[{'id': 1801}],167145,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",2499719,Pass,1609,1H,10.302366,85,177959175
...,...,...,...,...,...,...,...,...,...,...,...,...
3251289,8,Simple pass,[{'id': 1801}],3476,"[{'y': 20, 'x': 46}, {'y': 6, 'x': 64}]",2058017,Pass,9598,2H,2978.301867,85,263885652
3251290,7,Touch,[],14812,"[{'y': 6, 'x': 64}, {'y': 2, 'x': 82}]",2058017,Others on the ball,9598,2H,2979.084611,72,263885653
3251291,8,Cross,"[{'id': 401}, {'id': 801}, {'id': 1802}]",14812,"[{'y': 2, 'x': 82}, {'y': 100, 'x': 100}]",2058017,Pass,9598,2H,2983.448628,80,263885654
3251292,4,Goalkeeper leaving line,[],25381,"[{'y': 0, 'x': 0}, {'y': 98, 'x': 18}]",2058017,Goalkeeper leaving line,4418,2H,2985.869275,40,263885613


In [6]:
all_events_df=pd.merge(all_events_df, player_df[['wyId', 'foot', 'firstName', 'lastName']], 
                       left_on='playerId', right_on='wyId')
all_events_df=pd.merge(all_events_df, matches_df[['wyId', 'label', 'venue', 'date']], left_on='matchId', right_on='wyId')
all_events_df=all_events_df.drop(columns=['wyId_x', 'wyId_y'])
all_events_df=all_events_df.sort_values(['matchId', 'matchPeriod', 'eventSec'])

In [7]:
all_events_df

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,foot,firstName,lastName,label,venue,date
256825,8,Simple pass,[{'id': 1801}],26010,"[{'y': 48, 'x': 50}, {'y': 50, 'x': 47}]",1694390,Pass,4418,1H,1.255990,85,88178642,left,Olivier,Giroud,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2"
257300,8,Simple pass,[{'id': 1801}],3682,"[{'y': 50, 'x': 47}, {'y': 48, 'x': 41}]",1694390,Pass,4418,1H,2.351908,85,88178643,left,Antoine,Griezmann,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2"
256905,8,Simple pass,[{'id': 1801}],31528,"[{'y': 48, 'x': 41}, {'y': 35, 'x': 32}]",1694390,Pass,4418,1H,3.241028,85,88178644,right,N'Golo,Kant\u00e9,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2"
257166,8,High pass,[{'id': 1802}],7855,"[{'y': 35, 'x': 32}, {'y': 6, 'x': 89}]",1694390,Pass,4418,1H,6.033681,83,88178645,right,Laurent,Koscielny,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2"
257338,1,Ground defending duel,"[{'id': 702}, {'id': 1801}]",25437,"[{'y': 6, 'x': 89}, {'y': 0, 'x': 85}]",1694390,Duel,4418,1H,13.143591,12,88178646,left,Blaise,Matuidi,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2988073,2,Foul,[],21234,"[{'y': 87, 'x': 24}, {'y': 82, 'x': 26}]",2576338,Foul,3185,2H,2824.741855,20,253567159,right,Andrea,Belotti,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2"
2989272,3,Free kick cross,"[{'id': 801}, {'id': 1801}]",70974,"[{'y': 23, 'x': 75}, {'y': 65, 'x': 95}]",2576338,Free Kick,3193,2H,2870.982660,32,253567160,left,Iuri Jos\u00e9,Pican\u00e7o Medeiros,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2"
2988021,1,Ground loose ball duel,"[{'id': 702}, {'id': 1801}]",14745,"[{'y': 35, 'x': 5}, {'y': 36, 'x': 3}]",2576338,Duel,3185,2H,2872.101142,13,253567161,left,Cristian,Molinaro,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2"
2989401,1,Ground loose ball duel,"[{'id': 702}, {'id': 1801}]",413041,"[{'y': 65, 'x': 95}, {'y': 64, 'x': 97}]",2576338,Duel,3193,2H,2872.990437,13,253567163,right,Jawad,El Yamiq,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2"


In [8]:
all_events_df['previous_event'] = all_events_df['subEventName'].shift(1)

In [9]:
all_shots = all_events_df[(all_events_df['subEventName'] == 'Shot') |(all_events_df['subEventName'] =='Free kick shot')].copy()

all_shots['free_kick'] = 1*(all_shots['subEventName'] == 'Free kick shot')

In [10]:
all_shots['rebound']= 1*(all_shots['previous_event'] == 'Penalty') \
                    + 1*(all_shots['previous_event'] == 'Free kick shot') \
                    + 1*(all_shots['previous_event'] == 'Shot') \
                    + 1*(all_shots['previous_event'] == 'Save attempt') 

all_shots['prev_cross'] = 1*(all_shots['previous_event'] == 'Corner') \
                        + 1*(all_shots['previous_event'] == 'Free kick cross') \
                        + 1*(all_shots['previous_event'] == 'Cross') 

all_shots['prev_touch'] = 1*(all_shots['previous_event'] == 'Touch')

all_shots['prev_pass'] = 1*(all_shots['previous_event'] == 'Simple pass') \
                       + 1*(all_shots['previous_event'] == 'Head pass') \
                       + 1*(all_shots['previous_event'] == 'Goal kick') \

all_shots['prev_smart_pass'] = 1*(all_shots['previous_event'] == 'Smart pass')

all_shots['prev_duel'] = 1*(all_shots['previous_event'] == 'Air duel') \
                       + 1*(all_shots['previous_event'] == 'Ground defending duel')  \
                       + 1*(all_shots['previous_event'] == 'Ground attacking duel') \
                       + 1*(all_shots['previous_event'] == 'Ground loose ball duel duel') 

In [11]:
shots_model=pd.DataFrame(columns=['Goal','X','Y', 'side_of_field', 'left_foot', 
                                  'right_foot', 'header', 'counter_attack', 'strong_foot'])

In [12]:
all_shots

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,...,venue,date,previous_event,free_kick,rebound,prev_cross,prev_touch,prev_pass,prev_smart_pass,prev_duel
257339,10,Shot,"[{'id': 402}, {'id': 1401}, {'id': 1203}, {'id...",25437,"[{'y': 29, 'x': 91}, {'y': 0, 'x': 0}]",1694390,Shot,4418,1H,31.226217,...,Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",Head pass,0,0,0,0,1,0,0
257923,10,Shot,"[{'id': 402}, {'id': 201}, {'id': 1216}, {'id'...",83824,"[{'y': 29, 'x': 71}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,143.119551,...,Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",Simple pass,0,0,0,0,1,0,0
258172,10,Shot,"[{'id': 402}, {'id': 201}, {'id': 1201}, {'id'...",33235,"[{'y': 57, 'x': 96}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,219.576026,...,Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",Air duel,0,0,0,0,0,0,1
257684,10,Shot,"[{'id': 403}, {'id': 201}, {'id': 1215}, {'id'...",6165,"[{'y': 61, 'x': 96}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,247.532561,...,Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",Air duel,0,0,0,0,0,0,1
257303,10,Shot,"[{'id': 401}, {'id': 2101}, {'id': 1802}]",3682,"[{'y': 33, 'x': 75}, {'y': 0, 'x': 0}]",1694390,Shot,4418,1H,557.319065,...,Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",Simple pass,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2988673,10,Shot,"[{'id': 401}, {'id': 201}, {'id': 1215}, {'id'...",116269,"[{'y': 45, 'x': 95}, {'y': 0, 'x': 0}]",2576338,Shot,3193,2H,1152.032980,...,,"May 20, 2018 at 3:00:00 PM GMT+2",Touch,0,0,0,1,0,0,0
2989067,10,Shot,"[{'id': 401}, {'id': 201}, {'id': 1212}, {'id'...",3548,"[{'y': 38, 'x': 93}, {'y': 0, 'x': 0}]",2576338,Shot,3193,2H,1251.730517,...,,"May 20, 2018 at 3:00:00 PM GMT+2",Smart pass,0,0,0,0,0,1,0
2988599,10,Shot,"[{'id': 101}, {'id': 401}, {'id': 201}, {'id':...",21177,"[{'y': 46, 'x': 90}, {'y': 0, 'x': 0}]",2576338,Shot,3193,2H,2065.034482,...,,"May 20, 2018 at 3:00:00 PM GMT+2",Ground defending duel,0,0,0,0,0,0,1
2988762,10,Shot,"[{'id': 402}, {'id': 1212}, {'id': 1802}]",349102,"[{'y': 32, 'x': 79}, {'y': 0, 'x': 0}]",2576338,Shot,3193,2H,2367.252041,...,,"May 20, 2018 at 3:00:00 PM GMT+2",Simple pass,0,0,0,0,1,0,0


In [13]:
for i,shot in all_shots.iterrows():
    shots_model.at[i,'X']=100-shot['positions'][0]['x']
    shots_model.at[i,'Y']=shot['positions'][0]['y']
    shots_model.at[i,'side_of_field']= 1*(shot['positions'][0]['y'] <  50)
    shots_model.at[i,'C']=abs(shot['positions'][0]['y']-50)
    
    #Distance in metres and shot angle in radians.
    x=shots_model.at[i,'X']*105/100
    y=shots_model.at[i,'C']*68/100
    shots_model.at[i,'Distance']=np.sqrt(x**2 + y**2)
    a = np.arctan(7.32 *x /(x**2 + y**2 - (7.32/2)**2))
    if a<0:
        a=np.pi+a
    shots_model.at[i,'Angle'] =a
    shottags=[tag['id'] for tag in shot['tags']]
    if 101 in shottags:
        shots_model.at[i,'Goal']=1
    if 401 in shottags:
        shots_model.at[i, 'left_foot']=1
        if shot.loc['foot']=='left':
            shots_model.at[i, 'strong_foot'] = 1
    if 402 in shottags:
        shots_model.at[i, 'right_foot']=1
        if shot.loc['foot']=='right':
            shots_model.at[i, 'strong_foot'] = 1
    if 403 in shottags:
        shots_model.at[i, 'header']=1
    if 1901 in shottags:
        shots_model.at[i, 'counter_attack'] = 1
shots_model = shots_model.fillna(0)

In [14]:
shots_model['out_swinging'] = 1*(shots_model['side_of_field'] == 0)*(shots_model['right_foot'] == 1) \
                           + 1*(shots_model['side_of_field'] == 1)*(shots_model['left_foot'] == 1)
shots_model['in_swinging'] = 1*(shots_model['side_of_field'] == 0)*(shots_model['left_foot'] == 1) \
                            + 1*(shots_model['side_of_field'] == 1)*(shots_model['right_foot'] == 1)

In [15]:
shots_model = pd.merge(shots_model, all_shots[['rebound', 'prev_cross', 'prev_touch',  'prev_pass',
                                               'prev_smart_pass',  'free_kick', 'prev_duel',  'firstName', 
                                               'lastName', 'label', 'venue', 'date','eventSec', 'matchPeriod']], 
                       left_index=True, right_index=True, how='left')
shots_model = shots_model.reset_index(drop=True)

In [17]:
radians_to_degrees = lambda x: x * (180/np.pi)

# apply the lambda function to the column and create a new column with the converted values
shots_model['Angle'] = shots_model['Angle'].apply(radians_to_degrees)


In [18]:
shots_model

Unnamed: 0,Goal,X,Y,side_of_field,left_foot,right_foot,header,counter_attack,strong_foot,C,...,prev_smart_pass,free_kick,prev_duel,firstName,lastName,label,venue,date,eventSec,matchPeriod
0,0,9,29,1,0,1,0,0,0,21.0,...,0,0,0,Blaise,Matuidi,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",31.226217,1H
1,0,29,29,1,0,1,0,0,1,21.0,...,0,0,0,Mihai Doru,Pintilii,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",143.119551,1H
2,0,4,57,0,0,1,0,0,1,7.0,...,0,0,1,Bogdan Sorin,Stancu,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",219.576026,1H
3,0,4,61,0,0,0,1,0,0,11.0,...,0,0,1,Florin,Andone,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",247.532561,1H
4,0,25,33,1,1,0,0,0,1,17.0,...,0,0,0,Antoine,Griezmann,"France - Romania, 2 - 1",Stade de France,"June 10, 2016 at 9:00:00 PM GMT+2",557.319065,1H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45279,0,5,45,1,1,0,0,0,1,5.0,...,0,0,0,Diego Sebasti\u00e1n,Laxalt Su\u00e1rez,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2",1152.032980,2H
45280,0,7,38,1,1,0,0,0,1,12.0,...,1,0,0,Giuseppe,Rossi,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2",1251.730517,2H
45281,1,10,46,1,1,0,0,0,1,4.0,...,0,0,1,Goran,Pandev,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2",2065.034482,2H
45282,0,21,32,1,0,1,0,0,1,18.0,...,0,0,0,Stephane,Omeonga,"Genoa - Torino, 1 - 2",,"May 20, 2018 at 3:00:00 PM GMT+2",2367.252041,2H


In [16]:
shots_model.columns

Index(['Goal', 'X', 'Y', 'side_of_field', 'left_foot', 'right_foot', 'header',
       'counter_attack', 'strong_foot', 'C', 'Distance', 'Angle',
       'out_swinging', 'in_swinging', 'rebound', 'prev_cross', 'prev_touch',
       'prev_pass', 'prev_smart_pass', 'free_kick', 'prev_duel', 'firstName',
       'lastName', 'label', 'venue', 'date', 'eventSec', 'matchPeriod'],
      dtype='object')

In [21]:
shots_model.to_csv('shots_matrix.csv')

### Referrences : 

1. https://figshare.com/collections/Soccer_match_event_dataset/4415000/2
2. https://www.nature.com/articles/s41597-019-0247-7
3. https://soccermatics.readthedocs.io/en/latest/
4. https://github.com/eddwebster/football_analytics
5. https://github.com/devinpleuler/analytics-handbook
6. https://github.com/iandragulet/xG_Model_Workflow
7. https://github.com/KubaMichalczyk/Expected-Goals-Model
8. https://github.com/andrewsimplebet/expected_goals_deep_dive
9. https://github.com/Dato-Futbol/xg-model
10.https://github.com/andrewRowlinson/expected-goals-thesis
11.http://www.statsandsnakeoil.com/2021/06/09/does-xg-really-tell-all/
12.https://pena.lt/y/2014/02/12/expected-goals-for-all
13.https://mplsoccer.readthedocs.io/en/latest/index.html
14.https://www.youtube.com/watch?v=Xc6IG9-Dt18
15.https://www.datofutbol.cl/xg-model/
16.https://www.kaggle.com/gabrielmanfredi/expected-goals-player-analysis
17.https://differentgame.wordpress.com/2017/04/29/an-xg-model-for-everyone-in-20-minutes-ish/
18.https://web.archive.org/web/20200301071559/http://petermckeever.com/2019/01/building-an-expected-goals-model-in-python/
19.https://www.statsperform.com/resource/introducing-expected-goals-on-target-xgot/
