Let's start by just import statsbomb and getting the data. You don't need to run the functions that retrieve data because I already provide it for you here with the dataset, this dataset has around 50k rows and over a hundred columns, each row is a shot from the statsbomb male football data (The free one), this project will focus on training a model to predict the Expected Goals (xG) of each shot, at first my idea is to use logistic regression built by myself, so I can understand how each feature afects the final output, then we will trya neural network and see which one is better. To measure which one is better I will aproximate which xG was clsoer to what happened (1 for goal, 0 for no goal). Neural network should be better in terms of accuracy but I will like to see Logistic Regression because I want to understand each coefficient, so let's go.

In [2]:
# pip install statsbombpy

In [3]:
from statsbombpy import sb
import pandas as pd
import numpy as np
import time
import warnings

warnings.filterwarnings('ignore')

If you want the code to retrieve all the data here it is, I warn you that this code could take more than an hour to retrieve all the data because I didn't wanted to be blocked by Github do to excess of requests.

In [4]:
comps = sb.competitions()
all_comps = comps[comps['competition_gender'] == 'male']

In [5]:
def filter_year(season):

    if int(season[:4]) > 2010:
        return season
    else:
        return 0

all_comps['season_name'] = all_comps['season_name'].apply(filter_year)
all_comps = all_comps[all_comps['season_name'] != 0]

In [None]:
def get_comp_id(all_comps_df: pd.DataFrame, comp_id_col: str, season_id_col: str) -> list:

    comp_season_ids = []

    for comp_id, season_id in zip(all_comps_df[comp_id_col].values, all_comps_df[season_id_col].values):
        comp_season_ids.append([comp_id, season_id])

    return comp_season_ids


def get_matches_id(all_comps_df: pd.DataFrame, comp_id_col: str, season_id_col: str) -> list:

    comp_season_ids = get_comp_id(all_comps_df, comp_id_col, season_id_col)

    all_matches_ids = []

    for comp_id, season_id in comp_season_ids:
        all_matches_ids.extend(sb.matches(comp_id, season_id)['match_id'].values.tolist())
        time.sleep(2)

    return all_matches_ids

def get_shots(all_comps_df: pd.DataFrame, comp_id_col: str, season_id_col: str) -> pd.DataFrame:

    matches_id = get_matches_id(all_comps_df, comp_id_col, season_id_col)

    all_data = pd.DataFrame()

    try:

        for id in matches_id:

            match_data = sb.events(match_id=id)
            match_data = match_data[match_data['type'] == 'Shot']
            all_data = pd.concat([all_data, match_data], ignore_index=True)
            time.sleep(2)

    except Exception as e:

        print(f'Error on match {id}\n {e}')
        time.sleep(10)

    return all_data

# data = get_shots(all_comps, 'competition_id', 'season_id')

Statsbomb data is great, for each shot it has information about the minute, player, type of shot, type of situation, position of the shot in the pitch using X and Y values and one of the most important ones, the location of each player at the time of the shot, at the time the shot is taken, statsbomb takes a snapshot of all the players on the pitch including information about their exact location. So below this I will mention all the variables I want to build using feature engineering to feed my model.

# Variables for the xG Model

### Geometric Characteristics:

1. Euclidean Distance to Goal (Use the Pythagorean Theorem)
2. Shooting Angle: Calculate the (It will be expressed in degrees)
3. Distance to the Center: abs(40 - y)

### Characteristic Context:

1. Body Part
2. Play Type
3. Shooting Technique
4. First-Time Shot
5. One-on-One

### Advanced Characteristics

1. Defenders between the Ball and the Goal: Calculate a triangle between the ball and two posts and see how many defenders there are.
2. Distance from the nearest defender Close
3. Goalkeeper distance and angle to the ball

### Characteristics of the Preceding Play:

1. Type of previous assist or pass
2. Angle of the previous pass


Now let's clean the dataset and keep just the features we need.

In [323]:
df = pd.read_csv('Shots_df.csv')

In [324]:
df = df[['match_id', 'location', 'minute', 'period', 'play_pattern', 'position', 'shot_aerial_won', 'shot_body_part', 
         'shot_first_time', 'shot_freeze_frame', 'shot_one_on_one', 'shot_outcome', 'shot_statsbomb_xg',
         'shot_technique', 'shot_type', 'under_pressure', 'shot_open_goal', 'shot_follows_dribble']]

## 1.1 Euclidean Distance

Perfect, now we have the features we need, so the next thing is just building each one of our new features, let's start with the geometric charasteristics. The first one is calculating the distance from the shot to the goal. Of course it is not as easy as substracting 2 numbers, statsbomb shows the position of the shot as an X and Y position, where X is the length of the pitch starting from the shooter's team goal to the other and the Y position is the width where of the pitch the shot was taken. Length goes from 0 to 120 and width goes from 0 to 80. So to calculate the distance between the shot and the goal we have to use the Euclidean distance, the goal position is 120 and 40, 40 because it is the center.

In [325]:
df = df.dropna(subset=['location'])

In [326]:
import ast

df['location'] = [ast.literal_eval(val) for val in df['location']]

In [327]:
def get_euclidean_d(row):

    x = row[0]
    y = row[1]

    ed = np.sqrt((120-x)**2 + (40-y)**2)
    return ed

df['shot_distance'] = df['location'].apply(get_euclidean_d)

In [328]:
df[['location', 'shot_distance']].head(3)

Unnamed: 0,location,shot_distance
0,"[100.4, 35.1]",20.203218
1,"[114.6, 33.5]",8.450444
2,"[106.2, 55.8]",20.978084


## 1.2 Shot Angle

Now we have to calculate the angle of the shot, when I reffer to the triangle that forms the balland the 2 posts, to do this we need to substract the absolute angles to each post, to be able to do that you have to use the arctan2 function of numpy.

In [329]:
def get_left_angle(row):

    x = row[0]
    y = row[1]

    left_a = np.arctan2(44 - y, 120 - x)

    return left_a

def get_right_angle(row):

    x = row[0]
    y = row[1]

    right_a = np.arctan2(36 - y, 120 - x)

    return right_a


df['left_angle'] = df['location'].apply(get_left_angle)
df['right_angle'] = df['location'].apply(get_right_angle)
df['shot_angle'] = abs(df['left_angle'] - df['right_angle'])

In [330]:
df['shot_angle'].head(3)

0    0.380357
1    0.662204
2    0.254675
Name: shot_angle, dtype: float64

## 1.3. Distance to the Center

Is very important to know how centered was the shot because if the shot was close the xG may increase, but if the shot was very close to the endline of the pitch, the xG may go down.

In [331]:
def get_distance_center(row):

    y = row[1]
    dc = abs(40 - y)

    return dc

df['shot_distance_to_center'] = df['location'].apply(get_distance_center)

In [332]:
df[['location', 'shot_distance_to_center']].head(3)

Unnamed: 0,location,shot_distance_to_center
0,"[100.4, 35.1]",4.9
1,"[114.6, 33.5]",6.5
2,"[106.2, 55.8]",15.8


## 2.1. Body Part

already in data as "shot_body_part"

## 2.2. Play Type

already in data as "play_pattern"

## 2.3 Shooting Technique

already in data as "shot_technique"

## 2.4. First Time Shot

already in data as "shot_first_time"

## 2.5 Shot one on one

already in data as "shot_one_on_one"

In [333]:
df['shot_first_time'] = df['shot_first_time'].replace(np.nan, False)
df['shot_one_on_one'] = df['shot_one_on_one'].replace(np.nan, False)

## 3.1. Defenders between the Ball and the Goal: 

We are going to calculate a triangle between the ball and the 2 posts and then see how many players are inside that area using baricentric coordenates, one of the key features here is the shot_freeze_frame, so first let's make sure there are not NA inside it.

In [334]:
len(np.where(df['shot_freeze_frame'].isna())[0])

722

In [335]:
df.dropna(subset='shot_freeze_frame', inplace=True)

In [336]:
len(np.where(df['shot_freeze_frame'].isna())[0])

0

In [337]:
df['shot_freeze_frame'] = [ast.literal_eval(val) for val in df['shot_freeze_frame']]

In [338]:
def point_in_triangle(pt, v1, v2, v3):
    """
    Check if point 'pt' is inside the triangle formed by v1, v2, and v3.
    We use the sign method (simplified barycentric technique).
    """
    def sign(p1, p2, p3):
        return (p1[0] - p3[0]) * (p2[1] - p3[1]) - (p2[0] - p3[0]) * (p1[1] - p3[1])

    d1 = sign(pt, v1, v2)
    d2 = sign(pt, v2, v3)
    d3 = sign(pt, v3, v1)

    has_neg = (d1 < 0) or (d2 < 0) or (d3 < 0)
    has_pos = (d1 > 0) or (d2 > 0) or (d3 > 0)

    return not (has_neg and has_pos)

def calculate_angle_density(row):

    # 1. Get coordenates of the ball
    ball = row['location']
    
    # 2. Define the posts
    post_l = [120, 36]
    post_r = [120, 44]
    
    # 3. Get the Freeze Frame
    freeze_frame = row['shot_freeze_frame']
    
    # If there are no freeze frames, return 0
    if not isinstance(freeze_frame, list):
        return 0
    
    defenders = 0
    
    # 4. Iterate over every player
    for player in freeze_frame:
        
        # We just need players from the opposite team
        if not player['teammate'] and player['position']['name'] != 'Goalkeeper': 
            loc_player = player['location']
            
            # 5. Check if it is inside the triangle
            if point_in_triangle(loc_player, ball, post_l, post_r):
                defenders += 1
                
    return defenders


# Aplicamos la funci√≥n fila por fila
df['shot_angle_density'] = df.apply(calculate_angle_density, axis=1)

# Verificamos los resultados
print(df[['location', 'shot_angle_density']].head())

        location  shot_angle_density
0  [100.4, 35.1]                   1
1  [114.6, 33.5]                   0
2  [106.2, 55.8]                   1
3  [113.9, 47.4]                   0
4   [89.2, 42.5]                   2


## 3.2. Distance from the nearest defender Close

This feature will tell us how close the closest defender is, for this we are going to measure all the distances from all the opponent players and keep the distance of the defender that was the closest to the ball.

In [339]:
def get_closest_distance(row):

    shorter_dist = float('inf')
    ball_pos = row['location']
    ball_pos_x = ball_pos[0]
    ball_pos_y = ball_pos[1]

    for player in row['shot_freeze_frame']:

        if not player['teammate'] and player['position']['name'] != 'Goalkeeper':
            pos = player['location']
            pos_x = pos[0]
            pos_y = pos[1]
            dist = np.sqrt((ball_pos_x - pos_x)**2 + (ball_pos_y - pos_y)**2)

            if dist < shorter_dist:
                shorter_dist = dist

    return shorter_dist


df['closest_defender_distance'] = df.apply(get_closest_distance, axis=1)

## 3.3 Goalkeeper distance to the ball

This one is pretty simple, we just have to calculate the euclidian distance between the goalkeeper and the ball.

In [343]:
def get_goalkeeper_dist(row):

    ball_pos = row['location']
    ball_pos_x = ball_pos[0]
    ball_pos_y = ball_pos[1]
    dist = False

    for player in row['shot_freeze_frame']:

        if not player['teammate'] and player['position']['name'] == 'Goalkeeper':
            pos = player['location']
            pos_x = pos[0]
            pos_y = pos[1]
            dist = np.sqrt((ball_pos_x - pos_x)**2 + (ball_pos_y - pos_y)**2)

    if not dist:
        dist = 0

    return dist


df['goalkeeper_distance_ball'] = df.apply(get_goalkeeper_dist, axis=1)

In [348]:
(df['goalkeeper_distance_ball'] == 0).value_counts()

goalkeeper_distance_ball
False    49245
True        29
Name: count, dtype: int64

There were some values were it was 0, so there was no data, so let's drop them.

In [350]:
df = df[df['goalkeeper_distance_ball'] != 0]

In [352]:
(df['goalkeeper_distance_ball'] == 0).value_counts()

goalkeeper_distance_ball
False    49245
Name: count, dtype: int64

## 3.4. Goalkeeper Angle to the Ball

