## EDA of Bundesliga Data Shootout 2022

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv("../train.csv")

In [3]:
train.head()

Unnamed: 0,video_id,time,event,event_attributes
0,1606b0e6_0,200.265822,start,
1,1606b0e6_0,201.15,challenge,['ball_action_forced']
2,1606b0e6_0,202.765822,end,
3,1606b0e6_0,210.124111,start,
4,1606b0e6_0,210.87,challenge,['opponent_dispossessed']


Our training data consists of 12 halves of football matches. 4 of the matches have both halves and the other halves come from different matches.

In [4]:
train["video_id"].unique()

array(['1606b0e6_0', '1606b0e6_1', '35bd9041_0', '35bd9041_1',
       '3c993bd2_0', '3c993bd2_1', '407c5a9e_1', '4ffd5986_0',
       '9a97dae4_1', 'cfbe2e94_0', 'cfbe2e94_1', 'ecf251d4_0'],
      dtype=object)

In [5]:
train["video_id"].nunique()

12

This table appears to include all of the events included in those files along with the time of occurrence.

In [6]:
train["event"].unique()

array(['start', 'challenge', 'end', 'throwin', 'play'], dtype=object)

We can have 3 different types of events : 
* challenge;
* throwin;
* play;

Above we can observe that we also have two other types on our table :
* start;
* end;

These are included to represent the scoring interval. These can let us know when the event starts and ends and allows for the evaluation process.

In [7]:
train["time"].describe()

count    11218.000000
mean      1787.796418
std        860.845970
min        175.025822
25%       1050.635250
50%       1769.089449
75%       2527.932750
max       3575.000727
Name: time, dtype: float64

Each file contains a half period of play time in football and these have around 50 mins. The play time for each half in football is 45 mins but games normally have extra time.
As we can see the our time variable ranges from around 175 to 3575 seconds, alternatively 3 to 60 minutes.

In [8]:
train["event_attributes"].unique()

array([nan, "['ball_action_forced']", "['opponent_dispossessed']",
       "['pass']", "['pass', 'openplay']", "['cross', 'openplay']",
       "['possession_retained']", "['pass', 'freekick']", "['cross']",
       "['fouled']", "['opponent_rounded']", "['cross', 'corner']",
       "['challenge_during_ball_transfer']", "['cross', 'freekick']",
       "['pass', 'corner']"], dtype=object)

Event attributes appears to provide extra information on the events. 

In [40]:
event_summary = train[~train["event"].isin(["start", "end"])].groupby(["video_id", "event"])["time"].count() / train[~train["event"].isin(["start", "end"])].groupby(["video_id"])["time"].count() * 100
event_summary_df = pd.DataFrame(event_summary).reset_index()
event_summary_df.columns = ["video_id", "event", "%_total"]
event_summary_df["%_total"] = event_summary_df["%_total"].round(2)
pd.pivot_table(event_summary_df, values="%_total", index="event", columns="video_id")

video_id,1606b0e6_0,1606b0e6_1,35bd9041_0,35bd9041_1,3c993bd2_0,3c993bd2_1,407c5a9e_1,4ffd5986_0,9a97dae4_1,cfbe2e94_0,cfbe2e94_1,ecf251d4_0
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
challenge,14.14,11.83,11.68,15.88,12.8,18.62,16.0,18.87,16.49,11.48,13.33,11.92
play,80.56,85.6,86.37,80.78,83.82,76.86,80.86,77.81,81.44,79.02,81.4,83.94
throwin,5.3,2.56,1.95,3.34,3.38,4.52,3.14,3.31,2.06,9.51,5.26,4.15


With the previous table we can observe the percentage of each event. "play" events account for the majority in all of the training files as it was expected. Most of the time in a football match we have open play. 

In [43]:
train[~train["event"].isin(["start", "end"])]["video_id"].value_counts()

1606b0e6_1    507
3c993bd2_0    414
35bd9041_0    411
1606b0e6_0    396
ecf251d4_0    386
3c993bd2_1    376
35bd9041_1    359
407c5a9e_1    350
cfbe2e94_0    305
4ffd5986_0    302
9a97dae4_1    291
cfbe2e94_1    285
Name: video_id, dtype: int64

The number of events per file ranges from 285 to 507

By opening our file with the most events it was recognized that the warmup exercises are included. Hence our unusual time of around 60 mins in the video. This should be noted.

37