Set Piece Strategy Analysis w/Data Science

At the end of the day, any sport-related Data Science project should be focused on (either directly or in-directly) the bottom line of competition: to maximize your chances of winning. This project strives to do this in the context of soccer by focusing its scope on set piece sequences, the set of plays that immediately follow and are directly related to a set piece.

Set pieces are the events in soccer that resume play after there has been a stoppage because of, i.e., the ball going out of bounds, a foul occurring, etc. They are crucial to winning because it is easy for teams to execute pre-planned strategies since player movement and repositioning is allowed at the teams are getting ready to restart play. Since the team that has possession and is executing the set piece knows what they are about to do while the other team is more in a reactive defensive mindset, many goals are scored off of set pieces that occur near the goal. In fact, 50% of the goals scored in the Men's 2018 World Cup were off of set pieces. That figure was at 25% for the Men's 2014 World Cup (the competition occurs every 4 years). Thus, if one can gain a deeper understanding of set piece strategy using Machine Learning, they could direct their teams towards more successful strategies to score more goals and to defend against those strategies when the other team is executing a set piece to prevent them from scoring goals. The end result would be your team having a better chance of having more goals than your opposition and thus a better chance of winning the match.

Of course, being that this is a Data Science project, we need a data set to be able to solve this business problem of increasing a team's chances of winning by scoring more goals after gaining insight on set pieces. Luckily, there is a free and publicly-available data set that is (almost; more on that later) perfect for this task. This is the "spatio-temporal" event tracking data set that is collected and provided by Wyscout (download-able files of this data can be found here). Data is collected on a event-by-event basis for all matches in the 2017 and 2018 seasons of La Liga, the Premiere League, the Bundesliga, Seria A, Ligue 1, Champions League, and World Cup (events can simply be thought of as occurrences in a soccer match such as passes, shots, fouls, etc.) For each event, you are given key information such as the coordinates of the ball on the field at the beginning and at the end of the event, the player that initiated the event (i.e., who made the pass), the team that that player is on, what kind of event it was (i.e., pass, duel, etc...) and when in the match the event occurred, among other descriptive pieces of information.

TL;DR Conclusion

This project has sought out to begin the process of creating a system that can make informed conclusions about the characteristics of different set piece strategies. It has accomplished that goal by taking the free version of the Wyscout event-tracking data set (where the only positional information you are given is that of the ball) and has used a processed version of it to train clustering algorithms. The results and predictions of these models reveal that ML algorithms can extract out important characteristics of soccer events and strategies. Knowing this, one will be able to continue building off this project to reveal more nuanced information about soccer events and strategies through a more detailed analysis and access to the paid version of this data set where positional information is also given for all of the players on the field.

Results

This link takes one to a shared Google Slides file that was used to present this project.

Data Preprocessing and Feature Engineering

The following image displays all of the steps taken to take the raw event-tracking Wyscout data and prepare it for cluster model training.

By computing "cumulative scores", it is meant that code was executed to determine the score of the match at the point at which each event takes place. This was done because Wyscout only tracks the number of goals scored (slightly different than the score because of "own goals") by each team at the end of the two halves and during penalties (if there was a penalty shootout) and because the score differential provides a lot of game context for each event (information that we use later on). Luckily, this computation simply requires the raw event-tracking data since the "101" and "102" tag values indicate when a goal and an own-goal have been scored. We are very confident in these cumulative score calculations because they are verified along the way with the aforementioned end-of-half scores.

Perhaps the most important part of this project is the next step in the data pre-processing pipeline. The raw data set gives us information on all of the tracked events of a match. However, we are only interested in set pieces and the related events that are part of the sequence of attacking plays immediately following it. This what we refer to as set piece sequences. The set of events that make up these sequences is the data set that we ultimately after to use in model training. To extract these special event sequences took advantage of the fact that the attack following a set piece is over after any of the following things occur:

The team playing defense during the set piece has comfortably taken back possession and is now dictating the flow of the match (that is, they have made several passes without the other team touching the ball).
The team that initiated the set piece decides to completely reset whatever attack it started with that set piece. This could be seen by them passing the ball all the back to their side of the field (either to their defenders or to their goalie) and repositioning the rest of their players.
The defending goalie saves any on-target shot attempt that was made forcing the original attacking team to retreat in preparation for the opposition to begin their possession of the ball.
The attacking team is successful in their set piece attack and scores a goal. Thus, possession will now be turned over.
The defending team commits a hard foul meaning that a new set piece sequence is about to start.s
A player on the attacking team is called offsides meaning that possession will change hands.
The attacking team accidentally kicks the ball out of bounds.
The half is said to be over by the referees which results in a stop of any play.
The defending team is able to make an effective clearance that either results in the defending team regaining possession or the attacking team having to reset its attack.
Another set piece sequence begins for some reason.

We also have the benefit of the fact that an event ID of "3" indicates that that specific event is a set piece. Thus, code was written to analyze a large chunk of plays (about 50) that immediately follow of the set pieces in our data set to determine when and how the corresponding set piece sequences ended. Below you will find two examples of set piece sequences that were extracted out from the May 13, 2018 match between Leicester City and Tottenham:

Evidently, we can visually see each event identified as part of the sequence as well as how the score of the match in the last columns of the displayed tables get updated as the score off the sequence is scored.

With all of the set piece sequences extracted out from the data set, the next steps involve feature engineering both at the event-by-event level as well as the sequence-by-sequence level. For the former, the positional, player, time, and score information is all taken to compute features such as the distance between where the event starts and end, a set of indicator variables that tells which position the initiating player players (goalie, defender, midfield, or forward), and the proportion through which the game has been played (i.e., 0.5 would be halftime and 1 would 90 minutes), to name a few. After doing this, we arrive a set of sequence-wide features by aggregating the information of all of the information into one set of feature vectors. For most features, this is simply done by taking the mean of all of its event values. For others, we take the maximum value or the first value across the event features. Now that we have sequence-wide features, the only remaining step is to ensure that the scale of all of the different features is comparable. We do this by implementing a z-score transformation to the features with a wide scale.

Clustering and Cluster Characteristics

Once the raw tracking has undergone all of the steps in the data pre-processing pipeline, it is ready to be used a training set for a clustering model. The two such models that we implemented were K-Means and Mean-Shift. Both methods yielded similar results and so we will focus our discussion on the results of the method that is the most computationally efficient and scalable, K-Means. The below image provides a snapshot of the cluster performance as viewed in the feature space:

We see an encouraging level of clean segmentation across the clusters in this space. We can further determine how different the identified clusters are from each other by identifying the data instances that are closest to the centroids of each cluster. Doing so yields the following:

After validating that the model was able to identify clusters that were different from each other, the next step involved determining what each cluster meant since that was key to relating it back to the topic at hand, winning games in soccer. We did this generating the following plots for each cluster. Notice above that the optimal number of clusters we identified for the K-Means model was 6. The following is a breakdown of each:

1. "Completely Down and Out"

Initiating team is losing.
Closest data point to cluster shows no goalie involvement.
The initiating team struggles a bit to hold on to possession.
The attack makes little progress towards the goal.
The event types is dominated by simple passes.

2. "2nd Half Out the Gates"

Closest data shows the match is tied.
Time in match seems to favor being right after half-time.
Cluster with highest rate of shot attempts.
Cluster with highest rate of ball going out of bounds.
High distribution in attacking half.

3. "Lethargic Beginning"

Closest data point shows that we are early on in the match.
The match is tied.
High rate of goalie kicks
High rate of long passes.
Despite, long passes not much advancement towards goal. Perhaps goal kicks are not successful in lead towards effective attacks.

4. "1st Half Out the Gates"

Closest data point shows that we are early on in the match.
The match is tied.
Cluster with most involvement of forwards.
Cluster with best advancement towards goal.
Main difference with the second cluster is that these events occur early on in the match.

5. "Passive and Dominating"

Closest data point shows that the initiating team has a big lead.
Events occur late in the match
Closest data point to cluster shows no goalie involvement.
Cluster with Highest Possession Rate.
Most events occur in own half of field.

6. "Coasting Towards Half Time"

The match is tied.
Closest data point shows that we are about to reach halftime for the match.
We mainly see passes and duels.
Cluster in which play is very fluid.
Minimal advancement towards the goal.

Tying it All Back to the Business Problem

Recall that the goal of this project was to gain a deeper understanding of set pieces in order to help a team either maximize their chances of scoring more goals off of set pieces or minimize the chances of the other team scoring set piece goals since that will help a team's chance of winning. The results that we have for this project so far help in making significant progress towards that goal. As presently constructed, one can use the trained models of this project to analyze a specific team's set piece sequences by seeing how their classified clusters are distributed. They can see if they have executed many sequences that are in the "Passive and Dominating" cluster, for example, which of course would be very encouraging. Conversely, one analyze the set piece sequences of an upcoming opponent to quickly see how well they are at scoring off of set pieces; with that information, the coaches of that team can make an informed decision on how much they think they should prepare for such attacks in their team's preparation for match against that opponent.

At the very least, we have Minimally Viable Product (MVP) with this project that shows that it is possible to obtain information about set piece strategy with position tracking data regarding the ball and descriptive event data.

Repository Contents and Organization

(Use drop-down menus to see more information about each directory)

1. data: Stores all of the data that is used during the project.

raw/: Directory where the data was downloaded from Wyscout (see link above) is stored.
interim/: Directory where the data obtained after pre-processing is stored.
final/: Directory where the data obtained after feature engineering is stored. This data is used to train the clustering models of this project.

2. models: Stores all of the saved models trained for this project.

event_by_event/: Directory where the models trained on non-sequence aggregated data are stored.
sequence_aggregation/: Directory where the models trained on sequence aggregated data are stored.

3. notebooks: Stores all of the Jupyter notebooks that run the code written during this project.

1_Obtaining_Set_Piece_Data.ipynb: Jupyter notebook that runs all of the source code that puts together the set of piece sequences.
2_Clustering_Investigation.ipynb: Jupyter notebook that takes the set piece sequences and performs feature engineering, sequence aggregation, model training, and model evaluation.

4. reports: Stores all of the files created to summarize the findings and conclusions of this project.

Set_Piece_Sequence_Investigation_Slides.pdf: PDF document that contains all of the slides of a Google Slides document used to present this project.

5. src: Stores all of the the code written during this project that performs all of the necessary tasks for data loading, data pre-processing, model training, and model evaluation.

data/: Directory that contains all of the source code dedicated to loading in and manipulating data.
models/: Directory that contains all of the source code dedicated to training and saving clustering models.
test/: Directory that contains all of the source code dedicated to validating function input data.
visualizations/: Directory that contains all of the source code dedicated to creating the visualizations that help perform model evaluation.

6. visualizations: Stores all of the images that were saved while exploring that predictions of trained models.

cluster_scatter/: Directory that contains all of the scatter plots that were used to first see how well the clustering model was able to segment between the different identified clusters in the feature space.
clusters_investigation/: Directory that contains all of the visualizations that help with the in-depth cluster exploration performed after training the clustering models.
Data_Preprocessing_Pipeline.png: Image that visually displays the sequence of steps that were taken to prepare the raw event-tracking data for clustering modeling.
example_sps_1.gif: GIF that shows the first example of a set piece (displayed above).
example_sps_2.gif: GIF that shows the second example of a set piece (displayed above).
match_2500097_boxscore.png: Image that displays the box score of the match for which we are displaying set piece sequence examples.
match_2500097_spp_1.gif: GIF of an example set piece sequence that was identified extracted by the source code that compiles all of the set piece sequences in our data set.
example_sps_1.png: Image that displays the extracted out sequences of events that comprise the set piece sequence displayed in the corresponding GIF.
match_2500097_spp_2.gif: Another GIF of an example set piece sequence that was identified extracted by the source code that compiles all of the set piece sequences in our data set.
match_2500097_spp_2.png: Image that displays the extracted out sequences of events that comprise the set piece sequence displayed in the corresponding GIF.

7. .gitignore: Text file that specifies all of the files and directories that were not written to this repository for reasons such as security and memory/size limitations.

8. README.md: Markdown file that contains all of the information used to generate this summary view.

9. README.md: Text file that contains all of the Python packages and their versions that are need to successfully run the code in this project.

10. setup.py: Python script that allows the user to create a new local module that is comprised of all of the code found in `src/` directory.

Future Improvements

Immediate next steps that can be taken to improve the quality of this project are:

Investigate model behavior on subsets of the data partitioned on set piece type.
Analyze model predictions on a team-by-team basis as a consistency check.
Build code infrastructure to handle full event tracking data. By the full event tracking data, we mean the paid version of the data set used in this project that tells you not only the starting and ending locations of the ball for each event, but also those locations for every player on the field. Perhaps the most important aspect to this is determining new features to "engineer" using the additional information of the paid data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Set Piece Strategy Analysis w/Data Science

TL;DR Conclusion

Results

Data Preprocessing and Feature Engineering

Clustering and Cluster Characteristics

Tying it All Back to the Business Problem

Repository Contents and Organization

Future Improvements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
models		models
notebooks		notebooks
reports		reports
src		src
visualizations		visualizations
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

gosebastian12/Set_Piece_Strategy

Folders and files

Latest commit

History

Repository files navigation

Set Piece Strategy Analysis w/Data Science

TL;DR Conclusion

Results

Data Preprocessing and Feature Engineering

Clustering and Cluster Characteristics

Tying it All Back to the Business Problem

Repository Contents and Organization

Future Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages