# Nashville SC Technical Analytics Internship Data Project

This is my submission for a data project that I have to complete as a part of the Nashville SC 2021 Spring Internship selection process.

### 1. Problem Statement

For the project, you will be assuming the role of Data Engineer at an MLS club. You've been tasked with sending your supervisor a CSV containing a timestamped list of on-ball pressures (defending team player within 5 yards of the player in possession of the ball and moving towards the ball) from your last match.

You have been provided the following files from the match:

* Tracking Data (contains player and ball coordinates)
* Metadata (contains player names and their playerID)
* Documentation for the Tracking Data and Metadata files 
* Full match video of the game

### 2. Import the required libraries and modules

In [1]:
import json
import pandas as pd
import csv
import math

### 3. Load and explore the metadata and the tracking data

In [2]:
# loading the metadata
with open('metadata.json') as f:
    metadata = json.load(f)

In [3]:
print(json.dumps(metadata, indent=2))

{
  "description": "NSH - ATL : 2020-9-13",
  "startTime": 1599957484770,
  "year": 2020,
  "month": 9,
  "day": 13,
  "pitchLength": 105.7656021118164,
  "pitchWidth": 68.58000183105469,
  "fps": 25.0,
  "periods": [
    {
      "number": 1,
      "startFrameClock": 1599957484770,
      "endFrameClock": 1599960188450,
      "startFrameIdx": 0,
      "endFrameIdx": 67592,
      "homeAttPositive": false
    },
    {
      "number": 2,
      "startFrameClock": 1599961165490,
      "endFrameClock": 1599964110050,
      "startFrameIdx": 67593,
      "endFrameIdx": 141207,
      "homeAttPositive": true
    }
  ],
  "homePlayers": [
    {
      "name": "A. Danladi",
      "number": 7,
      "position": "RW",
      "ssiId": "0811e82b-9274-4a21-8071-83647b4c4bc9",
      "optaId": "202195",
      "optaUuid": "1wfz5avhq6h41axvxggwa6vvt"
    },
    {
      "name": "A. Johnston",
      "number": 12,
      "position": "RB",
      "ssiId": "0c0f65ce-18f4-4f84-862c-29b5d1f9b144",
      "optaId": "479

In [4]:
# loading and reading the tracking data into a dataframe
tracking_data = pd.read_json('tracking_data.jsonl', lines=True)

In [5]:
tracking_data.head()

Unnamed: 0,period,frameIdx,gameClock,wallClock,homePlayers,awayPlayers,ball,live,lastTouch
0,1,0,0.0,1599957484770,"[{'playerId': '220753', 'number': 29, 'xyz': [...","[{'playerId': '41705', 'number': 1, 'xyz': [-4...","{'xyz': [-0.06, 0.04, 0.22], 'speed': 17.17}",False,home
1,1,1,0.04,1599957484810,"[{'playerId': '220753', 'number': 29, 'xyz': [...","[{'playerId': '41705', 'number': 1, 'xyz': [-4...","{'xyz': [-0.75, 0.04, 0.29], 'speed': 17.0}",True,away
2,1,2,0.08,1599957484850,"[{'playerId': '220753', 'number': 29, 'xyz': [...","[{'playerId': '41705', 'number': 1, 'xyz': [-4...","{'xyz': [-1.49, 0.02, 0.34], 'speed': 16.83}",True,away
3,1,3,0.12,1599957484890,"[{'playerId': '220753', 'number': 29, 'xyz': [...","[{'playerId': '41705', 'number': 1, 'xyz': [-4...","{'xyz': [-2.21, 0.02, 0.35000000000000003], 's...",True,away
4,1,4,0.16,1599957484930,"[{'playerId': '220753', 'number': 29, 'xyz': [...","[{'playerId': '41705', 'number': 1, 'xyz': [-4...","{'xyz': [-2.91, 0.02, 0.34], 'speed': 16.46}",True,away


For more information on the metadata and the tracking data, please read the documentation [here](data_documentation.pdf). 

### 4. Create the functions to be used

In [6]:
# function to convert values in yard(s) to meter(s)
def yard_to_meter(yard):
    return yard/1.094

In [7]:
# function to calculate the Euclidean distance between two location points
def euclidean_distance(a, b):
    x = pow(a[0]-b[0], 2)
    y = pow(a[1]-b[1], 2)
#     z = pow(a[2]-b[2], 2)  # to be used when z-axis coordinate values are to be considered for calculating distances
    dist = math.sqrt(x+y)
    return dist

In [8]:
# function to find name and number of a player from the metadata using the player's Opta ID
def find_player(player_id):
    for player in metadata['homePlayers']:
        if player['optaId'] == player_id:
            return player['name'], player['number']

In [9]:
# function to help convert game clock value (expressed in seconds) to minutes:seconds format
def game_clock_minutes(game_clock_seconds):
    minutes = int(game_clock_seconds//60)
    seconds = str(int(game_clock_seconds%60)).zfill(2)
    return minutes, seconds

In [10]:
# function to find out if the on-ball player is under pressure from the opposition team players or not
# criterion for being under pressure: on-ball player has opposition team players within 5 yards of his location
# the function also counts the number of opposition team players within 5 yards of the on-ball player
def on_ball_pressure(player_loc, frame):
    row = tracking_data[tracking_data['frameIdx'] == frame]
    count = 0
    for opp_player in row['awayPlayers'].iloc[0]:
        opp_player_loc = opp_player['xyz']
        if euclidean_distance(player_loc, opp_player_loc) < yard_to_meter(5):
            count += 1
    if count > 0:
        return True, count
    return False, count

Before moving on, there are a few key details I'd like you to consider:

* The next (as well as the last) function is the main function which combines all the functions defined so far to find on-ball players who are under pressure from the opposition team players.


* The ball location is null if the ball is not visible in the tracking area. I have considered only those frames in which the ball location is not null as I have used the ball location find the home team players who are "on the ball".


* A player is "on the ball" only when he has made the last touch, and with 25 frames captured per second, it's very unlikely to miss out on a player's first touch on receiving the ball. Therefore, another condition that I have used to find an on-ball home team player is to set the "last touch" to "home".


* The tracking data contains a column titled "live". It is set to True if the ball is "in-play" and to False otherwise. I have considered only those frames in which the ball is "in-play".


* For finding whether a home team player is "on the ball" or not, I have used Euclidean distance for an estimate. I have considered a home team player to be "on the ball" if the Euclidean distance between the ball location and his location is less than 1 yard. The z-axis coordinate value for all the players in all the frames is set to 0. As a result, when I use the z-axis coordinate value (which is not 0 for the ball location) for measuring the distance between a player and the ball, a player will not be considered "on the ball" if he's controlling the ball with his head or upper body - as the distance might exceed 1 yard. Therefore, I have not considered the z-axis coordinate values while measuring the distance between the player and the ball. This can be discussed with the team staff, and it can be decided if we need to include the z-axis coordinate values as well.

In [11]:
# function to create and return a list of on-ball pressure instances with important details
def get_csv_rows(tracking_data):
    csv_rows = []
    for index, row in tracking_data.iterrows():
        frame = row['frameIdx']
        ball_loc = row['ball']['xyz']
        if (ball_loc is not None) and (row['lastTouch'] == 'home') and (row['live'] == True):
            for player in row['homePlayers']:
                player_loc = player['xyz']
                if euclidean_distance(player_loc, ball_loc) < yard_to_meter(1):
                    player_on_ball_pressure, opp_pressure_count = on_ball_pressure(player_loc, frame)
                    if player_on_ball_pressure:
                        game_clock_seconds = row['gameClock']
                        minutes, seconds = game_clock_minutes(game_clock_seconds)
                        half = row['period']
                        if half == 2:
                            minutes += 45  # to make the game clock values start from the 45th minute in the second half
                        game_clock = f'{minutes}:{seconds}'
                        unix_timestamp = row['wallClock']
                        player_name, player_number = find_player(player['playerId'])
                        csv_row = [frame, game_clock, half, game_clock_seconds, unix_timestamp, player_number, player_name, 
                                   player_loc, ball_loc, opp_pressure_count]
                        csv_rows.append(csv_row)
    return csv_rows

### 5. Create a CSV file containing a timestamped list of on-ball pressures

In [12]:
with open('on_ball_pressures.csv', 'w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Frame Index', 'Game Clock Time (min:sec)', 'Half', 'Game Clock (in seconds since start of half)', 
                         'Unix Timestamp', 'On-ball Player Number', 'On-ball Player Name', 'On-ball Player Coordinates', 
                         'Ball Coordinates', 'Number of Opposition Players within 5 yd'])
    csv_rows = get_csv_rows(tracking_data)
    csv_writer.writerows(csv_rows)

### 6. Analysis of the resulting CSV file

In [13]:
# reading the CSV file into a dataframe
df = pd.read_csv('on_ball_pressures.csv')

In [14]:
df.head(10)

Unnamed: 0,Frame Index,Game Clock Time (min:sec),Half,Game Clock (in seconds since start of half),Unix Timestamp,On-ball Player Number,On-ball Player Name,On-ball Player Coordinates,Ball Coordinates,Number of Opposition Players within 5 yd
0,397,0:15,1,15.88,1599957500650,6,D. McCarty,"[-11.97, 21.42, 0.0]","[-12.38, 21.42, 0.5]",2
1,398,0:15,1,15.92,1599957500690,6,D. McCarty,"[-11.95, 21.36, 0.0]","[-12.79, 21.42, 0.6000000000000001]",2
2,441,0:17,1,17.64,1599957502410,9,D. Badji,"[-25.44, 23.47, 0.0]","[-25.16, 22.7, 0.55]",1
3,442,0:17,1,17.68,1599957502450,9,D. Badji,"[-25.32, 23.27, 0.0]","[-25.3, 22.54, 0.6900000000000001]",1
4,443,0:17,1,17.72,1599957502490,9,D. Badji,"[-25.2, 23.07, 0.0]","[-25.44, 22.38, 0.84]",1
5,444,0:17,1,17.76,1599957502530,9,D. Badji,"[-25.07, 22.88, 0.0]","[-25.59, 22.21, 1.0]",1
6,495,0:19,1,19.8,1599957504570,10,H. Mukhtar,"[-36.45, 18.77, 0.0]","[-37.12, 19.35, 0.14]",1
7,496,0:19,1,19.84,1599957504610,10,H. Mukhtar,"[-36.66, 18.87, 0.0]","[-37.32, 19.42, 0.13]",1
8,497,0:19,1,19.88,1599957504650,10,H. Mukhtar,"[-36.87, 18.97, 0.0]","[-37.48, 19.48, 0.12]",1
9,498,0:19,1,19.92,1599957504690,10,H. Mukhtar,"[-37.09, 19.07, 0.0]","[-37.69, 19.54, 0.11]",1


I will verify the results by using the match video. Let's check a few instances (on-ball pressure situations are encircled in red; game clock time is encircled in blue):

__1. Frame Index: 397, Game Clock Time (min:sec): 0:15__

<img src="./images/image1.png" />

In this image, you can see that McCarty is being pressed by 2 opposition players within 5 yards of him.

__2. Frame Index: 441, Game Clock Time (min:sec): 0:17__

<img src="./images/image2.png" />

In this image, you can see that Badji is being pressed by 1 opposition player within 5 yards of him.

__3. Frame Index: 495, Game Clock Time (min:sec): 0:19__

<img src="./images/image3.png" />

In this image, you can see that Mukhtar is being pressed by 1 opposition player within 5 yards of him.

Though I have shown only 3 instances here, I have verified multiple instances of on-ball pressures and the resulting CSV file seems ready to be sent to my supervisor. There's only one issue that's bothering me. If you have noticed carefully, there's a 1-2 second(s) time lag between the game clock time in the match video and the game clock value (in min:sec format) in the resulting CSV. 

As of now, I have not been able to figure out the reason for this. There might be an actual time lag between the match video game clock and the game clock values in the tracking data, or the new game clock values (in min:sec format) calculated by me might not have been accurate enough. I have used the game clock (in seconds since start of half) value to calculate it's equivalent in the minutes:seconds format. I will try to resolve this issue soon. 

For the time being, I have decided to send the resulting CSV file to my supervisor. Since the CSV file I have created also contains game clock (in seconds since start of half) and Unix timestamps of all the on-ball pressure frames taken directly from the tracking data provided, I think that will fulfill my supervisor's requirements and that my supervisor will be able to use those timestamps without any issue.

Nashville SC had won the match with a score of 4-2. Let's have a look at the goals:

__1. Goal #1 (goalscorer: Badji)__

<img src="./images/image6.png" />

This instance didn't make it to the CSV file. Possible reasons might include non-visibility of the ball in the tracking area or this instance might not have been captured by any frame (which is highly unlikely with 25 frames being captured per second). 

__2. Goal #2 (goalscorer: Mukhtar)__

<img src="./images/image4.png" />

*Frame Index: 42000-42008, Game Clock Time (min:sec): 28:00 (notice the 2 seconds time lag in the match video image).* 

In this image, you can see that Mukhtar is being pressed by 3 opposition players within 5 yards of him.

__3. Goal #3 (goalscorer: McCarty)__

<img src="./images/image7.png" />

This instance didn't make it to the CSV file. Possible reasons might include no opposition player within 5 yards of McCarty (which seems unlikely to the naked eye as 2 opposition players are really close to him) or this instance might not have been captured by any frame (which is highly unlikely with 25 frames being captured per second). 

__4. Goal #4 (goalscorer: Danladi)__

<img src="./images/image5.png" />

*Frame Index: 84546-84567, Game Clock Time (min:sec): 56:18 (notice the 1 second time lag in the match video image).* 

In this image, you can see that Danladi is being pressed by 1 opposition player within 5 yards of him.

### 7. Challenges

Here are a few issues and challenges that I faced while working on this data project:

* I have used Euclidean distance to figure out which player is "on the ball" and whether or not any player is "on the ball" in each frame. I believe there are better ways to do this. Any information regarding the "on the ball" player in each frame would have definitely assisted me in making better decisions.


* I am yet to find a method to help build synchronization between the match video game clock values and the game clock values (in min:sec format) that I have calculated. Either my calculations are not accurate enough or the tracking data needs to be updated to be in synchronization with the match video game clock.


* There are instances (which I think are on-ball pressure instances) which didn't make it to the CSV file. For example, Nashville SC's goal #1 and goal #3 were not considered as on-ball pressure instances. There can be multiple reasons for this, as I have mentioned earlier.

### 8. Result

* If you have run all the cells in the correct order from the beginning of this notebook, you can find the resulting CSV file [here](on_ball_pressures.csv).

* You can also find the Python script for this notebook [here](on_ball_pressures.py). This script generates the same CSV file on successful execution.