## Parker Dunn
# Exploring Ruby player tracking data from Catapult

### Started on 7 March 2022
### Updates
* March 7 -> getting things setup; mostly working in `datapreparation.py`
* March 8 -> continuing to work on *eda*
    * Working with time to understand the structure of the data relative to the match
    * Working through bugs with DataFrames:
        1. trying to keep the column information readily available in GameData class
        2. Trying to separate the full player data into two halves

In [1]:
# IMPORTS
import pandas as pd
import os
from datapreparation import *
from time import ctime
from matplotlib import pyplot

# OTHER SETUP

dir_list = os.listdir("./raw_data")

# Used for debugging ... this is not the problem
# print(dir_list, type(dir_list))
# for file in dir_list:
#     print(file)



# CREATING GAME DATA OBJECT

# ALL MANUALLY ENTERED TIMES ARE IN SECONDS
h1_start = 1582988376
h1_end = 1582990850
h1_length = h1_end - h1_start       # <- this is in seconds
h1_clock = centiseconds_to_clock(h1_length*100)

print(f"Length of 1st Half:\n cs: {h1_length}\n Clock: {h1_clock[0]}:{h1_clock[1]}.{h1_clock[2]}")

h2_start = 1582991486
h2_end = 1582993988
h2_length = h2_end - h2_start       # <- this is in seconds
h2_clock = centiseconds_to_clock(h2_length*100)

game1 = GameData(h1_start, h1_end, h2_start, h2_end)

# EXPLORING THE TIMING INFO
print(f"Length of 2nd Half:\n cs: {h2_length}\n Clock: {h2_clock[0]}:{h2_clock[1]}.{h2_clock[2]}")

# LOOKING DATE & TIME OF MATCH BASED ON UNIX TIME
print("\nFIRST HALF")
print(f"Start: {ctime(h1_start)}")
print(f"End: {ctime(h1_end)}")
# print("--> No idea why the date above is returning 1970 <--") <-- FIXED THIS

Length of 1st Half:
 cs: 2474
 Clock: 41:14.0
Length of 2nd Half:
 cs: 2502
 Clock: 41:42.0

FIRST HALF
Start: Sat Feb 29 09:59:36 2020
End: Sat Feb 29 10:40:50 2020


__Want to double check my work on the timing stuff__
25 minutes seems like a short half, so I decided to check the tape on the game

1st Half in Video
* Start: 2:45
* End: approx. 31:27

Notes about 1st Half:
* There seems to be a long break at the end of the half where nothing is happening
* Both teams are waiting for a final kick at the end of the half that takes a few minutes
* WAIT NEVERMIND -> *full game play is happening until at least 30:34*
* Actual Game Time for First Half --> 30:34 - 2:45 = approx 28:00

2nd Half in Video
* Start:
* End:

In [2]:
# Loading the Data

full_game_player_dds = []
for file in dir_list:
    if file[-4:]==".csv":
        player_dd = load_to_dask_df(file)
        # print(type(player_dd))
        full_game_player_dds.append(player_dd)

# debugging above

# print(len(full_game_player_dds)) # <- returned 15 ... also not the issue
# for player in full_game_player_dds:
#     # print(type(player)) # <-- 'dask.dataframe.core.DataFrame'
#     print(str(player.athlete_id.unique().compute())) # <-- does return athlete_id object
#     # print(player.columns) # <--- aaahhhh this returns an Index Object ... not a
#     print(list(player.columns.array))
#     # print(player.columns.data) # <-- this doesn't work

# WRITING THE INFORMATION ABOUT THE PLAYER DataFrame to GameData object
for player in full_game_player_dds:
    key = str(player.device_id.unique().compute())
    cols = list(player.columns.array)
    game1.store_df_features(key, cols)

game1.print_df_features()

df name: 0    16710
Name: device_id, dtype: int64 -> ['athlete_id', 'device_id', 'first_name', 'last_name', 'jersey', 'cs_time_unix', 'X', 'Y', 'accel', 'vel', 'smoothed_load', 'period', 'position_name']
df name: 0    11354
Name: device_id, dtype: int64 -> ['athlete_id', 'device_id', 'first_name', 'last_name', 'jersey', 'cs_time_unix', 'X', 'Y', 'accel', 'vel', 'smoothed_load', 'period', 'position_name']
df name: 0    11616
Name: device_id, dtype: int64 -> ['athlete_id', 'device_id', 'first_name', 'last_name', 'jersey', 'cs_time_unix', 'X', 'Y', 'accel', 'vel', 'smoothed_load', 'period', 'position_name']
df name: 0    14114
Name: device_id, dtype: int64 -> ['athlete_id', 'device_id', 'first_name', 'last_name', 'jersey', 'cs_time_unix', 'X', 'Y', 'accel', 'vel', 'smoothed_load', 'period', 'position_name']
df name: 0    15952
Name: device_id, dtype: int64 -> ['athlete_id', 'device_id', 'first_name', 'last_name', 'jersey', 'cs_time_unix', 'X', 'Y', 'accel', 'vel', 'smoothed_load', 'period

__Not sure if it is a problem yet that these are not "neat" data types (ABOVE)__
* Feels like it would be better if then "name" (a.k.a. the dictionary key) was just a simple object type

In [4]:
# Splitting the player data into two halves

h1_player_dds = []
h2_player_dds = []
for full_game_player in full_game_player_dds:
    h1_player, h2_player = split_into_halves(game1, full_game_player)
    h1_player_dds.append(h1_player)
    h2_player_dds.append(h2_player)

# Double checking that above is working
print(f"1st Half player data: {len(h1_player_dds)}")
print(f"2nd Half player data: {len(h2_player_dds)}")

1st Half player data: 15
2nd Half player data: 15


### NEXT UP:
* I think I will jump to another script to do some data visualization
    * Should create function to add new "game_time" feature that scales all of the unix times to time since the start of the half


* When I come back...
    1. Add a new time feature to DataFrames - plan to create a function to take care of it
    2. Consider merging all of the data frames by half ... would make it easier to manipulate the player data within time frames
        * Simple extract a time frame then manipulate all players in a single DataFrame
