# SailGP Data Analyst Challenge

The aim is to test you python abilities. The challenge is to analyze the data provided and answer the questions below. You can use any library you want to help you with the analysis. The data is from the SailGP event in Auckland 2025. The data is in the 'DATA' folder.

There are various sources available.

The Boat Logs are in the 'Boat_Logs' folder. The data is in csv format and the columns are described in the "Boat_Logs/Boat_Logs_Columns.csv" file.
The "Course_Marks_2025-01-19.csv" file contains the mark positions and wind reading on the course for the whole day.

The Race_XML folder contains the xml files for each race that contains information on where the boundaries of the course are, the theoretical position of the marks and the target racecourse axis.

The 2025-01-19_man_summary.csv file contains the metrics from the manoeuvre summary for the day.
The 2025-01-19_straight_lines.csv file contains the metrics from the straight line summary for the day.

Both are derived from the boat logs.

The 2502 m8_APW_HSB2_HSRW.kph.csv file contains the polar data for the boats in that config.

## Requierements
- Chose at least 3 questions from the list below to answer.
- Python 3.8 or higher
- Notebook should be able to run without any errors from start to finish.
- Specify the libraries (imports) used in the notebook.
- Any comments to make the notebook self-explanatory and easy to follow would be appreciated.
- If you can"t get to the end of a question, we would appreciate the code you have written so far and explain what you were trying to do.

## Further information:
- We usually use bokeh for visualizations. So any showcase of bokeh would be appreciated.
-

## Submitting the results.
It would be great if you could provide a jupyter notebook with the code and the results of the analysis. You can submit the results by sharing a link to a git repository.


### Imports and re-used functions
Free section to initialize the notebook with the necessary imports and functions that will be used in the notebook.



In [1]:
#ytube video of day racing
#https://www.youtube.com/watch?v=pDuPn3_cDJo

#Q1 imports
import numpy as np
import pandas as pd

from bokeh.plotting import figure, show
from bokeh.io import output_notebook, curdoc
from bokeh.models import ColumnDataSource, Slider, CustomJS
from bokeh.layouts import column

from bokeh.themes import Theme
import yaml
# Load YAML file
with open("custom_theme.yaml", "r") as f:
    theme_yaml = yaml.safe_load(f)

# Apply the theme
custom_theme = Theme(json=theme_yaml)
curdoc().theme = custom_theme

output_notebook()  # Display in Jupyter Notebook

import helpers.ingest

#Q2 imports
import helpers.maths
from bokeh.models import Arrow, OpenHead


#Q5 imports
import helpers.pathing
import folium

#Q7 imports
from sklearn.preprocessing import StandardScaler
from bokeh.models import ColumnDataSource, HoverTool
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error


#Q8 imports
from bokeh.models import Span
import holoviews as hv
from holoviews import opts
from bokeh.layouts import gridplot
from bokeh.models import LabelSet

hv.extension('bokeh')


## Question 1: Write a Python function that can take a compass direction (ie. TWD or Heading) and calculate an accurate mean value across a downsampled frequency. Eg. If TWD is at 1Hz, give me a 10s average.

## **STRATEGY**
This is a relatively straightforward operation, however care must be taken where compass directions are bound to the interval 0-359 e.g. the average of 002 and 358 is not 180, but actually zero. To avoid this, decompose the bearings into vectors and calculate the average of their components, before reassembling into a compass dir.

In [None]:


BOAT = "GBR"

def heading_downsample_avg(heading_data:pd.DataFrame,period:float):
    # takes a datetime indexed dataframe column and returns a time-averaged mean
    # period is in seconds
    # first converts heading angle values to orthogonal vectors which are time averaged
    # and then converted back into heading values
    x_values = np.cos(np.deg2rad(heading_data))
    y_values = np.sin(np.deg2rad(heading_data))
    
    rolling_x = x_values.resample(f"{period}s").mean()
    rolling_y = y_values.resample(f"{period}s").mean()
    
    # convert atan2 result back into the 0-360 interval
    rolling_heading = (np.rad2deg(np.atan2(rolling_y,rolling_x)) + 360) % 360
    rolling_heading = np.round(rolling_heading,2) # round back to original precision 
    
    return rolling_heading


df_boat = helpers.ingest.boat_ingest(BOAT)
result = heading_downsample_avg(df_boat["HEADING_deg"],10)
result.to_csv("Q1_ans.csv")

## add an averaging graph demo here?

## Question 2: Given a course XML and a timeseries of boat Lat/Lon values, calculate a VMC column for the same timeseries.


## **STRATEGY**
VMC - Velocity Made Correct, component of velocity in the direction of destination  
#### Steps:
1. Convert positions into displacements  
2. Convert displacements into velocities  
3. Calculate unit vectors from displacements to next mark  

4. VMC speed is the dot product of the destination unit vector and current velocity.  
5. VMC vector is the unit vector to the next mark multiplied by the current kmh speed.

A plot has been included which displays the track from the timeseries, with a vector of VMC overlayed

#### Extras (not implemented)
Given the boat data includes the leg and race information, one could look up which race data and import dynamically based on the time of the data provided.

In [4]:
# define boat
BOAT = "GBR" 
# define start and end times of time series
data_starttime = "2025-01-19 16:07:25"
data_endtime = "2025-01-19 16:20:00"
# define course
COURSE = "Data/Race_XMLs/25011905_03-13-55.xml"

boat_data = helpers.ingest.boat_ingest(BOAT)
course = helpers.ingest.ingest_course(xml_path=COURSE,map_course=False)

pd.options.mode.chained_assignment = None  # default='warn'

def calculate_VMC(boat_data,course,start="2025-01-19 16:07:25",end="2025-01-19 16:07:25"):
    """
    Calculates and adds a VMC column in both km/h vector and scalar format.
    :param boat_data: Timeseries data for a boat
    :param course: Course data output from ingest_course()
    :param start: start time
    : param end: end time
    """
    boat_data = boat_data.loc[data_starttime:data_endtime]
    # initialise calculation vectors
    displacements = [np.array(0)]*(len(boat_data))
    V_vec = [np.array(0)]*(len(boat_data))
    VMC_vec = [np.array(0)]*(len(boat_data))
    VMC_km_h = [0]*(len(boat_data))

    boat_data['time_td'] = boat_data.index.to_series().diff()#.astype('timedelta64[ms]')

    # iterate through boat data and add a column for displacements 
    for i in range(len(boat_data)-1):
        displacements[i+1] = helpers.maths.coords_to_displacement(boat_data["LATITUDE_GPS_unk"].iloc[i],
                                                            boat_data["LONGITUDE_GPS_unk"].iloc[i],
                                                            boat_data["LATITUDE_GPS_unk"].iloc[i+1],
                                                            boat_data["LONGITUDE_GPS_unk"].iloc[i+1])
        # this calculation of the VMC vector is quite rough, and would probably need some filtering
        V_vec[i+1] = displacements[i+1]/(boat_data['time_td'].iloc[i+1].total_seconds())

        # need to understand which mark to be aiming for
        next_mark_idx = int(boat_data["TRK_LEG_NUM_unk"].iloc[i+1])

        next_mark = course[next_mark_idx]["marks"][0] # again it would be nice to have a 'preferred mark' here but not implemented
        mark_vector = helpers.maths.coords_to_displacement(boat_data["LATITUDE_GPS_unk"].iloc[i+1],
                                                            boat_data["LONGITUDE_GPS_unk"].iloc[i+1],
                                                            next_mark["lat"],
                                                            next_mark["lon"],
                                                            unit=True)

        VMC_km_h[i+1] = float(np.dot(V_vec[i+1],mark_vector))
        VMC_vec[i+1] = float(np.dot(V_vec[i+1],mark_vector)) * mark_vector

    boat_data["VMC_km_h"] = VMC_km_h
    boat_data["VMC_vec"] = VMC_vec
    
    return boat_data

boat_data_vmc = calculate_VMC(boat_data=boat_data,course=course,start=data_starttime,end=data_endtime)



Ingested course with 8 points.


In [None]:
###### PLOTTING #######

course_marks = []
course_names = []

for gate in course:
    for mark in gate['marks']:
        course_marks.append(mark)
        course_names.append(gate['name'])

# Creating a ColumnDataSource for the course marks
course_source = ColumnDataSource(data={'lat': [mark['lat'] for mark in course_marks],
                                       'lon': [mark['lon'] for mark in course_marks],
                                       'name': course_names})
boat_data = boat_data_vmc.iloc[1:]

# Bokeh Data Sources
source_full = ColumnDataSource(data=boat_data_vmc)  # Entire dataset
source_vectors = ColumnDataSource(data={"lat": [boat_data_vmc["LATITUDE_GPS_unk"].iloc[0]], "lon": [boat_data_vmc["LONGITUDE_GPS_unk"].iloc[0]],
                                        "vec_x": [boat_data_vmc["LONGITUDE_GPS_unk"].iloc[0]+0.1], "vec_y": [boat_data_vmc["LATITUDE_GPS_unk"].iloc[0]+0.1]})

# Bokeh Figure
p = figure(title="Interactive GPS Path & Vectors", match_aspect=True,
           x_axis_label="Longitude", y_axis_label="Latitude",
           tools="pan,wheel_zoom,reset", width=800, height=500)

p.scatter(x='lon', y='lat', source=course_source, size=8, color="blue", alpha=0.7, legend_label="Course Marks")

# Scatter plot for full timeseries path
p.scatter(x="LONGITUDE_GPS_unk", y="LATITUDE_GPS_unk", source=source_full, size=5, color="gray", alpha=0.5, legend_label="Full Path")

# Vector arrows for the selected time index
arrow = Arrow(end=OpenHead(size=10), 
              x_start="lon", y_start="lat", 
              x_end="vec_x", y_end="vec_y",
              source=source_vectors, line_width=2, line_color="red")
p.add_layout(arrow)


# Slider to select time
# Would be nicer if the slider had the datetime displayed but MVP achieved
slider = Slider(start=0, end=len(boat_data) - 1, value=0, step=1, title="Time Index")



# JS callback to update vector and position
callback = CustomJS(args=dict(source=source_full, vec_source=source_vectors), code="""
    var data = source.data;
    var vec_data = vec_source.data;
    var idx = cb_obj.value;  // Get the slider value (timestamp index)

    // Update position (latitude, longitude)
    vec_data['lat'][0] = data['LATITUDE_GPS_unk'][idx];
    vec_data['lon'][0] = data['LONGITUDE_GPS_unk'][idx];
    
    // Extract vector components (VMC_vec is expected to be a 2D array)
    var vec = data['VMC_vec'][idx];
    vec_data['vec_x'][0] = vec_data['lon'][0] + vec[0]/(8000);  // X-component of the vector
    vec_data['vec_y'][0] = vec_data['lat'][0] + vec[1]/(8000);  // Y-component of the vector
    
    vec_source.change.emit();  // Trigger update
""")

slider.js_on_change("value", callback)  # Attach JS callback to slider


# Layout & Display
layout = column(p, slider)
show(layout)  # Works in standalone mode

## Question 3: Verify and comment on the boats calibration. If possible propose a post-calibrated set of wind numbers and a potential calibration table.


## Question 4: Given a timeseries of Lat/Lon positions and a course XML, in a Python notebook, calculate a Distance to Leader metric for each boat.

## Question 5: Given a course XML, along with a wind speed and direction and a polar, calculate the minimum number of tacks or gybes for each leg of the course and each gate mark on the leg.

## STRATEGY
 I am choosing to ignore the boundaries for the initial implementation
 I am also setting as an input the highest and lowest TWA that the boat could sail.
 I can't see how a polar should be involved in these calculations besides setting a reasonable limit on the best possible upwind or downwind angle

 With these assumptions, the answer for each leg is either 0 or 1

**Zero manoeuvre case**

 The destination vector is between the UPWIND_MAX_HEADING and DOWNWIND_MAX_HEADING vectors for that tack

**One manoeuvre case**

 The destination vector is not between the UPWIND_MAX_HEADING or DOWNWIND_MAX_HEADING vectors for that tack

 
 so I need to work out the vector from gate to gate, and see if it fits between the two vectors when the allowable angles have
 been rotated by the TWA. 

**Mark Manoeuvres**

If the mark rounding at the other side is not a gate and has a specific rounding, P vs S then that needs to be accounted.
 

In [None]:

# TODOS
# ideally one would sail from the most advantageous end of each gate, not just a 'random' end (mark[0])
# calculation of tacking before the boundary 
# - I can't think of a way besides projecting the heading until it hits the boundary OR the layline.
# - This feels awkward and inelegant 
#
#############################
# USER PARAMETERS
# MIN_TWA - highest upwind mode
# MAX_TWA - lowest downwind mode
# TWD - True wind direction for course


MIN_TWA = 45
MAX_TWA = 150
TWD = 60

COURSE = "Data/Race_XMLs/25011906_03-35-13.xml"
# POLAR = "Data/2502 m8_APW_HSB2_HSRW.kph.csv"

#############################
# rotate best upwind and downwind modes by TWA
MIN_TWA = (MIN_TWA + TWD) % 360
MAX_TWA = (MAX_TWA + TWD) % 360


course = helpers.ingest.ingest_course(xml_path=COURSE,map_course=False)

# initialise results lists
waypoints_sailed = []
leg_mans = []
mark_mans = []
approaches = []

#assume race starts on starboard
starboard = True
stbd_dict = {True:"Starboard",False:"Port"} # used for output formatting


for i in range(len(course)-1):
    waypoints_sailed.append(course[i]["marks"][0]) # improve by adding preferential starting end
    preferred_mark = 0 # TODO - configure a function to add this consideration
    next_gate_bearing = helpers.maths.coords_to_bearing(waypoints_sailed[-1]["lat"],
                                                        waypoints_sailed[-1]["lon"],
                                                        course[i+1]["marks"][preferred_mark]["lat"],
                                                        course[i+1]["marks"][preferred_mark]["lon"])
    
    leg_mans.append(0)
    mark_mans.append(0)

    # Each boundary intersection would require an additional leg manoeuvre, not calculated
    if helpers.pathing.manv_required(twa_min=MIN_TWA,
                                     twa_max=MAX_TWA,
                                     starboard=starboard,
                                     dest_bearing=next_gate_bearing,
                                     TWD=TWD):
        leg_mans[-1] += 1
        starboard = not starboard
    # record what the next mark rounding tack will be
    approaches.append(starboard)
    preferred_mark = 0 # again this is missing

    # There's a decision tree where if a certain rounding direction is required, 
    # an additional manoeuvre may be required in able to round the mark correctly.
    # This is also contingent on whether the next mark is upwind or downwind of 
    # the immediate one. 
    # But if the next mark is the finish line, we don't need to know.
    if i == len(course)-2: continue
    mark_exit_bearing = helpers.maths.coords_to_bearing(course[i+1]["marks"][preferred_mark]["lat"],
                                                        course[i+1]["marks"][preferred_mark]["lon"],
                                                        course[i+2]["marks"][preferred_mark]["lat"],
                                                        course[i+2]["marks"][preferred_mark]["lon"])
    
    # Calculate if next mark is upwind or downing
    if np.dot(helpers.maths.compass_to_vector(TWD+180),helpers.maths.compass_to_vector(mark_exit_bearing)) > 0:
        bearaway = True
    else:
        bearaway = False
    
    # This logic tree makes sense and I'm sticking to it
    if bearaway:
        if course[i+1]["rounding"] == "Starboard":
            if starboard:
                mark_mans[-1] += 1
                starboard = not starboard
        elif course[i+1]["rounding"] == "Port":
            if not starboard:
                mark_mans[-1] += 1
                starboard = not starboard
    else:
        if course[i+1]["rounding"] == "Starboard":
            if not starboard:
                mark_mans[-1] += 1
                starboard = not starboard
        elif course[i+1]["rounding"] == "Port":
            if starboard:
                mark_mans[-1] += 1
                starboard = not starboard

# unfortunately in the given courses, no mark manoeuvres are needed! But I am pretty sure this code works, having moved the TWD around.
for i in range(len(course)-1):
    print(f"From {course[i]["name"]} to {course[i+1]["name"]}, {leg_mans[i]} leg manoeuvres, approaching on {stbd_dict[approaches[i]]}, with {mark_mans[i]} mark mans")



Ingested course with 8 points.
From SL1 to M1, 1 leg manoeuvres, approaching on Port, with 1 mark mans
From M1 to LG1, 1 leg manoeuvres, approaching on Port, with 0 mark mans
From LG1 to WG1, 1 leg manoeuvres, approaching on Starboard, with 0 mark mans
From WG1 to LG1, 1 leg manoeuvres, approaching on Port, with 0 mark mans
From LG1 to WG1, 1 leg manoeuvres, approaching on Starboard, with 0 mark mans
From WG1 to LG1, 1 leg manoeuvres, approaching on Port, with 0 mark mans
From LG1 to FL1, 1 leg manoeuvres, approaching on Starboard, with 0 mark mans


## Question 6: Calculate a “tacked” set of variables depending on the tack of the boat, so that sailors don’t need to think about what tack they’re on when looking at measurements. And show the results in a visualisation.


## Question 7: Given a set of tacks (in CSV), and train a model to explain the key features of these tacks when optimizing for vmg. Show appropriate visualisations to explain your conclusions.

In [None]:
df_mans = pd.read_csv("Data/2025-01-19_man_summary.csv")
df_tacks = df_mans[df_mans['type'] == 'tack']
# Prepare data
source = ColumnDataSource(df_tacks)


x = df_tacks[['entry_bsp', 'exit_bsp', 'min_bsp', 'bsp_loss', #'entry_twa', 'exit_twa',
    #'orig_entry_twa', 'orig_exit_twa', 'entry_rh', 'exit_rh', 'entry_rh_stability', 
    'max_yaw_rate', 'db_down', 'two_DB_time', 'two_DB_Broadcast', 'flying', 'tws', 
    'pop_time', 'turn_min_rh', 't_invert', 't_to_lock', 'max_lat_gforce', 'max_fwd_gforce', 
    'max_gforce', 'drop_time_P', 'drop_time_S', 'unstow_time_P', 'unstow_time_S', 'stow_time_P', 
    'stow_time_S', 'boards_up_time_S', 'boards_up_time_P', 'press_sys_acc_start', 'press_sys_acc_end', 
    'press_sys_acc_delta', 'press_rake_acc_start', 'press_rake_acc_end', 'press_rake_acc_delta', 'pump_press_avg', 
    'pump_press_max', 'press_wing_acc_start', 'press_wing_acc_end', 'db_ud_ret_press_s_avg', 'db_ud_ret_press_s_max', 
    'db_ud_ret_press_p_avg', 'db_ud_ret_press_p_max', 'db_ud_ext_press_s_avg', 'db_ud_ext_press_s_max', 
    'db_ud_ext_press_p_avg', 'db_ud_ext_press_p_max', 'entry_heel', 'entry_pitch', 'exit_heel', 'exit_pitch', 
    'turning_time', 't_swap', 'bsp_at_drop', 'heel_at_drop', 'pitch_at_drop', 'winward_rh_at_drop', 
    'db_cant_ret_press_p_avg', 'db_cant_ret_press_p_max', 'db_cant_ret_press_s_avg', 'db_cant_ret_press_s_max',
    'db_cant_ext_press_p_avg', 'db_cant_ext_press_p_max', 'db_cant_ext_press_s_avg', 'db_cant_ext_press_s_max',
    'entry_jib_lead', 'exit_jib_lead', 'entry_jib_sheet', 'exit_jib_sheet', 'entry_jib_sheet_pct', 'exit_jib_sheet_pct', 
    'max_rudder_angle', 'vmg_distance', 'b_diff', 'b_diff_1', 'b_diff_2', 'avg_TWD', 'distance', 'dist_2', 'bearing', 
    'bearing_2', 
    #'theoretical_vmg', 'theoretical_target_vmg', 'theoretical_distance', 'theoretical_targ_distance', 
    #'loss_vs_vmg', 'loss_vs_targ_vmg', 'drop_offset', 'drop_to_wind_axis', 'htw_bsp', 'entry_cant', 'exit_cant', 
    #'cant_drop_target', 'cant_stow_target', 
    # 'dashboard', 'entry_tack'
    ]]

y = df_tacks["theoretical_vmg"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")


# Create figure
p = figure(title="Entry BSP vs Theoretical VMG",
           x_axis_label="Entry BSP (kmh)",
           y_axis_label="VMG (knots)",
           tools="pan,box_zoom,reset,hover",
           outer_width=700, outer_height=500)

# Add scatter points
p.scatter(x="entry_bsp", y="theoretical_vmg", source=source, size=8, color="blue", alpha=0.6)

# Add hover tool
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("Entry BSP", "@entry_bsp"), ("theoretical_vmg", "@theoretical_vmg")]
show(p)
# Extract feature importances from the model
importances = model.feature_importances_
features = x.columns

# Create data source
source = ColumnDataSource(data=dict(features=features, importance=importances))

# Create figure
p = figure(y_range=list(features), x_axis_label="Feature Importance", title="Feature Importance",
           outer_width=300, outer_height=1000, tools="pan,box_zoom,reset")

# Add bars
p.hbar(y="features", right="importance", source=source, height=0.5)#, color=Spectral10[:len(features)])

# Show plot
show(p)

Mean Absolute Error: 4.96


## Question 8: Give insights on the racing on what made a team win or underperform in the race.

An obvious place to start is by watching the racing in real time, which enables qualitative insights into the general trends of the racing. For example, in Race 5 (first race of the day), AUS take a flying start which puts them in clear water for the entire remainder of the race. The AUS boat was both the fastest over the line and had the lowest TTK of all the boats at the start time (Figure 1) . These figures have been replicated across all four races, (Figures 2-4), with a reasonable correlation between being faster (both in time and boat speed) and placing more highly, as the fast F50 boats lose so much potential VMG through having to make unnecessary manoeuvres in traffic.  

Below these plots I have created VMG Violin plots for both upwind and downwind speed, for all boats, for all races. Although these plots can be described as stating the obvious (i.e. the fastest boat upwind and downwind won), I think they highlight some key points about the racing that day. Median upwind VMG is the killer here. Generally downwind, the boats are better matched and there is less of a difference in what the crews are able to extract when coming downwind but upwind the winners always have a high VMG. 
You can also see where boats have made mistakes, characterised by a large blob width on the lowest quartile of their VMGs.   
**Race 5**  
Both DEN and GER get stuck down at the LG on the first upwind leg, DEN making a tacking error and tangling both boats in traffic. DEN managed to avoid mistakes in later races, which enables their wins in R7 and R8.
ITA take an early penalty at M1, which is unfortunate considering the general shape of the VMG is quite good. They manage to recover to third which is impressive.  

**Race 6**  
Unlike the first race, DEN make a fantastic LG mark rounding on the first leg, and although they come in behind AUS and ITA, both boats ahead make big mistakes rounding up at LG1, which ITA fail to recover from. AUS come in level with ESP to the final mark but on Port, and giving way costs them 2nd place.  

**Race 7**  
DEN take the win here by staying out of trouble, their extra speed over the line (Figure 3) helps them keep clear air into M1, which enables them to control the racing from there.   

**Race 8**  
ESP get the worst start here, but GBR make a mistake on their first gybe coming downwind, which puts them into third. Neither of the other two boats make mistakes this race, and it comes down to a straight up VMG drag race. ESP's start puts them definitively behind AUS, but with all the boats sailing so well (GBR#s mistake aside), there is no room to make up the distance. N.B. The downwind VMG data has been poorly masked, with all boats having wide tails on the downwind VMG data - if I had more time, I would be able to filter this out. 


In [None]:
# generic list of teams competing - CAN ignored due total DNS
teams = ['GBR', #0
             'USA', #1
             'BRA', #2
             'ITA', #3
             'GER', #4
            #  'CAN', #5
             'DEN', #6
             'ESP', #7
             'AUS', #8
             'SUI', #9
             'NZL', #10
             ]
# results list (derived from Wikipedia data)
results = {'GBR' :  [2,4,6,3],
             'USA': [9,8,9],
             'BRA' : [7,5,7],
             'ITA' : [3,9,8],
             'GER' : [6,7,5],
             'DEN' : [8,1,1],
             'ESP' : [5,3,3,2],
             'AUS' : [1,2,'DNF',1],
             'SUI' : [10,10,4],
             'NZL' : [4,6,2]
              }
# race data files
races = ["Data/Race_XMLs/25011905_03-13-55.xml",
         "Data/Race_XMLs/25011906_03-35-13.xml",
         "Data/Race_XMLs/25011907_03-56-34.xml",
         "Data/Race_XMLs/25011908_04-23-32.xml"]

country_colours = {
    'GBR': '#00247D',  # Royal Blue (Union Jack)
    'USA': '#B22234',  # Old Glory Red (Flag)
    'BRA': '#009C3B',  # Green (Flag)
    'ITA': '#008C45',  # Verde (Tricolor Flag)
    'GER': '#000000',  # Black (Flag)
    'DEN': '#C60C30',  # Danish Red (Flag)
    'ESP': '#AA151B',  # Spanish Red (Flag)
    'AUS': '#012169',  # Blue (Australian Flag)
    'SUI': '#D52B1E',  # Swiss Red (Flag)
    'NZL': '#00247D',  # New Zealand Blue (Flag)
}

# use a high level 'data_dict' which contains all of the boat data,
# results data, upwind and downwind filtered data, and results
# start times are decided by the race, but each boat has an individual
# finishing time which will also be stored in the dictionary
data_dict = {}
for t in teams:
    data_dict[t] = {"full_data" : {},
                      "results"   : results[t],
                      "full_upwind"    : None,
                      "full_downwind"  : None,
                      "finishes"  : [],
                      "colour" : country_colours[t]
                      }
    data_dict[t]["full_data"] = helpers.ingest.boat_ingest(t)


race_starttimes = [] # race start times are global variables, not related to any crew
race_teams = [] # race_teams is a useful list of lists, with each race (top level) having a list of all teams who finished that race
results_sources = []
ttk_kmh_dict = {}
ttk_plots = []

for i, race in enumerate(races):
    race_starttimes.append(helpers.ingest.get_start_time(xml_path=race).tz_localize(None))
    race_teams.append([])
    for team in teams:
        # catch teams who did not compete in the final
        try:
            results[team][i]
        except IndexError:
            continue
        race_teams[-1].append(team)

        start_idx = data_dict[team]["full_data"].index.asof(race_starttimes[-1])
    # put the plotting data into a CDS for Bokeh to plot
    # in this case, it's a scatter plot of TTK and BS at gun time
    results_sources.append(ColumnDataSource({
                               'startline_ttk':[data_dict[t]["full_data"]["PC_TTK_s"][start_idx] for t in race_teams[-1]],
                               'startline_kmh': [data_dict[t]["full_data"]["BOAT_SPEED_km_h_1"][start_idx] for t in race_teams[-1]],
                               'teams' : race_teams[-1],
                               'results': [results[t][i] for t in race_teams[-1]],
                               'str_results' : [str(results[t][i]) +' '+ t for t in race_teams[-1]],
                               'colour' : [data_dict[t]["colour"] for t in race_teams[-1]]
                               }))
    
    ttk_plots.append(figure(title=f"Figure {i+1} - TTK vs Startline BS - Race {i+5}",
                            x_axis_label="Startline TTK (s)", 
                            y_axis_label="Startline km/h",
                            x_range=(-5, 0.5),
                            y_range=(0,90),
                            height=400,
                            width=800
                            )
                    )
    
    ttk_plots[-1].scatter(x='startline_ttk', y='startline_kmh', source=results_sources[-1], size=15, color="colour", alpha=0.9)
    labels = LabelSet(x='startline_ttk', y='startline_kmh', text='str_results', x_offset=8, y_offset=8, source=results_sources[-1])

    # Create a HoverTool
    hover = HoverTool(
        tooltips=[
            ("Team", "@teams"),
            ("Startline TTK (s)", "@startline_ttk{0,0.0}s"),
            ("Startline (km/h)", "@startline_kmh{0,0.0}km/h"),
            ("Result", "@results"),
        ],
        mode="mouse"  # Ensures it follows the mouse
    )

    # Create a vertical span at x = 0 representing the finish line.
    startline = Span(location=0,  # x position
                 dimension='height',  # Vertical line
                 line_color='black', 
                 line_width=3, 
                 line_dash='dashed')  # Make it dashed for better visibility


    ttk_plots[-1].add_layout(startline)
    ttk_plots[-1].add_layout(labels)
    ttk_plots[-1].add_tools(hover)

# Arrange the plots in a 2x2 grid
grid_ttk = gridplot([ttk_plots[:2], ttk_plots[2:]])
# Show the grid of Figures 1-4
show(grid_ttk)

# get upwind and downwind dataset TWA limits
PORT_U_TWA_RANGE = [35,70]
STBD_U_TWA_RANGE = [290,325]

PORT_D_TWA_RANGE = [100,160]
STBD_D_TWA_RANGE = [200,260]


for team in teams:
    # this one line finds the approximate race finish times by 
    # finding the index of the rows where the 'LEG_NUM' data goes down
    # i.e. when it resets only. However - the data doesn't reset at the end so 
    # the last index value is taken. This isn't exact but good enough
    data_dict[team]["finish_times"] = list(data_dict[team]["full_data"].loc[data_dict[team]["full_data"]["TRK_LEG_NUM_unk"].diff() < 0].index)
    data_dict[team]["finish_times"].append(data_dict[team]["full_data"].index[-1])
    TWA_SGP_deg = data_dict[team]["full_data"]["TWA_SGP_deg"] 
    # create masks for when boats have an upwind TWA or downwind TWA - on either tack
    upwind_mask = ((TWA_SGP_deg > PORT_U_TWA_RANGE[0]) & (TWA_SGP_deg < PORT_U_TWA_RANGE[1])) or \
        ((TWA_SGP_deg > STBD_U_TWA_RANGE[0]) & (TWA_SGP_deg < STBD_U_TWA_RANGE[1]))
    downwind_mask = ((TWA_SGP_deg > PORT_D_TWA_RANGE[0]) & (TWA_SGP_deg < PORT_D_TWA_RANGE[1])) or \
        ((TWA_SGP_deg > STBD_D_TWA_RANGE[0]) & (TWA_SGP_deg < STBD_D_TWA_RANGE[1]))
    
    data_dict[team]["upwind"] = data_dict[team]["full_data"].loc[upwind_mask]
    data_dict[team]["downwind"] = data_dict[team]["full_data"].loc[downwind_mask]
    


for i, race in enumerate(race_teams):
    # This code should probably be put in a list + mega dict if I was doing any more than two different graphs. 
    # For now it's readadble.
    upwind_vmg_values, downwind_vmg_values = {}, {}
    for team in race:
        upwind_race_mask = (race_starttimes[i] < data_dict[team]["upwind"].index) & (data_dict[team]["upwind"].index < data_dict[team]["finish_times"][i])
        downwind_race_mask = (race_starttimes[i] < data_dict[team]["downwind"].index) & (data_dict[team]["downwind"].index < data_dict[team]["finish_times"][i])
        upwind_vmg_values[team] = data_dict[team]["upwind"]["VMG_km_h_1"].loc[upwind_race_mask]
        downwind_vmg_values[team] = data_dict[team]["downwind"]["VMG_km_h_1"].loc[downwind_race_mask]


    # Convert to list format for HoloViews
    # create a list of the teams in descending race position
    ordered_teams = [0] * len(race) 

    for team in race:
        if type(results[team][i]) != int: # handle DNF
            del ordered_teams[-1]
            continue
        ordered_teams[data_dict[team]["results"][i]-1] = team
    # format for violins
    upwind_vmg_data = [(team, value) for team in ordered_teams for value in upwind_vmg_values[team]]
    downwind_vmg_data = [(team, -value) for team in ordered_teams for value in downwind_vmg_values[team]]

    # Create violin plot
    downwind_vmg_violin = hv.Violin(downwind_vmg_data, ['Team'], 'Value')
    downwind_vmg_violin.opts(opts.Violin(inner='quartiles',
                            violin_color='Team',
                            cmap=country_colours,
                            ylabel="VMG (km/h)",
                            width=max(800,len(race) * 60),
                            height=400,
                            violin_fill_alpha=0.8,
                            ylim=(0,100)))
    downwind_vmg_violin.opts(title=f"Figure {2*i + 6} - Race {i+5} - Downwind VMG")
    
    upwind_vmg_violin = hv.Violin(upwind_vmg_data, ['Team'], 'Value')
    upwind_vmg_violin.opts(opts.Violin(inner='quartiles',
                            violin_color='Team',
                            cmap=country_colours,
                            ylabel="VMG (km/h)",
                            width=max(800,len(race) * 60),
                            height=400,
                            violin_fill_alpha=0.8,
                            ylim=(0,None)))
    upwind_vmg_violin.opts(title=f"Figure {2*i+5} - Race {i+5} - Upwind VMG")

    layout = upwind_vmg_violin + downwind_vmg_violin  # This automatically creates a side-by-side layout

    layout.opts(shared_axes=False)  # Prevents y-axis scaling issues

    show(hv.render(layout))
