# Predicting athlete results in snowboarding

In this example we will try an approach to predict an athletes' performance based on past results of the athlete and on information about the current event

First we need to import some tools that will help us fetch the data from the WST website. If we have a database with the necessary data, we can use the data from there but for this example, we will just scrape everything that is necessary

In [111]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from datetime import datetime

Next, we define the athlete we want to predict the result for. We will use **Clemens Millauer** as an example

In [112]:
current_athlete = "http://www.worldsnowboarding.org/riders/clemens-millauer/"

To make things easier, we will take a past event and try to see how our predictions will differ from the actual result that Clemens achieved. This will be our "current event"

In [113]:
current_event = "http://www.worldsnowboarding.org/events/fis-world-cup-2020-8"

## Getting information about the current event

As a first step, we will get some data about the current event. The most basic information we need is the discipline and the date. Then also, factors such as the number of competitors and their standing in the world points list could be important factors for our prediction

Using the link to the current event, we fetch the HTML from the event page

In [114]:
res = requests.get(current_event)
soup = BeautifulSoup(res.content, 'html.parser')

### Basic Event Details

In the code below, we scrape the HTML to get some basic information about the event

In [115]:
header = soup.find(class_="detailed-header")
ce_name = header.find(class_="event-label").get_text().strip()

details = header.find(class_="icon-group")
ce_disciplines = details.find_all(class_="icon-discipline-large")
ce_disciplines = [d.get_text().strip() for d in ce_disciplines]
ce_genders = details.find(class_="icon-type-large").get_text().strip()
ce_genders = [event_genders[i:i+1] for i in range(0, len(event_genders), 1)]

ce_start_date = None
ce_info = header.find(class_="plain-list")
items = ce_info.find_all("li")

for item in items:
    item_text = item.get_text()
    if "Date:" in item_text:
        date = item_text.strip().replace("Date: ", "")
        if " - " in date:
            date_range = date.split(" - ")
            ce_start_date = date_range[0]
            ce_start_date = datetime.strptime(ce_start_date, "%d.%m.%y")
        else:
            ce_start_date = datetime.strptime(date, "%d.%m.%y")

That gives us the following information:

In [116]:
print(f"Event Name: {ce_name}")
print(f"Event Disciplines: {ce_disciplines}")
print(f"Event Competitions: {ce_genders}" )
print(f"Event Start Date: {ce_start_date}")

Event Name: FIS World Cup
Event Disciplines: ['HP', 'SS']
Event Competitions: [['M'], ['W']]
Event Start Date: 13.02.20


As we can see, the event we chose for this example has multiple disciplines for both Men and Women. As we are going with the example of Clemens, we will only consider the Mens Slopestyle event for this example. If you visit the link yourself, you can see that this competition is already selected by default so we don't need to do any extra work here

### Number of competitors

In order to make our prediction, we need the number of competitors for the current event.

In [117]:
ranking_table = soup.find("table", class_="rank-results")
ranks = ranking_table.find_all(class_="rank")
ce_number_competitors = len(ranks)
print(ce_number_competitors)

51


### Average WSPL points

Also we will consider the average number of WSPL points that the athlete field has. For this we need to get every competitors current WSPL points from his profile and calculate the average

In [118]:
points_sum = 0

for result in ranks:
    cells = result.find_all("td")
    athlete_profile = cells[1].find("a")['href']
    res = requests.get("http://worldsnowboarding.org" + athlete_profile)
    profile = BeautifulSoup(res.content, 'html.parser')
    ss_details = profile.find(id="result-table-points-list-ss").find_all("li")

    for i in ss_details:
        if "Current Points" in i.get_text().strip():
            profile_ss_points = float(i.find("strong").get_text())
            points_sum += profile_ss_points

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [121]:
ce_points_average = round((points_sum / ce_number_competitors),2)
print(ce_points_average)

405.62


### Current Event Summary

In [122]:
print(f"Event Name: {ce_name}")
print(f"Event Disciplines: {ce_disciplines}")
print(f"Event Competitions: {ce_genders}" )
print(f"Event Start Date: {ce_start_date}")
print(f"Number of Competitors: {ce_number_competitors}")
print(f"Average WSPL Points: {ce_points_average}")

Event Name: FIS World Cup
Event Disciplines: ['HP', 'SS']
Event Competitions: [['M'], ['W']]
Event Start Date: 13.02.20
Number of Competitors: 51
Average WSPL Points: 405.62


## Getting Information about the athlete

Next up, we need some past data about the athlete so we can then make a prediction. Primarily, we want details about the past events that he participated in and about his performance in those

### Collect past events

Below, we first define a helper function which will return the number of months lying between today and a given date. We will use it to determine how long ago an event happened. This is important, as we will not consider results that are older than 4 years (48 months). Later we will use it again to give more recent results more weight in the calculation.

In [52]:
def diff_month(date):
    now = datetime.now()
    return (now.year - date.year) * 12 + now.month - date.month

First, we need a list of all of the events of the current events' discipline that our athlete has previously particpated in. To get that, we will visit his profile link and collect all of the URLs and some other basic information about the relevant events

In [68]:
res = requests.get(current_athlete)
soup = BeautifulSoup(res.content, 'html.parser')
results_table = soup.find(id="result-table-all-results-all-results").find(class_="rank-results")
results = results_table.find_all(class_="rank")

rider_results = []
for result in results:
    cells = result.find_all("td")
    result_discipline = cells[-1].find(class_="icon-discipline-medium").get_text().strip()
    if result_discipline == "SS":
        event_date = cells[0].get_text().strip()
        event_date = datetime.strptime(event_date, "%d.%m.%y")
        
        if diff_month(event_date) < 48 and event_date < ce_start_date:
            rank = cells[1].get_text().strip().replace("st", "").replace("nd", "").replace("rd", "").replace("th", "")
            rank = int(rank)
            event_name = cells[3].find("a").get_text().strip()
            event_link = cells[3].find("a")['href']
            event_link = "http://worldsnowboarding.org" + event_link
            
            # Make sure we don't consider the event we want to predict
            if event_link != current_event:
                rider_results.append({
                    "event_name": event_name,
                    "event_date": event_date.date(),
                    "event_link": event_link,
                    "rider_rank": rank
                })

The output of the above code is a list of the events where each object contains information about the events that the athlete participated in: Event Name, Event Date, Event Link & the athletes result in that event.

As a next step, we want to complement these objects with additional information about the event. Specifically, we want to know how many athletes participated in each event and again what their average WSPL points were.

Once we have this information, we can calculate the **average percentile finish** of our current athlete across all of the events we have just looked at as an indicator for future performance

### Analyze Past Events

Lets quickly have a look at how many results we are working with now. The following number represents the amount of Slopestyle events that Clemens participated in within the past 4 years.

In [123]:
len(rider_results)

17

Below we create an object to store the points for athletes whose profile we already visited because otherwise we need to go to the same profiles over and over again which will take a long time to run

In [55]:
points_cache = {}

The following code iterates over the list of events that we just created, filters for events not older than 4 years before the current event and adds information about number of competitors and their WSPL points to each

In [70]:
for event in rider_results:
    res = requests.get(event['event_link'])
    soup = BeautifulSoup(res.content, 'html.parser')
    ranking_table = soup.find("table", class_="rank-results")
    ranks = ranking_table.find_all(class_="rank")
    number_competitors = len(ranks)

    points_sum = 0
    missed_riders = 0
    for result in ranks:
        cells = result.find_all("td")
        athlete_profile = cells[1].find("a")['href']

        if athlete_profile in points_cache.keys():
            points_sum += points_cache[athlete_profile]
        else:
            res = requests.get("http://worldsnowboarding.org" + athlete_profile)
            profile = BeautifulSoup(res.content, 'html.parser')

            try:
                ss_details = profile.find(id="result-table-points-list-ss").find_all("li")
            except AttributeError:
                ss_details = None
                missed_riders += 1

            if ss_details:
                for i in ss_details:
                    if "Current Points" in i.get_text().strip():
                        profile_ss_points = float(i.find("strong").get_text())
                        points_sum += profile_ss_points
                        points_cache[athlete_profile] = profile_ss_points

    counted_riders = number_competitors - missed_riders
    points_average = round((points_sum / counted_riders),2)

    event['event_competitors'] = number_competitors
    event['points_average'] = points_average

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


### Analyze past athlete performance

Now that we have the data about the athletes activity in the past 4 years, we can go ahead and analyze it.

As mentioned before, we will first calculate the percentile in which the athlete finished for each event and add it to the information in our athlete results collection

In [72]:
for event in rider_results:
    event['rank_percentile'] = event['rider_rank'] / event['event_competitors']

### Data Overview

Now lets actually have a look at what our data collection currently looks like

In [144]:
df = pd.DataFrame(rider_results)
df = df[df['event_link'] != current_event]
df.drop(['event_link'], axis=1)

Unnamed: 0,event_name,event_date,rider_rank,event_competitors,points_average,rank_percentile
0,FIS World Cup,2020-01-22,18,49,338.36,0.367347
1,LAAX Open - FIS World Cup,2020-01-13,23,58,398.55,0.396552
2,Toyota U.S. Grand Prix,2019-03-03,41,58,385.43,0.706897
3,Burton US Open Snowboarding Championships,2019-03-02,13,29,528.3,0.448276
4,FIS World Snowboard Championships,2019-01-31,13,58,362.89,0.224138
5,FIS World Cup - Day 1,2019-01-11,48,50,303.95,0.96
6,Spring Battle,2018-03-17,1,31,400.68,0.032258
7,FIS World Cup,2018-03-15,9,55,303.48,0.163636
8,Burton US Open Snowboarding Championships,2018-03-05,26,32,490.45,0.8125
9,XXIII Olympic Winter Games 2018,2018-02-11,13,35,401.17,0.371429


### Calculate Averages

Now we could simply calculate an average percentile and translate that to our current event. 

However, it would make more sense to calculate a weighted average by giving more recent results more weight than those further in the past.

To do this, we can assume the following weights based on the time passed since the event

**X = percentile**

**W = weighting multiplier**

**T = time passed since event in months**

T <= 6:		    W = 1

6 < T <= 12:     W = 0.8

12 < T <= 24: 	W = 0.5

24 < T <= 36:   W = 0.3

T > 36:          W = 0.1


We can now go ahead and add a column with the time weighted average

In [145]:
for event in rider_results:
    months_passed = diff_month(event['event_date'])
    if months_passed <= 6:
        event['time_multiplier'] = 1
    elif months_passed <= 12:
        event['time_multiplier'] = 0.8
    elif months_passed <= 24:
        event['time_multiplier'] = 0.5
    elif months_passed <= 36:
        event['time_multiplier'] = 0.3
    else:
        event['time_multiplier'] = 0.1

Now we can iterate over the list once again and calculate the weighted average of the placement percentile based on [this explanation of the calculation](https://www.indeed.com/career-advice/career-development/how-to-calculate-weighted-average#3)

In [151]:
weight_sum = 0
weighted_percentile_sum = 0

for event in rider_results:
    weighted_percentile_sum += (event['rank_percentile'] * event['time_multiplier'])
    weight_sum += event['time_multiplier']

time_weighted_average = weighted_percentile_sum / weight_sum
print(time_weighted_average)

0.46478360532397217


The result of this calculation shows that Clemens' weighted average finish over the last 4 years is in the **46th percentile** of the competitors field.

## Prediction 1 - Time Weighted Average of past results

Given the weighted average result percentile, we can multiply the number of competitors of our current event with this number to get a first prediction of Clemens' result:

In [147]:
round(ce_number_competitors*time_weighted_average)

24

According to this calculation, Clemens should have finished in **24th place**. This is quite a bad prediction, as Clemens actually finished 6th in our current event.

Most likely, Clemens outperformed his previous results but we can still try to improve our algorithm to come closer to the actual result.

In the next step, we will try to also consider the level of the events that he particpated in.

## Factor in Event Level

We already have information about the average WSPL score of the competing athletes for both our current event and the athletes' past event. We can try to adapt our calculation to give more weight to the past events that have a similar level as our current event.

To do this, we can identify past events of our athlete that are within a certain threshold of the current events' average WSPL score. Lets say + / - 10%

Let's give the results within this threshold a weight of 1.0 and the remaining ones 0.5

In [148]:
for event in rider_results:
    average_wspl = event['points_average']
    deviation = abs((average_wspl / ce_points_average) * 100 - 100)
    if deviation <= 10:
        event['level_multiplier'] = 1
    else:
        event['level_multiplier'] = 0.5

Lets have another look at what our data looks at this point

In [149]:
df = pd.DataFrame(rider_results)
df.drop(['event_link'], axis=1)

Unnamed: 0,event_name,event_date,rider_rank,event_competitors,points_average,rank_percentile,time_multiplier,level_multiplier
0,FIS World Cup,2020-01-22,18,49,338.36,0.367347,0.8,0.5
1,LAAX Open - FIS World Cup,2020-01-13,23,58,398.55,0.396552,0.8,1.0
2,Toyota U.S. Grand Prix,2019-03-03,41,58,385.43,0.706897,0.5,1.0
3,Burton US Open Snowboarding Championships,2019-03-02,13,29,528.3,0.448276,0.5,0.5
4,FIS World Snowboard Championships,2019-01-31,13,58,362.89,0.224138,0.5,0.5
5,FIS World Cup - Day 1,2019-01-11,48,50,303.95,0.96,0.5,0.5
6,Spring Battle,2018-03-17,1,31,400.68,0.032258,0.3,1.0
7,FIS World Cup,2018-03-15,9,55,303.48,0.163636,0.3,0.5
8,Burton US Open Snowboarding Championships,2018-03-05,26,32,490.45,0.8125,0.3,0.5
9,XXIII Olympic Winter Games 2018,2018-02-11,13,35,401.17,0.371429,0.3,1.0


Now we can apply the same calculation as before with the level multiplier to see what the result looks like

In [152]:
weight_sum = 0
weighted_percentile_sum = 0

for event in rider_results:
    weighted_percentile_sum += (event['rank_percentile'] * event['level_multiplier'])
    weight_sum += event['level_multiplier']

level_weighted_average = weighted_percentile_sum / weight_sum
print(level_weighted_average)

0.4694671549082217


It seems like this change did not have a big effect, there is only a minimal change between the two weighted averages:

In [154]:
abs(time_weighted_average - level_weighted_average)

0.004683549584249547

## Prediction 2 - Level weighted average of past results

In [155]:
round(level_weighted_average * ce_number_competitors)

24

As expected, there is no change in the result - based on the second prediction, Clemens would still finish in **24th place**.

## Combination of both weights

At last, we can try to see what happens if we consider both weights in our calculation. For this, we will simply add the values of time_average and level_average for each event and perform the calculation like that.

In [156]:
weight_sum = 0
weighted_percentile_sum = 0

for event in rider_results:
    combined_multiplier = event['level_multiplier'] + event['time_multiplier']
    weighted_percentile_sum += (event['rank_percentile'] * combined_multiplier)
    weight_sum += combined_multiplier

combined_weighted_average = weighted_percentile_sum / weight_sum
print(combined_weighted_average)

0.46790597171347165


Again, we can see that the change of our multiplier is very minimal. Still, we can use it again to predict Clemens' result

## Prediction 3 - Combined weighted average of level & time

In [157]:
round(combined_weighted_average * ce_number_competitors)

24

The prediction still remains the same, with Clemens' supposedly finishing in **24th place**.

## Additional Tests

As the original test was not too successful, a few more cases have been tested for predictions:

**Mark McMorris at XGames Norway Slopestyle 2020**

Prediction 1: 2nd place

Prediction 2: 2nd place

Prediction 3: 2nd place

Actual Result: 2nd place

**Stale Sandbech at US Open Slopestyle 2020**

Prediction 1: 9th place

Prediction 2: 9th place

Prediction 3: 9th place

Actual Result: 7th place

**Jonas Boesiger at FIS Slopestyle World Cup Calgary 2020**

Prediction 1: 12th place

Prediction 2: 18th place

Prediction 3: 16th place

Actual Result: 15th place

**Ryan Stassel at FIS Slopestyle World Cup China 2020**

Prediction 1: 17th place

Prediction 2: 20th place

Prediction 3: 19th place

Actual Result: 13th place

As we can see, the results of the additional tests have been a lot more accurate than the example with Clemens - probably because Clemens' result was an outlier.

Based on these tests, it is not entirely clear, which one of the predictions would be the most accurate. However, prediction 3 takes the most factors into account and could thus be seen as more promising than the others.

## Further ideas

Needless to say, the calculations we used to not consider a whole lot of factors. For further development of the algorithm, these are some of the factors that could be considered for more accurate predictions:

- Wheather situation at current and past events
- Competition breaks by the athlete because of injury or similar
- Discipline focus of the athlete
- Who is and has been judging the athlete