# **Quick Overview of the 24/25 Ski Jumping Season Dataset**
### Let's begin by importing pandas and initializing our DataFrames

In [3]:
import pandas as pd
df_events = pd.read_csv("data/events.csv",delimiter=",")
df_results = pd.read_csv("data/results.csv",delimiter=",")

In [43]:
df_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Event_id  30 non-null     int64 
 1   Date      30 non-null     object
 2   Country   30 non-null     object
 3   City      30 non-null     object
 4   HS_Point  30 non-null     int64 
 5   K_Point   30 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 1.5+ KB


In [5]:
df_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1130 entries, 0 to 1129
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Place     1130 non-null   object 
 1   Jumper    1130 non-null   object 
 2   Country   1130 non-null   object 
 3   Jump1     1130 non-null   float64
 4   Jump2     657 non-null    float64
 5   Points    1130 non-null   float64
 6   Event_id  1130 non-null   int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 61.9+ KB


In [6]:
#When someone is DSQ  we consider him as 50th place
df_results["Place"] = pd.to_numeric(df_results["Place"], errors="coerce")
df_results["Place"] = df_results["Place"].fillna(50).astype(int)


**Both datasets seem fine - no missing values.
Just a quick note: In the results dataset, sometimes the "Jump2" field has "NaN." This happens when a jumper doesn't qualify for "Jump2" after "Jump1" (not in the top 30), or if the second round was canceled or didn't take place.
Let's take a look at our datasets.**

In [10]:
df_events.head()
#Contains list of every single ski jumps event, it has its own unique id, and general info about the hill

Unnamed: 0,Event_id,Date,Country,City,HS_Point,K_Point
0,1,23-11-2024,Norwegia,Lillehammer,140,123
1,2,24-11-2024,Norwegia,Lillehammer,140,123
2,3,30-11-2024,Finlandia,Ruka,142,120
3,4,01-12-2024,Finlandia,Ruka,142,120
4,5,07-12-2024,Polska,Wisła,134,120


In [11]:
df_results.head()
#Provides data about final result for every single event that took place till today (25-02-2025). Jump1 and Jump2 are lenghts of jumps,
#and points are sum of final points get in both jumps (it includes wind/gate points)

Unnamed: 0,Place,Jumper,Country,Jump1,Jump2,Points,Event_id
0,1,PASCHKE Pius,Niemcy,131.5,138.5,317.1,1
1,2,TSCHOFENIG Daniel,Austria,132.5,132.5,309.2,1
2,3,ORTNER Maximilian,Austria,132.0,131.5,307.1,1
3,4,KRAFT Stefan,Austria,133.0,130.0,306.0,1
4,5,HOERL Jan,Austria,128.5,130.0,300.9,1


## All winners of competitions

Winner = 1st place 

Podium = 1st, 2nd or 3rd place

In [20]:
dfx = df_results.groupby(by="Jumper").agg(
    winner = ("Place",lambda x: (x == 1).sum() ), 
    podium=("Place", lambda x: x.isin([1, 2, 3]).sum() )
).sort_values(by=["winner","podium"], ascending=[False,False]
             ).loc[lambda x: (x["winner"] > 0) | (x["podium"] > 0)]
dfx

Unnamed: 0_level_0,winner,podium
Jumper,Unnamed: 1_level_1,Unnamed: 2_level_1
TSCHOFENIG Daniel,8,15
PASCHKE Pius,5,7
HOERL Jan,2,13
KRAFT Stefan,2,7
KOBAYASHI Ryoyu,2,2
FORFANG Johann Andre,1,6
PREVC Domen,1,3
WELLINGER Andreas,1,1
ZAJC Timi,1,1
DESCHWANDEN Gregor,0,4


## Top 5 average places 

In [35]:
dfx = df_results[["Jumper","Place"]].groupby(by="Jumper"
                                            ).mean().rename(columns={"Place":"Avg Place"}).sort_values(by="Avg Place", ascending=True).head(5)
dfx

Unnamed: 0_level_0,Avg Place
Jumper,Unnamed: 1_level_1
TSCHOFENIG Daniel,3.217391
HOERL Jan,4.347826
KRAFT Stefan,7.347826
DESCHWANDEN Gregor,9.304348
FORFANG Johann Andre,10.521739


## Average places of Polish jumpers

In [38]:
dfx = df_results.loc[df_results["Country"] == "Polska",["Jumper","Place"]].groupby(by="Jumper"
                                                                                  ).mean().rename(columns={"Place":"Avg Place"}).sort_values(by="Avg Place")
dfx

Unnamed: 0_level_0,Avg Place
Jumper,Unnamed: 1_level_1
WĄSEK Paweł,16.565217
ZNISZCZOŁ Aleksander,25.238095
STOCH Kamil,25.714286
KUBACKI Dawid,27.125
WOLNY Jakub,29.529412
ŻYŁA Piotr,31.571429
KOT Maciej,35.166667
JUROSZEK Kacper,41.5


## Average length of single jump by Country

In [39]:
dfx = df_results[["Country","Jump1", "Jump2"]].groupby(by="Country"
                                                      ).mean().sort_values(by="Jump1",ascending=False)
dfx

Unnamed: 0_level_0,Jump1,Jump2
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Austria,135.898734,138.908088
Ukraina,133.0,156.416667
Bułgaria,132.526316,134.038462
Norwegia,131.887218,136.03271
Niemcy,130.166667,136.404494
Francja,129.104167,131.346154
Polska,127.619469,134.070423
Japonia,127.123932,137.516949
Szwajcaria,125.727941,135.0
Słowenia,125.331683,138.316667


This table shows real data, but we can't look at it like this.
Generally, at the top, we have strong countries like Austria, and weaker ones like China or Kazakhstan at the bottom – based on [this data](https://www.skijumping.pl/wyniki/klasyfikacja/ps/puchar_narodow/2024-2025).
However, there are some interesting differences. Ukraine or Bulgaria are at the top, while the USA, despite having good jumpers, is lower.
This is because Ukraine and Bulgaria have only 1-2 top jumpers, whereas the USA has some solid or average jumpers, but also often has some bad ones that drag the overall statistic down.

In [60]:
dfx = df_results.groupby(by="Country")
dfx = dfx["Jumper"].nunique().sort_values(ascending=False)
dfx

Country
Austria       11
Niemcy        10
Japonia       10
Norwegia       9
Słowenia       8
Polska         8
Finlandia      7
USA            7
Szwajcaria     6
Chiny          2
Ukraina        2
Turcja         2
Kazachstan     2
Francja        2
Włochy         2
Bułgaria       1
Czechy         1
Estonia        1
Name: Jumper, dtype: int64

As we can see, the number of distinct jumpers from each country is not equal. Let's try to repeat the previous calculations, but this time, we'll focus only on the top 4 jumpers from each country. We'll skip countries with only 1-2 jumpers. Let's find out who the top 4 jumpers are from every country.

In [None]:
tbc.