Prognoza:

I miejsce - Qinwen Zheng \
II miejsce - Aryna Sabalenka \
III/IV miejsce - Cori Gauff, Linda Noskova \
\
Zastosowana metoda:
W projekcie zastosowano metodę AutoML z biblioteki H2O w celu prognozowania wyników meczów tenisowych. AutoML automatycznie eksploruje różne algorytmy uczenia maszynowego i optymalizuje ich hiperparametry, co pozwala na skuteczną budowę modeli klasyfikacyjnych. Dane dotyczące meczów tenisowych zostały poddane odpowiedniemu przetwarzaniu, a następnie zastosowano inżynierię cech, uwzględniając różne parametry statystyczne graczy. Ostateczny model został oceniony i porównany z wykorzystaniem platformy H2O AutoML, co umożliwiło efektywny wybór najlepszego modelu do prognozowania wyników.

Kroki:
Analiza danych i modelowanie rozpoczęły się od przygotowania zestawu meczów tenisowych kobiet z lat 2021-2023, obejmujących turnieje wielkoszlemowe i nawierzchnię twardą. Połączono dane, skupiając się na turniejach wielkoszlemowych i nawierzchni twardą. Stworzono nowe cechy, takie jak współczynniki punktowe podczas serwowania czy liczba punktów zdobytych przez zwycięzcę przy returnie.

Obliczono średnie kroczące dla graczy przed turniejem. Przeprowadzono randomizację meczów, przy zachowaniu równowagi zwycięstw dla gracza 1. Wyliczono różnice między cechami dwóch graczy, co stało się istotnymi danymi dla modelu. Dane podzielono na zbiór treningowy i walidacyjny.

Do modelowania wykorzystano H2O AutoML, które automatycznie trenuje modele klasyfikacyjne. Przeprowadzono prognozowanie wyników Australian Open 2024 dla wybranej grupy zawodniczek, analizując średnie prawdopodobieństwa zwycięstwa dla każdej z nich.


In [1]:
import pandas as pd
import numpy as np
import h2o
from h2o.automl import H2OAutoML
import itertools

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
df_matches_2023 = pd.read_csv("data/wta_matches_2023.csv")
df_matches_2022 = pd.read_csv("data/wta_matches_2022.csv")
df_matches_2021 = pd.read_csv("data/wta_matches_2021.csv")

In [4]:
# Łączenie danych z ostatnich trzech lat i usunięcie duplikatów
df_matches = pd.concat([df_matches_2023, df_matches_2022, df_matches_2021]).drop_duplicates().reset_index(drop=True)

### Preprocessing

In [5]:
# Filtracja tylko turniejów wielkoszlemowych
grand_slam = ["Australian Open", "Wimbledon", "Roland Garros", "Us Open"]
df_matches = df_matches[df_matches['tourney_name'].isin(grand_slam)]

In [6]:
# Filtracja tylko meczów na nawierzchni twardej
#df_matches.groupby('surface').count().head(20)
df_matches = df_matches[df_matches['surface'] == "Hard"]

In [7]:
# Zmiana nazw kolumn na odpowiednie dla zwycięzcy i przegranego
df_matches = df_matches.rename(columns=lambda x: x.replace('w_', 'winner_') if x.startswith('w_') else x)
df_matches = df_matches.rename(columns=lambda x: x.replace('l_', 'loser_') if x.startswith('l_') else x)

In [8]:
df_matches_base = df_matches

In [9]:
# Usuniecie zbednych kolumn
drop_cols = ["winner_name","loser_name","winner_seed","winner_entry","loser_seed","loser_entry"] #nie wiem czy potrzebujemy "winner_seed","winner_entry","loser_seed","loser_entry"
df_matches = df_matches.drop(drop_cols, axis=1)

In [10]:
# Pobranie nazw kolumn dla statystyk zwycięzcy i przegranego
winner_cols = [col for col in df_matches.columns if col.startswith('winner')]
loser_cols = [col for col in df_matches.columns if col.startswith('loser')]

# Ramki danych dla zwycięzców i przegranych
common_cols = ['tourney_name', 'tourney_date', 'best_of', 'surface', 'minutes', 'draw_size', 'tourney_level', 'round', 'score', 'tourney_id', 'match_num']

df_winner = df_matches[winner_cols + common_cols]
df_loser = df_matches[loser_cols + common_cols]

# Dodanie kolumny informującej, czy gracz wygrał (1) czy przegrał (0)
df_winner["won"] = 1
df_loser["won"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_winner["won"] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_loser["won"] = 0


In [11]:
# Zmiana nazw kolumn dla wygranych i przegrnaych
new_column_names = [col.replace('winner','player') for col in winner_cols]
df_winner.columns = new_column_names + common_cols + ['won']

df_loser.columns  = df_winner.columns

# Połączenie ramki danych zwycięzców i przegranych
df_whole = pd.concat([df_winner, df_loser])

### Feature Engineering

In [12]:
# Współczynnik punktowy podczas serwowania
df_whole['player_serve_win_ratio'] = (df_whole['player_1stWon'] + df_whole['player_2ndWon']) / df_whole['player_svpt']

#Liczba punktów zdobytych przez zwycięzcę przy returnie
df_whole['player_return_won'] = df_whole['player_svpt'] - (df_whole['player_1stWon'] + df_whole['player_2ndWon'])

# Współczynnik player_return_win_ratio
df_whole['player_return_win_ratio'] = df_whole['player_return_won'] / df_whole['player_svpt']

# Współczynnik Break Point Conversion Rate (Wskaźnik Wykorzystania Break Pointów)
df_whole['player_bpConversionRate'] = df_whole['player_bpSaved'] / df_whole['player_bpFaced']

# First Serve Success Rate (Skuteczność Pierwszego Serwisu):
df_whole['player_firstServeSuccessRate'] = df_whole['player_1stWon'] / df_whole['player_1stIn']

# Average Aces per Game (Średnia Liczba Asów na Gem):
df_whole['player_avgAcesPerGame'] = df_whole['player_ace'] / df_whole['player_SvGms']

# Average Double Faults per Game (Średnia Liczba Podwójnych Błędów na Gem):
df_whole['player_avgDoubleFaultsPerGame'] = df_whole['player_df'] / df_whole['player_SvGms']


In [13]:
df_whole.head()

Unnamed: 0,player_id,player_hand,player_ht,player_ioc,player_age,player_ace,player_df,player_svpt,player_1stIn,player_1stWon,player_2ndWon,player_SvGms,player_bpSaved,player_bpFaced,player_rank,player_rank_points,tourney_name,tourney_date,best_of,surface,minutes,draw_size,tourney_level,round,score,tourney_id,match_num,won,player_serve_win_ratio,player_return_won,player_return_win_ratio,player_bpConversionRate,player_firstServeSuccessRate,player_avgAcesPerGame,player_avgDoubleFaultsPerGame
167,216347,R,176.0,POL,21.6,0.0,3.0,78.0,38.0,29.0,22.0,11.0,2.0,3.0,1.0,11025.0,Australian Open,20230116,3,Hard,119.0,128,G,R128,6-4 7-5,2023-580,100,1,0.653846,27.0,0.346154,0.666667,0.763158,0.0,0.272727
168,215785,R,162.0,COL,21.0,0.0,3.0,62.0,44.0,31.0,7.0,9.0,3.0,5.0,84.0,744.0,Australian Open,20230116,3,Hard,79.0,128,G,R128,6-4 6-1,2023-580,101,1,0.612903,24.0,0.387097,0.6,0.704545,0.0,0.333333
169,213710,U,,ESP,25.0,4.0,0.0,80.0,49.0,30.0,16.0,11.0,5.0,8.0,100.0,611.0,Australian Open,20230116,3,Hard,106.0,128,G,R128,2-6 6-0 6-2,2023-580,102,1,0.575,34.0,0.425,0.625,0.612245,0.363636,0.0
170,215370,R,170.0,CAN,22.5,0.0,0.0,56.0,35.0,26.0,14.0,9.0,3.0,3.0,43.0,1121.0,Australian Open,20230116,3,Hard,101.0,128,G,R128,6-2 6-4,2023-580,103,1,0.714286,16.0,0.285714,1.0,0.742857,0.0,0.0
171,214981,R,184.0,KAZ,23.5,5.0,4.0,55.0,22.0,17.0,19.0,11.0,1.0,4.0,25.0,1585.0,Australian Open,20230116,3,Hard,83.0,128,G,R128,7-5 6-3,2023-580,104,1,0.654545,19.0,0.345455,0.25,0.772727,0.454545,0.363636


In [14]:
# Rozróżnienie kolumn
text_columns = df_matches.select_dtypes(include=['object']).columns
numeric_columns = df_matches.select_dtypes(include=['int64','float64']).columns

# Zmienienie wartości object na string
for text_column in text_columns:
    df_matches[text_column] = df_matches[text_column].astype("string")

In [15]:
# Średnie kroczące

In [16]:
tournaments = ["Australian Open", "Wimbledon", "Roland Garros", "Us Open"]
tournament_dates = df_whole.loc[df_whole.tourney_name.isin(tournaments)].groupby(['tourney_name','tourney_date']) \
.size().reset_index()[['tourney_name','tourney_date']]


# Dodanie jednej dodatkowej daty dla końcowej predykcji
tournament_dates.loc[-1] = ['Australian Open', ('20240115')]

In [17]:
# Konwersja 'tourney_date' w 'tournament_dates' na typ int64
tournament_dates['tourney_date'] = tournament_dates['tourney_date'].astype(int)

In [18]:
tournament_dates

Unnamed: 0,tourney_name,tourney_date
0,Australian Open,20210208
1,Australian Open,20220117
2,Australian Open,20230116
3,Us Open,20210830
4,Us Open,20220829
5,Us Open,20230828
-1,Australian Open,20240115


In [19]:
# Funkcja do obliczania średnich kroczących
def get_rolling_features (df, date_df=None,rolling_cols = None, last_cols= None):    

    # Sortowanie według gracza i daty
    df = df.sort_values(['player_id','tourney_date'], ascending=True)
    
    # Dla każdego turnieju oblicz średnie kroczące meczów gracza przed datą rozpoczęcia turnieju
    for index, tourney_date in enumerate(date_df.tourney_date):
        
        df_temp = df.loc[df.tourney_date < tourney_date]
        
        df_temp_last = df_temp.groupby('player_id')[last_cols].last().reset_index()
        
        # Pobranie 15 ostatnich meczów dla średniej kroczącej
        df_temp = df_temp.groupby('player_id')[rolling_cols].rolling(15, min_periods=1).mean().reset_index()
        df_temp = df_temp.groupby('player_id').tail(1) 
        
        df_temp= df_temp.merge(df_temp_last, on='player_id', how='left')
        
        if index ==0:
            df_result = df_temp
            df_result['tourney_date_index'] = tourney_date
        else:
            df_temp['tourney_date_index'] = tourney_date
            df_result = pd.concat([df_result, df_temp])
    
    df_result.drop('level_1', axis=1,inplace=True)
    
    return df_result

In [20]:
# Kolumny, na których stosujemy średnie kroczące
rolling_cols = ['player_serve_win_ratio', 'player_return_won',
               'player_return_win_ratio', 'player_bpConversionRate', 'player_firstServeSuccessRate', 
               'player_avgAcesPerGame', 'player_avgDoubleFaultsPerGame']

# Kolumny, do których bierzemy ostatnią wartość
# Dla rankingu gracza uważamy, że możemy po prostu użyć ostatniego rankingu (przed rozpoczęciem turnieju), 
# ponieważ powinien on odzwierciedlać najnowsze osiągnięcia gracza
last_cols = ['player_rank']

df_rolling = get_rolling_features (df_whole, tournament_dates, rolling_cols, last_cols= last_cols)

In [21]:
df_rolling.head()

Unnamed: 0,player_serve_win_ratio,player_return_won,player_return_win_ratio,player_bpConversionRate,player_firstServeSuccessRate,player_avgAcesPerGame,player_avgDoubleFaultsPerGame,player_id,player_rank,tourney_date_index
0,0.630214,22.5,0.369786,0.439394,0.77147,0.607011,0.243585,200033,11.0,20220117
1,0.528704,26.5,0.471296,0.269231,0.623366,0.316667,0.433333,200748,81.0,20220117
2,0.611911,32.0,0.388089,0.807692,0.690399,0.166667,0.379167,201320,37.0,20220117
3,0.46559,30.333333,0.53441,0.494444,0.557284,0.265079,0.526984,201325,186.0,20220117
4,0.558248,31.5,0.441752,0.2875,0.632653,0.1,0.266667,201329,101.0,20220117


In [22]:
# Randomizacja meczów:
np.random.seed(123)

# Losowanie liczby 0/1 z 50% szansą na każdą
df_matches_base['random_number'] = np.random.randint(2, size=len(df_matches_base))

# Jeśli 0, to bierz zwycięzcę, jeśli 1, to bierz przegranego
df_matches_base['randomised_player_1'] = np.where(df_matches_base['random_number']==0,df_matches_base['winner_id'],df_matches_base['loser_id'])
df_matches_base['randomised_player_2'] = np.where(df_matches_base['random_number']==0,df_matches_base['loser_id'],df_matches_base['winner_id'])

# Ustalenie, czy zawodnik 1 wygrał mecz (1 oznacza zwycięstwo, 0 oznacza porażkę)
df_matches_base['player_1_win'] = np.where(df_matches_base['random_number']==0,1,0)

print ('After shuffling, the win rate for player 1 for the womens is {}%'.format(df_matches_base['player_1_win'].mean()*100))

After shuffling, the win rate for player 1 for the womens is 48.031496062992126%


In [23]:
# Blisko 50% jest ok

In [24]:
df_matches_base

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,winner_ace,winner_df,winner_svpt,winner_1stIn,winner_1stWon,winner_2ndWon,winner_SvGms,winner_bpSaved,winner_bpFaced,loser_ace,loser_df,loser_svpt,loser_1stIn,loser_1stWon,loser_2ndWon,loser_SvGms,loser_bpSaved,loser_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,random_number,randomised_player_1,randomised_player_2,player_1_win
167,2023-580,Australian Open,Hard,128,G,20230116,100,216347,1.0,,Iga Swiatek,R,176.0,POL,21.6,215898,,,Jule Niemeier,R,178.0,GER,23.4,6-4 7-5,3,R128,119.0,0.0,3.0,78.0,38.0,29.0,22.0,11.0,2.0,3.0,3.0,1.0,71.0,45.0,32.0,9.0,11.0,2.0,5.0,1.0,11025.0,69.0,845.0,0,216347,215898,1
168,2023-580,Australian Open,Hard,128,G,20230116,101,215785,,,Camila Osorio,R,162.0,COL,21.0,215910,,,Panna Udvardy,R,170.0,HUN,24.3,6-4 6-1,3,R128,79.0,0.0,3.0,62.0,44.0,31.0,7.0,9.0,3.0,5.0,4.0,5.0,63.0,38.0,22.0,7.0,8.0,6.0,11.0,84.0,744.0,89.0,695.0,1,215910,215785,0
169,2023-580,Australian Open,Hard,128,G,20230116,102,213710,,Q,Cristina Bucsa,U,,ESP,25.0,220332,,Q,Eva Lys,R,165.0,GER,21.0,2-6 6-0 6-2,3,R128,106.0,4.0,0.0,80.0,49.0,30.0,16.0,11.0,5.0,8.0,1.0,5.0,73.0,48.0,24.0,11.0,11.0,9.0,15.0,100.0,611.0,126.0,507.0,0,213710,220332,1
170,2023-580,Australian Open,Hard,128,G,20230116,103,215370,,,Bianca Andreescu,R,170.0,CAN,22.5,213631,25.0,,Marie Bouzkova,R,180.0,CZE,24.4,6-2 6-4,3,R128,101.0,0.0,0.0,56.0,35.0,26.0,14.0,9.0,3.0,3.0,2.0,2.0,60.0,50.0,30.0,4.0,9.0,2.0,5.0,43.0,1121.0,26.0,1581.0,0,215370,213631,1
171,2023-580,Australian Open,Hard,128,G,20230116,104,214981,22.0,,Elena Rybakina,R,184.0,KAZ,23.5,220714,,,Elisabetta Cocciaretto,R,,ITA,21.9,7-5 6-3,3,R128,83.0,5.0,4.0,55.0,22.0,17.0,19.0,11.0,1.0,4.0,2.0,1.0,60.0,33.0,19.0,12.0,10.0,7.0,12.0,25.0,1585.0,48.0,1015.0,0,214981,220714,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7507,2021-560,Us Open,Hard,128,G,20210830,222,220367,,,Leylah Fernandez,L,168.0,CAN,18.9,202494,5.0,,Elina Svitolina,R,174.0,UKR,26.9,6-3 3-6 7-6(5),3,QF,144.0,1.0,5.0,98.0,54.0,38.0,21.0,16.0,2.0,6.0,8.0,3.0,98.0,64.0,46.0,14.0,15.0,6.0,10.0,73.0,1024.0,5.0,5210.0,1,202494,220367,0
7508,2021-560,Us Open,Hard,128,G,20210830,223,214544,2.0,,Aryna Sabalenka,R,182.0,BLR,23.3,206252,8.0,,Barbora Krejcikova,R,178.0,CZE,25.7,6-1 6-4,3,QF,86.0,6.0,7.0,67.0,38.0,29.0,14.0,9.0,5.0,6.0,4.0,5.0,55.0,21.0,12.0,15.0,8.0,6.0,10.0,2.0,7010.0,9.0,4273.0,0,214544,206252,1
7509,2021-560,Us Open,Hard,128,G,20210830,224,221054,,Q,Emma Raducanu,R,175.0,GBR,18.7,206289,17.0,,Maria Sakkari,R,172.0,GRE,26.0,6-1 6-4,3,SF,84.0,4.0,2.0,56.0,40.0,29.0,11.0,9.0,7.0,7.0,4.0,5.0,60.0,31.0,22.0,10.0,8.0,8.0,11.0,150.0,531.0,18.0,3210.0,0,221054,206289,1
7510,2021-560,Us Open,Hard,128,G,20210830,225,220367,,,Leylah Fernandez,L,168.0,CAN,18.9,214544,2.0,,Aryna Sabalenka,R,182.0,BLR,23.3,7-6(3) 4-6 6-4,3,SF,141.0,6.0,2.0,102.0,59.0,40.0,22.0,17.0,7.0,11.0,10.0,8.0,94.0,64.0,45.0,12.0,16.0,3.0,7.0,73.0,1024.0,2.0,7010.0,0,220367,214544,1


In [25]:
#Usunięcie kolumn
cols_to_keep = ['winner_name','loser_name','tourney_name','tourney_date',
                    'player_1_win','randomised_player_1',
                    'randomised_player_2']

df_matches_base = df_matches_base[cols_to_keep]


# Łączymy ramki danych z kolumnami średnich kroczących z osobnymi meczami.

# Cechy kroczące dla gracza 1
df_matches_base = df_matches_base.merge(df_rolling, how='left',
                      left_on = ['randomised_player_1','tourney_date'],
                      right_on = ['player_id','tourney_date_index'],
                      validate ='m:1')

# Cechy kroczące dla gracza 2
df_matches_base = df_matches_base.merge(df_rolling, how='left',
                      left_on = ['randomised_player_2','tourney_date'],
                      right_on = ['player_id','tourney_date_index'],
                      validate ='m:1',
                      suffixes=('_p1','_p2'))

In [26]:
# Ilość graczy bez historii meczów przed turniejem
print('{} player_1s nie ma historii meczy przed turniejem'.format(df_matches_base.loc[df_matches_base.player_id_p1.isna(),'randomised_player_1'].nunique()))
print('{} player_2s nie ma historii meczy przed turniejem'.format(df_matches_base.loc[df_matches_base.player_id_p2.isna(),'randomised_player_2'].nunique()))

146 player_1s nie ma historii meczy przed turniejem
146 player_2s nie ma historii meczy przed turniejem


In [27]:
df_matches_base.loc[df_matches_base.player_id_p1.isna(),'tourney_date'].value_counts()

20210208    127
20210830     26
20220117     17
20220829     12
20230828     11
20230116      9
Name: tourney_date, dtype: int64

In [28]:
# Większość brakujących danych dotyczy lat początkowych, co ma sens, ponieważ nie mamy wystarczająco dużo historii dla tych graczy.

In [29]:
def get_player_difference(df, diff_cols = None):

    
    p1_cols = [i + '_p1' for i in diff_cols] 
    p2_cols = [i + '_p2' for i in diff_cols] 

    # Dla brakujących wartości wypełniamy je zerami, z wyjątkiem rankingu, gdzie używamy 999
    df['player_rank_p1'] = df['player_rank_p1'].fillna(999)
    df[p1_cols] = df[p1_cols].fillna(0)
    
    df['player_rank_p2'] = df['player_rank_p2'].fillna(999)
    df[p2_cols] = df[p2_cols].fillna(0)

    
    new_column_name = [i + '_diff' for i in diff_cols]

    # Obliczamy różnicę
    df_p1 = df[p1_cols]
    df_p2 = df[p2_cols]
    
    df_p1.columns=new_column_name
    df_p2.columns=new_column_name
    
    df_diff = df_p1 - df_p2
    df_diff.columns = new_column_name

    # usunięcie kolumny p1 i p2, ponieważ mamy już różnice
    df.drop(p1_cols + p2_cols, axis=1, inplace=True)
    
    # Połączenie df_diff i df
    df = pd.concat([df, df_diff], axis=1)
    
    return df,new_column_name

In [30]:
diff_cols = ['player_serve_win_ratio', 'player_return_won',
               'player_return_win_ratio', 'player_bpConversionRate', 'player_firstServeSuccessRate', 
               'player_avgAcesPerGame', 'player_avgDoubleFaultsPerGame', 'player_rank']

# Zastosowanie funkcji
df_matches_base,_ = get_player_difference(df_matches_base,diff_cols=diff_cols)

# Utoworzenie kopii
df_final = df_matches_base.copy()

In [31]:
df_final.head()

Unnamed: 0,winner_name,loser_name,tourney_name,tourney_date,player_1_win,randomised_player_1,randomised_player_2,player_id_p1,tourney_date_index_p1,player_id_p2,tourney_date_index_p2,player_serve_win_ratio_diff,player_return_won_diff,player_return_win_ratio_diff,player_bpConversionRate_diff,player_firstServeSuccessRate_diff,player_avgAcesPerGame_diff,player_avgDoubleFaultsPerGame_diff,player_rank_diff
0,Iga Swiatek,Jule Niemeier,Australian Open,20230116,1,216347,215898,216347.0,20230116.0,215898.0,20230116.0,-0.035571,-0.516667,0.035571,-0.066859,-0.072582,-0.326733,-0.470795,-107.0
1,Camila Osorio,Panna Udvardy,Australian Open,20230116,0,215910,215785,215910.0,20230116.0,215785.0,20230116.0,-0.137167,-8.0,0.137167,-0.288571,-0.111969,-0.165165,-0.303909,25.0
2,Cristina Bucsa,Eva Lys,Australian Open,20230116,1,213710,220332,213710.0,20230116.0,,,0.550179,28.75,0.449821,0.564042,0.566181,0.027778,0.158333,-881.0
3,Bianca Andreescu,Marie Bouzkova,Australian Open,20230116,1,215370,213631,215370.0,20230116.0,213631.0,20230116.0,0.004655,3.555556,-0.004655,-0.051052,0.029651,-0.004358,-0.020351,7.0
4,Elena Rybakina,Elisabetta Cocciaretto,Australian Open,20230116,1,214981,220714,214981.0,20230116.0,220714.0,20230116.0,0.074957,-6.625,-0.074957,0.006944,0.107264,0.588418,0.198363,-74.0


Tworzenie modelu

Do trenowania użyjemy wszystkich dostępnych danych z 2022 roku (zbyt wiele brakujących danych w pierwszym roku).

Do walidacji modelu użyjemy danych z US Open 2023

In [32]:
# Konwersja 'tourney_date' w 'df_final' na typ datatime
df_final['tourney_date'] = pd.to_datetime(df_final['tourney_date'].astype(str), format='%Y%m%d')

In [33]:
df_train = df_final.loc[(df_final.tourney_date != '20230828') # wyłączenie US Open 2023
                                & (df_final.tourney_date > '20220117')] # wyłączenie 2021
df_valid = df_final.loc[df_final.tourney_date == '20230828'] # US Open 2023

In [34]:
df_train.head()

Unnamed: 0,winner_name,loser_name,tourney_name,tourney_date,player_1_win,randomised_player_1,randomised_player_2,player_id_p1,tourney_date_index_p1,player_id_p2,tourney_date_index_p2,player_serve_win_ratio_diff,player_return_won_diff,player_return_win_ratio_diff,player_bpConversionRate_diff,player_firstServeSuccessRate_diff,player_avgAcesPerGame_diff,player_avgDoubleFaultsPerGame_diff,player_rank_diff
0,Iga Swiatek,Jule Niemeier,Australian Open,2023-01-16,1,216347,215898,216347.0,20230116.0,215898.0,20230116.0,-0.035571,-0.516667,0.035571,-0.066859,-0.072582,-0.326733,-0.470795,-107.0
1,Camila Osorio,Panna Udvardy,Australian Open,2023-01-16,0,215910,215785,215910.0,20230116.0,215785.0,20230116.0,-0.137167,-8.0,0.137167,-0.288571,-0.111969,-0.165165,-0.303909,25.0
2,Cristina Bucsa,Eva Lys,Australian Open,2023-01-16,1,213710,220332,213710.0,20230116.0,,,0.550179,28.75,0.449821,0.564042,0.566181,0.027778,0.158333,-881.0
3,Bianca Andreescu,Marie Bouzkova,Australian Open,2023-01-16,1,215370,213631,215370.0,20230116.0,213631.0,20230116.0,0.004655,3.555556,-0.004655,-0.051052,0.029651,-0.004358,-0.020351,7.0
4,Elena Rybakina,Elisabetta Cocciaretto,Australian Open,2023-01-16,1,214981,220714,214981.0,20230116.0,220714.0,20230116.0,0.074957,-6.625,-0.074957,0.006944,0.107264,0.588418,0.198363,-74.0


In [35]:
# Zmienna celu
target= 'player_1_win'

# Wybrane cechy do modelu
feats = ['player_serve_win_ratio_diff', 'player_return_won_diff',
        'player_return_win_ratio_diff', 'player_bpConversionRate_diff', 'player_firstServeSuccessRate_diff', 
        'player_avgAcesPerGame_diff', 'player_avgDoubleFaultsPerGame_diff', 'player_rank_diff']

print(feats)

['player_serve_win_ratio_diff', 'player_return_won_diff', 'player_return_win_ratio_diff', 'player_bpConversionRate_diff', 'player_firstServeSuccessRate_diff', 'player_avgAcesPerGame_diff', 'player_avgDoubleFaultsPerGame_diff', 'player_rank_diff']


### H2O model

In [36]:
h2o.init()

# Konwersja do ramki danych H2O
df_train_h2o = h2o.H2OFrame(df_train)
df_valid_h2o = h2o.H2OFrame(df_valid)


# Dla klasyfikacji binarnej zmienna odpowiedzi powinna być czynnikiem
df_train_h2o[target] = df_train_h2o[target].asfactor()
df_valid_h2o[target] = df_valid_h2o[target].asfactor()

# Uruchomienie AutoML
aml = h2o.automl.H2OAutoML(max_runtime_secs=300,
                           max_models=100,
                           stopping_metric='logloss',
                           sort_metric='logloss',
                           balance_classes=True,
                           seed=183
                          )
aml.train(x=feats, y=target, training_frame=df_train_h2o,validation_frame=df_valid_h2o)

# Wyświetlenie wyników
lb = aml.leaderboard
lb.head()  

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "17.0.1" 2021-10-19; OpenJDK Runtime Environment (build 17.0.1+12-39); OpenJDK 64-Bit Server VM (build 17.0.1+12-39, mixed mode, sharing)
  Starting server from /Users/patrycjakubas/opt/anaconda3/lib/python3.9/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/y5/2z6ljk6d3nbgf1yjf30sgxc80000gn/T/tmpiq_rsfq9
  JVM stdout: /var/folders/y5/2z6ljk6d3nbgf1yjf30sgxc80000gn/T/tmpiq_rsfq9/h2o_patrycjakubas_started_from_python.out
  JVM stderr: /var/folders/y5/2z6ljk6d3nbgf1yjf30sgxc80000gn/T/tmpiq_rsfq9/h2o_patrycjakubas_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,Europe/Warsaw
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,4 months and 10 days
H2O_cluster_name:,H2O_from_python_patrycjakubas_hvenkd
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
01:45:33.495: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

███████████████████████████████████████████████████████████████| (done) 100%


model_id,logloss,auc,aucpr,mean_per_class_error,rmse,mse
GBM_grid_1_AutoML_1_20240501_14533_model_11,0.629349,0.7039,0.650999,0.356771,0.467681,0.218726
GLM_1_AutoML_1_20240501_14533,0.635078,0.698909,0.660745,0.333829,0.47083,0.221681
GBM_grid_1_AutoML_1_20240501_14533_model_4,0.642509,0.679501,0.643142,0.368056,0.475459,0.226061
GBM_grid_1_AutoML_1_20240501_14533_model_5,0.642543,0.686725,0.608425,0.379402,0.474731,0.225369
GBM_grid_1_AutoML_1_20240501_14533_model_13,0.643424,0.683005,0.650143,0.37996,0.474922,0.22555
GBM_grid_1_AutoML_1_20240501_14533_model_12,0.644311,0.682509,0.631149,0.345424,0.475246,0.225859
XGBoost_grid_1_AutoML_1_20240501_14533_model_5,0.647131,0.674386,0.624868,0.414373,0.476875,0.227409
GBM_grid_1_AutoML_1_20240501_14533_model_14,0.647916,0.681858,0.644769,0.37965,0.477384,0.227896
XGBoost_grid_1_AutoML_1_20240501_14533_model_12,0.649881,0.671472,0.599046,0.353051,0.477495,0.228001
XGBoost_grid_1_AutoML_1_20240501_14533_model_6,0.650439,0.667504,0.605069,0.349392,0.478443,0.228908


### Prediction

In [37]:
# Przygotowanie danych do prognozy na podstawie US Open 2023

players = ['216146', '221103', '206252', '214544', '222328', '215035', '221012', '214939']
players_df = pd.DataFrame(players)
player_permutations = list(itertools.permutations(players, 2))
df_predict = pd.DataFrame(player_permutations, columns=['player_1','player_2']).astype(int)
df_predict.loc[:,'player_1_win_probability'] = 0.5
df_predict

#216146 Marta Kostyuk 
#221103 Cori Gauff
#206252 Barbora Krejcikova
#214544 Aryna Sabalenka
#222328 Linda Noskova
#215035 Dayana Yastremska
#221012 Qinwen Zheng
#214939 Anna Kalinskaya

Unnamed: 0,player_1,player_2,player_1_win_probability
0,216146,221103,0.5
1,216146,206252,0.5
2,216146,214544,0.5
3,216146,222328,0.5
4,216146,215035,0.5
5,216146,221012,0.5
6,216146,214939,0.5
7,221103,216146,0.5
8,221103,206252,0.5
9,221103,214544,0.5


In [38]:
df_predict

Unnamed: 0,player_1,player_2,player_1_win_probability
0,216146,221103,0.5
1,216146,206252,0.5
2,216146,214544,0.5
3,216146,222328,0.5
4,216146,215035,0.5
5,216146,221012,0.5
6,216146,214939,0.5
7,221103,216146,0.5
8,221103,206252,0.5
9,221103,214544,0.5


In [39]:
df_predict['tourney_date'] = '20240115'
df_predict['tourney_date'] = df_predict['tourney_date'].astype(int)

In [40]:
# Pobranie cech dla gracza 1
df_predict = df_predict.merge(df_rolling, how='left',
                     left_on = ['player_1','tourney_date'],
                     right_on = ['player_id','tourney_date_index'],validate ='m:1')


# Pobranie cech dla gracza 2
df_predict = df_predict.merge(df_rolling, how='left',
                     left_on = ['player_2','tourney_date'],
                     right_on = ['player_id','tourney_date_index'],validate ='m:1',suffixes=('_p1','_p2'))

In [41]:
df_predict.head()

Unnamed: 0,player_1,player_2,player_1_win_probability,tourney_date,player_serve_win_ratio_p1,player_return_won_p1,player_return_win_ratio_p1,player_bpConversionRate_p1,player_firstServeSuccessRate_p1,player_avgAcesPerGame_p1,player_avgDoubleFaultsPerGame_p1,player_id_p1,player_rank_p1,tourney_date_index_p1,player_serve_win_ratio_p2,player_return_won_p2,player_return_win_ratio_p2,player_bpConversionRate_p2,player_firstServeSuccessRate_p2,player_avgAcesPerGame_p2,player_avgDoubleFaultsPerGame_p2,player_id_p2,player_rank_p2,tourney_date_index_p2
0,216146,221103,0.5,20240115,0.570616,27.090909,0.429384,0.470873,0.63412,0.338351,0.451741,216146,39.0,20240115,0.615552,27.133333,0.384448,0.58246,0.69047,0.391198,0.384393,221103,6.0,20240115
1,216146,206252,0.5,20240115,0.570616,27.090909,0.429384,0.470873,0.63412,0.338351,0.451741,216146,39.0,20240115,0.593987,26.333333,0.406013,0.577698,0.691551,0.413497,0.394293,206252,12.0,20240115
2,216146,214544,0.5,20240115,0.570616,27.090909,0.429384,0.470873,0.63412,0.338351,0.451741,216146,39.0,20240115,0.646946,23.8,0.353054,0.613363,0.745903,0.525799,0.422298,214544,2.0,20240115
3,216146,222328,0.5,20240115,0.570616,27.090909,0.429384,0.470873,0.63412,0.338351,0.451741,216146,39.0,20240115,0.59268,32.666667,0.40732,0.459596,0.667677,0.25,0.215278,222328,41.0,20240115
4,216146,215035,0.5,20240115,0.570616,27.090909,0.429384,0.470873,0.63412,0.338351,0.451741,216146,39.0,20240115,0.533177,34.0,0.466823,0.370726,0.558554,0.221078,0.329902,215035,101.0,20240115


In [42]:
df_predict,_ = get_player_difference(df_predict,diff_cols=diff_cols)

In [43]:
df_predict_h2o = h2o.H2OFrame(df_predict[feats])

preds = aml.predict(df_predict_h2o)['p1'].as_data_frame()

df_predict['player_1_win_probability'] = preds

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%




In [44]:
AO2023 = df_predict[['player_1','player_2','player_1_win_probability']]
AO2023

Unnamed: 0,player_1,player_2,player_1_win_probability
0,216146,221103,0.436682
1,216146,206252,0.460748
2,216146,214544,0.383292
3,216146,222328,0.519146
4,216146,215035,0.567476
5,216146,221012,0.32651
6,216146,214939,0.543947
7,221103,216146,0.642226
8,221103,206252,0.625298
9,221103,214544,0.498717


Biorąc pod uwagę znane pary zawodniczek w ćwierćfinale turnieju Australian Open 2024:
1. #216146 Marta Kostyuk vs #221103 Cori Gauff
2. #206252 Barbora Krejcikova vs #214544 Aryna Sabalenka
3. #214939 Anna Kalinskaya vs #221012 Qinwen Zheng
4. #222328 Linda Noskova vs #215035 Dayana Yastremska \

Analizowane są mecze właśnie w powyzszych układach oraz ponownie przeprowadzana analiza dla top 4 zawodniczek.

In [45]:
# Słownik przypisujący numery do imion i nazwisk zawodniczek
players = {
    221012: "Qinwen Zheng",
    214544: "Aryna Sabalenka",
    221103: "Cori Gauff",
    206252: "Barbora Krejcikova",
    216146: "Marta Kostyuk",
    214939: "Anna Kalinskaya",
    222328: "Linda Noskova",
    215035: "Dayana Yastremska"
}

# Mecze do analizy
matches = [
    (216146, 221103),
    (206252, 214544),
    (214939, 221012),
    (222328, 215035)
]

# Funkcja do wyszukiwania zwycięzców
def find_winners(df, matches, player_dict):
    winners = {}
    for player_1, player_2 in matches:
        match_data = df[(df['player_1'] == player_1) & (df['player_2'] == player_2)]
        if not match_data.empty:
            winner_id = player_1 if match_data.iloc[0]['player_1_win_probability'] > 0.5 else player_2
        else:
            # Sprawdzam w odwrotnej kolejności
            match_data = df[(df['player_1'] == player_2) & (df['player_2'] == player_1)]
            winner_id = player_2 if match_data.iloc[0]['player_1_win_probability'] < 0.5 else player_1
        winners[(player_1, player_2)] = player_dict[winner_id]
    return winners

# Wyszukiwanie zwycięzców
winners = find_winners(AO2023, matches, players)

# Wyświetlanie wyników
for match, winner in winners.items():
    print(f"{players[match[0]]} vs {players[match[1]]}: Zwycięża {winner}")

Marta Kostyuk vs Cori Gauff: Zwycięża Cori Gauff
Barbora Krejcikova vs Aryna Sabalenka: Zwycięża Aryna Sabalenka
Anna Kalinskaya vs Qinwen Zheng: Zwycięża Qinwen Zheng
Linda Noskova vs Dayana Yastremska: Zwycięża Linda Noskova


In [46]:
players = ['221103', '214544', '221012', '222328']
players_df = pd.DataFrame(players)
player_permutations = list(itertools.permutations(players, 2))
df_predict = pd.DataFrame(player_permutations, columns=['player_1','player_2']).astype(int)
df_predict.loc[:,'player_1_win_probability'] = 0.5

df_predict['tourney_date'] = '20240115'
df_predict['tourney_date'] = df_predict['tourney_date'].astype(int)

# Pobranie cech dla gracza 1
df_predict = df_predict.merge(df_rolling, how='left',
                     left_on = ['player_1','tourney_date'],
                     right_on = ['player_id','tourney_date_index'],validate ='m:1')


# Pobranie cech dla gracza 2
df_predict = df_predict.merge(df_rolling, how='left',
                     left_on = ['player_2','tourney_date'],
                     right_on = ['player_id','tourney_date_index'],validate ='m:1',suffixes=('_p1','_p2'))

df_predict,_ = get_player_difference(df_predict,diff_cols=diff_cols)

df_predict_h2o = h2o.H2OFrame(df_predict[feats])

preds = aml.predict(df_predict_h2o)['p1'].as_data_frame()

df_predict['player_1_win_probability'] = preds

AO2023 = df_predict[['player_1','player_2','player_1_win_probability']]
AO2023

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%




Unnamed: 0,player_1,player_2,player_1_win_probability
0,221103,214544,0.498717
1,221103,221012,0.473752
2,221103,222328,0.585742
3,214544,221103,0.550931
4,214544,221012,0.547897
5,214544,222328,0.736177
6,221012,221103,0.597622
7,221012,214544,0.551138
8,221012,222328,0.716103
9,222328,221103,0.421904


In [47]:
AO2023.groupby('player_1')['player_1_win_probability'].mean() \
.reset_index().sort_values('player_1_win_probability',ascending=False)

Unnamed: 0,player_1,player_1_win_probability
1,221012,0.621621
0,214544,0.611668
2,221103,0.519404
3,222328,0.267833


#1. Qinwen Zheng
#2. Aryna Sabalenka
#3. i #4. Cori Gauff oraz Linda Noskova