# Analysis 500

## Purpose
In this notebook we will begin our analysis for research question 3 - 'Can we predict the next young Grand Slam winners?'. Focusing on analysing young players who fit between the mean and standard deviation of 6 match statistics of previous Grand Slam winners when they were under 23 y/o.

## Datasets
* _Input_: GrandSlamWinners.csv, Under23_14_17.csv

In [1]:
#importing relevant libraries
import os
import sys
import hashlib
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
 
%matplotlib inline

In [2]:
GrandSlamWinners = pd.read_csv("../data/GrandSlamWinners", low_memory = False)

In [3]:
Under23_14_17 = pd.read_csv("../data/Under23_14_17", low_memory = False)

### Average match statistics Grand Slam winners 1991 - present

In [4]:
GrandSlamWinners.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Albert Portas,1.180556,1.277778,69.930556,47.680556,30.763889,12.958333,10.027778,4.5,6.25
1,Andreas Vinciguerra,4.610427,2.844558,84.466325,51.211937,38.706068,17.272821,13.397806,4.318319,6.430142
2,Aqeel Khan,,,,,,,,,
3,Bruno Abdel Nour,,,,,,,,,
4,Chris Guccione,15.82381,2.935714,79.882143,46.804762,39.197619,18.913095,13.282143,2.857143,3.684524


Finding mean and standard deviation to use as comparison for future young players 

In [5]:
#Mean of match statistics from all previous Grand Slam winners
GrandSlamWinners.mean()

w_ace         6.254965
w_df          3.328873
w_svpt       79.315793
w_1stIn      48.015195
w_1stWon     36.223670
w_2ndWon     16.739106
w_SvGms      12.226077
w_bpSaved     4.278576
w_bpFaced     6.007836
dtype: float64

In [6]:
#Standard deviation of match statistics from all previous Grand Slam winners
GrandSlamWinners.std()

w_ace         5.521484
w_df          2.458519
w_svpt       19.404386
w_1stIn       9.181520
w_1stWon      8.387960
w_2ndWon      4.629210
w_SvGms       2.693367
w_bpSaved     3.006659
w_bpFaced     3.270651
dtype: float64

## Analysis of under 23 y/o from 2014 - 2017

In [7]:
Under23_14_17.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Adam Pavlasek,5.333333,2.666667,92.0,54.666667,38.333333,20.0,13.333333,6.333333,8.666667
1,Albano Olivetti,21.0,3.5,100.0,58.5,51.0,25.0,17.0,2.5,4.0
2,Alberto Lim,2.0,5.0,70.0,46.0,29.0,10.0,11.0,6.0,9.0
3,Alex De Minaur,7.0,5.5,124.5,72.0,50.0,27.0,19.5,7.0,12.0
4,Alex Diaz,,,,,,,,,


Finding young players who fit between the mean and standard deviation of aces of previous Grand Slam winners

In [8]:
RQ3_sorted_w_ace = Under23_14_17[(Under23_14_17['w_ace'] >= 3.265495) & (Under23_14_17['w_ace'] <= 8.599819)]
RQ3_sorted_w_ace.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Adam Pavlasek,5.333333,2.666667,92.0,54.666667,38.333333,20.0,13.333333,6.333333,8.666667
3,Alex De Minaur,7.0,5.5,124.5,72.0,50.0,27.0,19.5,7.0,12.0
5,Alexandar Lazarov,4.0,5.0,106.0,54.0,41.0,25.0,15.0,10.0,13.0
6,Alexander Bublik,7.666667,6.333333,80.333333,44.666667,34.333333,15.0,13.0,5.666667,8.666667
7,Alexander Zverev,7.138462,4.107692,84.892308,52.6,38.969231,17.353846,13.107692,4.323077,5.969231


Finding young players who fit between the mean and standard deviation of double faults of previous Grand Slam winners

In [9]:
RQ3_sorted_w_df = Under23_14_17[(Under23_14_17['w_df'] >= 1.871434) & (Under23_14_17['w_df'] <= 3.180153)]
RQ3_sorted_w_df.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Adam Pavlasek,5.333333,2.666667,92.0,54.666667,38.333333,20.0,13.333333,6.333333,8.666667
10,Andrey Rublev,5.9,2.8,81.4,49.9,35.9,18.0,12.8,5.9,8.0
13,Ayed Zatar,6.0,2.0,105.0,59.0,43.0,24.0,16.0,4.0,7.0
14,Benjamin Hannestad,1.0,2.0,26.0,16.0,14.0,7.0,5.0,0.0,0.0
17,Borna Coric,6.288462,2.076923,79.980769,48.134615,35.807692,18.326923,12.807692,3.807692,5.480769


Finding young players who fit between the mean and standard deviation of serving % of previous Grand Slam winners

In [10]:
RQ3_sorted_w_svpt = Under23_14_17[(Under23_14_17['w_svpt'] >= 71.818407) & (Under23_14_17['w_svpt'] <= 79.648735)]
RQ3_sorted_w_svpt.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
31,Denis Kudla,4.2,2.6,78.6,44.2,33.0,20.2,12.4,4.6,6.0
37,Dominic Thiem,5.435897,2.166667,75.679487,45.666667,34.769231,17.012821,12.410256,2.974359,4.602564
59,Jared Donaldson,7.083333,3.833333,78.25,37.666667,30.333333,22.083333,12.833333,2.916667,4.833333
73,Karen Khachanov,9.333333,2.866667,78.8,44.333333,34.466667,19.066667,12.8,2.866667,4.4
89,Marcelo Tomas Barrios Vera,2.0,4.0,77.0,50.0,40.0,19.0,12.0,2.0,3.0


Finding young players who fit between the mean and standard deviation of serve games of previous Grand Slam winners

In [11]:
RQ3_sorted_w_SvGms = Under23_14_17[(Under23_14_17['w_SvGms'] >= 11.391894) & (Under23_14_17['w_SvGms'] <= 12.65957)]
RQ3_sorted_w_SvGms.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
21,Casper Ruud,3.0,1.5,68.5,46.0,33.0,15.0,11.5,5.0,5.5
31,Denis Kudla,4.2,2.6,78.6,44.2,33.0,20.2,12.4,4.6,6.0
37,Dominic Thiem,5.435897,2.166667,75.679487,45.666667,34.769231,17.012821,12.410256,2.974359,4.602564
50,Henri Laaksonen,1.0,10.0,95.0,65.0,45.0,11.0,12.0,6.0,9.0
55,Hyeon Chung,5.15,2.75,70.25,45.45,34.0,14.55,11.95,2.3,3.8


Finding young players who fit between the mean and standard deviation of break points saved of previous Grand Slam winners

In [12]:
RQ3_sorted_w_bpSaved = Under23_14_17[(Under23_14_17['w_bpSaved'] >= 2.962277) & (Under23_14_17['w_bpSaved'] <= 3.73373)]
RQ3_sorted_w_bpSaved.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
16,Bernard Tomic,9.694444,1.194444,80.75,54.666667,41.805556,14.25,12.944444,3.416667,4.694444
18,Bruno Britez,5.0,3.0,53.0,41.0,32.0,7.0,9.0,3.0,3.0
19,Calvin Hemery,6.0,3.0,84.0,46.0,30.0,18.0,14.0,3.0,7.0
29,Daniil Medvedev,6.0,2.111111,68.777778,41.666667,30.444444,14.0,10.222222,3.111111,4.666667
37,Dominic Thiem,5.435897,2.166667,75.679487,45.666667,34.769231,17.012821,12.410256,2.974359,4.602564


Finding young players who fit between the mean and standard deviation of break points faced of previous Grand Slam winners

In [13]:
RQ3_sorted_w_bpFaced = Under23_14_17[(Under23_14_17['w_bpFaced'] >= 4.243891) & (Under23_14_17['w_bpFaced'] <= 5.637147)]
RQ3_sorted_w_bpFaced.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
16,Bernard Tomic,9.694444,1.194444,80.75,54.666667,41.805556,14.25,12.944444,3.416667,4.694444
17,Borna Coric,6.288462,2.076923,79.980769,48.134615,35.807692,18.326923,12.807692,3.807692,5.480769
21,Casper Ruud,3.0,1.5,68.5,46.0,33.0,15.0,11.5,5.0,5.5
27,Damir Dzumhur,5.5,4.5,81.0,42.0,33.5,22.5,14.0,2.5,4.5
29,Daniil Medvedev,6.0,2.111111,68.777778,41.666667,30.444444,14.0,10.222222,3.111111,4.666667


In [14]:
stats = pd.concat([RQ3_sorted_w_ace, RQ3_sorted_w_df, RQ3_sorted_w_svpt, RQ3_sorted_w_SvGms, RQ3_sorted_w_bpSaved,
                   RQ3_sorted_w_bpFaced]).reset_index().drop('index', 1)
stats.head()

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Adam Pavlasek,5.333333,2.666667,92.0,54.666667,38.333333,20.0,13.333333,6.333333,8.666667
1,Alex De Minaur,7.0,5.5,124.5,72.0,50.0,27.0,19.5,7.0,12.0
2,Alexandar Lazarov,4.0,5.0,106.0,54.0,41.0,25.0,15.0,10.0,13.0
3,Alexander Bublik,7.666667,6.333333,80.333333,44.666667,34.333333,15.0,13.0,5.666667,8.666667
4,Alexander Zverev,7.138462,4.107692,84.892308,52.6,38.969231,17.353846,13.107692,4.323077,5.969231


In [15]:
stats.to_csv('../data/stats_RQ3', index = False, encoding='utf-8')