In [1]:
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6)
library(dplyr)



── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

# Preliminary exploratory data analysis:


### Data reading

In [2]:
url <- "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2020.csv"
player_stats <- read_csv(url)
player_stats

[1mRows: [22m[34m1462[39m [1mColumns: [22m[34m49[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (14): tourney_id, tourney_name, surface, tourney_level, winner_entry, wi...
[32mdbl[39m (35): draw_size, tourney_date, match_num, winner_id, winner_seed, winner...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2020-8888,Atp Cup,Hard,24,A,20200106,300,104925,,,⋯,51,39,6,10,6,8,2,9055,1,9985
2020-8888,Atp Cup,Hard,24,A,20200106,299,105138,,,⋯,35,21,6,9,5,10,10,2335,34,1251
2020-8888,Atp Cup,Hard,24,A,20200106,298,104925,,,⋯,57,35,25,14,6,11,2,9055,5,5705
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
2020-7485,Antwerp,Hard,32,A,20201019,128,126203,7,,⋯,37,30,5,9,3,5,28,1670,33,1402
2020-7485,Antwerp,Hard,32,A,20201019,129,144750,,Q,⋯,45,29,5,10,7,11,90,748,74,838
2020-7485,Antwerp,Hard,32,A,20201019,130,200005,,,⋯,32,26,7,9,2,4,38,1306,172,353


### Data cleaning
   We would like to follow the followings steps when cleaning data: 
1. extracting only the predictors, player ID and ranking from the original table
2. filter players who are left-handed, which relates to the research topic

In [10]:
player_stats_righthanded <- filter (player_stats, grepl('R', winner_hand))
player_stats_righthanded

tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2020-8888,Atp Cup,Hard,24,A,20200106,300,104925,,,⋯,51,39,6,10,6,8,2,9055,1,9985
2020-8888,Atp Cup,Hard,24,A,20200106,299,105138,,,⋯,35,21,6,9,5,10,10,2335,34,1251
2020-8888,Atp Cup,Hard,24,A,20200106,298,104925,,,⋯,57,35,25,14,6,11,2,9055,5,5705
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
2020-7485,Antwerp,Hard,32,A,20201019,126,200267,,WC,⋯,45,27,10,12,4,9,528,58,45,1165
2020-7485,Antwerp,Hard,32,A,20201019,128,126203,7,,⋯,37,30,5,9,3,5,28,1670,33,1402
2020-7485,Antwerp,Hard,32,A,20201019,129,144750,,Q,⋯,45,29,5,10,7,11,90,748,74,838


In [11]:
player_stats_with_height <- select (player_stats_righthanded, winner_ht, winner_rank) |>
                         drop_na()
player_stats_with_height

winner_ht,winner_rank
<dbl>,<dbl>
188,2
183,10
188,2
⋮,⋮
185,528
193,28
193,90


In [16]:
age_with_category <- player_stats_with_height |> 
                    mutate(rank = ifelse(winner_rank < 50, "excellent",
                    ifelse(winner_rank > 50 &  winner_ht <=150, "good",
                    ifelse(winner_rank > 150 &  winner_ht <=300, "upper-middle", "normal")))) |>
                    mutate(rank = factor(rank, levels=c("excellent", "good", "upper-middle", "normal")))
age_with_category

winner_ht,winner_rank,rank
<dbl>,<dbl>,<fct>
188,2,excellent
183,10,excellent
188,2,excellent
⋮,⋮,⋮
185,528,upper-middle
193,28,excellent
193,90,normal


In [17]:
player_stats_with_age <- select (player_stats_righthanded, winner_age, winner_rank)  |>
                            drop_na()
player_stats_with_age

winner_age,winner_rank
<dbl>,<dbl>
32.6,2
31.7,10
32.6,2
⋮,⋮
21.3,528
22.9,28
23.6,90


In [19]:
age_with_category <- player_stats_with_age |> 
                    mutate(rank = ifelse(winner_rank < 50, "excellent",
                    ifelse(winner_rank > 50 &  winner_age <=150, "good",
                    ifelse(winner_rank > 150 &  winner_age <=300, "upper-middle", "normal")))) |>
                    mutate(rank = factor(rank, levels=c("excellent", "good", "upper-middle", "normal")))
age_with_category

winner_age,winner_rank,rank
<dbl>,<dbl>,<fct>
32.6,2,excellent
31.7,10,excellent
32.6,2,excellent
⋮,⋮,⋮
21.3,528,good
22.9,28,excellent
23.6,90,good


# Expected outcomes and significance:

### What do you expect to find?
We expect to find a relationship between the height and age of a right-handed tennis player and their ranking. From examining some prelimary data and then forming a training set, we can make a scatter plots that relates the ranking against the height and age of right-handed players. After seeing whether there is a relationship or not, we can then separate the player rankings into several classes (excellent, good, average, bad) and use this training set to predict what class of ranking an unknown right-handed tennis player would have based on his height.

### What impact could such findings have?
The impact of these findings can provide insight on whether tennis is a sport that depends on height and age or not. Often, height and age are essential factors a participant needs to excel in a sport. From this data, we want to see if this concept applies to tennis amongst right handed players (who are the majority) and see if height and age are predictors of a player's ranking. As a result, these findings can help tennis players see if they have an advantage in a professional career of tennis based on whether height and age are a predictor or not.

### What future questions could this lead to?
- Can we use height and age to predict rankings of left-handed players?
- If height is a predictor, can weight also affect a player's ranking? (as they combined to perform calculations of BMI)
- Depending on the relationship seen, will these results affect a new player's interest in pursueing a professional career?