# **Predicting Rankings of Tennis Players: The 2018 ATP Tour**

<br>

# **Introduction**

By using the data set of games won by a tennis player, we are able to use several categorical variables to predict and classify the future ranking of a particular tennis player based on these attributes. This leads to the question, how well do variables relating to player status, age, current rank, and playing hand predict, and game statistics the future ranking of the player during later tennis seasons? The dataset used to answer this question will be the game results for the top 500 players during the 2018 tennis season. This particular data was chosen to avoid interruptions by the COVID-19 pandemic and to ensure that the data would be stagnant, with no new data incoming. The dataset includes data from the winners of the rounds of national and international tennis tournaments, hosted by the Association of Tennis Professionals (ATP).

<br>

# **Preliminary Exploratory Data Analysis**

### **Importing Libraries**

In [2]:
library(tidyverse)
library(tidymodels)
library(repr)
library(dplyr)

options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

### **Download the Tennis Dataset**

In [3]:
url <- "https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn"
download.file(url, "data/data.csv")

### **Load the Tennis Dataset**

In [4]:
tennis <- read_csv("data/data.csv")
head(tennis)

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_seed = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [31mcol_character()[39m,
  winner_hand = [31mcol_character()[39m,
  winner_ioc = [31mcol_character()[39m,
  loser_seed = [31mcol_character()[39m,
  loser_entry = [31mcol_character()[39m,
  loser_name = [31mcol_character()[39m,
  loser_hand = [31mcol_character()[39m,
  loser_ioc = [31mcol_character()[39m,
  score = [31mcol_character()[39m,
  round = [31mcol_character()[39m
)

See spec(...) for full column specifications.



X1,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,⋯,54,34,20,14,10,15,9,3590,16,1977
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,⋯,52,36,7,10,10,13,16,1977,239,200
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,⋯,27,15,6,8,1,5,9,3590,40,1050
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,⋯,60,38,9,11,4,6,239,200,31,1298
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,⋯,56,46,19,15,2,4,16,1977,18,1855
5,2019-M020,Brisbane,Hard,32,A,20181231,295,104871,,⋯,54,40,18,15,6,9,40,1050,185,275


### **Filtering the Dataset**

In [5]:
tennis <- subset(tennis, select = c(tourney_name, draw_size, tourney_level, tourney_date, winner_name, winner_hand, winner_ht, winner_ioc, winner_age, score, best_of, w_ace, w_df, w_svpt, w_1stIn, w_1stWon, w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced, winner_rank, winner_rank_points))

In [8]:
tennis <- filter(tennis, substr(tourney_date, 0, 4) == "2018")
tennis <- mutate(tennis, winner_hand = as.integer(as.factor(winner_hand)), na.rm = TRUE)

#### **Group By and Summarize Player Name, Tidying the Data**

In [10]:
player_stats <- group_by(tennis, winner_name) %>% 
    summarize(winner_rank = mean(winner_rank), winner_age = median(winner_age), winner_hand = mean(winner_hand), w_ace = mean(w_ace), w_df = mean(w_df), w_svpt = mean(w_svpt), w_1stIn = mean(w_1stIn), w_1stWon = mean(w_1stWon), w_2ndWon = mean(w_2ndWon), w_SvGms = mean(w_SvGms), w_bpSaved = mean(w_bpSaved), w_bpFaced = mean(w_bpFaced)) %>% 
    na.omit(player_stats)

player_stats

`summarise()` ungrouping output (override with `.groups` argument)



winner_name,winner_rank,winner_age,winner_hand,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Adam Pavlasek,196.50,23.55921,2,10.00,3.00,101.0,72.50,52.50,17.50,16.50,4.00,5.00
Adrian Mannarino,30.24,29.98768,1,6.76,2.56,86.4,53.96,40.12,18.68,14.08,3.68,5.52
Adrian Menendez Maceiras,128.00,32.29295,2,7.50,1.50,89.0,63.00,41.00,15.50,13.00,6.00,8.50
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Yuichi Sugita,46.71429,29.72758,2,5.571429,3,82.71429,51.42857,40,16.57143,13.28571,4.142857,5.714286
Ze Zhang,247.00000,27.74538,2,4.000000,4,54.00000,35.00000,25,10.00000,9.00000,4.000000,6.000000
Zsombor Piros,419.00000,18.91034,3,12.000000,2,159.00000,102.00000,73,30.00000,27.00000,7.000000,12.000000


### **Splitting the Data into Training and Testing Set**

In [11]:
set.seed(2000)
player_split <- initial_split(player_stats, prop = 3/4, strata = winner_rank)
player_training <- training(player_split)
player_testing <- testing(player_split)

In [None]:
#ggplot(player_training, aes(x = , y = ))

<br>

# **Methods**

From the data set, we will be using only the columns of rank, age, playing hand, and several other game statistics to predict the future ranks of tennis players. Each of these variables was averaged to ensure that data throughout the ATP circuit would be included in the analysis. We will be using regression analysis that uses data to understand how a ranking variable's value is affected when one or more independent variables change or stay the same. By understanding each variable's relationship and how they developed in the past, we will anticipate possible outcomes and make better predictions for future ranking. Due to the amount of data in the original dataset, many of the tennis players were not included in this analysis due to a lack of, or insufficient data. These players were filtered out during the initial phase of exploring our data. Furthermore, only the data based on the winners of each tournament round were considered. This is due to the fact that the points that determine a player's ranking are only awarded when matches are won and are never deducted for losing. Losing has no impact on the ranking of the player, our predicted variable.

We will visualize the results by using graphs to find the relationship between the data and predict the most adequate rankings based on the data. We will be using scatterplot and functions such as ggplot/ggpairs to visualize the results with future ranking on the y-axis and the variables on the x-axis. 

<br>

# **Expected Outcomes and Significance**

   It is to be expected that tennis player statistics, to a certain degree, should be able to predict their current and future rankings. In general observation, the majority of the top players will remain near the top of the rankings for the majority of their careers. It will be interesting to see if any predictors can significantly predict which players are to rise or fall in their rankings. It is expected that age, current rankings, and their previous rise in rankings over preceding years. There are also some factors that will be interesting to explore their effect on world rankings, such as physical attributes. 
   
   It’ll be interesting to explore whether, and if so, the extent, the future ranking of Tennis players can be forecasted using current statistics. Although there are not any significant real-world impacts that any of our findings will lead to, it will be an interesting read for any sports fans and data junkies to read and expand upon. Based on our findings, it will be interesting to explore whether there are any missing factors that could improve on our prediction model; and the result of applying our model to other sports as well.