In [1]:
Top 10 athletes who have the most potential to earn their career 
Group: 005-26 
Group members: Rinky Manivannan, Prajith Ravisankar, Zhifan Wang

Introduction 

"Who's the best?" is always the question.

Picking the best tennis players with absolute justice is difficult. Because they come from different eras, everything from age, fitness, diet, medical conditions, to equipment, the surface of the court, and even the pressure the players face are constantly changing with the times.

Under these multiple and complex conditions, we should pay more attention to the individual talents of players. The efforts of these players will bear fruit in long-term practice, persistence and success. Therefore, we decided to infer the top 10 athletes who had the most potential to earn their career by looking at the physical fitness of the top 101 (odd number as predictors for the training set) players in the dataset, including height, weight, and the number of medals in career success.

We will be using player stats for the top 500 players, which is an excel database that contains player profiles, historical standings, results and match stats, favorite pitch, age, goat standing and more.

Preliminary exploratory data analysis 

Since the available data set is provided in excel, the data set can be easily read into a jupyter notebook using the read_excel() function. The tibble of this data frame contains 38 observations X 500 variables to utilize. 

Since we aim to provide useful information about player standing based on height, and age, we intend to use age and height as predicting values. Only one observation in the age category is missing. There are only 115 values available from the height variable. 

Visualizing the plot for the distribution of the age and height, we found out that the age plot has different points scattered all around the time frame given, however, a significant chunk of the age were found near the late 20’s and early 30’s. Most of the height of these players were greater than 185 cm, and a smaller portion of the available data were found to be about 180 cm. And, very few outliers were less than 180 cm tall.  

Methods 

In our project, the major predicting variables that we will utilize are Age and Height. All the observations under these variables will be utilized to train for the classification based on the k-nearest model. Since we are investigating the impact of two variables over a new observation, we base our analysis on scatter plots. The work will be directed towards centring and scaling of these values, and the scaled predicting observations will be utilized for plotting and summary analysis. Functions for color palettes and aesthetic alignment of legends will be taken care of to supplement the visual understanding. 

Expected outcomes and significance 

Since we, as a group, would be able to finalize the top 10 players based on the past data, this method would produce a useful prediction for players based on age and height. Since we are using the k-nearest model for predicting we would be able to produce a relevant finding given that we have a relatively large amount of observations present in the excel. Even though the height of many players are not included in the data set, we would be able to consider about half of the remaining observations and classify players, respectively. 


ERROR: Error in parse(text = x, srcfile = src): <text>:1:5: unexpected numeric constant
1: Top 10
        ^


In [2]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [3]:
tennis_data <- read_csv("player_stats.csv")

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_character(),
  X1 = [32mcol_double()[39m,
  `Turned Pro` = [32mcol_double()[39m,
  Seasons = [32mcol_double()[39m,
  Titles = [32mcol_double()[39m,
  `Best Season` = [32mcol_double()[39m,
  Retired = [32mcol_double()[39m,
  Masters = [32mcol_double()[39m,
  `Grand Slams` = [32mcol_double()[39m,
  `Davis Cups` = [32mcol_double()[39m,
  `Team Cups` = [32mcol_double()[39m,
  Olympics = [32mcol_double()[39m,
  `Weeks at No. 1` = [32mcol_double()[39m,
  `Tour Finals` = [32mcol_double()[39m
)

See spec(...) for full column specifications.



In [8]:
names(tennis_data)

In [11]:
filtered_data <- filter(tennis_data, !is.na(Age), !is.na(Age), !is.na(Height), !is.na('GOAT Rank'), !is.na('Prize Money'))

In [10]:
# to predict if the new observation is in the goat standing based on their height, age (classification)
# to predict how much a new observation will earn based on attributes such as age, height (regression)

In [69]:

library(tidymodels)


selected_data <- filtered_data %>%
                    select(Name, Plays, Age, Height, 'GOAT Rank', 'Turned Pro', 'Best Rank') %>%
                    rename(best_rank = 'Best Rank', 
                          goat_rank = 'GOAT Rank')


new_data_0 <- na.omit(selected_data)

new_data_1 <- new_data_0 %>%
            mutate(Height = as.numeric(gsub("cm", "", Height)), 
                    best_rank = as.numeric(gsub("\\(.*\\)", "", best_rank)), 
                    goat_rank = as.numeric(gsub("\\(.*\\)", "", goat_rank)), 
                    Age = as.numeric(gsub("\\(.*\\)", "", Age)))

#new_data_1

new_data_2 <- arrange(new_data_1, Age, best_rank)

new_data_2

plot1 <- ggplot(new_data_2, aes(x = best_rank, y = goat_rank, color = Plays)) +
                geom_point() +
                labs(x = "Best rank of the player", y = "Goat Rank all time", color = "Left hand vs Right hand play style") + 
                scale_color_manual(labels = c(), values = c("orange2", "steelblue2")))
                theme(text = element_text(size = 12))

#data_plot <- new_data %>%
#                ggplot(aes(x = Age, y = Height, color = goat_rank)) + 
#                geom_point() + 
#                labs(x = "Age", y = "height (cm)", color = "")



# new_data

#here for the classification problem we take the age and height as the predictor values goat_rank as the categorical class for the prediction 
# first we visualize how the data is spread and use k model for classification and prediction 

#unscaled <- new_data %>%
#                mutate(goat_rank = as_factor(goat_rank)) %>%
#                select(Age, Height, goat_rank)

#unscaled



# the knn model creation 

#knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
#                 set_engine("kknn") %>%
#                  set_mode("classification")

# centering and scaling for improving the prediction

#scaled_centered_data <- recipe(goat_rank ~ Age + Height, data = unscaled) 

#scaled_fit <- workflow() %>%
#                  add_recipe(scaled_centered_data) %>%
#                  add_model(knn_spec) %>%
#                  fit(data = unscaled)
#scaled_fit




Name,Plays,Age,Height,goat_rank,Turned Pro,best_rank
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Alexander Zverev,Right-handed,22,198,65,2013,3
Nick Kyrgios,Right-handed,24,193,165,2013,13
Lucas Pouille,Right-handed,25,185,169,2012,10
Dominic Thiem,Right-handed,26,185,58,2011,4
Diego Sebastian Schwartzman,Right-handed,27,170,222,2010,11
Nikoloz Basilashvili,Right-handed,27,185,264,2008,16
Bernard Tomic,Right-handed,27,193,264,2008,17
Damir Dzumhur,Right-handed,27,172,357,2011,23
Filip Krajinovic,Right-handed,27,185,400,2008,26
Ryan Harrison,Right-handed,27,183,489,2008,40
