Media outlets are consistently on the lookout for new and upcoming players, regardless of sport. Young talent, valiant comebacks, rise after injury or any combination of the latter generate consistent media coverage, followed by support for that player and love for the sport. For instance, Serena Williams has grown both fame and worldwide support through her distinguished and transformative career as a tennis icon. She has changed the game through her triumphs and unprecedented adversities. Throughout her career, she has won thousands of dollars in prizes and has been breaking tennis records for decades. In our project, we will be analysing statistics from current tennis players, including their prize sums and the number of seasons they have been playing. We will be creating a classification model to predict a player's global ranking based on how many seasons they’ve played and their total prize money. This in turn will help officials classify a player who is new to the league in comparison to the veterans. The dataset that we have obtained was retrieved from “canvas” and consists of several columns, however our focus will be placed upon “Prize Money” and ”Seasons” to predict current rank. 

Firstly, we are reading the necessary packages for our plots and functions for our regression model. These include ggplopt2 and tidyverse. 

In [24]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



To read our data, we have tidyed most of it outside of Jupyter as this was more efficient. We have titled our raw data "player_data_initial". Below, we had to ensure that the column names were representative of the information, so those were renamed accoridngly. We have subsequently filtered for the players current rank. total seasons, and prize money. 

After the filter, we have placed 75% of our data into training and 25% to testing, with the predictor being "current rank"

In [52]:
player_data_initial <- read_csv("https://raw.githubusercontent.com/pangus3/DSCI-Group-Project/main/data/player_stats.csv")
player_data_initial

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m38[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
[32mdbl[39m (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,⋯,Facebook,Twitter,Nicknames,Grand Slams,Davis Cups,Web Site,Team Cups,Olympics,Weeks at No. 1,Tour Finals
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378,363 (04-11-2019),Oscar Jose Gutierrez,,,⋯,,,,,,,,,,
1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326,316 (14-10-2019),Jack Draper,Two-handed,59040.00,⋯,,,,,,,,,,
2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178,44 (14-01-2013),Lukas Lacko,Two-handed,3261567.00,⋯,,,,,,,,,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
497,23 (14-03-1996),Netherlands,Left-handed,Wikipedia,495,342 (05-08-2019),Gijs Brouwer,,,⋯,,,,,,,,,,
498,24 (17-05-1995),Ukraine,,Wikipedia,419,419 (20-01-2020),Vladyslav Orlov,,,⋯,,,,,,,,,,
499,22 (26-03-1997),Tunisia,Left-handed,Wikipedia,451,408 (24-12-2018),Aziz Dougaz,Two-handed,61984.00,⋯,,,,,,,,,,


In [53]:
colnames(player_data_initial) <- make.names(colnames(player_data_initial), unique=TRUE)
player_data_edited <- player_data_initial |>
mutate(current_rank = as.integer(Current.Rank)) |>
mutate(prize_money = as.integer(Prize.Money)) |>
select(Seasons, current_rank, prize_money) |>
as_tibble()


player_data <- player_data_edited |>
filter(!is.na(Seasons),!is.na(current_rank), !is.na(prize_money) )
player_data

“NAs introduced by coercion”
“NAs introduced by coercion”


Seasons,current_rank,prize_money
<dbl>,<int>,<int>
14,178,3261567
2,236,374093
11,183,6091971
⋮,⋮,⋮
1,397,40724
10,5,22132368
2,451,61984


In [54]:
player_data
#Splitting the data into the training and testing sets, with 75% of the data going to training, and 25% of it going to testing. 
player_split <- initial_split(player_data, prop = 0.75, strata = current_rank)  
player_train<- training(player_split)   
player_test<- testing(player_split)


Seasons,current_rank,prize_money
<dbl>,<int>,<int>
14,178,3261567
2,236,374093
11,183,6091971
⋮,⋮,⋮
1,397,40724
10,5,22132368
2,451,61984


In [55]:
#Building a model specification (player_spec) to specify the model and training algorithm. 
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
       set_mode("regression") 
player_spec

K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 


In [56]:
#Recipe specifies the predictors/response and preprocesses the data, so it’s scaled. 
player_recipe <- recipe(current_rank ~ Seasons + prize_money, data = player_train) |>
       step_scale(all_predictors()) |>
       step_center(all_predictors())
player_recipe


Recipe

Inputs:

      role #variables
   outcome          1
 predictor          2

Operations:

Scaling for all_predictors()
Centering for all_predictors()

In [61]:
#creating the fold for cross-validation. Cross validation uses a structured splitting so that each observation is used in a validation set only once
player_vfold<-  vfold_cv(player_train, v = 5, strata = current_rank)

#The workflow combines the model and recipe
player_workflow <- workflow() |>
       add_recipe(player_recipe) |>
   add_model(player_spec)


In [70]:
gridvals <- tibble(neighbors= c(1:100))
player_results <- player_workflow |>
       tune_grid(resamples = player_vfold, grid = gridvals) |>
collect_metrics()
player_results

neighbors,.metric,.estimator,mean,n,std_err,.config
<int>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
1,rmse,standard,128.4353758,5,4.72243847,Preprocessor1_Model001
1,rsq,standard,0.2933101,5,0.02793417,Preprocessor1_Model001
2,rmse,standard,110.9333303,5,6.18897791,Preprocessor1_Model002
⋮,⋮,⋮,⋮,⋮,⋮,⋮
99,rsq,standard,0.2899283,5,0.04348656,Preprocessor1_Model099
100,rmse,standard,117.5764279,5,3.81944249,Preprocessor1_Model100
100,rsq,standard,0.2872302,5,0.04392069,Preprocessor1_Model100


In [71]:
player_min<- player_results |>
    filter(.metric == "rmse") |>
  arrange(mean) |> 
    slice(1)

player_min

neighbors,.metric,.estimator,mean,n,std_err,.config
<int>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
9,rmse,standard,101.0704,5,4.63592,Preprocessor1_Model009


In [72]:
k_min <-player_min |>
      pull(neighbors)

player_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
         set_engine("kknn") |>
          set_mode("regression")

player_best_fit <- workflow() |>
         add_recipe(player_recipe) |>
          add_model(player_best_spec) |>
          fit(data = player_train)

player_summary <- player_best_fit |>
         predict(player_test) |>
           bind_cols(player_test) |>
           metrics(truth = current_rank, estimate = .pred)


player_summary

#Explain what this table and values means!!

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
rmse,standard,105.7700408
rsq,standard,0.3338534
mae,standard,86.1775362


In [79]:
#Predicting a new players rank given their prize money and seasons of tennis played
#Jack Black is an up and coming tennis player from planet Mars. He has played tennis in the US for the past 3 seasons and has recieved $145,563 dollars in prize money.
#With this information we want to know roughly what Jack Blacks ranking would be so that we can decide if he is someone we want to bet/sponsor.

new_player_stats <- tibble(Seasons = 3, 
                            prize_money=145563)

#This recipe is using all the player data this time to create the model recipe instead of just the trianing.. but it is using the same spec that found the best k!
player_recipe_all_data <- recipe(current_rank ~ Seasons + prize_money, data = player_data) |>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())
                     
player_fit_all_data <- workflow() |>
          add_recipe(player_recipe_all_data) |>
          add_model(player_best_spec) |>
          fit(data = player_data)

new_player_predicted <- predict(player_fit_all_data, new_player_stats)

new_player_predicted

#He would have a current_rank of 294.6667 I believe!
##EXPLAIN WHAT THIS MEANS FOR US!

.pred
<dbl>
294.6667
