<h1> DSCI 100 Project Proposal </h1>

Introduction

For this project, we will be making use of Assocation of Tennis Profesionals (ATP) results data to predict what characteristics make the 'ultimate player'. The dataset we are using has been compiled from Jeff Slackmans Github page (https://github.com/JeffSackmann/tennis_atp). The dataset contains information regarding the players, the match, and the tournament the match was played in. We aim to determine which player characteristics best predict the outcome of a tennis match. More aptly, if a player with a given set of characterisitcs appeared, how likely is it that they would win a match against another player with a different set of characterisitcs? Examples of player characteristics include age, handedness, height and ATP rank

Match characteristics, though included in the dataset, cannot be used in this project, as they would "give away" the outcome of the match to the classifier.

We are asking a regression question, as we are attempting to discern if one or more variables can be used to predict a numerical variable of interest, in this case, the optimal age, handedness, rank, height, etc. for a player, or in other words, how often does a player of x height/age/rank win?

In [None]:
#Loading the dataset 

library(tidyverse)
library(tidyr)
library(tidymodels)
library(dplyr)
options(repr.matrix.max.rows = 6)

url <- "https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn"

ATP_data <- read_csv(url)
ATP_data

There are several columns which we can reasonably anticipate will not effect the results or are not in the scope of our question, and therefore should be removed:

all columns from draw_size to winner_id
winner_entry, winner_name and winner_ioc, as with the same categories for the loser
score (we are only concerned with the final outcome)
best_of
all columns relating to match information

In [None]:
#select the columns necessary from the dataset

ATP_data_trimmed <- select(ATP_data , tourney_name, surface, winner_hand, winner_ht, winner_age, loser_hand, loser_ht, loser_age, winner_rank, winner_rank_points, loser_rank, loser_rank_points)
ATP_data_trimmed

After removing those extra columns, this dataset is relatively clean, in that it meets these criteria:

Each column is a single variable
Each value is a single cell
Each row is a unqiue observation
However, there are some observations that contain 'NA' values, which make the data untidy, therefore, all rows with an NA value should be removed

In [None]:
#drop_na function is an easy way to drop rows with NA values, but it requires the tidyr library

library(tidyr)
ATP_data_tidy <- drop_na(ATP_data_trimmed)
ATP_data_tidy

Now the data is tidy and ready to work with!

Next, we will create some preliminary visualizations. First off, it will be helpful to see how certain player characteristics, like height and player rank, are distributed amongst winning and losing players.

We will create our training and testing sets below to get started on making a table and plot of the given dataset to get a better sense of what we are working with.

In [None]:
ATP_split<- initial_split(ATP_data_tidy, prop = 0.75, strata = winner_rank)
ATP_training <- training(ATP_split)
ATP_testing <- testing(ATP_split)

We now have a training and testing set of data to work with and create basic and preliminary visualizations for to
test and examine before we go any further.

In [None]:
number_of_rows_tourney_name <- ATP_training %>%
group_by(tourney_name) %>%
summarise(Count = n())  
number_of_rows_tourney_name

number_of_rows_surface <- ATP_training %>%
group_by(surface) %>%
summarise(Count = n())  
number_of_rows_surface


number_of_rows_winner_hand <- ATP_training %>%
group_by(winner_hand) %>%
summarise(Count = n())  
number_of_rows_winner_hand

number_of_rows_winner_ht <- ATP_training %>%
group_by(winner_ht) %>%
summarise(Count = n())  
number_of_rows_winner_ht

number_of_rows_winner_age <- ATP_training %>%
group_by(winner_age) %>%
summarise(Count = n())  
number_of_rows_winner_age

number_of_rows_loser_hand <- ATP_training %>%
group_by(loser_hand) %>%
summarise(Count = n())  
number_of_rows_loser_hand

number_of_rows_loser_ht <- ATP_training %>%
group_by(loser_ht) %>%
summarise(Count = n())  
number_of_rows_loser_ht

number_of_rows_loser_age <- ATP_training %>%
group_by(loser_age) %>%
summarise(Count = n())  
number_of_rows_loser_age

number_of_rows_winner_rank <- ATP_training %>%
group_by(winner_rank) %>%
summarise(Count = n())  
number_of_rows_winner_rank

number_of_rows_winner_rank_points <- ATP_training %>%
group_by(winner_rank_points) %>%
summarise(Count = n())  
number_of_rows_winner_rank_points

number_of_rows_loser_rank <- ATP_training %>%
group_by(loser_rank) %>%
summarise(Count = n())  
number_of_rows_loser_rank

number_of_rows_loser_rank_points <- ATP_training %>%
group_by(loser_rank_points) %>%
summarise(Count = n())  
number_of_rows_loser_rank_points

Plotting our data:

since our data has a loser and winner set of data, we can create a plot showing different stats of both data with side by side graph comparisons. 

In [None]:
options(repr.plot.height = 8, repr.plot.width = 7)

ATP_plot <- ggplot(ATP_training, aes(x = winner_ht, y = winner_rank)) +
geom_point() +
facet_wrap(~ factor(winner_hand, levels = c("R","L"))) +
xlab("Winner Height (in cm)") +
ylab("Winner Rank") +
ggtitle("Rank Over Height Graph") +
theme(text = element_text(size = 20))
ATP_plot


ATP_plot2 <- ggplot(ATP_training, aes(x = loser_ht, y = loser_rank)) +
geom_point() +
facet_wrap(~ factor(loser_hand, levels = c("R","L"))) +
xlab("Loser Height (in cm)") +
ylab("Loser Rank") +
ggtitle("Rank Over Height Graph") +
theme(text = element_text(size = 20))
ATP_plot2

In [None]:
options(repr.plot.height = 8, repr.plot.width = 7)

ATP_plot3 <- ggplot(ATP_training, aes(x = winner_age, y = winner_rank)) +
geom_point(alpha = 0.3) +
facet_wrap(~ factor(winner_hand, levels = c("R","L"))) +
xlab("Winner Age (years)") +
ylab("Winner Rank") +
ggtitle("Rank Over Age Graph") +
theme(text = element_text(size = 20))
ATP_plot3


ATP_plot4 <- ggplot(ATP_training, aes(x = loser_age, y = loser_rank)) +
geom_point(alpha = 0.3) +
facet_wrap(~ factor(loser_hand, levels = c("R","L"))) +
xlab("Loser Age (years)") +
ylab("Loser Rank") +
ggtitle("Rank Over Age Graph") +
theme(text = element_text(size = 20))
ATP_plot4