# Using age and racket hand to predict the winners of tennis games

Question: Can we predict the winner of tennis matches based on their age and hand used to hold the racket?

### Introduction
In the dataset provided by Jeff Sackman, data from tennis matches are provided, with game statistics as well as winner and loser attributes. We will be using data collected between 2017 and 2019. We will focus on the columns titled “winner_hand”, “winner_age”, “loser_hand”, and “loser_age”. “Hand” refers to the hand the athlete used to hold their racket during the match, and “age” refers to the athlete’s recorded age at the time of the match. Using these data, we will attempt to answer the question: can we predict the winner of tennis matches based on their age and hand used to hold the racket? 

### Preliminary exploratory data analysis
The dataset would be read using “read_csv”. To tidy the data, we use “select()” to select all the columns we need, including “winner_age”, “winner_hand”, “loser_age”, and “loser_hand”. 

In [2]:
library(tidyverse)
library(repr)
library(dplyr)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [None]:
tennis_data <- read_csv("https://raw.githubusercontent.com/keelbeier/dsci100-group69/main/atp2017-2019.csv")
tennis_data

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m6866[39m [1mColumns: [22m[34m50[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (16): tourney_id, tourney_name, surface, tourney_level, winner_seed, win...
[32mdbl[39m (34): ...1, draw_size, tourney_date, match_num, winner_id, winner_ht, wi...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


**Above:** our csv file read into R

Now, we will wrangle our data into a tidy format. We will remove all columns that we do not need.

In [1]:
tennis_data <- tennis_data |>
                    select(tourney_date, winner_hand, 
                      winner_age, loser_hand, loser_age)
winner_age_rounded <- round(tennis_data$winner_age)
loser_age_rounded <- round(tennis_data$loser_age)

ERROR: Error in select(tennis_data, tourney_date, winner_hand, winner_age, loser_hand, : could not find function "select"


In [None]:
sum_hand_win <- table(tennis_data$winner_hand)
sum_hand_win

A bar graph that counts the number of winners in each age with the racket hand labelled is created to see the distribution of the data. 

In [None]:
options(repr.plot.width=8, repr.plot.height=7)
tennis_plot <- tennis_data |> 
    ggplot(aes(x = winner_age_rounded, fill = winner_hand)) +
    geom_bar(width = 0.2) +
    labs(x = "Winner age",
        y = "Number of Players") +
    theme(text = element_text(size = 20)) + 
    scale_fill_discrete(name = "Winner Racket Hand", labels = c("Left", "Right", "Undefined"))
tennis_plot

### Methods

We used the classification to predict the winner in a match based on their age and used hands in the match. We achieved our data analysis by using the K-nearest neighbors from the tidymodels package. We used the “winner_age”, “winner_hand”, “loser_age”, “loser_hand”  columns to make our training data set.  
We visualized our results by using the ggplot2 function in the tidyverse package. Our visualization used different shapes to represent hands which players used and two different colors to indicate the winners and losers. We used the x-axis to represent the winners’ ages and indicated the y-axis as the losers’ ages.

### Expected Outcomes

By applying the methods stated in the methods section, we expect to get a scatter plot of the winners age versus loser’s age with different shapes/color assigned for the dots based on the winner’s racket hand. We expect to see a trend from the graph that shows the range where the winners age is more clustered. A guess we made for the result is that winners age would be overall smaller than the loser’s age, so the plot would be more clustered around the beginning of the winner’s age’s axis. Scanning through the header of the data set, we could see that the majority of players are right-handed, therefore we expect to see that the shape/color for right handed players would appear more often. 
The result of the study can be used to predict the winner of a tenis game before the game have started, which could be useful for sports betting. 
Some future questions that could be brought up from the result of the study include: Is the energy of tenis players affected by age? Does experience gained from the increase in age help in winning a game? 