### Capstone Project ATP Tennis Match

Predict the winner of a professional men's tennis match based on ATP matches data from 2018. This could increase the odds of someone betting on the winner of a tennis match.

I will use the data curated by Jeff Sackman derived from Association of Tennis Professionals (ATP) data.
See https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_2018.csv

The data consists of the 2889 observations with the following features.

Some of the interesting columns are listed below:

| Variable| Description |
|---------|----------------|
|winner_seed| winner's seeding within tournament |
|winner_entry| 'WC' = wild card, 'Q' = qualifier, 'LL' = lucky loser, 'PR' = protected ranking, 'ITF' = ITF entry, and there are a few others that are occasionally used |
|winner_name| winner's name |
|winner_hand| R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand |
|winner_ht| height in centimeters, where available |
|winner_ioc| three-character country code |
|winner_age| age, in years, as of the tourney_date |
|w_ace| winner's number of aces |
|w_df| winner's number of doubles faults|
|w_svpt| winner's number of serve points|
|w_1stIn| winner's number of first serves made|
|w_1stWon| winner's number of first-serve points won|
|w_2ndWon| winner's number of second-serve points won|
|w_SvGms| winner's number of serve games|
|w_bpSaved| winner's number of break points saved|
|w_bpFaced| winner's number of break points faced|

This will be a supervised classification problem. I will decide which algorithm to use later in the course.

### Exploring ATP Matches Data
---

In [None]:
# Standard imports
import pandas as pd
import numpy as np

# Visualization imports
import seaborn as sns
import matplotlib.pyplot as plt

# Specific imports
# These are new! Notice we're using the 'from' approach to import only what we need.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# Statistics imports
from scipy import stats
import statsmodels.api as sm

# magic and parameters
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

<a id="read-in-the--capital-bikeshare-data"></a>
### Read In the Capital Bikeshare Data

In [None]:
# Read the data and set the datetime as the index.
url = './data/atp_matches_2018.csv'
df = pd.read_csv(url)
#df = pd.read_csv(url, index_col='datetime', parse_dates=True)

In [9]:
df.head().T

Unnamed: 0,0,1,2,3,4
tourney_id,2018-M020,2018-M020,2018-M020,2018-M020,2018-M020
tourney_name,Brisbane,Brisbane,Brisbane,Brisbane,Brisbane
surface,Hard,Hard,Hard,Hard,Hard
draw_size,32,32,32,32,32
tourney_level,A,A,A,A,A
tourney_date,20180101,20180101,20180101,20180101,20180101
match_num,271,272,273,275,276
winner_id,105992,111577,104797,200282,111581
winner_seed,,,,,
winner_entry,,,,WC,Q
