*If you're unfamiliar with Jupyter Notebooks, this one is pre-loaded such that all you need to do is scroll, read, and enjoy.*

*To interact with the file, click on cells with `[ ]:` to their left and run them one by one by pressing ▶️ ("Run") in the menu above or `Shift`+`Enter` on your keyboard. Go from top to bottom to keep things working properly.*

# `tennis_abs_api`, A Python API for *Tennis Abstract*

In short, `tennis_abs_api` is intended as a flexible Python package that **handles the grunt work of downloading and scrubbing historical match data** from Jeff Sackmann's [*Tennis Abstract*](http://www.tennisabstract.com) website **so users can get to the fun parts of tennis data analysis more quickly**.

It's written to be flexible, so **read on to learn more about levels I, II, and (especially) III of interaction** with the website.

We start by importing some other helpful packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import re

___
## I. Generate *Tennis Abstract* player page URLs. 🎾

If you're used to interacting with Tennis Abstract and want a way to get links to the website's player pages in Python, you can **pass your player's name and tour to `ConstructURL`** and **return their respective player page**.

Take Bianca Andreescu from the WTA, for example:

In [None]:
from construct_query import ConstructURL, DownloadStats
from IPython.display import Image

name = 'bianca andreescu' # capitalization doesn't matter
name_obj = ConstructURL(name, tour='WTA')
print(name_obj.URL)

If you follow the link, the top corner of the page should look like the following, with the data table just below:

In [None]:
Image('bianca.png')

___
## II. Download match data from *Tennis Abstract* player page URLs. 🎾

If you're more interested in **downloading data for local use** in Python, you can convert *Tennis Abstract* URLs into `pandas` DataFrames that contain those webpages' match data.

For example, say you have a link to a page on Tennis Abstract that lists **stats for every quarterflnal, semifinal, and final Roger Federer played in 2015**.

In [None]:
stat_url = 'http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RogerFederer&f=A2015qqE0i1i2'

You can **pass the link to the `url` argument of `DownloadStats`** and wait for the download to finish _(speed depends on the table's size)_.

In [None]:
cp_stats = DownloadStats(url=stat_url)

Once `DownloadStats` compiles and scrubs the data, it's saved as a DataFrame in the resulting object's `match_data` attribute.

In [None]:
print(f"{cp_stats.name}, {cp_stats.title}")
cp_stats.match_data.head()

From here, the only limit for how to use the data is your creativity.

As an example, let's **visualize the relationship between Federer's double fault percentage ('DF%') and ace percentage ('A%') in quarterfinals and semifinals in 2015**. We'll fit lines to the data for both rounds and see if Federer's serve trended consistently at both stages.

_(Since this isn't a `pandas` or `matplotlib` tutorial, the visualization code is presented without much commentary.)_

In [None]:
# make smaller, round-speciifc dataFrames
qf_stats = cp_stats.match_data[cp_stats.match_data['Rd'] == 'QF'].copy()
sf_stats = cp_stats.match_data[cp_stats.match_data['Rd'] == 'SF'].copy()

# plot serve percentages for each round
ft_sz = 14
ax = qf_stats.plot(x='DF%', y='A%', figsize=(8, 6), kind='scatter', marker='o', lw=2,
                   fontsize=ft_sz, c='w', edgecolor='#008ca8', alpha=.7)
sf_stats.plot(x='DF%', y='A%', ax=ax, kind='scatter', marker='o', lw=2,
              fontsize=ft_sz, c='w', edgecolor='#1d1160', alpha=.7)

ax.xaxis.get_label().set_fontsize(ft_sz)
ax.yaxis.get_label().set_fontsize(ft_sz)

# fit lines to serve percentages
qf_fit = np.polyfit(qf_stats['DF%'], qf_stats['A%'], deg=1)
sf_fit = np.polyfit(sf_stats['DF%'], sf_stats['A%'], deg=1)

# plot fitted lines
qf_x = np.linspace(0, qf_stats['DF%'].max())
ax.plot(qf_x, qf_fit[0] * qf_x + qf_fit[1],
        linestyle='--', lw=2.5, c='#008ca8', label='in QF')

sf_x = np.linspace(0, sf_stats['DF%'].max())
ax.plot(sf_x, sf_fit[0] * sf_x + sf_fit[1],
        linestyle='--', lw=2.5, c='#1d1160', label='in SF')

# highlight an extreme point
extr = qf_stats[qf_stats['A%'] == qf_stats['A%'].max()].iloc[0].copy()

opp = re.findall(r'\s\w*\s\[', extr['Result'])[0].split()[0]
ax.text(extr['DF%'], extr['A%'] - 3,
        f"({extr['Tournament']}, vs. {opp})",
        fontsize=ft_sz - 2, style='italic', horizontalalignment='center')

# set other plot styling options
ax.set_title("Federer's 2015 serve stats by round", fontsize=ft_sz + 2)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(fontsize=ft_sz)

___
## III. Use *custom queries* to download match data. 🎾

That's right, **you can download match data even if you don't have the matching _Tennis Abstract_ URL**. Just pass a dictionary to `DownloadStats`' `attrs` argument to filter data just as you would on the site. ([See this page](https://github.com/ojustino/tennis-abs-api/blob/master/attrs_docs.md) for more on how to build `attrs`.)

Let's **create a query for a subset of the data on Martina Navratilova's player page**.

**in English**:
<br>
_Martina Navratilova matches between 1977-01-01 and 1991-01-01._

**as a `DownloadStats` query:**

In [None]:
qu_stats = DownloadStats('Martina Navratilova', tour='WTA',
                         attrs={'start date': pd.Timestamp(1977, 1, 1),
                                'end date': pd.Timestamp(1991, 1, 1)})

In [None]:
print(f"{qu_stats.name} {qu_stats.title}")
qu_stats.match_data.head()

_Note that older data is more likely to be incomplete, hence the `NaN`s in the resulting DataFrame._

As before, we get a class instance with a `pandas` DataFrame that's ready for use in analyses and visualizations. Here, we'll **plot the percentage of tournaments Navratilova entered by surface** for each year in the set.

In [None]:
# add specific 'Year' column to downloaded dataFrame
qu_df = qu_stats.match_data.copy()
qu_df['Year'] = qu_df['Date'].apply(lambda d: d.year)

# Create new dataFrame with columns relevant to our plot...
# 1) a column for the number of tournaments entered per surface per year
by_yr_sf = (qu_df.groupby(['Year', 'Surface'])[['Tournament']].nunique()
            .rename(columns={'Tournament': 'Tns'}))# .reset_index())

# 2) a column for fraction of tns entered by surface,
# normalized by each year's total number of tournaments
by_yr_sf['Frac_Tns'] = by_yr_sf['Tns'] / by_yr_sf.groupby('Year')['Tns'].sum()

In [None]:
# make pivot table to help with plot
pv0 = (by_yr_sf.reset_index()
       .pivot(index='Year', columns='Surface', values='Frac_Tns').fillna(0))

# and make the plot, setting specific colors for each surface
clr_dict = {'Carpet': '#6b4f7f', 'Clay': '#c9480e',
            'Grass': '#9ab558', 'Hard': '#1b91cf'}
bar_clrs = [clr_dict[cl] for cl in pv0.columns]

ft_sz = 14
ax = pv0.plot.bar(stacked=True, figsize=(6, 4), fontsize=ft_sz,
                  color=bar_clrs, edgecolor='k', width=1, rot=45)
ax.get_yaxis().set_major_formatter(mpl.ticker.PercentFormatter(1))

# make other style modifications
ax.xaxis.get_label().set_fontsize(ft_sz)
ax.yaxis.get_label().set_fontsize(ft_sz)

ax.set_title("Navratilova's tournaments entered by surface",
             fontsize=ft_sz + 2)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(fontsize=ft_sz, bbox_to_anchor=(1.05, 1))

We can use similar steps to **make a plot for total matches played per surface each year**.

In [None]:
# 3) a column for total matches by surface per year
by_yr_sf['Mts'] = qu_df.groupby(['Year', 'Surface'])[['Date']].count()

# 4) a column for fraction of matches by surface,
# normalized by each year's total number of matches
by_yr_sf['Frac_Mts'] = by_yr_sf['Mts'] / by_yr_sf.groupby('Year')['Mts'].sum()

In [None]:
# make pivot table to help with plot
pv1 = (by_yr_sf.reset_index()
       .pivot(index='Year', columns='Surface', values='Mts').fillna(0))

# and make the plot, setting specific colors for each surface
clr_dict = {'Carpet': '#6b4f7f', 'Clay': '#c9480e',
            'Grass': '#9ab558', 'Hard': '#1b91cf'}
bar_clrs = [clr_dict[cl] for cl in pv0.columns]

ft_sz = 14
ax = pv1.plot.bar(stacked=True, figsize=(6, 4), fontsize=ft_sz,
                  color=bar_clrs, edgecolor='w', width=1, rot=45)

# make other style modifications
ax.xaxis.get_label().set_fontsize(ft_sz)
ax.yaxis.get_label().set_fontsize(ft_sz)

ax.set_title("Navratilova's matches played by surface",
             fontsize=ft_sz + 2)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(fontsize=ft_sz, bbox_to_anchor=(1.05, 1))

**Getting back to queries, let's try another example with `DownloadStats`.**

**in English**:
<br>
_Serena Williams matches in the Round of 32 or Round of 16 of Premier-level tournaments on clay or carpet between 1999-08-07 and 2010-04-14 where she was ranked in the top 5._

**as a `DownloadStats` query:**

In [None]:
qu_stats1 = DownloadStats('Serena Williams', tour='WTA',
                          attrs={'surface': ['carpet', 'clay'],
                                 'level': 'premier', 'as rank': 'Top 5',
                                 'start date': pd.Timestamp(1999, 8, 7),
                                 'end date': pd.Timestamp(2010, 4, 14),
                                 'round': ['R16', 'Round of 32']})

In [None]:
print(f"{qu_stats1.name}, {qu_stats1.title}")
qu_stats1.match_data.head()

Finally, **we'll end with a convoluted and specific query** to help reassure the author that things are working as intended.

**in English**:
<br>
_Andy Murray matches where he was seeded -- and the opponent (who is not Arnaud Clement or Guillermo Canas) was right-handed, shorter than Murray, ranked between 14-112 at the time of the match, was either seeded or given a wild card, and is currently inactive -- that ended in 4 sets or straight sets and featured at least 1 tiebreak._

**as a `DownloadStats` query:**

In [None]:
qu_stats2 = DownloadStats('Andy Murray', tour='ATP',
                          attrs={'vs height': 'Shorter', 'vs hand': 'right',
                                 'vs entry': ['wild card', 'seeded'],
                                 'as entry': 'seeded',
                                 'vs current rank': 'inactive',
                                 'vs rank': (14, 112), 'score': 'all 7-6',
                                 'sets': ['straights', '4 of 5 sets'],
                                 'exclude opp': ['Arnaud Clement',
                                                 'Guillermo Canas']})

In [None]:
print(f"{qu_stats2.name} {qu_stats2.title}")
qu_stats2.match_data.head()

___
## As I hope you've stayed to see...

...`tennis_abs_api` allows for **multiple levels of interaction** with *Tennis Abstract* and can act as **a gateway to deeper match data analysis**. It's under active development [on GitHub](https://github.com/ojustino/tennis-abs-api), so any constructive critiques (or compliments) are welcome there. **Thank you for reading.**