# Analysis 100

## Purpose
* This is the first analysis notebook of our project and this is where we will begin to run commands and actions on our cleaned dataframes.
* We have three reseach questions that need to be answered for our project and these analysis notebooks is where they must begin.
* These questions are:
    * How has the game of tennis developed since 1968?
    * What factors impact the success of a top player?
    * Can we predict the next young Grand Slam winner?
* In this notebook we will focus on research question 1 'How has the game of tennis developed since 1968?'

## Datasets
* These dataframes were created in the Prep notebooks of this project.
* atp_main; a dataframe which holds each match (as a row) of the Men's Singles game from 1968-2017.
* atp_small; a smaller dataframe which holds each Men's Singles match from 2003-2014. This was created in order to find a corresponding dataframe to wta_dataset which is explained below.

In [1]:
# Importing relevant libraries
import os
import sys
import hashlib
import numpy as np
import pandas as pd
    
%matplotlib inline

In [2]:
# Checking if path exists
if not os.path.exists("../data/atp_main"):
    print("Missing project folder")
print("Success!")

Success!


## Reading the files

In [3]:
# Open and read dataframes
# Additionally for this notebook, we need a time series index for our analysis
# As we are comparing data from paticular datas so creating a timeseries makes this easier
atp_main = pd.read_csv("../data/atp_main", low_memory = False, index_col = 'tourney_date')

In [4]:
# Index by date
atp_main.index = pd.to_datetime(atp_main.index, format="%Y-%m-%d", errors='coerce')

In [5]:
atp_small = pd.read_csv("../data/atp_small", low_memory = False)

## RQ1: How has the game of tennis developed since 1968?
* We have chosen numerous areas to focus on in answering this question, namely;
    * Length of matches.
    * Each of the 4 tennis surfaces where matches can be played on and their prevalence over the years.
    * The Ace - an ace is made in tennis when the server hits his/her serve in such a way that the opponent cannot touch it i.e. it is too fast or at too much of an angle.
    * The countries producing the tennis talent.

#### A view of the length of tennis matches since 1990, on average.
<br> 
We have chosen to investigate from 1990 forward as minutes only began to be recorded from then in our dataset

We can see how the game has progressed as the match minutes are getting longer

In [6]:
atp_main[atp_main.match_year>1990].groupby(['match_year'])['minutes'].mean()

match_year
1991.0    103.081248
1992.0    106.141569
1993.0    100.063582
1994.0     93.702773
1995.0     92.700000
1996.0     92.550410
1997.0     95.753036
1998.0     93.967700
1999.0     98.295516
2000.0     99.180150
2001.0    100.915460
2002.0    101.871731
2003.0    100.446547
2004.0    100.525182
2005.0    101.984885
2006.0    104.451702
2007.0    103.163578
2008.0    105.386257
2009.0    107.541621
2010.0    106.954713
2011.0    108.318943
2012.0    110.297277
2013.0    103.839648
2014.0    104.377234
2015.0    105.039970
2016.0    108.828708
2017.0    125.935829
Name: minutes, dtype: float64

In the 90's we see the average match minutes is 97.6

In [7]:
avg_min_90s = atp_main['1990':'2000']['minutes'].mean()
avg_min_90s

97.60476479417494

In [8]:
avg_min_00s = atp_main['2005':'2015']['minutes'].mean()
avg_min_00s

105.56791927627

* We can see just in difference of 25 years that the average playing times have risen.
* A rise by 8 minutes on average! Why is this? 
    * Introduction of Hawk Eye? 

#### Hawk-Eye
* This technology was introduced in 2005 where players could challenge a line-call. Traditionally, players would accept the call of the umpire on whether the ball was in or out and get on with the game. 
* The introduction of Hawk-Eye in 2005 meant that players could challenge a call and this was the process; player thinks about challenging, player challenges the call, umpire announces to the crowd that the player has challenged, crowd waits for the big screen to show the reply of the ball, big screen shows whether the ball was in or out, crowd cheers and finally players are ready to get back to the match.
*Since the introduction of Hawk-Eye, avergae match lengths in the Men's game have never fallen below 103 minutes. Before Hawk-Eye (1991-2004), 85% of the average length of matches were less than 103 minutes.*

##### Average tennis match times by surface

In [9]:
# Group this by surface type
avg_mins_df = atp_main['1990':'2000'].groupby('surface', as_index=False)['minutes'].mean()
avg_mins_df

Unnamed: 0,surface,minutes
0,Carpet,91.649648
1,Clay,99.009918
2,Grass,104.926525
3,Hard,96.983812


In [10]:
# Group this by surface type
avg_mins_df = atp_main['2005':'2015'].groupby('surface', as_index=False)['minutes'].mean()
avg_mins_df

Unnamed: 0,surface,minutes
0,Carpet,93.486865
1,Clay,106.45045
2,Grass,110.954738
3,Hard,104.403425


* Each surface has increased in match minutes over the last 25 years
* We can see grass has the longest matches on average followed by clay. Interestingly, grass is actually a faster game as the ball bounces low therefore the players must reach it faster. Clay, on the otherhand, is a slow game for the opposite reason as the ball bounces higher. 
* It's a surprising result that grass in fact has longer games. 

#### Longest tennis match in history

In [11]:
# Finding the longest tennis match in history in our dataframe
atp_main.loc[(atp_main['winner_name'] == 'John Isner') & (atp_main['loser_name'] == 'Nicolas Mahut')]

Unnamed: 0_level_0,tourney_id,tourney_name,surface,draw_size,tourney_level,match_num,winner_id,winner_seed,winner_entry,winner_name,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,match_year
tourney_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-06-21,2010-540,Wimbledon,Grass,128,G,60,104545,23.0,,John Isner,...,103,21,489,328,284,101,91,12,14,2010.0
2011-06-20,2011-540,Wimbledon,Grass,128,G,42,104545,,,John Isner,...,8,6,104,68,48,21,16,2,5,2011.0
2012-07-09,2012-315,Newport,Grass,32,A,17,104545,1.0,,John Isner,...,5,7,70,31,22,20,10,6,8,2012.0


* In the table above you can see that the players who contended the longest match in history were drawn to play each other the very next year in Wimbledon again. John Isner won again but in a much quicker fashion.
* Interestingly, the longest game ever was played on a grass surface.

In [12]:
atp_main['score'].loc[(atp_main['winner_name'] == 'John Isner') & (atp_main['loser_name'] == 'Nicolas Mahut')]

tourney_date
2010-06-21    6-4 3-6 6-7(7) 7-6(3) 70-68
2011-06-20              7-6(4) 6-2 7-6(6)
2012-07-09                     6-2 7-6(2)
Name: score, dtype: object

In [13]:
atp_main['minutes'].loc[(atp_main['winner_name'] == 'John Isner') & (atp_main['loser_name'] == 'Nicolas Mahut')]

tourney_date
2010-06-21    665.0
2011-06-20    123.0
2012-07-09     78.0
Name: minutes, dtype: float64

## Set formats
* Each tournament will have either a best of 3 or 5 set match format. Matches with a best of 5 set format can take much longer so this next section will factor this in.
* Below we look at all the matches between 1990 and 2015 and comapre the length of matches based on their format.
* We can do this by utilising the 'best_of' column in our table where it indicates the format of a match.

In [14]:
best_of_df = atp_main['1990':'2000'].groupby('best_of', as_index=False)['minutes'].mean()
best_of_df

Unnamed: 0,best_of,minutes
0,3,92.249637
1,5,136.740249


* From the table above we can see that, on average, matches of a 'best_of' 5 format can be longer. This is to be expected but we now muct focus on how this has changed over the past 2 decades.

In [15]:
best_of_df = atp_main['2000':'2015'].groupby('best_of', as_index=False)['minutes'].mean()
best_of_df

Unnamed: 0,best_of,minutes
0,3,94.615799
1,5,145.097671


* We see a small jump in best of 3 matches with less than 2 minutes.
* A large 8 minute increase in best of 5 matches.

* These results are interesting as they show a slight dip in the mid 90s interms of average length of matches and then a steady increase through the 2000s before a big jump in 2017. 
* In the mid 80s, the tennis scoring system started to change for a variety of reasons. These chagnes would result in shorter matches and was eventually phased-in throughout the sport by the late 80s. If a final, deciding set was required, that set would be played in a tie-break format which is analogous to winning quick-fire points.
* A startling trend is the early 2000s versus the late 2010s. Match lengths have increased dramatically here and the following is our reasoning for this.

## Hard Court Matches

In [16]:
# Hard Court matches make up a significant % of the surface that is played on
atp_main['surface'].value_counts()

Hard       63821
Clay       59565
Carpet     19833
Grass      18179
surface        1
Name: surface, dtype: int64

Why Hard Court? Faster speeds = more aces => better entertainment? Perfect way to go onto aces in analysis_200

* Has this always been the case?

In [17]:
df_1968 = atp_main['1968':'1978']
df_1968  = df_1968['surface'].value_counts()
df_1968

Clay      13640
Hard       7337
Carpet     5373
Grass      4842
Name: surface, dtype: int64

In [18]:
df_1978 = atp_main['1978':'1988']
df_1978  = df_1978['surface'].value_counts()
df_1978

Clay      15948
Hard      12330
Carpet     6958
Grass      4992
Name: surface, dtype: int64

In [19]:
df_1988 = atp_main['1988':'1998']
df_1988  = df_1988['surface'].value_counts()
df_1988

Hard      16631
Clay      14195
Carpet     6507
Grass      3372
Name: surface, dtype: int64

In [20]:
df_1998 = atp_main['1998':'2008']
df_1998  = df_1998['surface'].value_counts()
df_1998

Hard      17465
Clay      12420
Grass      3504
Carpet     2701
Name: surface, dtype: int64

In [21]:
df_2008 = atp_main['2008':'2016']
df_2008  = df_2008['surface'].value_counts()
df_2008

Hard      15158
Clay       8677
Grass      2864
Carpet      146
Name: surface, dtype: int64