# Analysis 300

## Purpose
In this notebook we will continue our analysis for research question 1 - 'How has the game of tennis developed since 1968?'. 
Primarily, in this notebook we will focus on nationalities of players and how that has changed from 1968 to now.

## Datasets
* For this notebook we will just use the atp_main dataset.
* We will create 2 resulting dataframes for 2 different decadesmcontaining top 20 countries who produced the most players in that decade and corresponding average population of the country and a calculated normalised value for player production.
* We will also create a small dataframe for a case study on the decline of players from Australia.

In [1]:
import os
import sys
import hashlib
import numpy as np
import pandas as pd
from datetime import datetime
    
%matplotlib inline

In [2]:
atp_main = pd.read_csv("../data/atp_main", low_memory = False, index_col = 'tourney_date')

Making the atp main dataframe a timeseries

In [3]:
atp_main.index = pd.to_datetime(atp_main.index, format="%Y-%m-%d", errors='coerce')

In [4]:
atp_main['1975'].head(5)

Unnamed: 0_level_0,tourney_id,tourney_name,surface,draw_size,tourney_level,match_num,winner_id,winner_seed,winner_entry,winner_name,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,match_year
tourney_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1975-05-12,1975-347,Bournemouth,Clay,64,A,1,100282,,,Guillermo Vilas,...,,,,,,,,,,1975.0
1975-05-12,1975-347,Bournemouth,Clay,64,A,2,100199,,,Patrice Dominguez,...,,,,,,,,,,1975.0
1975-05-12,1975-347,Bournemouth,Clay,64,A,3,100062,,,Francois Jauffret,...,,,,,,,,,,1975.0
1975-05-12,1975-347,Bournemouth,Clay,64,A,4,100359,,,Richard Lewis,...,,,,,,,,,,1975.0
1975-05-12,1975-347,Bournemouth,Clay,64,A,5,109787,,,Cm Robinson,...,,,,,,,,,,1975.0


## Nationalities of players 1968 - 1978

In [5]:
#getting dataframe with years 1968 to 1978
no_players_68_78 = atp_main['1968':'1978']

#grouping by winner name and ioc (where player is from) and sorting the winner name as unique 
#so we don't count players more than once
no_players_68_78 = no_players_68_78.winner_name.groupby(no_players_68_78['winner_ioc']).nunique().sort_values()

#creating a dataframe and printing out the countries with the most players
no_players_68_78 = pd.DataFrame(no_players_68_78).reset_index().tail(20)
no_players_68_78.tail()

Unnamed: 0,winner_ioc,winner_name
68,ESP,31
69,GBR,39
70,FRA,44
71,AUS,97
72,USA,198


In [6]:
no_players_68_78.columns = ['winner_ioc', 'num_players']

### Normalizing the player numbers by country nationality

In [7]:
rel_68_78 = atp_main['1968':'1978']

#grouping by unique winner name and winner ioc
rel_68_78 = pd.DataFrame({'player_num' : rel_68_78.winner_name.groupby(rel_68_78['winner_ioc']).nunique().sort_values().tail(21)}).reset_index()

To get the average population of each country we manually calculated an average population over the 11 year period from 1968-1978.
<br> https://countryeconomy.com/ we got the populations for each year from this website.

In [8]:
#average population of each country in the table between the years 1968 - 1978
rel_68_78['average_pop']=(20.978222, 132.788000, 6.258093, 57.344959, 2.955318, 883.675818, 10.470653, 0, 25.231640, 22.403424, 
                   59.4896107, 54.887776, 99.379630, 8.143423, 61.438594, 24.437233, 42.502060, 56.010238, 53.292658, 13.315630, 
                210.913956)

In [9]:
#calculating the normalised number of players per country per million
rel_68_78['pop_rel'] = rel_68_78['player_num'] / rel_68_78['average_pop'] 

In [10]:
rel_68_78 = rel_68_78.drop(rel_68_78.index[7]).reset_index().drop('index', axis=1)
rel_68_78.head()

Unnamed: 0,winner_ioc,player_num,average_pop,pop_rel
0,ROU,10,20.978222,0.476685
1,RUS,10,132.788,0.075308
2,SUI,11,6.258093,1.757724
3,MEX,11,57.344959,0.191822
4,NZL,14,2.955318,4.737223


In [11]:
rel_68_78.to_csv('../data/rel_68_78', index = False, encoding='utf-8')

## Nationalities of players 2006 - 2016

In [12]:
#getting dataframe with years 2006 to 2016
no_players_06_16 = atp_main['2006':'2016']

#grouping by winner name and ioc (where player is from) and sorting the winner name as unique 
#so we don't count players more than once
no_players_06_16 = no_players_06_16.winner_name.groupby(no_players_06_16['winner_ioc']).nunique().sort_values()

#creating a dataframe and printing out the countries with the most players
no_players_06_16 = pd.DataFrame(no_players_06_16).reset_index().tail(20)
no_players_06_16.tail()

Unnamed: 0,winner_ioc,winner_name
94,GER,35
95,ARG,35
96,ESP,43
97,FRA,50
98,USA,62


In [13]:
no_players_06_16.columns = ['winner_ioc', 'num_players']

### Normalizing the player numbers by country nationality

In [14]:
rel_06_16 = atp_main['2006':'2016']

#grouping by unique winner name and winner ioc
rel_06_16 = pd.DataFrame({'player_num' : rel_06_16.winner_name.groupby(rel_06_16['winner_ioc']).nunique().sort_values().tail(20)}).reset_index()

Again, to get the average population of each country we manually calculated an average population over the 11 year period from 2006-2016 using the same website as previously 

In [15]:
#average population of each country in the table between the years 1968 - 1978
rel_06_16['average_pop']=(127.464909, 7.961611, 5.402596, 11.012713, 115.999584, 4.262103, 8.461710, 9.567811, 49.873882, 
                          197.121091, 63.458004, 10.470653, 143.438740, 59.667704, 22.551720, 81.461219, 41.235810, 
                          42.502060, 65.333179, 311.298094)

In [16]:
#calculating the normalised number of players per country per million
rel_06_16['pop_rel'] = rel_06_16['player_num'] / rel_06_16['average_pop'] 

In [17]:
rel_06_16.to_csv('../data/rel_06_16', index = False, encoding='utf-8')

## Australia's dip in players
- Due to introduction of the Australian Institute of Sport in 1981
- They provide funding for elite funding programs
- Other sports became more popular and were pushed by the AIS while tennis may have taken a backseat.

In [18]:
Aus = atp_main[atp_main['winner_ioc'] == 'AUS']

1984 seems to be where the noticable decrease in players begins. Doesn't go above 20 after.

In [19]:
Aus1 = Aus.winner_name.groupby(Aus['match_year']).nunique()
Aus1 = pd.DataFrame(Aus1).reset_index()
Aus1.head(20)

Unnamed: 0,match_year,winner_name
0,1968.0,41
1,1969.0,30
2,1970.0,31
3,1971.0,34
4,1972.0,40
5,1973.0,42
6,1974.0,43
7,1975.0,39
8,1976.0,38
9,1977.0,47


In [20]:
len(Aus['1984']['winner_name'].unique())

18

In [21]:
#Saving dataframe to csv for use in Results notebook
Aus1.to_csv('../data/Aus', index = False, encoding='utf-8')