# Prep 300

## Purpose
In this notebook we will begin our prep for research question 2 - 'What factors impact the success of a top player?'. Primarily, we will focus on creating 2 master dataframes containing the top winners and their average match statistics and the same with the top losers. This notebook will prep the winners dataframe. 

## Datasets
* The data in this notebook;
    * Men's Singles Matches from 1968 to 2017.
    * Women's Singles matches from 2000 to 2016.
    * Men's Singles Matches from 2003 to 2014.
* These datasets have been cleaned and are now in appropriate dataframes. They are used in this notebook to create the question-specific dataframes for Research Question 2; "What factors impact the success of a top player?".

In [1]:
#Importing relevant librarys
import os
import sys
import hashlib
import numpy as np
import pandas as pd
from datetime import datetime
    
%matplotlib inline

In [2]:
atp_main = pd.read_csv("../data/atp_main", low_memory = False, index_col = 'tourney_date')

In [3]:
#converting atp_main to time series
atp_main.index = pd.to_datetime(atp_main.index, format="%Y-%m-%d", errors='coerce')

In [4]:
atp_small = pd.read_csv("../data/atp_small", low_memory = False)

In [5]:
wta_dataset = pd.read_csv("../data/wta_dataset", low_memory = False)

In [6]:
men_women = pd.read_csv("../data/men_women", low_memory = False)

## Creating the master winners table

Counting the players who have won the most matches in the respective men and womens dataframe.
Used atp_small and wta_dataset as both dataframes are from 2003-2016 allowing for fair and comparable results.

In [7]:
mens10 = atp_small['winner_name'].value_counts().head(10)
womens10 = wta_dataset['winner_name'].value_counts().head(10)

In [8]:
#Joining the 2 lists
both = [mens10, womens10]
rq2 = pd.concat(both)

In [9]:
#converting the list to a dataframe
df = pd.DataFrame({'col':rq2}).reset_index()
df

Unnamed: 0,index,col
0,Roger Federer,847
1,Rafael Nadal,712
2,Novak Djokovic,612
3,David Ferrer,597
4,Andy Roddick,512
5,Andy Murray,487
6,Tomas Berdych,485
7,Nikolay Davydenko,463
8,Tommy Robredo,429
9,Fernando Verdasco,404


In [10]:
#renaming columns
df.columns = ['winner_name', 'matches_won']

Used this line of code to count the amount of matches lost by each player in the table

In [11]:
men_women[(men_women['loser_name'] == 'Agnieszka Radwanska')]['tourney_id'].count()

174

In [12]:
#adding a matches lost column 
df['matches_lost']=(137, 141, 141, 289, 171, 154, 263, 298, 270, 282, 
                   135, 278, 88, 221, 167, 255, 200, 263, 233, 174)

In [13]:
#creating a total matches column for each player by summing their matches won and lost
df['total_matches'] = df['matches_lost'] + df['matches_won']

In [14]:
df.head(5)

Unnamed: 0,winner_name,matches_won,matches_lost,total_matches
0,Roger Federer,847,137,984
1,Rafael Nadal,712,141,853
2,Novak Djokovic,612,141,753
3,David Ferrer,597,289,886
4,Andy Roddick,512,171,683


Calulating a winning percentage for each player to show fairer list of top winners

In [15]:
df['winning_perc'] = df['matches_won'] / df['total_matches'] 

In [16]:
df.head(5)

Unnamed: 0,winner_name,matches_won,matches_lost,total_matches,winning_perc
0,Roger Federer,847,137,984,0.860772
1,Rafael Nadal,712,141,853,0.834701
2,Novak Djokovic,612,141,753,0.812749
3,David Ferrer,597,289,886,0.673815
4,Andy Roddick,512,171,683,0.749634


### Calculating the average match statistics for each winning player
* In the "df" dataframe, we have 5 columns filled with values that we have calculated. 
* In addition to this we need to include their average statistics per match played.

In [17]:
#getting mean of columns 31-39 (winners match statistics)
a = atp_small[(atp_small['winner_name'] == 'Roger Federer')].iloc[:,31:40].mean()
#converting to a dataframe, reseting the index and transposing the index column to create columns
a = pd.DataFrame(a).reset_index().set_index('index').T

In [18]:
b = atp_small[(atp_small['winner_name'] == 'Rafael Nadal')].iloc[:,31:40].mean()
b = pd.DataFrame(b).reset_index().set_index('index').T

In [19]:
c = atp_small[(atp_small['winner_name'] == 'Novak Djokovic')].iloc[:,31:40].mean()
c = pd.DataFrame(c).reset_index().set_index('index').T

In [20]:
d = atp_small[(atp_small['winner_name'] == 'David Ferrer')].iloc[:,31:40].mean()
d = pd.DataFrame(d).reset_index().set_index('index').T

In [21]:
e = atp_small[(atp_small['winner_name'] == 'Andy Roddick')].iloc[:,31:40].mean()
e = pd.DataFrame(e).reset_index().set_index('index').T

In [22]:
f = atp_small[(atp_small['winner_name'] == 'Andy Murray')].iloc[:,31:40].mean()
f = pd.DataFrame(f).reset_index().set_index('index').T

In [23]:
g = atp_small[(atp_small['winner_name'] == 'Tomas Berdych')].iloc[:,31:40].mean()
g = pd.DataFrame(g).reset_index().set_index('index').T

In [24]:
h = atp_small[(atp_small['winner_name'] == 'Nikolay Davydenko')].iloc[:,31:40].mean()
h = pd.DataFrame(h).reset_index().set_index('index').T

In [25]:
i = atp_small[(atp_small['winner_name'] == 'Tommy Robredo')].iloc[:,31:40].mean()
i = pd.DataFrame(i).reset_index().set_index('index').T

In [26]:
j = atp_small[(atp_small['winner_name'] == 'Fernando Verdasco')].iloc[:,31:40].mean()
j = pd.DataFrame(j).reset_index().set_index('index').T

In [27]:
k = wta_dataset[(wta_dataset['winner_name'] == 'Maria Sharapova')].iloc[:,31:40].mean()
k = pd.DataFrame(k).reset_index().set_index('index').T

In [28]:
l = wta_dataset[(wta_dataset['winner_name'] == 'Jelena Jankovic')].iloc[:,31:40].mean()
l = pd.DataFrame(l).reset_index().set_index('index').T

In [29]:
m = wta_dataset[(wta_dataset['winner_name'] == 'Serena Williams')].iloc[:,31:40].mean()
m = pd.DataFrame(m).reset_index().set_index('index').T

In [30]:
n = wta_dataset[(wta_dataset['winner_name'] == 'Svetlana Kuznetsova')].iloc[:,31:40].mean()
n = pd.DataFrame(n).reset_index().set_index('index').T

In [31]:
o = wta_dataset[(wta_dataset['winner_name'] == 'Caroline Wozniacki')].iloc[:,31:40].mean()
o = pd.DataFrame(o).reset_index().set_index('index').T

In [32]:
p = wta_dataset[(wta_dataset['winner_name'] == 'Flavia Pennetta')].iloc[:,31:40].mean()
p = pd.DataFrame(p).reset_index().set_index('index').T

In [33]:
q = wta_dataset[(wta_dataset['winner_name'] == 'Vera Zvonareva')].iloc[:,31:40].mean()
q = pd.DataFrame(q).reset_index().set_index('index').T

In [34]:
r = wta_dataset[(wta_dataset['winner_name'] == 'Marion Bartoli')].iloc[:,31:40].mean()
r = pd.DataFrame(r).reset_index().set_index('index').T

In [35]:
s = wta_dataset[(wta_dataset['winner_name'] == 'Nadia Petrova')].iloc[:,31:40].mean()
s = pd.DataFrame(s).reset_index().set_index('index').T

In [36]:
t = wta_dataset[(wta_dataset['winner_name'] == 'Agnieszka Radwanska')].iloc[:,31:40].mean()
t = pd.DataFrame(t).reset_index().set_index('index').T

In [37]:
#joining all the average statistics to create a master dataframe
stats = pd.concat([a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t]).reset_index().drop('index', 1)
stats.head()

index,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,7.775434,1.496278,73.7134,46.308933,36.566998,16.404467,12.477667,2.470223,3.42928
1,3.057018,1.394737,70.70614,48.704678,35.869883,13.017544,11.69883,3.038012,4.30848
2,5.762565,2.124783,75.376083,48.790295,36.60312,15.126516,12.343154,3.15078,4.592721
3,3.095238,2.142857,74.359788,47.696649,34.405644,14.977072,11.869489,3.518519,5.250441
4,12.588358,1.659044,72.862786,48.399168,39.432432,14.56341,12.503119,2.087318,2.794179


## Dataframes
We have 2 dataframes which we will now join;
<br>df -> top 10 Men and Women players where we added the "matches_won", "matches_lost", "total_matches" and 	"winning_perc" columns.
<br>stats -> the average statistics for each player in the df dataframe.

In [38]:
#Joining the 2 dataframes now containing names of players, number of matches and the players average match statistics
RQ2_winners = pd.concat([df, stats], axis = 1)
RQ2_winners#.head()

Unnamed: 0,winner_name,matches_won,matches_lost,total_matches,winning_perc,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Roger Federer,847,137,984,0.860772,7.775434,1.496278,73.7134,46.308933,36.566998,16.404467,12.477667,2.470223,3.42928
1,Rafael Nadal,712,141,853,0.834701,3.057018,1.394737,70.70614,48.704678,35.869883,13.017544,11.69883,3.038012,4.30848
2,Novak Djokovic,612,141,753,0.812749,5.762565,2.124783,75.376083,48.790295,36.60312,15.126516,12.343154,3.15078,4.592721
3,David Ferrer,597,289,886,0.673815,3.095238,2.142857,74.359788,47.696649,34.405644,14.977072,11.869489,3.518519,5.250441
4,Andy Roddick,512,171,683,0.749634,12.588358,1.659044,72.862786,48.399168,39.432432,14.56341,12.503119,2.087318,2.794179
5,Andy Murray,487,154,641,0.75975,7.134199,2.313853,75.510823,43.852814,33.679654,17.199134,12.287879,3.229437,4.911255
6,Tomas Berdych,485,263,748,0.648396,8.473451,2.20354,73.090708,42.988938,34.495575,16.893805,12.050885,2.942478,4.017699
7,Nikolay Davydenko,463,298,761,0.60841,3.550336,2.322148,71.626398,48.44519,34.888143,12.747204,11.458613,3.579418,5.263982
8,Tommy Robredo,429,270,699,0.613734,4.353081,2.13981,78.277251,51.094787,37.260664,15.116114,12.492891,3.64218,5.414692
9,Fernando Verdasco,404,282,686,0.588921,5.44557,3.564557,75.903797,52.473418,38.675949,12.929114,12.303797,3.592405,5.174684


## Saving the new dataframes as csv files

In [39]:
RQ2_winners.to_csv('../data/winners_df', index = False, encoding='utf-8')