# Prep 400

## Purpose
In this notebook we will begin our prep for research question 2 - 'What factors impact the success of a top player?'. Primarily, we will focus on creating 2 master dataframes containing the top winners and their average match statistics and the same with the top losers. This notebook will prep the losers dataframe. 

## Datasets
* The data in this notebook;
    * Men's Singles Matches from 1968 to 2017.
    * Women's Singles matches from 2000 to 2016.
    * Men's Singles Matches from 2003 to 2014.
* These datasets have been cleaned and are now in appropriate dataframes. They are used in this notebook to create the question-specific dataframes for Research Question 2; "What factors impact the success of a top player?".

In [1]:
#Importing relevant librarys
import os
import sys
import hashlib
import numpy as np
import pandas as pd
from datetime import datetime
    
%matplotlib inline

In [2]:
atp_main = pd.read_csv("../data/atp_main", low_memory = False, index_col = 'tourney_date')

In [3]:
#converting atp_main to time series
atp_main.index = pd.to_datetime(atp_main.index, format="%Y-%m-%d", errors='coerce')

In [4]:
atp_small = pd.read_csv("../data/atp_small", low_memory = False)

In [5]:
wta_dataset = pd.read_csv("../data/wta_dataset", low_memory = False)

In [6]:
men_women = pd.read_csv("../data/men_women", low_memory = False)

## Creating the master losers table

Counting the players who have lost the most matches in the respective men and womens dataframe.
Used atp_small and wta_dataset as both dataframes are from 2003-2016 allowing for fair and comparable results.

Just because some players lost the most matches it does not mean they are necessarily a bad player. To combat this, we got the top 30 players who lost the most matches from both the mens and womens datasets. We then calculated the losing percentage for each of these players and picked the players who had the largest losing percentage.

In [7]:
mens01 = atp_small['loser_name'].value_counts().head(30)
womens01 = wta_dataset['loser_name'].value_counts().head(30)

In [8]:
#Joining the list to create list of 60 names
both2 = [mens01, womens01]
rq21 = pd.concat(both2)

In [9]:
#converting to a dataframe
loser1 = pd.DataFrame({'col':rq21}).reset_index()
loser1.head()

Unnamed: 0,index,col
0,Feliciano Lopez,311
1,Jarkko Nieminen,301
2,Jurgen Melzer,300
3,Nikolay Davydenko,298
4,David Ferrer,289


In [10]:
#renaming columns
loser1.columns = ['loser_name', 'matches_lost']

Used this line of code to count the amount of matches won by each player in the table

In [11]:
men_women[(men_women['winner_name'] == 'Agnieszka Radwanska')]['tourney_id'].count()

401

In [12]:
#adding a matches won column
loser1['matches_won']=(348, 359, 326, 463, 597, 399, 404, 264, 429, 330, 485, 347, 218, 243, 223, 197, 305, 210, 260, 333,
                       236, 200, 380, 349, 263, 180, 341, 252, 320, 217,   
                   391, 380, 506, 295, 402, 257, 412, 365, 402, 288, 174, 135, 447, 294, 249, 275, 247, 314, 230, 407,
                      257, 189, 147, 393, 177, 275, 204, 159, 189, 401)

In [13]:
#adding a total matches column by summing the matches lost and the matches won
loser1['total_matches'] = loser1['matches_lost'] + loser1['matches_won']

In [14]:
#calculating a players losing percentage
loser1['losing_perc'] = loser1['matches_lost'] / loser1['total_matches'] 

In [15]:
#only taking players with a losing percentage over 47.2%
loser1 = loser1[(loser1['losing_perc'] > .472)]

In [16]:
#dropping 'Jurgen Melzer' (lowest losing percentage) so there are 10 men and 10 women in the table
loser1 = loser1[loser1.loser_name != 'Jurgen Melzer'].reset_index().drop('index', axis=1)

In [17]:
loser1.head()

Unnamed: 0,loser_name,matches_lost,matches_won,total_matches,losing_perc
0,Fernando Verdasco,282,264,546,0.516484
1,Guillermo Garcia Lopez,250,218,468,0.534188
2,Julien Benneteau,249,243,492,0.506098
3,Paul Henri Mathieu,244,223,467,0.522484
4,Victor Hanescu,241,197,438,0.550228


In [18]:
#getting mean of columns 40-49 (losers match statistics)
a1 = atp_small[(atp_small['loser_name'] == 'Andreas Seppi')].iloc[:,40:49].mean()
#converting to a dataframe, reseting the index and transposing the index column to create columns
a1 = pd.DataFrame(a1).reset_index().set_index('index').T

In [19]:
b1 = atp_small[(atp_small['loser_name'] == 'Guillermo Garcia Lopez')].iloc[:,40:49].mean()
b1 = pd.DataFrame(b1).reset_index().set_index('index').T

In [20]:
c1 = atp_small[(atp_small['loser_name'] == 'Julien Benneteau')].iloc[:,40:49].mean()
c1 = pd.DataFrame(c1).reset_index().set_index('index').T

In [21]:
d1 = atp_small[(atp_small['loser_name'] == 'Paul Henri Mathieu')].iloc[:,40:49].mean()
d1 = pd.DataFrame(d1).reset_index().set_index('index').T

In [22]:
e1 = atp_small[(atp_small['loser_name'] == 'Victor Hanescu')].iloc[:,40:49].mean()
e1 = pd.DataFrame(e1).reset_index().set_index('index').T

In [23]:
f1 = atp_small[(atp_small['loser_name'] == 'Albert Montanes')].iloc[:,40:49].mean()
f1 = pd.DataFrame(f1).reset_index().set_index('index').T

In [24]:
g1 = atp_small[(atp_small['loser_name'] == 'Olivier Rochus')].iloc[:,40:49].mean()
g1 = pd.DataFrame(g1).reset_index().set_index('index').T

In [25]:
h1 = atp_small[(atp_small['loser_name'] == 'Gilles Simon')].iloc[:,40:49].mean()
h1 = pd.DataFrame(h1).reset_index().set_index('index').T

In [26]:
i1 = atp_small[(atp_small['loser_name'] == 'Janko Tipsarevic')].iloc[:,40:49].mean()
i1 = pd.DataFrame(i1).reset_index().set_index('index').T

In [27]:
j1 = atp_small[(atp_small['loser_name'] == 'Florian Mayer')].iloc[:,40:49].mean()
j1 = pd.DataFrame(j1).reset_index().set_index('index').T

In [28]:
k1 = wta_dataset[(wta_dataset['loser_name'] == 'Anabel Medina Garrigues')].iloc[:,40:49].mean()
k1 = pd.DataFrame(k1).reset_index().set_index('index').T

In [29]:
l1 = wta_dataset[(wta_dataset['loser_name'] == 'Klara Koukalova')].iloc[:,40:49].mean()
l1 = pd.DataFrame(l1).reset_index().set_index('index').T

In [30]:
m1 = wta_dataset[(wta_dataset['loser_name'] == 'Iveta Benesova')].iloc[:,40:49].mean()
m1 = pd.DataFrame(m1).reset_index().set_index('index').T

In [31]:
n1 = wta_dataset[(wta_dataset['loser_name'] == 'Svetlana Kuznetsova')].iloc[:,40:49].mean()
n1 = pd.DataFrame(n1).reset_index().set_index('index').T

In [32]:
o1 = wta_dataset[(wta_dataset['loser_name'] == 'Elena Vesnina')].iloc[:,40:49].mean()
o1 = pd.DataFrame(o1).reset_index().set_index('index').T

In [33]:
p1 = wta_dataset[(wta_dataset['loser_name'] == 'Tsvetana Pironkova')].iloc[:,40:49].mean()
p1 = pd.DataFrame(p1).reset_index().set_index('index').T

In [34]:
q1 = wta_dataset[(wta_dataset['loser_name'] == 'Virginie Razzano')].iloc[:,40:49].mean()
q1 = pd.DataFrame(q1).reset_index().set_index('index').T

In [35]:
r1 = wta_dataset[(wta_dataset['loser_name'] == 'Gisela Dulko')].iloc[:,40:49].mean()
r1 = pd.DataFrame(r1).reset_index().set_index('index').T

In [36]:
s1 = wta_dataset[(wta_dataset['loser_name'] == 'Eleni Daniilidou')].iloc[:,40:49].mean()
s1 = pd.DataFrame(s1).reset_index().set_index('index').T

In [37]:
t1 = wta_dataset[(wta_dataset['loser_name'] == 'Alize Cornet')].iloc[:,40:49].mean()
t1 = pd.DataFrame(t1).reset_index().set_index('index').T

In [38]:
#Joining all the individual match statistics dataframes
stats1 = pd.concat([a1,b1,c1,d1,e1,f1,g1,h1,i1,j1,k1,l1,m1,n1,o1,p1,q1,r1,s1,t1]).reset_index().drop('index', 1)
stats1.head()

index,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,3.847015,3.052239,78.940299,45.085821,29.843284,15.873134,11.94403,4.757463,8.593284
1,2.955823,4.188755,78.634538,46.698795,30.064257,14.293173,11.839357,4.891566,9.056225
2,4.788618,3.552846,79.853659,51.215447,33.353659,12.51626,12.170732,4.605691,8.638211
3,4.021097,3.362869,78.341772,42.696203,28.78481,17.168776,11.881857,4.472574,8.067511
4,4.276018,1.80543,81.642534,56.402715,37.022624,11.638009,12.425339,4.628959,8.244344


In [39]:
#Joining the 2 dataframes
RQ2_losers = pd.concat([loser1, stats1], axis = 1)
RQ2_losers#.head()

Unnamed: 0,loser_name,matches_lost,matches_won,total_matches,losing_perc,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,Fernando Verdasco,282,264,546,0.516484,3.847015,3.052239,78.940299,45.085821,29.843284,15.873134,11.94403,4.757463,8.593284
1,Guillermo Garcia Lopez,250,218,468,0.534188,2.955823,4.188755,78.634538,46.698795,30.064257,14.293173,11.839357,4.891566,9.056225
2,Julien Benneteau,249,243,492,0.506098,4.788618,3.552846,79.853659,51.215447,33.353659,12.51626,12.170732,4.605691,8.638211
3,Paul Henri Mathieu,244,223,467,0.522484,4.021097,3.362869,78.341772,42.696203,28.78481,17.168776,11.881857,4.472574,8.067511
4,Victor Hanescu,241,197,438,0.550228,4.276018,1.80543,81.642534,56.402715,37.022624,11.638009,12.425339,4.628959,8.244344
5,Albert Montanes,237,210,447,0.530201,2.729958,3.679325,74.177215,43.85654,28.122363,13.603376,11.270042,4.57384,8.658228
6,Igor Andreev,229,236,465,0.492473,1.411215,2.724299,80.042056,48.962617,30.200935,14.691589,11.85514,5.158879,9.317757
7,Gilles Simon,229,200,429,0.5338,3.841629,2.303167,78.701357,43.9819,28.438914,16.39819,11.936652,4.904977,9.067873
8,Janko Tipsarevic,218,180,398,0.547739,5.916256,2.812808,79.871921,44.758621,31.064039,16.35468,12.054187,4.596059,8.108374
9,Florian Mayer,211,217,428,0.492991,3.291457,2.045226,72.61809,43.929648,28.020101,12.79397,11.115578,4.050251,8.050251


## Saving the new dataframe as csv file

In [40]:
RQ2_losers.to_csv('../data/losers_df', index = False, encoding='utf-8')