# Prep 100

## Purpose
* In this notebook we will begin the cleaning of our Men's Singles 1968-2017 dataset.
* We will;
    * load the data
    * fix data types
    * add a column
    * save the cleaned dataset as a dataframe in a specified folder

## Datasets
* This is the first notebook where we begin preparing the raw datasets and the end result will be a cleaned dataframe.
* The datasets that will be used are;
    * Men's Singles Matches from 1968 to 2017. This dataset is a csv file and is entitled ATP.csv. The title of this dataset is "ATP Matches, 1968 to 2017".
    * This dataset was downloaded from Kaggle.com.

In [1]:
# Importing relevant libraries
import os
import sys
import hashlib
import numpy as np
import pandas as pd

In [2]:
if not os.path.exists("../data/ATP.csv"):
    print("Missing dataset file")
print("Success!")

Success!


## ATP Matches, 1968 to 2017

In [3]:
# Reading in the ATP.csv file
atp_main = pd.read_csv("../data/ATP.csv", low_memory = False)
atp_main.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,1968-580,Australian Chps.,Grass,64,G,19680119,1,110023,,,...,,,,,,,,,,
1,1968-580,Australian Chps.,Grass,64,G,19680119,2,109803,,,...,,,,,,,,,,
2,1968-580,Australian Chps.,Grass,64,G,19680119,3,100257,,,...,,,,,,,,,,
3,1968-580,Australian Chps.,Grass,64,G,19680119,4,100105,5.0,,...,,,,,,,,,,
4,1968-580,Australian Chps.,Grass,64,G,19680119,5,109966,,,...,,,,,,,,,,


In [4]:
# get idea of the size of data we are dealing with
atp_main.shape

(164029, 49)

* This dataframe contains every match from 1968-2017.

### A 'year' column is added
* This column was added for an easier way to grab all matches within a specified time period of a year.
* Years are easy to work with rather than months, particularly in tennis.
* This is because the tennis calendar runs from January to early December.

In [5]:
atp_main['year'] = atp_main['tourney_id'].str.split('-').str[0]
atp_main.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,year
0,1968-580,Australian Chps.,Grass,64,G,19680119,1,110023,,,...,,,,,,,,,,1968
1,1968-580,Australian Chps.,Grass,64,G,19680119,2,109803,,,...,,,,,,,,,,1968
2,1968-580,Australian Chps.,Grass,64,G,19680119,3,100257,,,...,,,,,,,,,,1968
3,1968-580,Australian Chps.,Grass,64,G,19680119,4,100105,5.0,,...,,,,,,,,,,1968
4,1968-580,Australian Chps.,Grass,64,G,19680119,5,109966,,,...,,,,,,,,,,1968


* The original dataset has a column 'tourney_id' where the year and ID of that specific tournament is contained. I split this column and used the first part i.e. the year integers to create this new column

In [6]:
# Consistency across the columns
atp_main = atp_main.rename(columns={'year': 'match_year'})
atp_main.tail(1)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,match_year
164028,2017-M-DC-2017-WG-M-SUI-USA-01,Davis Cup WG R1: SUI vs USA,Hard,4,D,20170203,5,105449,,,...,1,2,54,27,15,15,9,2,6,2017


In [7]:
# Converting match year column to float
atp_main['match_year'] = pd.to_numeric(atp_main['match_year'], errors='coerce')

### Age columns corrected

In [8]:
# The data types of all these columns can be seen below
atp_main.dtypes

tourney_id             object
tourney_name           object
surface                object
draw_size              object
tourney_level          object
tourney_date           object
match_num              object
winner_id              object
winner_seed            object
winner_entry           object
winner_name            object
winner_hand            object
winner_ht              object
winner_ioc             object
winner_age             object
winner_rank            object
winner_rank_points     object
loser_id               object
loser_seed             object
loser_entry            object
loser_name             object
loser_hand             object
loser_ht               object
loser_ioc              object
loser_age              object
loser_rank             object
loser_rank_points      object
score                  object
best_of                object
round                  object
minutes                object
w_ace                  object
w_df                   object
w_svpt    

* After exploring these columns it became clear that we should alter some of these data types.
* This is necessary because we need to perform calculations on certain columns and it is easier when the data types are of type float or int.

In [9]:
atp_main['winner_age'] = pd.to_numeric(atp_main['winner_age'], errors='coerce')
atp_main.winner_age.dtypes

dtype('float64')

In [10]:
# Same action with 'loser_age' column
atp_main['loser_age'] = pd.to_numeric(atp_main['loser_age'], errors='coerce')
atp_main.winner_age.dtypes

dtype('float64')

* The age columns have been successfully converted to the corect data types but they are not displaying in the correct fashion. It makes much more sense for an age column to be a single integer i.e. 23, 24, 31, etc rather than 23.735473648, 24.745231224, 31.12908787 etc as they currently are.
* We first looked at flooring these long decimals but the best solution was to adjust the formatting.

In [11]:
pd.options.display.float_format = '{:,.0f}'.format
atp_main['winner_age'].head(10)

0   nan
1    28
2    16
3    22
4    30
5   nan
6    23
7    31
8   nan
9    34
Name: winner_age, dtype: float64

In [12]:
# Same method applied to 'loser_age'
pd.options.display.float_format = '{:,.0f}'.format
atp_main['loser_age'].tail()

164024   41
164025   35
164026   25
164027   30
164028   20
Name: loser_age, dtype: float64

### Adjusting other dtype from object to float

It became aparrent while carrying out analysis that we would have to change other columns from object to a float

In [13]:
# Convert column minutes to float
atp_main['minutes'] = pd.to_numeric(atp_main['minutes'], errors='coerce')

In [14]:
# Converting column w_ace to float
atp_main['w_ace'] = pd.to_numeric(atp_main['w_ace'], errors='coerce')

In [15]:
# Converting column winner_ht to float
atp_main['winner_ht'] = pd.to_numeric(atp_main['winner_ht'], errors='coerce')

### Saving this dataframe

In [16]:
# From this point, the 1968-2017 dataframe will be called atp_main
atp_main.to_csv('../data/atp_main', index=False, float_format='%.f', encoding='utf-8')