### Karan Vombatkere
CSC 440: Data Mining - Final Project

Below is an implementation of Python code on a dataset with detailed tennis match information from the ATP World Tour.
This notebook contains scripts that import, clean and process the match data to extract the required features to be used in the ML models. The pandas library was used extensively to store, clean and process the data. 

First unique tennis player information was extracted into a Pandas dataframe. The following attributes were extracted from the match data - 'PlayerID', 'PName', 'Age', 'Height', 'MaxRank', 'Hand', 'Country'

Then a function computePlayerStats() was written to calculate the following cumulative (average) statistics of a players: 'PlayerID', 'PName', 'Height(cm)', 'Matches Played', 'Overall Win%', 'Top 100 Win %', 'Top 30 Win %', 'First Serve %', 'First Serve Win %', 'BPSave %', 'BPConv %', 'Aces/Match', 'DF/Match', 'Future Top 30'

The required features were extracted for valid players (having played ~50 matches under the age of 26) and these tuples were returned as a new dataframe. Finally, this dataframe was exported as a .csv file, to be used as training and test data for the Machine Learning models (Logistic Regression and Neural Network).

In [1]:
#Karan Vombatkere
#CSC 440: Data Mining Final Project
#November - Decemeber 2017

#Imports needed
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import time

import itertools as itools
from statistics import *


In [2]:
#Load ATPTennis.csv Dataset into a dataframe

ATPData = pd.read_csv("atp_matches_combined.csv", error_bad_lines=False)
ATPData.name = "ATP Match Data"

b'Skipping line 50181: expected 49 fields, saw 69\nSkipping line 50182: expected 49 fields, saw 69\nSkipping line 50183: expected 49 fields, saw 69\nSkipping line 50184: expected 49 fields, saw 69\nSkipping line 50185: expected 49 fields, saw 69\nSkipping line 50186: expected 49 fields, saw 69\nSkipping line 50187: expected 49 fields, saw 69\nSkipping line 50188: expected 49 fields, saw 69\nSkipping line 50189: expected 49 fields, saw 69\nSkipping line 50190: expected 49 fields, saw 69\nSkipping line 50191: expected 49 fields, saw 69\nSkipping line 50192: expected 49 fields, saw 69\nSkipping line 50193: expected 49 fields, saw 69\nSkipping line 50194: expected 49 fields, saw 69\nSkipping line 50195: expected 49 fields, saw 69\nSkipping line 50196: expected 49 fields, saw 69\nSkipping line 50197: expected 49 fields, saw 69\nSkipping line 50198: expected 49 fields, saw 69\nSkipping line 50199: expected 49 fields, saw 69\nSkipping line 50200: expected 49 fields, saw 69\nSkipping line 5020

In [3]:
#View the data
ATPData.head(10)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2000-717,Orlando,Clay,32.0,A,20000501.0,1.0,102179.0,,,...,15.0,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0
1,2000-717,Orlando,Clay,32.0,A,20000501.0,2.0,103602.0,,Q,...,6.0,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0
2,2000-717,Orlando,Clay,32.0,A,20000501.0,3.0,103387.0,,,...,0.0,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0
3,2000-717,Orlando,Clay,32.0,A,20000501.0,4.0,101733.0,,,...,12.0,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0
4,2000-717,Orlando,Clay,32.0,A,20000501.0,5.0,101727.0,4.0,,...,1.0,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0
5,2000-717,Orlando,Clay,32.0,A,20000501.0,6.0,103181.0,,,...,6.0,11.0,8.0,94.0,48.0,31.0,29.0,15.0,6.0,9.0
6,2000-717,Orlando,Clay,32.0,A,20000501.0,7.0,101675.0,,,...,12.0,5.0,5.0,126.0,70.0,45.0,36.0,18.0,3.0,6.0
7,2000-717,Orlando,Clay,32.0,A,20000501.0,8.0,102834.0,5.0,,...,0.0,0.0,0.0,42.0,25.0,9.0,10.0,7.0,3.0,7.0
8,2000-717,Orlando,Clay,32.0,A,20000501.0,9.0,103454.0,6.0,,...,4.0,2.0,3.0,47.0,25.0,13.0,10.0,8.0,3.0,7.0
9,2000-717,Orlando,Clay,32.0,A,20000501.0,10.0,102466.0,,,...,6.0,3.0,2.0,102.0,62.0,43.0,14.0,14.0,9.0,13.0


In [4]:
#Create dataframe for Player Data and label columns
PlayerColumns = ['PlayerID', 'PName', 'Age', 'Height', 'MaxRank', 'Hand', 'Country']

PlayerData = pd.DataFrame(columns = PlayerColumns)
PlayerData.set_index('PlayerID')


#Add data to the PlayerData dataframe
for index, row in ATPData.head(50180).iterrows():
    #If player was the winner
    p_id = int(row.winner_id)
    
    #Check if value exists or not
    if(p_id not in PlayerData.index):
        PlayerData.loc[p_id] = [p_id, row.winner_name, row.winner_age, row.winner_ht, row.winner_rank, row.winner_hand, row.winner_ioc]

    #Update rank with best rank value
    if(row.winner_rank < PlayerData.loc[p_id].MaxRank and not np.isnan(PlayerData.loc[p_id].MaxRank)):
        PlayerData.loc[p_id] = [p_id, row.winner_name, row.winner_age, row.winner_ht, row.winner_rank, row.winner_hand, row.winner_ioc]
    
    else:
        PlayerData.loc[p_id] = [p_id, row.winner_name, row.winner_age, row.winner_ht, PlayerData.loc[p_id].MaxRank, row.winner_hand, row.winner_ioc]
   

    #If the player lost
    p_id2 = int(row.loser_id)
    
    #Check if value exists or not
    if(p_id2 not in PlayerData.index):
        PlayerData.loc[p_id2] = [p_id2, row.loser_name, row.loser_age, row.loser_ht, row.loser_rank, row.loser_hand, row.loser_ioc]

    if(row.loser_rank < PlayerData.loc[p_id2].MaxRank and not np.isnan(PlayerData.loc[p_id2].MaxRank)):
        PlayerData.loc[p_id2] = [p_id2, row.loser_name, row.loser_age, row.loser_ht, row.loser_rank, row.loser_hand, row.loser_ioc]

    else:
        PlayerData.loc[p_id2] = [p_id2, row.loser_name, row.loser_age, row.loser_ht, PlayerData.loc[p_id2].MaxRank, row.loser_hand, row.loser_ioc]


In [6]:
#View the Player Data
PlayerData.head(100)

Unnamed: 0,PlayerID,PName,Age,Height,MaxRank,Hand,Country
102179,102179,Antony Dupuis,34.368241,185.0,57.0,R,FRA
102776,102776,Andrew Ilie,26.737851,180.0,38.0,R,AUS
103602,103602,Fernando Gonzalez,31.561944,183.0,5.0,R,CHI
102821,102821,Cecil Mamiit,35.208761,173.0,95.0,R,PHI
103387,103387,Paradorn Srichaphan,27.761807,185.0,9.0,R,THA
102205,102205,Sebastien Lareau,27.931554,183.0,94.0,R,CAN
101733,101733,Jan Siemerink,32.251882,183.0,82.0,L,NED
102925,102925,Justin Gimelstob,30.198494,196.0,73.0,R,USA
101727,101727,Jason Stoltenberg,31.225188,185.0,63.0,R,AUS
101826,101826,Alex Lopez Moron,29.270363,175.0,111.0,R,ESP


In [7]:
PlayerData.loc[103819]

PlayerID           103819
PName       Roger Federer
Age               34.0999
Height                185
MaxRank                 1
Hand                    R
Country               SUI
Name: 103819, dtype: object

In [8]:
ATPData.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

In [5]:
#Create Data Frame for calculating Player Statistics
#Specify Column Names
PlayerStatsColumns = ['PlayerID', 'PName', 'Height(cm)', 'Matches Played', 'Overall Win%', 'Top 100 Win %', 'Top 30 Win %', 'First Serve %', 'First Serve Win %', 'BPSave %', 'BPConv %', 'Aces/Match', 'DF/Match', 'Future Top 30']

PlayerStats = pd.DataFrame(index = PlayerData.index.copy(), columns = PlayerStatsColumns)

In [6]:
#Function to compute statistics and Populate the PlayerStats Table
#Note that we reject all tuples that have invalid/incomplete statistics data

#Inputs: 
#playerCount = number of players to calculate data for
#matchRows = number of rows of match data to consider
#initMatches = initial number of matches played to calculate features

#Output: Updates the Player Stats table dynamicall using loc pandas function

def computePlayerStats(playerCount, matchRows, initMatches):
    playerAge = 27 #Age of player to be considered for initial matches
    
    start = time.time() #compute the start time
    validPlayers = 0 #valid player/data count
    
    #Iterate over each player
    for p_indx, player in PlayerData.head(playerCount).iterrows(): #Search stats for each player in PlayerData Table

        #Initialize cumulative stat data for each player for a particular year
        FirstServe = 0
        FirstServeWin = 0
        FirstServeCounter = 0

        BPSaved = 0
        BPFaced = 0

        BPChances = 0
        BPWon = 0

        Aces = 0
        DF = 0
        numMatches = 0
        Wins = 0

        top30 = False #Boolean to keep track if the player ever breaks into the top 30
        #This is computationally expensive since we must check every match
        
        #Keep track of win record against top 30 opponents
        top30matches = 0
        top30wins = 0
        
        #Keep track of win record against top 100 opponents
        top100matches = 0
        top100wins = 0

        #Iterate over the match data for each player
        for index, row in ATPData.head(matchRows).iterrows():

            #If player won the match and is under age playerAge
            if(p_indx == row.winner_id and row.winner_age < playerAge):
                #Only compute statistics for first 50 matches of the player
                if(numMatches < initMatches):
                    
                    #if(p_indx == 103819): #Federer matches check!
                        #print(player.PName, row.score, row.loser_name)
                        #print(row.WAces, row.WDF, FirstServe, BPSaved, BPFaced)
                        #print(index)
                        
                    #Check whether or not the match was a walkover and has valid stats
                    if(not np.isnan(row.w_ace) and not np.isnan(row.w_df) and not np.isnan(row.w_svpt) and not np.isnan(row.w_bpFaced) and not np.isnan(row.w_bpSaved) and row.w_svpt != 0):
                        Aces += row.w_ace
                        DF += row.w_df
                        FirstServe += (row.w_1stIn / row.w_svpt) #calculate first serve percentage
                        FirstServeWin += (row.w_1stWon/ row.w_1stIn) #calculate first serve win rate
                        FirstServeCounter += 1

                        BPSaved += row.w_bpSaved
                        BPFaced += row.w_bpFaced

                        BPChances += row.l_bpFaced
                        BPWon += (row.l_bpFaced - row.l_bpSaved)
                        numMatches += 1
                        Wins += 1
                        
                        #Increment top30 wins if player beat a top 30 opponent
                        if(row.loser_rank <= 30):
                            top30wins += 1
                            top30matches += 1
                            
                        #Increment top100 wins if player beat a top 100 opponent
                        if(row.loser_rank <= 100):
                            top100wins += 1
                            top100matches += 1
                            
                #Break out of loop once first 50 matches have been calculated           
                else:
                    break

            #If player lost the match and is under age 26
            if(p_indx == row.loser_id and row.loser_age < playerAge):
                #Only compute statistics for first 50 matches of the player
                if(numMatches < initMatches):
                    
                    #if(p_indx == 103819):
                        #print(row.winner_name, row.score, player.PName)
                        #print(Aces, DF, FirstServe, BPSaved, BPFaced)

                        
                    #Check whether or not the match was a walkover and has valid stats
                    if(not np.isnan(row.l_ace) and not np.isnan(row.l_df) and not np.isnan(row.l_svpt) and not np.isnan(row.l_bpFaced) and not np.isnan(row.l_bpSaved) and row.l_svpt != 0):
                        Aces += row.l_ace
                        DF += row.l_df
                        FirstServe += (row.l_1stIn / row.l_svpt) #calculate first serve percentage
                        FirstServeWin += (row.l_1stWon/ row.l_1stIn) #calculate first serve win rate
                        FirstServeCounter += 1

                        BPSaved += row.l_bpSaved
                        BPFaced += row.l_bpFaced

                        BPChances += row.w_bpFaced
                        BPWon += (row.w_bpFaced - row.w_bpSaved)
                        numMatches += 1
                        
                        #Increment top30 matches if player lost to a top 30 opponent
                        if(row.winner_rank <= 30):
                            top30matches += 1
                        
                        #Increment top100 matches if player lost to a top 100 opponent
                        if(row.winner_rank <= 100):
                            top100matches += 1
                    
                #Break out of loop once first 50 matches have been calculated           
                else:
                    break

        
        #Update the cumulative statistics for the Player
        winPercent = 0
        servePercentage = 0
        firstservewinRate = 0
        BPsaveRate = 0
        BPconvRate = 0
        
        top30winPercent = 0
        top100winPercent = 0
        
        #Mark true if the player ever broke into the top 30 in the dataset
        if(player.MaxRank <= 30):
            top30 = True

        #Calculate first serve percentage, win percentage and BP percentages
        if(numMatches != 0 and FirstServeCounter != 0):
            winPercent = Wins/numMatches
            #compute aces and double faults per match
            Aces = Aces/numMatches
            DF = DF/numMatches
            
            servePercentage = FirstServe/FirstServeCounter #average first serve percentage
            firstservewinRate = FirstServeWin/FirstServeCounter #average first serve win rate

            if(BPFaced != 0 and BPChances != 0):
                BPsaveRate = BPSaved/BPFaced
                BPconvRate = BPWon/BPChances
            
            if(top30matches != 0):
                top30winPercent = top30wins/top30matches
                top100winPercent = top100wins/top100matches
                
            
            #Check if this is a valid entry that the player has played initMatches matches below age 26
            if(numMatches == initMatches):
                validPlayers += 1
                
                #Add the cumulative statistics to the database
                print("Updating Cumulative Player Statistics for Player:", player.PName)
                PlayerStats.loc[p_indx] = [p_indx, player.PName, player.Height, numMatches, winPercent, top100winPercent, top30winPercent, servePercentage, firstservewinRate, BPsaveRate, BPconvRate, Aces, DF, top30]
    
    
    #compute runtime
    end = time.time()
    runTime = end - start
    
    print("Player Statistics Generation Runtime =", runTime, 'seconds')
    print("Number of valid players with", initMatches, "matches of initial match data =", validPlayers)
    
    #Drop all rows that have invalid statistics (insufficient matches, stats etc) and return final feature set
    #Copy the valid data with final extracted features to dataframe to be returned
    extractedFeatures = PlayerStats.dropna() 
    
    #Return the cleaned and processed player information
    return extractedFeatures

In [7]:
#Run the statistic generator function
#Specify inputs (total players = 1981, totalRows = 50180) -> Takes about 2.5 hours to run

numPlayers = 1981
numRows = 50180
initMatches = 40

#Call the statistic generator with inputs specified
ATPplayerFeatures = computePlayerStats(numPlayers, numRows, initMatches)

Updating Cumulative Player Statistics for Player: Andrew Ilie
Updating Cumulative Player Statistics for Player: Fernando Gonzalez
Updating Cumulative Player Statistics for Player: Cecil Mamiit
Updating Cumulative Player Statistics for Player: Paradorn Srichaphan
Updating Cumulative Player Statistics for Player: Justin Gimelstob
Updating Cumulative Player Statistics for Player: Jiri Vanek
Updating Cumulative Player Statistics for Player: Paul Goldstein
Updating Cumulative Player Statistics for Player: Nicolas Massu
Updating Cumulative Player Statistics for Player: Michael Russell
Updating Cumulative Player Statistics for Player: Alexander Popp
Updating Cumulative Player Statistics for Player: Andre Sa
Updating Cumulative Player Statistics for Player: Markus Hantschk
Updating Cumulative Player Statistics for Player: Xavier Malisse
Updating Cumulative Player Statistics for Player: Ramon Delgado
Updating Cumulative Player Statistics for Player: Jan Michael Gambill
Updating Cumulative Playe

Updating Cumulative Player Statistics for Player: Radek Stepanek
Updating Cumulative Player Statistics for Player: Flavio Saretta
Updating Cumulative Player Statistics for Player: Robin Soderling
Updating Cumulative Player Statistics for Player: Julien Benneteau
Updating Cumulative Player Statistics for Player: Ricardo Mello
Updating Cumulative Player Statistics for Player: Janko Tipsarevic
Updating Cumulative Player Statistics for Player: Yen Hsun Lu
Updating Cumulative Player Statistics for Player: Jimmy Wang
Updating Cumulative Player Statistics for Player: Alejandro Falla
Updating Cumulative Player Statistics for Player: Thierry Ascione
Updating Cumulative Player Statistics for Player: Rajeev Ram
Updating Cumulative Player Statistics for Player: Karol Beck
Updating Cumulative Player Statistics for Player: Brian Vahaly
Updating Cumulative Player Statistics for Player: Rafael Nadal
Updating Cumulative Player Statistics for Player: David Ferrer
Updating Cumulative Player Statistics fo

Updating Cumulative Player Statistics for Player: Guillaume Rufin
Updating Cumulative Player Statistics for Player: Milos Raonic
Updating Cumulative Player Statistics for Player: Dusan Lajovic
Updating Cumulative Player Statistics for Player: Benoit Paire
Updating Cumulative Player Statistics for Player: Albert Ramos
Updating Cumulative Player Statistics for Player: Jack Sock
Updating Cumulative Player Statistics for Player: Marinko Matosevic
Updating Cumulative Player Statistics for Player: Matthew Ebden
Updating Cumulative Player Statistics for Player: Joao Souza
Updating Cumulative Player Statistics for Player: Steve Johnson
Updating Cumulative Player Statistics for Player: Denis Kudla
Updating Cumulative Player Statistics for Player: Aljaz Bedene
Updating Cumulative Player Statistics for Player: David Goffin
Updating Cumulative Player Statistics for Player: Kenny De Schepper
Updating Cumulative Player Statistics for Player: Dominic Thiem
Updating Cumulative Player Statistics for Pl

In [8]:
#Add Heights(cm) for the missing players
PlayerStats.loc[105311, 'Height(cm)'] = 185
PlayerStats.loc[105138, 'Height(cm)'] = 183
PlayerStats.loc[105807, 'Height(cm)'] = 188
PlayerStats.loc[106233, 'Height(cm)'] = 185
PlayerStats.loc[105526, 'Height(cm)'] = 196
PlayerStats.loc[106210, 'Height(cm)'] = 198
PlayerStats.loc[106401, 'Height(cm)'] = 193
PlayerStats.loc[106432, 'Height(cm)'] = 185

In [9]:
#Use query function to return smaller subsets
q1 = PlayerStats[(PlayerStats['Overall Win%'] >= 0) & (PlayerStats['Height(cm)'] != 200)]

print(len(q1.index))
print(len(PlayerStats.index))
q1

278
1994


Unnamed: 0,PlayerID,PName,Height(cm),Matches Played,Overall Win%,Top 100 Win %,Top 30 Win %,First Serve %,First Serve Win %,BPSave %,BPConv %,Aces/Match,DF/Match,Future Top 30
102776,102776,Andrew Ilie,180,40,0.45,0.37931,0.272727,0.519141,0.714681,0.61442,0.346154,6.15,3.525,False
103602,103602,Fernando Gonzalez,183,40,0.5,0.36,0.285714,0.585753,0.711104,0.608541,0.421642,6.4,7.35,True
102821,102821,Cecil Mamiit,173,40,0.35,0.285714,0.0833333,0.56159,0.671882,0.588889,0.374582,3.65,2.85,False
103387,103387,Paradorn Srichaphan,185,40,0.325,0.16,0.166667,0.547278,0.681467,0.581538,0.383399,5.3,4.525,True
102925,102925,Justin Gimelstob,196,40,0.425,0.391304,0.272727,0.583462,0.735052,0.585761,0.349481,7.9,4.8,False
103181,103181,Jiri Vanek,185,40,0.375,0.28,0.25,0.547442,0.697113,0.599469,0.392523,6.1,3.15,False
102834,102834,Paul Goldstein,178,40,0.475,0.36,0.111111,0.6318,0.644716,0.590769,0.395137,2.325,2.275,False
103454,103454,Nicolas Massu,183,40,0.525,0.4,0.125,0.547837,0.723821,0.581197,0.452915,4.4,2.975,True
103188,103188,Michael Russell,173,40,0.25,0.137931,0,0.683744,0.625503,0.554622,0.384615,2.475,3.3,False
102880,102880,Alexander Popp,201,40,0.425,0.357143,0.222222,0.605206,0.679493,0.600559,0.402516,4.35,5.175,False


In [10]:
ATPplayerFeatures = PlayerStats.dropna()
print(len(ATPplayerFeatures.index))

275


In [11]:
ATPplayerFeatures.head(20)

Unnamed: 0,PlayerID,PName,Height(cm),Matches Played,Overall Win%,Top 100 Win %,Top 30 Win %,First Serve %,First Serve Win %,BPSave %,BPConv %,Aces/Match,DF/Match,Future Top 30
102776,102776,Andrew Ilie,180,40,0.45,0.37931,0.272727,0.519141,0.714681,0.61442,0.346154,6.15,3.525,False
103602,103602,Fernando Gonzalez,183,40,0.5,0.36,0.285714,0.585753,0.711104,0.608541,0.421642,6.4,7.35,True
102821,102821,Cecil Mamiit,173,40,0.35,0.285714,0.0833333,0.56159,0.671882,0.588889,0.374582,3.65,2.85,False
103387,103387,Paradorn Srichaphan,185,40,0.325,0.16,0.166667,0.547278,0.681467,0.581538,0.383399,5.3,4.525,True
102925,102925,Justin Gimelstob,196,40,0.425,0.391304,0.272727,0.583462,0.735052,0.585761,0.349481,7.9,4.8,False
103181,103181,Jiri Vanek,185,40,0.375,0.28,0.25,0.547442,0.697113,0.599469,0.392523,6.1,3.15,False
102834,102834,Paul Goldstein,178,40,0.475,0.36,0.111111,0.6318,0.644716,0.590769,0.395137,2.325,2.275,False
103454,103454,Nicolas Massu,183,40,0.525,0.4,0.125,0.547837,0.723821,0.581197,0.452915,4.4,2.975,True
103188,103188,Michael Russell,173,40,0.25,0.137931,0.0,0.683744,0.625503,0.554622,0.384615,2.475,3.3,False
102880,102880,Alexander Popp,201,40,0.425,0.357143,0.222222,0.605206,0.679493,0.600559,0.402516,4.35,5.175,False


In [12]:
#Export the extracted feature set to a csv file
ATPplayerFeatures.to_csv("ATPPlayerFeatures40.csv", index = False)