## Classification

As mentionned in the project introduction, one of the most well-known way that different players distinguish from one another while playing tennis is their profiency on a given field type, with Federer being the best on grass fields and Nadal being the best on clay fields.  

Naturally, finding a player's field preference gives us a lot of insight into how likely the player is to win a given match. 

## Intro

First, let us initialize the DataFrame.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

def get_df(): 
    return spark.read \
    .csv("./dataset/all_matches.csv", header=True)

## Case study 1: Roger Federer

We will start analyzing matches by Roger Federer.

In [20]:
rdf = df.where(df['player_id'] == "roger-federer")

In [25]:
total_matches = rdf.count()
print(f'Total matches: {total_matches}')

matches_won = rdf.where(rdf['player_victory'] == 't').count()
print(f'Matches won: {matches_won}')

print(f'Win ratio: {round(matches_won/total_matches, 2)}')

Total matches: 1470
Matches won: 1194
Win ratio: 0.81


In [30]:
clay_matches = rdf.where(rdf['court_surface'] == 'Clay').count()
print(f'Total clay matches: {clay_matches}')

clay_matches_won = rdf.where(rdf['court_surface'] == 'Clay') \
    .where(rdf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {clay_matches_won}')

print(f'Win ratio: {round(clay_matches_won/clay_matches, 2)}')

Total clay matches: 350
Clay matches won: 263
Win ratio: 0.75


In [32]:
grass_matches = rdf.where(rdf['court_surface'] == 'Grass').count()
print(f'Total clay matches: {grass_matches}')

grass_matches_won = rdf.where(rdf['court_surface'] == 'Grass') \
    .where(rdf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {grass_matches_won}')

print(f'Win ratio: {round(grass_matches_won/grass_matches, 2)}')

Total clay matches: 211
Clay matches won: 184
Win ratio: 0.87


In [35]:
hard_matches = rdf.where(rdf['court_surface'] == 'Hard').count()
print(f'Total hard matches: {hard_matches}')

hard_matches_won = rdf.where(rdf['court_surface'] == 'Hard') \
    .where(rdf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {hard_matches_won}')

print(f'Win ratio: {round(hard_matches_won/hard_matches, 2)}')

Total hard matches: 871
Clay matches won: 720
Win ratio: 0.83


## Case study 2: Nadal

In [36]:
ndf = df.where(df['player_id'] == "rafael-nadal")

In [38]:
total_matches = ndf.count()
print(f'Total matches: {total_matches}')

matches_won = ndf.where(ndf['player_victory'] == 't').count()
print(f'Matches won: {matches_won}')

print(f'Win ratio: {round(matches_won/total_matches, 2)}')

Total matches: 1188
Matches won: 977
Win ratio: 0.82


In [42]:
clay_matches = ndf.where(ndf['court_surface'] == 'Clay').count()
print(f'Total clay matches: {clay_matches}')

clay_matches_won = ndf.where(ndf['court_surface'] == 'Clay') \
    .where(ndf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {clay_matches_won}')

print(f'Win ratio: {round(clay_matches_won/clay_matches, 2)}')

Total clay matches: 526
Clay matches won: 459
Win ratio: 0.87


In [46]:
grass_matches = ndf.where(ndf['court_surface'] == 'Grass').count()
print(f'Total clay matches: {grass_matches}')

grass_matches_won = ndf.where(ndf['court_surface'] == 'Grass') \
    .where(ndf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {grass_matches_won}')

print(f'Win ratio: {round(grass_matches_won/grass_matches, 2)}')

Total clay matches: 105
Clay matches won: 83
Win ratio: 0.79


In [45]:
hard_matches = rdf.where(ndf['court_surface'] == 'Hard').count()
print(f'Total hard matches: {hard_matches}')

hard_matches_won = ndf.where(ndf['court_surface'] == 'Hard') \
    .where(ndf['player_victory'] == 't') \
    .count()
print(f'Clay matches won: {hard_matches_won}')

print(f'Win ratio: {round(hard_matches_won/hard_matches, 2)}')

Total hard matches: 871
Clay matches won: 428
Win ratio: 0.49


# Matchup Matrix

A naïve way to try and predict the outcome of a match would be to create a matrix with the percent win of each players against each other players.

In [4]:
# Matchups
players = ['roger-federer', 'rafael-nadal', 'novak-djokovic', 'andy-murray']
players.sort()

df = get_df()
rdd = (df.select(["player_id", "opponent_id", "player_victory"])
    .where(df['player_id'].isin(players))
    .where(df['opponent_id'].isin(players))
    .rdd
    .map(lambda x: ((x['player_id'], x['opponent_id']), (1, 1 if x['player_victory'] == 't' else 0)))
    .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])) # Count total matches played and matches won
    .map(lambda x: (x[0][0], {x[0][1]: x[1][1] / x[1][0]})) # Count win rate
    .reduceByKey(lambda a, b: { **a, **b }) # Aggregate results) 
    .sortByKey())

# Matchup matrix
df2 = spark.createDataFrame(rdd, ["player", "results"])
for player in sorted(players):
    df2 = df2.withColumn(player, df2['results'][player])

df2 = df2.drop('results')
df2.show()

+--------------+------------------+-------------------+-------------------+------------------+
|        player|       andy-murray|     novak-djokovic|       rafael-nadal|     roger-federer|
+--------------+------------------+-------------------+-------------------+------------------+
|   andy-murray|              null|0.29411764705882354|0.34782608695652173|               0.5|
|novak-djokovic|0.7058823529411765|               null| 0.5217391304347826|             0.525|
|  rafael-nadal|0.6521739130434783| 0.4782608695652174|               null|0.6666666666666666|
| roger-federer|               0.5|              0.475| 0.3333333333333333|              null|
+--------------+------------------+-------------------+-------------------+------------------+



This approach is unfortunately very limited, as it ignores important elements such as the court surface, and it only really works if two players have already played against each others in the past.

We'd ideally like to be able to able to predict the outcome of a match even when two players have not played against each others before.