# Introduction

The simplest way to try and predict the outcome of a match would be to create a matrix with the percent win of each players against each other players.

We can try that method using Federer and Nadal.

In [1]:
# Init
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, when, lit

spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

def init_df(): 
    return spark.read \
    .csv("./dataset/all_matches.csv", header=True)

In [2]:
# Winrate
df = init_df()

df = df.select(["player_id", "opponent_id", "player_victory"]) \
    .where(df['player_id'] == 'roger-federer') \
    .where(df['opponent_id'] == 'rafael-nadal') \
    .withColumn("player_victory", \
              when(df["player_victory"] == 't', 1).otherwise(0)) \
    .withColumn('matches', lit(1)) \
    .groupBy(['player_id', 'opponent_id']) \
    .sum()
df = df.withColumn('winrate', df['sum(player_victory)'] / df['sum(matches)'])
    
df.show()

+-------------+------------+-------------------+------------+------------------+
|    player_id| opponent_id|sum(player_victory)|sum(matches)|           winrate|
+-------------+------------+-------------------+------------+------------------+
|roger-federer|rafael-nadal|                 11|          33|0.3333333333333333|
+-------------+------------+-------------------+------------+------------------+



That result tells us that if we had to bet on the winner of a match between Federer and Nadal, **Federer would win 1 out of every 3 matches**. 

This is good, but can do a lot better.

In [3]:
# Winrate per surface
from pyspark.sql.functions import col, expr, when, lit

df = init_df()

df = df.select(["player_id", "opponent_id", "court_surface", "player_victory"]) \
    .where(df['player_id'] == 'roger-federer') \
    .where(df['opponent_id'] == 'rafael-nadal') \
    .withColumn("player_victory", \
              when(df["player_victory"] == 't', 1).otherwise(0)) \
    .withColumn('matches', lit(1)) \
    .groupBy(['player_id', 'opponent_id', 'court_surface']) \
    .sum()
df = df.withColumn('winrate', df['sum(player_victory)'] / df['sum(matches)'])
    
df.show()

+-------------+------------+-------------+-------------------+------------+-------------------+
|    player_id| opponent_id|court_surface|sum(player_victory)|sum(matches)|            winrate|
+-------------+------------+-------------+-------------------+------------+-------------------+
|roger-federer|rafael-nadal|         Clay|                  2|          15|0.13333333333333333|
|roger-federer|rafael-nadal|        Grass|                  2|           3| 0.6666666666666666|
|roger-federer|rafael-nadal|         Hard|                  7|          15| 0.4666666666666667|
+-------------+------------+-------------+-------------------+------------+-------------------+



Here, we can see a much different result from what we had previously. While betting on Nadal is still overall better than betting on Federer, we see that **Nadal almost always wins matches on Clay, Nadal and Federer are about even on Hard, and Federer has a slight advantage on Grass**.

Those results are much better, but they still suffer from a very big flaw: a lack of data. We only have 33 matches to work with (when working with two of the most active tennis players of all time), and we have even less matches to work with if we decide to split those matches based on the terrain type.

Even worse, what if we want to predict the outcome of a match from two players who have never played against each others? Our current system simply doesn't allow us to do that. We need to find something better.