# Markov Chain

> Klaasen and Magnus [12] show that points in tennis are approximately independent and indenticallydistributed (iid).  This finding allows us to assume that for any point played during the match, thepoint outcome does not depend on any of the previous points. Let’s further assume that we know theprobability of each player winning a point on their serve. Namely, letpbe the probability that playerAwins a point on their serve, andqthe probability that playerBwins a point on their serve. Usingthe iid assumption and the point-winning probabilities, we can formulate a Markov chain describing theprobability of a player winning a game.

## Estimating Serve Winning Probabilities

The paper by Bernett and Clark describes how to find the serve-winning probabilities for matches that have not been played using historical player statistics:

$$
\begin{align}
    f_i &= a_i b_i + (1 - a_i)c_i \\
    g_i &= a_{av} d_i + (1-a_{av}) e_i
\end{align}
$$

Where: 

$$
\begin{align}
    f_i &= \text{percentage of points won on serve for player }i \\
    g_i &= \text{percentage of points won on return for player }i \\
    a_i &= \text{first serve percentage of player }i \\
    a_{av} &= \text{average first serve percentage (across all players)} \\
    b_i &= \text{first serve win percentage of player }i \\
    c_i &= \text{second serve win percentage of player }i \\
    d_i &= \text{first service return points win percentage of player }i \\
    e_i &= \text{second service return points win percentage of player }i \\
\end{align}
$$

In [1]:
# Init
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, when, lit, sum, avg, max

spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

def init_df(): 
    return spark.read \
        .csv("./dataset/all_matches.csv", header=True)

In [21]:
def average_first_serve_percentage():
    df = init_df()
    df = df.select(["player_id", "first_serve_made", "first_serve_attempted"]) \
    .dropna() \
    .groupBy(['player_id']) \
    .agg(sum('first_serve_made'),
        sum('first_serve_attempted'))
    df = df.withColumn('first_serve_percentage', df['sum(first_serve_made)'] / df['sum(first_serve_attempted)']) \
    .groupBy() \
    .agg(avg('first_serve_percentage').alias('average_first_serve_percentage'))
    
    return df.collect()[0]['average_first_serve_percentage']

a_av = average_first_serve_percentage()
print(a_av)

0.5806432124494019


In [8]:
def find_players_statistics(players):
    df = init_df()
    df = df.select([
            "player_id", 
            "service_points_won", 
            "service_points_attempted", 
            "return_points_won", 
            "return_points_attempted",
            "first_serve_made", 
            "first_serve_attempted",
            "first_serve_points_made",
            "second_serve_points_made",
            "second_serve_points_attempted",
            "first_serve_return_points_made",
            "first_serve_return_points_attempted",
            "second_serve_return_points_made",
            "second_serve_return_points_attempted"
        ]) \
        .where(df['player_id'].isin(players)) \
        .dropna() \
        .groupBy(['player_id']) \
        .agg(
            sum("service_points_won"), 
            sum("service_points_attempted"), 
            sum("return_points_won"),
            sum("return_points_attempted"),
            sum('first_serve_made'),
            sum('first_serve_attempted'),
            sum('first_serve_points_made'),
            sum('second_serve_points_made'),
            sum('second_serve_points_attempted'),
            sum("first_serve_return_points_made"),
            sum("first_serve_return_points_attempted"),
            sum("second_serve_return_points_made"),
            sum("second_serve_return_points_attempted")
        )
    df = df.withColumn('point_on_serve_percentage', df['sum(service_points_won)'] / df['sum(service_points_attempted)']) \
        .withColumn('point_on_return_percentage', df['sum(return_points_won)'] / df['sum(return_points_attempted)']) \
        .withColumn('first_serve_percentage', df['sum(first_serve_made)'] / df['sum(first_serve_attempted)']) \
        .withColumn('first_serve_win_percentage', df['sum(first_serve_points_made)'] / df['sum(first_serve_made)']) \
        .withColumn('second_serve_win_percentage', df['sum(second_serve_points_made)'] / df['sum(second_serve_points_attempted)']) \
        .withColumn('first_service_return_points_win_percentage', df['sum(first_serve_return_points_made)'] / df['sum(first_serve_return_points_attempted)']) \
        .withColumn('second_service_return_points_win_percentage', df['sum(second_serve_return_points_made)'] / df['sum(second_serve_return_points_attempted)']) \
        .select([
            'player_id',
            'point_on_serve_percentage',
            'first_serve_percentage',
            'first_serve_win_percentage',
            'second_serve_win_percentage',
            'first_service_return_points_win_percentage',
            'second_service_return_points_win_percentage'
        ])

    
    return df

find_players_statistics(['roger-federer', 'rafael-nadal']).limit(5).toPandas().head()


Unnamed: 0,player_id,point_on_serve_percentage,first_serve_percentage,first_serve_win_percentage,second_serve_win_percentage,first_service_return_points_win_percentage,second_service_return_points_win_percentage
0,rafael-nadal,0.673683,0.686934,0.719071,0.574091,0.343812,0.554774
1,roger-federer,0.695963,0.620184,0.773856,0.569101,0.325188,0.510581
