## We are going to attempt to answer the question of whether or not the hot hands effect exists in Tennis, using first and second serves. We will look at the question backwards though, trying to see if the cold-hands effect exists. We want to answer these following questions:

### 1. What is a player's chance of faulting their first serve, if their last first serve was a fault? 
### 2. What is a player's chance of double faulting, if they double faulted the last point?
### 3. What is a player's chance of an ace, if they had an ace the last point?

### This amounts to calculating three conditional probabilities, i.e., for 1 it is P(F|LF) where F is a fault on the current service and LF is a fault on the last service (LF for last-fault).
### The "cold hands" effect exists if P(F|LF) > P(F|NF) , by a statistically significant amount (say 5 guassian sigma), where P(F|NF) is the probability of a fault on the first serve, given that the last first serve went in. (NF for no-fault).

## An important point: We are going to use mock data to set up and verify the analysis, then we will dig up and find a data set that is optimal to use.
### A couple of notes on why I made this choice. First, it will allow me to weed out any bugs by working with a dataset where I know the answer. Second, it will remove the possibility of biasing the analysis on real data. Third, is that although these datasets definitely exist -- which one to use is not immediately apparent. 
### At the pro level, the probability of an ace (and a double fault) is low (a typical number of aces in a given year is only 500, https://www.atptour.com/en/stats/aces/2019/all/all/ ). So point 2 and 3 might be exceedingly difficult to even investigate with confidence because we will be dealing with small number statistics. I.e., our constraints on the conditional probabilities will be weak. So making any kind of definitive conclusion will be hard. Looking at datasets at the college or highschool level would probably be more informative, (simply because the hot and cold hands effects, if they exist, are probably accentuated here) but then such datasets will be harder to come by.

### So, without further ado, lets create this mock dataset, focusing on question 1 because:
### Question 1 is likely to have the largest number of data points in the real dataset, and naively, the cold hands effect here should be just as present as it is in 2 and 3.

## Section 1, generating a dataset that will be replaceable by real data at the end.

Generally, top pro players make about 65 percent of their first serves and 90 percent of their second serves. So we will generate, say 50 players whose aggregate first and second serve percentanges are random values between 50 and 70%. This is pretty representative of the top 50 players. https://www.atptour.com/en/stats/1st-serve/2019/all/all/

We are including the complicating factor here of multiple players because we will certainly need to aggregate over players in the real data in order to build up the sample size required.

We want to be intelligent about how we structure this dataset. As with all questions, how you structure your data can be extremely hindering, or extremely enlightening. I personally prefer hefty datastructures that are easy to navigate. One could do this with a SQL table or something the like, but I will tackle with astropy tables because of their myriad of grouping and sorting abilities.

In [5]:
from astropy.table import Table
import numpy as np

num_players = 50
first_serve_chance_range = (0.5, 0.7)
player_table = {'player_id': np.arange(50), 
                'first_serve_chance': np.round(np.random.uniform(low=min(first_serve_chance_range), high=max(first_serve_chance_range), size=num_players), 3)}
player_table = Table(player_table)

In [7]:
player_table[:5]

player_id,first_serve_chance
int64,float64
0,0.607
1,0.581
2,0.625
3,0.648
4,0.542


Great, now we have some random players with identification integers 0, 1..., 50 (we would replace their names with these integers in the real data set), and each has some aggregate first serve percentage. This would be fetched, in the real data, from something like the ATP website https://www.atptour.com/en/stats/aces/2019/all/all/ .

Now, each player will play order-of-magnitude the same number of matches in a year (e.g., somewhere between 40 and 80). But we will need to take this into account when we assess the statistics, so we want to add this complication to the mock data as well. This will be the last complication we add.

In [8]:
serves_per_match = 150
num_serves_range = (serves_per_match * 40, serves_per_match* 80)

player_table['num_serves_on_record'] = np.random.randint(low=min(num_serves_range), high=max(num_serves_range), size=len(player_table))

In [9]:
player_table[:5]

player_id,first_serve_chance,num_serves_on_record
int64,float64,int64
0,0.607,11075
1,0.581,8946
2,0.625,8815
3,0.648,6772
4,0.542,10927


## Now we have all the aggregate data in hand to generate a mock data set. We will make a dataset class to generate these fake data, so that we can run the analysis for two different magnitudes of cold-hand effects:
Recall that we are defining P(F|LF) as the probability of a fault, conditional on the last serve being a fault as well. And P(F|NF) as the probability of faulting this first serve, given that the last first serve was not a fault.

## A cold hands effect of 10 % -- meaning P(F|LF) - P(F|NF)  = 0.1
## And a cold effects of 0% -- meaning P(F|LF) - P(F|NF) = 0.0

### The goal is that, through our analysis later on, we can recover both instances of the cold-hands effect.
### Remember that this is all statistical. If we succeed, we won't recover P(F|LF) - P(F|NF) = 0.1 for case 1, we will recover some posterior estimate that is consistent with (i.e., within 1 sigma of) 0.1

### We will make these mock data such that we have already split the data by set. We wouldn't expect as large of a cold/hot hands effect that extends between one service set and the next (which could be separated by tens of minutes). So we want to have our data and analysis reflect that.

We are going to simulate a set of serves using a monte carlo simulation. This is not the fastest way in terms of computation time to do this, but it is a reliable way that is very very simple to code up. Simple = less chance for bugs, so we are going to go with this method.

After we code up the monte carlo method, we will write a class for the MockData to keep everything organized and accessible in a nice object-oriented way.



In [None]:
def simulate_serves(num_serves, first_serve_chance, cold_hands_magnitude):
    """
    Generate a long list of first serve statistics. 1 means the first serve went in, 0 means a fault.
    """
    serve_chance = [1 - first_serve_chance, first_serve_chance]
    serve_data = [np.random.choice([0, 1], p=serve_chance)]
    for i in num_serves:
        
    
def split_serves_into_sets(serve_data, num_serves_per_set)
    """
    Tennis is played in discrete sets. This function will split the long 1d list of service statistics into discrete sets.
    """
    

In [None]:
class MockData(object):
    def __init__(self, player_table, serves_per_set=5, cold_hands_magnitude=0.1, match_data=None):
        self.player_table = player_table
        self.serves_per_set = serves_per_set
        self.cold_hands_magnitude = cold_hands_magnitude
        if match_data is None:
            self.serve_data = self.init_serve_data()

    def init_serve_data(self):
        serve_data = {i: None for i in self.player_table['player_id']}
        for i in serve_data.keys():
            player_data = self.player_table[self.player_table['player_id'] == i]
            serve_data[i] = simulate_serves(player_data['num_serves_on_record'],
                                            player_data['first_serve_chance'],
                                            self.cold_hands_magnitude)
            serve_data[i] = split_serves_into_sets(serve_data[i], self.serves_per_set)

            
    def __call__(self, player_id):
        return self.serve_data[player_id]

        
        