<a href="https://colab.research.google.com/github/hyunjimoon/stan/blob/master/20Stan_golf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [72]:
import cmdstanpy
import pandas as pd
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
import os
from cmdstanpy import CmdStanModel, cmdstan_path


def get_stan_model(code):
    code_hash = hashlib.sha1(code.encode('utf-8')).hexdigest()
    cache_path =  './cache/' + code_hash + '.pkl'
    try:
        with open(cache_path, 'rb') as file:
            model = pickle.load(file)
    except Exception:
        model = pystan.StanModel(model_code=code) ## c로 변환
        with open(cache_path, 'wb') as file:
            pickle.dump(model, file)
    return model

In [73]:
df = pd.read_csv('data/golf_data.txt')
new_df = pd.read_csv('data/golf_newdata.txt')

## Testing the fitted model on new data

Recently a local business school professor and golfer, Mark Broadie, came by my office with tons of new data. For simplicity we’ll just look here at the summary data, probabilities of the ball going into the hole for shots up to 75 feet from the hole. The graph below shows these new data (in red), along with our earlier dataset (in blue) and the already-fit geometry-based model from before, extending to the range of the new data.



![title](data/img/checking_alreadyfit.png)

Comparing the two datasets in the range 0-20 feet, the success rate is similar for longer putts but is much higher than before for the short putts. This could be a measurement issue, if the distances to the hole are only approximate for the old data, and it could also be that golfers are better than they used to be.

Beyond 20 feet, the empirical success rates become lower than would be predicted by the old model. These are much more difficult attempts, even after accounting for the increased angular precision required as distance goes up.

## A new model accounting for how hard the ball is hit
To get the ball in the hole, the angle isn’t the only thing you need to control; you also need to hit the ball just hard enough.

Mark Broadie added this to our model by introducing another parameter corresponding to the golfer’s control over distance. Supposing uu is the distance that golfer’s shot would travel if there were no hole, Broadie assumes that the putt will go in if (a) the angle allows the ball to go over the hole, and (b) uu is in the range [x,x+3]. That is the ball must be hit hard enough to reach the whole but not go too far. Factor (a) is what we have considered earlier; we must now add factor (b).

The following sketch, which is not to scale, illustrates the need for the distance as angle as well as the angle of the shot to be in some range, in this case the gray zone which represents the trajectories for which the ball would reach the hole and stay in it.

![title](data/img/golf_physical_model.png)

Broadie supposes that a golfer will aim to hit the ball one foot past the hole but with a multiplicative error in the shot’s potential distance, so that u=(x+1)⋅(1+error)u=(x+1)⋅(1+error), where the error has a normal distribution with mean 0 and standard deviation σdistanceσdistance. This new parameter σdistanceσdistance represents the uncertainty in the shot’s relative distance. In statistics notation, this model is,

$$ u∼normal(x+1,(x+1)\sigma_{distance}),$$
and the distance is acceptable if u∈[x,x+3], an event that has probability 
$$\Phi\left(\frac{2}{(x+1) \sigma_{distance}}\right)-\Phi\left(\frac{-1}{(x+1) \sigma_{distance}}\right)$$

Putting these together, the probability a shot goes in becomes, 

$$\left(2 \Phi\left(\frac{\sin ^{-1}((R-r)(x)}{\sigma_{\text {angle}}}\right)-1\right)\left(\Phi\left(\frac{2}{(x+1) \sigma_{\text {distance }}}\right)-\Phi\left(\frac{-1}{(x+1) \sigma_{\text {distance }}}\right)\right)$$

where we have renamed the parameter σσ from our earlier model to σangleσangle to distinguish it from the new σdistanceσdistance parameter. We write the new model in Stan, giving it the name `golf_angle_distance_2.stan` to convey that it is the second model in a series, and that it accounts both for angle and distance:

In [74]:
## \distance 겹치는 것과 frac 중간 선은 어디로ㅠㅠ

```
data {
  int J;
  int n[J];
  vector[J] x;
  int y[J];
  real r;
  real R;
  real overshot;
  real distance_tolerance;
}
transformed data {
  vector[J] threshold_angle = asin((R-r) ./ x);
}
parameters {
  real<lower=0> sigma_angle;
  real<lower=0> sigma_distance;
}
model {
  vector[J] p_angle = 2*Phi(threshold_angle / sigma_angle) - 1;
  vector[J] p_distance = Phi((distance_tolerance - overshot) ./ ((x + overshot)*sigma_distance)) -
               Phi((- overshot) ./ ((x + overshot)*sigma_distance));
  vector[J] p = p_angle .* p_distance;
  y ~ binomial(n, p);
  [sigma_angle, sigma_distance] ~ normal(0, 1);
}
generated quantities {
  real sigma_degrees = sigma_angle * 180 / pi();
}
```

Here we have defined `overshot` and `distance_tolerance` as data, which Broadie has specified as 1 and 3 feet, respectively. We might wonder why if the distance range is 3 feet, the overshot is not 1.5 feet. One reason could be that it is riskier to hit the ball too hard than too soft. In addition we assigned weakly informative half-normal(0,1) priors on the scale parameters, σangleσangle and σdistanceσdistance, which are required in this case to keep the computations stable.

In [79]:
# specify Stan file, create, compile CmdStanModel object
golf_angle_distance_2_path = os.path.join('./stanfile','golf_angle_distance_2.stan')
golf_angle_distance_2_model = CmdStanModel(stan_file=golf_angle_distance_2_path)

INFO:cmdstanpy:compiling stan program, exe file: /Users/hyunjimoon/Dropbox/stan/stan/stanfile/golf_angle_distance_2
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/hyunjimoon/Dropbox/stan/stan/stanfile/golf_angle_distance_2


In [84]:
golf_angle_distance_2_data = {
    "J": df.shape[0],
    "n": list(df.loc[:,'n']),
    "x": list(df.loc[:,'x']),
    "y": list(df.loc[:,'y']),
    "r": (1.68/2)/12,
    "R": (4.25/2)/12,
    "overshot": 1,
    "distance_tolerance": 3,
}
golf_angle_distance_2_fit = golf_angle_distance_2_model.sample(chains=5, cores=3, data=golf_angle_distance_2_data)

INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:start chain 5
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 4
INFO:cmdstanpy:finish chain 5


## Fitting the new model to data
We fit the model to the new dataset.

To understand what is happening, we graph the new data and the fitted model, accepting that this “fit,” based as it is on poorly-mixing chains, is only provisional:

In [89]:
new_df.iloc[:5,]

Unnamed: 0,x,n,y
0,0.28,45198,45183
1,0.97,183020,182899
2,1.93,169503,168594
3,2.92,113094,108953
4,3.93,73855,64740


With such large values of $n_j$, the binomial likelihood enforces an extremely close fit at these first few points, and that drives the entire fit of the model.

To fix this problem we took the data model, $yj∼binomial(nj,pj)$, and added an independent error term to each observation. There is no easy way to add error directly to the binomial distribution—we could replace it with its overdispersed generalization, the beta-binomial, but this would not be appropriate here because the variance for each data point ii would still be roughly proportional to the sample size njnj, and our whole point here is to get away from this assumption and allow for model misspecification—so instead we first approximate the binomial data distribution by a normal and then add independent variance; thus:

$$
y_{j} / n_{j} \sim \text { normal }\left(p_{j}, \sqrt{p_{j}\left(1-p_{j}\right) / n_{j}+\sigma_{y}^{2}}\right)
$$

To write this in Stan there are some complications:

y and n are integer variables, which we convert to vectors so that we can multiply and divide them.

To perform componentwise multiplication or division using vectors, you need to use `.*` or `./` so that San knows not to try to perform vector/matrix multiplication and division. Stan is opposite from R in this way: Stan defaults to vector/matrix operations and has to be told otherwise, whereas R defaults to componentwise operations, and vector/matrix multiplication in R is indicated using the %*% operator.

We implement these via the following new code in the transformed data block: