# 7. Constructing data: waiting_time_chi_sq.csv

**date**
: 2021-04-11

**desc**
: Categorise the waiting times between goals into bandings of 10 minutes.
All waiting times over 90 are comined into a single category of `90+`.

**in**
: "data/out/waiting_times_goals.csv"

**out**
: "data/out/waiting_times_goals_chi_sq.csv"

In [8]:
import pandas as pd
import numpy as np
from scipy.stats import expon
from setup.data import Data
from setup.support import add_bandings_poisson_proc

In [9]:
# read data in
df_in = Data.load_waiting_times_goals()

We calculate the **mean** waiting time and the size of the sample for use generating the **Pr** of a time banding.

In [10]:
mean = df_in["waiting_times"].mean()

In [11]:
size = df_in["waiting_times"].count()

We begin to aggregate the data by first making a copy and then adding a new column `time_banding` to the `DataFrame`.
This represents the category of the `waiting_time`.

In [12]:
# construct the out dataframe
df_out = df_in.copy(deep=True)

In [13]:
help(add_bandings_poisson_proc)

Help on function add_bandings_poisson_proc in module setup.support:

add_bandings_poisson_proc(x: int) -> str
    Returns banding of X, defined as the nearest multiple of 10.
    
    -  If x > 90, then set x to 90.
    -  Else if x is 0, then set x to 10.
    -  Otherwise, while the remainder of x/10 is not 0, increment
    x by 1.
    
    @param, x, int
        an integer representing the waiting time before a goal is
        scored.
    
    @return, int
        the nearest greater multiple of 10



In [14]:
# code the values, bounding them in groups of 10
df_out["time_banding"] = df_out["waiting_times"].apply(add_bandings_poisson_proc)

In [15]:
# aggegate the time_bandings to calculate the observed frequencies
df_chi_sq = df_out.groupby(["time_banding"])[["time_banding"]].count()

We need to transform the `DataFrame` so it can be used further.

In [16]:
# rename the count column
df_chi_sq.rename(columns={"time_banding": "Observed"}, inplace=True)

In [17]:
# ungroup the data, preserving the count
df_chi_sq.reset_index(inplace=True)

In [18]:
df_chi_sq

Unnamed: 0,time_banding,Observed
0,10,277
1,20,213
2,30,169
3,40,119
4,50,76
5,60,65
6,70,48
7,80,27
8,90,17
9,90+,61


Finally we calculate the probability of a goal occuring in each time banding, so **Pr**(0 $\leq$ X $\leq$ 15), **Pr**(15 $\leq$ X $\leq$ 30), etc.

*This is an awkward algorithm that I think would benefit from making less abstract! Ed.*

We first declare $M(\lambda)$.
Note that argument `scale` in `expon` expects the **standard deviation**.
And so it can be seen that if $s = 1/\lambda$ and $\lambda = 1/\>\overline{x}$, so

$$s = \frac{1}{1/\>\overline{x}} = \overline{x}.$$

In [19]:
# declare m(rate)
m = expon(scale=mean)

In [20]:
# Generate W (10, 20, 30, etc.)
W = np.arange(start=10, stop=91, step=10)

In [21]:
# instantiate the list
Pr = list()

In [22]:
# Need to add F(10) as 0-index of the array has no Pr to subtact
Pr.append(m.cdf(10))

In [23]:
# next substract an F(W) with the preceding F(W) to calculate the Pr goal
# is scored in that interval
k = 1
while (k < W.size):
    Pr.append(m.cdf(W[k]) - m.cdf(W[k-1]))
    k += 1

In [24]:
# add the final Pr(X>=90+)
Pr.append(1 - m.cdf(90))

In [25]:
df_chi_sq["Pr"] = Pr

Add the expected frequencies: Pr * size

In [26]:
# calculate the expected observations
df_chi_sq["Expected"] = df_chi_sq["Pr"] * size

Calculate the chi-squared contributions

In [27]:
df_chi_sq["chi-sq contribution"] = (
    ((df_chi_sq["Observed"] - df_chi_sq["Expected"])**2)/df_chi_sq["Expected"]
)

In [28]:
# output file
df_chi_sq.to_csv(Data.PATH_OUT + "waiting_time_chi_sq.csv", index=False)