# 1.7 Constructing data: waiting_time_chi_sq.csv

**date**
: 2021-04-12

**desc**
: Categorise the number of goals scored per minutes into bandings of 10 minutes.

**in**
: "data/out/epl_1819.csv.csv"

**out**
: "data/out/uniform_goals_chi_sq.csv"

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import expon
from src.data import Data
from src.uniformgoals import UniformGoals
from src.list_comprehension import ListComprehension

In [2]:
# read data in
df_in = Data.load_epl_1819_game_id()

In [3]:
df_in.head()

Unnamed: 0,total_goal_count,home_team_goal_timings,away_team_goal_timings,game_id
0,3,383,90'2,0
1,3,11,818,1
2,2,"24,90'1",,2
3,2,,4179,3
4,3,,344580,4


We begin to aggregate the data by first filling in the `NaN` values, and then extracting `home_team_goal_timings` and `away_team_goal_timings` as lists.
We then combine the lists.

In [4]:
# replace NaN values with "NaN"
df_in["home_team_goal_timings"].fillna(value="NaN", inplace=True)
df_in["away_team_goal_timings"].fillna(value="NaN", inplace=True)

In [5]:
home = df_in["home_team_goal_timings"].to_list()

In [6]:
away = df_in["away_team_goal_timings"].to_list()

In [7]:
goals: list = list()

for timings in home:
    goals.append(timings)
for timings in away:
    goals.append(timings)

Transfrom: `list(str) -> list(list(str))`

In [8]:
goal_times: list = ListComprehension.comma_separated_str_to_lst_uniform_dist(
    goals
)

Remove goals scored in injury time by removing goal times with `'`.

In [9]:
cleaned_goal_times: list =list()

for a_game_goal_times in goal_times:
    cleaned_game_goal_times: list = list()
    
    for a_goal_time in a_game_goal_times:
        if "'" not in a_goal_time:
            cleaned_game_goal_times.append(a_goal_time)

    cleaned_goal_times.append(cleaned_game_goal_times)

Transform: `list(list(str)) -> list(list(int))`

In [10]:
int_goal_times = ListComprehension.list_str_to_list_int(
    cleaned_goal_times
)

Transform: `list(list(int)) -> list(int)`

In [11]:
final_goal_times: list =list()

for lst_game_goal_times in int_goal_times:
    a_list: list = list()
    
    for a_goal_time in lst_game_goal_times:
        final_goal_times.append(a_goal_time)

We begin to aggregate the data by first making a new DataFrame containing the `final_goal_times`.

In [12]:
# construct the out dataframe
df_chi_sq = pd.DataFrame(data=final_goal_times, columns=["goal_times"])

We now encode the banding and then aggregate the counts in each banding.

In [13]:
# code the values, bounding them in groups of 10
df_chi_sq["time_banding"] = df_chi_sq["goal_times"].apply(UniformGoals.get_banding)

In [15]:
# aggegate the time_bandings to calculate the observed frequencies
df_chi_sq = df_chi_sq.groupby(["time_banding"])[["time_banding"]].count()

We need to transform the `DataFrame` so it can be used further.

In [16]:
# rename the count column
df_chi_sq.rename(columns={"time_banding": "Observed"}, inplace=True)

In [17]:
# ungroup the data, preserving the count
df_chi_sq.reset_index(inplace=True)

In [27]:
df_chi_sq

Unnamed: 0,time_banding,Observed,Pr,Expected,chi-sq contribution
0,10,82,0.111111,108.666667,6.543967
1,20,116,0.111111,108.666667,0.494888
2,30,108,0.111111,108.666667,0.00409
3,40,98,0.111111,108.666667,1.047035
4,50,95,0.111111,108.666667,1.718814
5,60,112,0.111111,108.666667,0.102249
6,70,133,0.111111,108.666667,5.448875
7,80,119,0.111111,108.666667,0.982618
8,90,115,0.111111,108.666667,0.369121


Add the Pr of a goal in each banding.
As we are modelling using the discrete uniform distribution, this would 1/9.

In [19]:
df_chi_sq["Pr"] = 1/9

Add the expected frequencies: Pr * size

In [24]:
# calculate the expected observations
df_chi_sq["Expected"] = df_chi_sq["Pr"] * df_chi_sq["Observed"].sum()

Calculate the chi-squared contributions

In [26]:
df_chi_sq["chi-sq contribution"] = (
    ((df_chi_sq["Observed"] - df_chi_sq["Expected"])**2)/df_chi_sq["Expected"]
)

In [28]:
# output file
df_chi_sq.to_csv(Data.PATH_OUT + "uniform_goal_chi_sq.csv", index=False)