# 1.6 Constructing data: goals_per_game_chi_sq

**date**
: 2021-04-11

**desc**
: Categorise the total goals per game, giving games with 7+ goals the combined categoy `7+`.
This is done because all expected number of goals must be 5+.

**in**
: "data/out/goal_times.csv"

**out**
: "data/out/goals_per_game_chi_sq.csv"

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import poisson
from src.data import Data

In [2]:
# read data in
df_in = Data.load_goals_per_game()

We calculate the **mean** number of goals scored and the **number of observation** for use generating the probability mass function.

In [3]:
mean = df_in["total_goal_count"].mean()

In [4]:
size = df_in["total_goal_count"].size

We begin to aggregate the data by first making a copy and then recoding total goal counts 7, 8 with 7+.

In [5]:
df_out = df_in.copy(deep=True)

In [6]:
# recode 7, 8
df_out["total_goal_count"].replace({7: "7+", 8: "7+"}, inplace=True)

In [7]:
# aggegate the games by total_goal_count to calculate the observed frequencies
df_chi_sq = df_out.groupby(["total_goal_count"])[["total_goal_count"]].count()

We need to transform the `DataFrame` so it can be used further.

In [8]:
# rename the count column
df_chi_sq.rename(columns={"total_goal_count": "Observed"}, inplace=True)

In [9]:
# ungroup the data, preserving the count
df_chi_sq.reset_index(inplace=True)

In [10]:
# rename the count column
df_chi_sq.rename(columns={"total_goal_count": "X"}, inplace=True)

Calculate the probability mass function using **Poisson(mean)**.

Note because `7+` is not a number we calculate its **Pr** as **Pr**(X $\geq$ 7) = 1 - **Pr**(X $\leq$ 6)

In [11]:
# generate range of X
X = np.arange(start=0, stop=7)

In [12]:
# generate the pmf of Poisson(mean)
Pr = list(poisson(mean).pmf(X))

In [13]:
# add the final Pr: Pr(X>=7)
Pr.append(1 - poisson(mean).cdf(6))

In [14]:
# append the Pr to the df
df_chi_sq["Pr"] = Pr

Add the expected frequencies: Pr(X=x) * size

In [15]:
# calculate the expected observations
df_chi_sq["Expected"] = df_chi_sq["Pr"] * size

Calculate the chi-squared contributions

In [16]:
df_chi_sq["chi-sq contribution"] = (
    ((df_chi_sq["Observed"] - df_chi_sq["Expected"])**2)/df_chi_sq["Expected"]
)

Output the file for later analysis.

In [17]:
# output file
df_chi_sq.to_csv(Data.PATH_OUT + "goals_per_game_chi_sq.csv", index=False)