# Euroleague Foul Drawn Analysis & Application of Hypothesis Testing

## Project Description

<div style="text-align: justify;">
The goal of this project is to take a look at historical Euroleague data related to fouls drawn per season. We will try to give some insights and investigate if there are any specific patterns around this data. Bad referee's calls cannot be proven and the scientific approach of this project will be shaked if claiming such a thing. However, we can investigate patterns related to the general referee's decisions for each year. Additionally, this project is giving the opportunity to study some strange data cases rarely occurred through the years.
</div>

## Imported Datasets Description

<div style="text-align: justify;">
Each Excel dataset includes the 50 most fouled players of the corresponding Euroleague season. For instance, "fouls_23_24.xlsx" includes the 50 most fouled players of the Euroleague 2023-2024 season. The raw total fouls were counted for this datalist, and no normalizations were made prior to this project. The columns "Player", "Team", "Fouled (Total)", "Games" were acquired from https://basketnews.com/leagues/25-euroleague/statistics.html. The rest of the Excel columns, i.e., "Minutes per Game" and "Position" were manually completed by myself looking at Euroleague's official site https://www.euroleaguebasketball.net/euroleague/?geoip=disabled.
</div>

## Ensure the Correct Environment - Import Libraries - Read Excel Datafiles

In [None]:
# Check the environment the jupyter server is running in:
import sys
sys.executable

In [None]:
# Import Libraries
from scipy.stats import t
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Create the list of Excel files:
excel_datafiles = ["fouls_23_24.xlsx", "fouls_22_23.xlsx", "fouls_21_22.xlsx", "fouls_20_21.xlsx", "fouls_19_20.xlsx"]

# Create an empty list:
df_list = []

# Iterate through the Excel datalist and read each file into a Pandas DataFrame (df):
df_list = [pd.read_excel(file) for file in excel_datafiles]

# Create the df with most fouled players of all seasons combined:
combined_df = pd.concat(df_list, ignore_index=True)

# Create the 2023-2024 season df:
df_2324 = pd.read_excel(excel_datafiles[0])

# Create the 2022-2023 season df:
df_2223 = pd.read_excel(excel_datafiles[1])

# Create the 2023-2024 season df:
df_2122 = pd.read_excel(excel_datafiles[2])

# Create the 2023-2024 season df:
df_2021 = pd.read_excel(excel_datafiles[3])

# Create the 2023-2024 season df:
df_1920 = pd.read_excel(excel_datafiles[4])

## Euroleague 2023-2024 Season

### Data Cleaning and Feature Engineering

In [None]:
# Split the "minutes_per_game" column to two columns:
df_2324[["minutes", "seconds"]] = df_2324["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
df_2324["minutes"] = df_2324["minutes"].astype(int)
df_2324["seconds"] = df_2324["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
df_2324["Seconds per Game"] = (df_2324["minutes"]*60 + df_2324["seconds"]).astype("int64")

# Drop the unecessary columns:
df_2324 = df_2324.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
df_2324["Fouled per Game"] = df_2324["Fouled (Total)"] / df_2324["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
df_2324["Fouled per 40 Minutes"] = round(2400 * df_2324["Fouled per Game"] / df_2324["Seconds per Game"], 2)

In [None]:
# Check the above cell's code:
print(df_2324.info())
df_2324.head()

The code is correct. The same code will be used for the other data frames as well.

### Exploratory Data Analysis and Descriptive Statistics

In [None]:
# Present some basic statistics:
df_2324.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_2324 = df_2324.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_2324.corr()

#### Correlation Matrix Observations


<div style="text-align: justify">
Ιt is worth mentioning that there is a negative linear correlation (although weak) between the time-normalized fouls per 40 minutes and both the number of games and the seconds per game. This indicates that as a player participates in more games throughout the season and spends more time on the court, his fouls per 40 minutes tend to decrease. In the context of basketball, this could be due to factors such as fatigue, opponent adaptation and changes of referees' decisions throughout the season or throughout the minutes a player is on the court.
</div>

<div style="text-align: justify">
Another interesting fact is that there is a weak negative correlation indicating that as the player's height increased the time spent on the court decreased. Watching the Euroleague 2023-2024 season this is not strange. There was a significant try of coaches to rotate the tall players, especially the centers. A basic exception to this rule was Panathinaikos AKTOR Athens where Mathias Lessort lift on his shoulders the center's position during the whole season.
</div>

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2324["Fouled (Total)"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(80, 281, 20))
plt.title(
    "Boxplot 1: Distribution of players' total fouls received during the season 2023-2024.\n"
    "The green dashed line represents the mean, whereas the whiskers of this boxplot\n"
    "range from 86 to 149 total fouls.")
plt.xlabel("Total Fouls Received");

plt.savefig('boxplot.png', dpi=300, bbox_inches='tight')

<div style="text-align: justify">
Boxplot 1 indicates that there are 6 players identified as outliers based on the total fouls gained during the season. One of these players exhibits an extreme deviation from the overall distribution. Since these observations represent raw data, further analysis is required before drawing definitive conclusions. The players identified as outliers are Mathias Lessort, Mike James, Tornike Shengelia, Facundo Campazzo, and Wade Baldwin IV. Additionally, the distribution displays clear right skewness. In the context of basketball, where exceptional players may effectively draw many fouls, it isn't unexpected to observe such a right skewness and a wide data spread to the higher end of this figure.
</div>

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2324["Fouled per Game"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(2, 7, 1))
plt.title(
    "Boxplot 2: Distribution of players' fouls received per game during the season 2023-2024.\n"
    "The whiskers of this boxplot range from 2.15 to 5.51 fouls per game.")

plt.xlabel("Fouls Received per Game");

plt.savefig('boxplot2.png', dpi=300, bbox_inches='tight')

<div style="text-align: justify">
The importance of normalization is evident in boxplot 2. The boxplot 1, which used raw data, identified six outliers, with one exhibiting an extreme deviation from the rest of the data. However, boxplot 2, which represents game-normalized fouls drawn, indicates only one outlier and this outlier isn't as extreme as before. The player still identified as an outlier is Mathias Lessort.
</div>

#### Critical Boxplot: Time-Normalized Foules (Foules Drawn per 40 Minutes) 2023-2024 Season

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2324["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 3: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.23 to 8.72 fouls per 40 minutes.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot3.png', dpi=300, bbox_inches='tight')

<div style="text-align: justify">
Boxplot 3 represents the distribution of time-normalized fouls drawn per 40 minutes. This indicator highlights that while the outlier remains, it is even less extreme than before. Additionally, increased normalization makes the distribution more similar to a normal distribution. The player still identified as an outlier is Mathias Lessort.
</div>

<div style="text-align: justify">
An interesting estimator defining the quality of the general referees' approach each Euroleague season may be the range between the max and min value of "fouls per 40 minutes" column, excluding the outliers. A small range may indicate a robust referees approach for calling decisions. Specifically, the smaller this range, the less the discriminations between the most fouled players of the season. Additionally, the existence or not of outliers in the above boxplot may be negatively affecting the referees' approach through the season, especially if the outlier has spent so much time on the court and having played so many games as Mathias Lessort.
</div>

  For 2023-2024 season, the range is $5.49$ fouls per 40 minutes and there is identified outlier.

In [None]:
# Create a new df without "Lesort" observation:
df_no_lessort = df_2324.drop(index=0)

# Create a table with means per position excluding Lessort from the dataset:
df_no_lessort.groupby("Position").mean("Fouls per 40 Minutes").transpose()

<div style="text-align: justify">
The data in this table support the possibility of an outlier originating from position number 5, given its higher mean of fouls drawn per 40 minutes of play. However, as stated before, the high mean of fouls drawn per 40 minutes from position 5 players could be due to increased rotation and hence decreased time on the court. Apart from the weak linear correlation between the time played as well as the number of games and the fouls drawn per 40 minutes, it will be shown in next sections how decreased time on the court or small number of games played affects the fouls drawn per 40 games, where in some cases might be very high.
</div>

## Euroleague 2022-2023 Season

## Data Cleaning and Feature Engineering

In [None]:
# Split the "minutes_per_game" column to two columns:
df_2223[["minutes", "seconds"]] = df_2223["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
df_2223["minutes"] = df_2223["minutes"].astype(int)
df_2223["seconds"] = df_2223["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
df_2223["Seconds per Game"] = (df_2223["minutes"]*60 + df_2223["seconds"]).astype("int64")

# Drop the unecessary columns:
df_2223 = df_2223.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
df_2223["Fouled per Game"] = df_2223["Fouled (Total)"] / df_2223["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
df_2223["Fouled per 40 Minutes"] = round(2400 * df_2223["Fouled per Game"] / df_2223["Seconds per Game"], 2)

## Exploratory Data Analysis and Descriptive Statistics

In [None]:
# Present some basic statistics of the df:
df_2223.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_2223 = df_2223.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_2223.corr()

<div style="text-align: justify">
The same patters are observed in this correlation matrix as well. Fouls drawn per 40 minuted tend to decrease the more minutes a player spends on the court and the more games he plays. There is a descent amount of rotation in tall players as well. However, the correlation matrix shows that tall players tend to participate in more games than the shorter players. This may be due to injuries affected most fouled and shorter players as there is not normal to observe an influence of the number of games to the position of the players.
</div>

#### Critical Boxplot: Time-Normalized Foules (Foules Drawn per 40 Minutes) 2022-2023 Season

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2223["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 4: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.83 to 8.37 fouls per 40 minutes.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot4.png', dpi=300, bbox_inches='tight')

For 2022-2023 season, the range is $4.54$ fouls per 40 minutes and there is no identified outlier.

## Euroleague 2021-2022 Season

## Data Cleaning and Feature Engineering

In [None]:
# Split the "minutes_per_game" column to two columns:
df_2122[["minutes", "seconds"]] = df_2122["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
df_2122["minutes"] = df_2122["minutes"].astype(int)
df_2122["seconds"] = df_2122["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
df_2122["Seconds per Game"] = (df_2122["minutes"]*60 + df_2122["seconds"]).astype("int64")

# Drop the unecessary columns:
df_2122 = df_2122.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
df_2122["Fouled per Game"] = df_2122["Fouled (Total)"] / df_2122["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
df_2122["Fouled per 40 Minutes"] = round(2400 * df_2122["Fouled per Game"] / df_2122["Seconds per Game"], 2)

## Exploratory Data Analysis and Descriptive Statistics

In [None]:
# Present some basic statistics of the df:
df_2122.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_2122 = df_2122.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_2122.corr()

Same observations as before can be made here.

#### Critical Boxplot: Time-Normalized Foules (Foules Drawn per 40 Minutes) 2021-2022 Season

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2122["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 5: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.14 to 6.76 fouls per 40 minutes.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot5.png', dpi=300, bbox_inches='tight')

<div style="text-align: justify">
Nikola Ivanovic has been identified as an outlier in the analysis. Ranked 30th among the most fouled players and having played only 15:21 minutes per game, Ivanovic’s value needs some study. Specifically, he very often drew approximately 10 or more fouls per 40 minutes, while staying between 9 to 17 minutes on the court. Ivanovic’s high foul rate appears to be influenced by his mediocre time on the court. As a result, his fouls-per-40-minutes value would not probably be the same, if he spent more minutes on the court. Therefore, it is likely that Ivanovic's high foul rate is a result of his adaptability to the game (approximately after 8-9 minutes on the court) and his mediocre time playing basketball (< 17 minutes). He may be highly skilled at drawing fouls when playing less than 20 minutes. However, being such an extreme outlier, even on time-normalized fouls, questioning referees decision, even if Ivanovic may be highly skilled in drawing fouls.
</div>

 For 2021-2022 season, the range is $3.62$ fouls per 40 minutes and there is an identified outlier.

## Euroleague 2020-2021 Season

## Data Cleaning and Feature Engineering

In [None]:
# Split the "minutes_per_game" column to two columns:
df_2021[["minutes", "seconds"]] = df_2021["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
df_2021["minutes"] = df_2021["minutes"].astype(int)
df_2021["seconds"] = df_2021["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
df_2021["Seconds per Game"] = (df_2021["minutes"]*60 + df_2021["seconds"]).astype("int64")

# Drop the unecessary columns:
df_2021 = df_2021.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
df_2021["Fouled per Game"] = df_2021["Fouled (Total)"] / df_2021["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
df_2021["Fouled per 40 Minutes"] = round(2400 * df_2021["Fouled per Game"] / df_2021["Seconds per Game"], 2)

## Exploratory Data Analysis and Descriptive Statistics

In [None]:
# Present some basic statistics of the df:
df_2021.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_2021 = df_2021.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_2021.corr()

Same observations as before can be made here.

#### Critical Boxplot: Time-Normalized Foules (Foules Drawn per 40 Minutes) 2020-2021 Season

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2021["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 6: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.39 to 8.77 fouls per 40 minutes.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot6.png', dpi=300, bbox_inches='tight')

For 2020-2021 season, the range is $5.38$ fouls per 40 minutes and there is no identified outlier.

## Euroleague 2019-2020 Season

## Data Cleaning and Feature Engineering

In [None]:
# Split the "minutes_per_game" column to two columns:
df_1920[["minutes", "seconds"]] = df_1920["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
df_1920["minutes"] = df_1920["minutes"].astype(int)
df_1920["seconds"] = df_1920["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
df_1920["Seconds per Game"] = (df_1920["minutes"]*60 + df_1920["seconds"]).astype("int64")

# Drop the unecessary columns:
df_1920 = df_1920.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
df_1920["Fouled per Game"] = df_1920["Fouled (Total)"] / df_1920["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
df_1920["Fouled per 40 Minutes"] = round(2400 * df_1920["Fouled per Game"] / df_1920["Seconds per Game"], 2)

## Exploratory Data Analysis and Descriptive Statistics

In [None]:
# Present some basic statistics of the df:
df_1920.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_1920 = df_1920.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_1920.corr()

Same observations as before can be made here.

#### Critical Boxplot: Time-Normalized Foules (Foules Drawn per 40 Minutes) 2019-2020 Season

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=130)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_1920["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 7: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.44 to 9.1 fouls per 40 minutes.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot7.png', dpi=300, bbox_inches='tight')

<div style="text-align: justify">
Arturas Gudaitis has a "fouls per 40 minutes" rate of 9.1. Although he isn't classified as an outlier, this value is the highest observed so far and it needs some further investigation. This case is probably similar to Ivanovic's situation because Gudaitis ranked 30th among the 50 most fouled players of the season, playing only in 19 games with an average of only 16:53 minutes per game. Checking Gudaitis stats reveals he drew 33 out of his total 73 fouls in just 5 games, where he played approximately 17.5 minutes per game. These 5 games significantly affects his overall "fouls per 40 minutes" rates, leading it to such high levels. Gudaitis high rate of fouls-per-40-minutes would probably be smaller if he played more games or spent more minutes on the court.
</div>

<div style="text-align: justify">
For 2019-2020 season, the range is $5.66$ fouls per 40 minutes and there is no identified outliers. However, Gudaitis' extremities might be a referees headache.
</div>

# Part II: Isolate the Players Averaged more than 25 Minutes on the Court 

<div style="text-align: justify">
Considering the effects observed with Ivanovic and Gudaitis, we will exclude the players who did not spend sufficient time on the court. Consequently, the datasets presented below include only players who averaged more than 25 minutes per game among the 50 most fouled players of each season.
</div>

## Euroleague 2023-2024 Season: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
df_2324_over25 = df_2324[df_2324["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=df_2324_over25["Seconds per Game"], 
                y=df_2324_over25["Fouled per 40 Minutes"], 
                hue=df_2324_over25["Position"],
                style=df_2324_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 1: Seconds per Game vs Fouls per 40 Minutes for Players Averaged Over 25 Minutes (2023-2024)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2324_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title(
    "Boxplot 8: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.23 to 7.99 fouls per 40 minutes. (Players over 25 minutes)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot8.png', dpi=300, bbox_inches='tight')

The range is $4.76$ fouls per 40 minutes and there is still an identified outlier.

## Euroleague 2022-2023 Season: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
df_2223_over25 = df_2223[df_2223["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=df_2223_over25["Seconds per Game"], 
                y=df_2223_over25["Fouled per 40 Minutes"], 
                hue=df_2223_over25["Position"],
                style=df_2223_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 2: Seconds per Game vs Fouls per 40 Minutes for Players Averaged Over 25 Minutes (2022-2023)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot2.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2223_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 8, 1))
plt.title(
    "Boxplot 9: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.83 to 6.69 fouls per 40 minutes. (Players over 25 minutes)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot9.png', dpi=300, bbox_inches='tight')

The range is $2.86$ fouls per 40 minutes and there is no identified outlier.

## Euroleague 2021-2022 Season: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
df_2122_over25 = df_2122[df_2122["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=df_2122_over25["Seconds per Game"], 
                y=df_2122_over25["Fouled per 40 Minutes"], 
                hue=df_2122_over25["Position"],
                style=df_2122_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 3: Seconds per Game vs Fouls per 40 Minutes for Players Averaged Over 25 Minutes (2021-2022)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot3.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2122_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 8, 1))
plt.title(
    "Boxplot 10: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.14 to 6.48 fouls per 40 minutes. (Players over 25 minutes)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot10.png', dpi=300, bbox_inches='tight')

The range is $3.34$ fouls per 40 minutes and there is no identified outlier.

## Euroleague 2020-2021 Season: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
df_2021_over25 = df_2021[df_2021["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=df_2021_over25["Seconds per Game"], 
                y=df_2021_over25["Fouled per 40 Minutes"], 
                hue=df_2021_over25["Position"],
                style=df_2021_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 4: Seconds per Game vs Fouls per 40 Minutes for Players Averaged Over 25 Minutes (2020-2021)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot4.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_2021_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 8, 1))
plt.title(
    "Boxplot 11: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.39 to 7.73 fouls per 40 minutes. (Players over 25 minutes)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot11.png', dpi=300, bbox_inches='tight')

The range is $4.34$ fouls per 40 minutes and there is no identified outlier.

## Euroleague 2019-2020 Season: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
df_1920_over25 = df_1920[df_1920["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=df_1920_over25["Seconds per Game"], 
                y=df_1920_over25["Fouled per 40 Minutes"], 
                hue=df_1920_over25["Position"],
                style=df_1920_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 5: Seconds per Game vs Fouls per 40 Minutes for Players Averaged Over 25 Minutes (2019-2020)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot5.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=df_1920_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 9, 1))
plt.title(
    "Boxplot 12: Distribution of players' fouls received per 40 minutes. The whiskers of this\n"
    "boxplot range from 3.44 to 8.28 fouls per 40 minutes. (Players over 25 minutes)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot12.png', dpi=300, bbox_inches='tight')

The range is $4.84$ fouls per 40 minutes and there is no identified outlier.

# Euroleague All Seasons

In [None]:
# Split the "minutes_per_game" column to two columns:
combined_df[["minutes", "seconds"]] = combined_df["Minutes per Game"].str.split("k", expand=True)

# Change the dtypes of the newly created columns:
combined_df["minutes"] = combined_df["minutes"].astype(int)
combined_df["seconds"] = combined_df["seconds"].astype(int)

# Create the final column showing the average player time per game in seconds:
combined_df["Seconds per Game"] = (combined_df["minutes"]*60 + combined_df["seconds"]).astype("int64")

# Drop the unecessary columns:
combined_df = combined_df.drop(columns=["Minutes per Game", "minutes", "seconds"])

# Normalize fouls per game by creating the "fouls per game" column:
combined_df["Fouled per Game"] = combined_df["Fouled (Total)"] / combined_df["Games"]

# Normalize fouls per 40 minutes (2400 seconds) by creating the "fouls per 40 mins" column:
combined_df["Fouled per 40 Minutes"] = round(2400 * combined_df["Fouled per Game"] / combined_df["Seconds per Game"], 2)

In [None]:
# Present some basic statistics of the df:
combined_df.describe().transpose()

In [None]:
# Select the numerical columns:
num_cols_comb = combined_df.select_dtypes(include=["int64", "float64"])

# Display a correlation matrix for the numerical columns:
num_cols_comb.corr()

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=combined_df["Seconds per Game"], 
                y=combined_df["Fouled per 40 Minutes"], 
                hue=combined_df["Position"],
                style=combined_df["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 6: Seconds per Game vs Fouls per 40 Minutes for All Euroleague Seasons")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot6.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=combined_df["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title("Boxplot 13: Distribution of players' fouls received per 40 minutes for all seasons.")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot13.png', dpi=300, bbox_inches='tight')

## Euroleague All Seasons: Players Over 25 Minutes on the Court

In [None]:
# Create a new df with players averaged over 25 minutes:
combined_df_over25 = combined_df[combined_df["Seconds per Game"] > 1500]

In [None]:
# Create the figure and set its size and dpi:
plt.figure(figsize=(10, 6), dpi=140)

# Create the scatter plot:
sns.scatterplot(x=combined_df_over25["Seconds per Game"], 
                y=combined_df_over25["Fouled per 40 Minutes"], 
                hue=combined_df_over25["Position"],
                style=combined_df_over25["Position"],
                s=50,
               palette="Set1")

# Set title and legend:
plt.title("Scatter Plot 7: Seconds per Game vs Fouled per 40 Minutes for all seasons\n"
         "(Players over 25 minutes on the court)")
plt.legend(title='Position', bbox_to_anchor=(1.02, 1.02), loc='upper left');

plt.savefig('scatterplot7.png', dpi=300, bbox_inches='tight')

In [None]:
# Create the boxplot's grid:
sns.set(style="whitegrid")

# Create the figure and set its size and dpi:
plt.figure(figsize=(8, 6), dpi=140)

# Create the boxplot showing the mean of the distribution:
sns.boxplot(x=combined_df_over25["Fouled per 40 Minutes"], showmeans=True, meanline=True)

# Set the x-axis ticks, boxplot's title and labels:
plt.xticks(range(3, 10, 1))
plt.title("Boxplot 14: Distribution of players' fouls received per 40 minutes for all seasons.\n"
    "(Players over 25 minutes on the court)")

plt.xlabel("Fouls Received per 40 Minutes");

plt.savefig('boxplot14.png', dpi=300, bbox_inches='tight')

## Two Sample T-Hypothesis Test

Assumptions:  
$1)$ The population's variance is unknown. The data of 2016-2017, 2017-2018 and 2018-2019 seasons are not yet prepared.  
$2)$ The sample size, $n$ is equal to $4$ (4 seasons) and hence $n<30$.  
$3)$ There is a single population. 

In [None]:
# Create a list with historical observations:
historical_dfs = [df_1920, df_2021, df_2122, df_2223]

# Create a list with seasons of the observations:
seasons = ["2019-2020", "2020-2021", "2021-2022", "2022-2023"]

# Iterate through historical_dfs and calculate the mean for each season:
list_of_seasons = []
for df in historical_dfs:  
    season_mean = df["Fouled per 40 Minutes"].mean()  
    list_of_seasons.append(season_mean)

# Create the hypothesis df format:
df_hypothesis = pd.DataFrame({"Season": seasons,
                              "Fouled per 40 Minutes": list_of_seasons
    
})

# Assign the degrees of freedom and the sample size to a variable:
n = 4
deg_fr = n -1

# Calculate the mean and the standard deviation of all seasons:
x, s = df_hypothesis["Fouled per 40 Minutes"].aggregate(["mean", "std"])

# Assign the mean of 2023-2024 season to μ:
μ_0 = df_2324["Fouled per 40 Minutes"].mean()

# Calculate the standard error:
standard_error = s / (n**0.5)

### Null and Alternative Hypothesis
The null hypothesis is that the mean of the season 2023-2024, $μ_0$, is not significantly smaller than the historical mean of the other seasons combined. This means:
$$H_0: μ - μ_0 = 0$$  
The alternative hypothesis is that the mean of the season 2023-2024, $μ_0$, is significantly smaller than the historical mean of the other seasons combined. This means:
$$H_1: μ - μ_0 > 0$$  

In [None]:
# Calculate T-Score:
t_score = round((x - μ_0) / standard_error, 2)
print("T-Score:", t_score)

# Calculate p-value for an one-tailed test:
p = round(1 - t.cdf(t_score, deg_fr), 2)
print("p-Value:", p)

<div style="text-align: justify">
Since p value is greater than any of the common α values, we fail to reject the null hypothesis. This indicates that, despite the fact that the mean of 2023-2024 season is smaller than the historical mean, this difference should not be considered statistically significant. Therefore, the test above does not provide sufficient evidence to reject the null hypothesis.
</div>