# Hypothesis Testing with Men's and Women's Soccer Matches

In this project, I apply my statistical testing skills to historical data of men's and women's international soccer matches. Specifically, I perform a hypothesis test to answer the question:

> Are more goals are scored in women's international soccer matches than men's?

For this analysis, I assume a **10% significance level** and use the following null and alternative hypotheses:

$H_0$ : The mean number of goals scored in women's international soccer matches is the same as men's.

$H_A$ : The mean number of goals scored in women's international soccer matches is greater than men's.

Because the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, I limit the data used in the analysis to only `FIFA World Cup matches` (excluding qualifiers) since `2002-01-01`. I also assume that each match is fully independent, i.e. team form is ignored.

The idea and dataset for this project are from [this DataCamp project](https://app.datacamp.com/learn/projects/hypothesis_testing_with_mens_and_womens_soccer_matches/guided/Python).

## Load and inspect data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import mannwhitneyu

In [11]:
# Load data
women_results = pd.read_csv("data/women_results.csv")
men_results = pd.read_csv("data/men_results.csv")

In [12]:
# Inspect women data
print(women_results.info())
women_results.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4884 entries, 0 to 4883
Data columns (total 7 columns):
Unnamed: 0    4884 non-null int64
date          4884 non-null object
home_team     4884 non-null object
away_team     4884 non-null object
home_score    4884 non-null int64
away_score    4884 non-null int64
tournament    4884 non-null object
dtypes: int64(3), object(4)
memory usage: 267.2+ KB
None


Unnamed: 0.1,Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament
0,0,1969-11-01,Italy,France,1,0,Euro
1,1,1969-11-01,Denmark,England,4,3,Euro
2,2,1969-11-02,England,France,2,0,Euro
3,3,1969-11-02,Italy,Denmark,3,1,Euro
4,4,1975-08-25,Thailand,Australia,3,2,AFC Championship


In [13]:
# Inspect men data
print(men_results.info())
men_results.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44353 entries, 0 to 44352
Data columns (total 7 columns):
Unnamed: 0    44353 non-null int64
date          44353 non-null object
home_team     44353 non-null object
away_team     44353 non-null object
home_score    44353 non-null int64
away_score    44353 non-null int64
tournament    44353 non-null object
dtypes: int64(3), object(4)
memory usage: 2.4+ MB
None


Unnamed: 0.1,Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament
0,0,1872-11-30,Scotland,England,0,0,Friendly
1,1,1873-03-08,England,Scotland,4,2,Friendly
2,2,1874-03-07,Scotland,England,2,1,Friendly
3,3,1875-03-06,England,Scotland,2,2,Friendly
4,4,1876-03-04,Scotland,England,3,0,Friendly


## Exploratory data analysis (EDA)

Inspecting the details of the data above, we see that the `date` column is of string data type. To proceed with our analysis, it needs to be converted to a `datetime` type.

In [14]:
# Convert `date` columns to datetime
women_results["date"] = pd.to_datetime(women_results["date"], infer_datetime_format=True)
men_results["date"] = pd.to_datetime(men_results["date"], infer_datetime_format=True)

Let's also inspect the `tournament` column to see if there are any FIFA World Cup matches.

In [15]:
print(women_results["tournament"].value_counts())
men_results["tournament"].value_counts()

UEFA Euro qualification                 1445
Algarve Cup                              551
FIFA World Cup                           284
AFC Championship                         268
Cyprus Cup                               258
African Championship qualification       226
UEFA Euro                                184
African Championship                     173
FIFA World Cup qualification             172
CONCACAF Gold Cup qualification          143
AFC Asian Cup qualification              141
Copa América                             131
Olympic Games                            130
CONCACAF Gold Cup                        126
AFC Asian Cup                            111
Friendly                                 111
Four Nations Tournament                  106
OFC Championship                          78
African Cup of Nations qualification      58
CONCACAF Championship                     42
SheBelieves Cup                           39
Euro                                      20
African Cu

Friendly                                         17519
FIFA World Cup qualification                      7878
UEFA Euro qualification                           2585
African Cup of Nations qualification              1932
FIFA World Cup                                     964
Copa América                                       841
AFC Asian Cup qualification                        764
African Cup of Nations                             741
CECAFA Cup                                         620
CFU Caribbean Cup qualification                    606
Merdeka Tournament                                 595
British Home Championship                          517
UEFA Nations League                                468
Gulf Cup                                           380
AFC Asian Cup                                      370
Gold Cup                                           358
Island Games                                       350
UEFA Euro                                          337
COSAFA Cup

## Get only FIFA World Cup matches since 2002

From the results of the EDA, we'll use the string `FIFA World Cup` to filter our data (in addition to using only records since `2002-01-01`).

In [16]:
# Get only FIFA World Cup matches since 2002
women_wc_since_02 = women_results[
    (women_results["date"] >= "2002-01-01") & (women_results["tournament"] == "FIFA World Cup")
]
men_wc_since_02 = men_results[
    (men_results["date"] >= "2002-01-01") & (men_results["tournament"] == "FIFA World Cup")
]

## Determine the type of hypothesis to use

Because there are two independent groups, men's and women's, this scenario requires an unpaired two-sample test.
An unpaired t-test and a Wilcoxon-Mann-Whitney test are the two most commmon two-sample tests, where the Wilcoxon-Mann-Whitney test is a non-parametric version of the unpaired t-test.
To determine if a parametric or non-parametric test is appropriate, you'll need to verify the underlying assumptions of parametric tests, including checking the sample size in each group and the normality of each distribution.