# **Hypothesis Testing with Men's and Women's Soccer Matches**

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

 H0: The mean number of goals scored in women's international soccer matches is the same as men's.

 Ha: The mean number of goals scored in women's international soccer matches is greater than men's.

#### Project Instructions:

Perform an appropriate hypothesis test to determine the p-value, and hence result, of whether to reject or fail to reject the null hypothesis that the mean number of goals scored in women's international soccer matches is the same as men's. Use a 10% significance level.

For this analysis, you'll use Official FIFA World Cup matches since 2002-01-01, and you'll also assume that each match is fully independent, i.e., team form is ignored.

The p-value and the result of the test must be stored in a dictionary called result_dict in the form:

result_dict = {"p_val": p_val, "result": result}

where p_val is the p-value and result is either the string "fail to reject" or "reject", depending on the result of the test.

In [36]:
# Loading in required libraries

import os
import pandas as pd
import seaborn as sns
import numpy as np
import pingouin

# Reading in the Nobel Prize data
# build read_csv function
folder_name = 'datasets'
dir = r'C:\Users\mcaba\OneDrive\Escritorio\Data Science\Datacamp_Projects\DataCamp_Projects\{}'.format(folder_name)

def read_csv_fun(folder_name,file_name, path):
    path = dir
    os.chdir(path)
    df = pd.read_csv('{}.csv'.format(file_name), sep=',', low_memory=False, on_bad_lines='skip')
    return df

women_results = read_csv_fun('datasets','women_results', dir)

# Taking a look at the first several winners
women_results.head(2)

Unnamed: 0.1,Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament
0,0,1969-11-01,Italy,France,1,0,Euro
1,1,1969-11-01,Denmark,England,4,3,Euro


In [37]:
men_results = read_csv_fun('datasets','men_results', dir)

# Taking a look at the first several winners
men_results.head(2)

Unnamed: 0.1,Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament
0,0,1872-11-30,Scotland,England,0,0,Friendly
1,1,1873-03-08,England,Scotland,4,2,Friendly


In [38]:
display(women_results.info())

display(women_results.describe())

display(men_results.info())

display(men_results.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4884 entries, 0 to 4883
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  4884 non-null   int64 
 1   date        4884 non-null   object
 2   home_team   4884 non-null   object
 3   away_team   4884 non-null   object
 4   home_score  4884 non-null   int64 
 5   away_score  4884 non-null   int64 
 6   tournament  4884 non-null   object
dtypes: int64(3), object(4)
memory usage: 267.2+ KB


None

Unnamed: 0.1,Unnamed: 0,home_score,away_score
count,4884.0,4884.0,4884.0
mean,2441.5,2.272727,1.431409
std,1410.033688,2.736377,1.974651
min,0.0,0.0,0.0
25%,1220.75,0.0,0.0
50%,2441.5,1.0,1.0
75%,3662.25,3.0,2.0
max,4883.0,24.0,24.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44353 entries, 0 to 44352
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  44353 non-null  int64 
 1   date        44353 non-null  object
 2   home_team   44353 non-null  object
 3   away_team   44353 non-null  object
 4   home_score  44353 non-null  int64 
 5   away_score  44353 non-null  int64 
 6   tournament  44353 non-null  object
dtypes: int64(3), object(4)
memory usage: 2.4+ MB


None

Unnamed: 0.1,Unnamed: 0,home_score,away_score
count,44353.0,44353.0,44353.0
mean,22176.0,1.740559,1.178793
std,12803.752581,1.748722,1.39458
min,0.0,0.0,0.0
25%,11088.0,1.0,0.0
50%,22176.0,1.0,1.0
75%,33264.0,2.0,2.0
max,44352.0,31.0,21.0


In [39]:
#Transforming data columns into the correct Dtype

women_results["date"] = pd.to_datetime(women_results["date"])

men_results["date"] = pd.to_datetime(men_results["date"])

In [40]:
# year column for validation

women_results["year"] = women_results["date"].dt.year 

men_results["year"] =  men_results["date"].dt.year 

In [41]:
display(women_results["year"].value_counts())

display(men_results["year"].value_counts())

year
2018    346
2012    242
2017    228
2016    228
2019    204
2015    202
2006    195
2010    190
2008    186
2003    179
2011    178
2014    161
2004    150
2000    145
2020    137
1999    136
2007    130
1995    124
2013    115
2002    112
1996    110
1998    101
2022     94
1991     92
2021     91
2009     90
1994     89
2001     61
2005     61
1983     54
1993     53
1997     50
1988     44
1989     42
1990     41
1992     40
1986     38
1985     24
1987     20
1982     18
1980     18
1981     16
1979     16
1977     10
1975     10
1984      9
1969      4
Name: count, dtype: int64

year
2019    1161
2008    1102
2011    1090
2021    1089
2004    1077
        ... 
1876       2
1873       1
1875       1
1874       1
1872       1
Name: count, Length: 151, dtype: int64

In [42]:
display(men_results["tournament"].value_counts())
display(men_results[men_results["tournament"].str.contains("FIFA", na=False)]["tournament"].value_counts())

tournament
Friendly                                17519
FIFA World Cup qualification             7878
UEFA Euro qualification                  2585
African Cup of Nations qualification     1932
FIFA World Cup                            964
                                        ...  
Real Madrid 75th Anniversary Cup            1
Évence Coppée Trophy                        1
Copa Confraternidad                         1
TIFOCO Tournament                           1
FIFA 75th Anniversary Cup                   1
Name: count, Length: 141, dtype: int64

tournament
FIFA World Cup qualification    7878
FIFA World Cup                   964
FIFA 75th Anniversary Cup          1
Name: count, dtype: int64

In [43]:
display(women_results["tournament"].value_counts())
display(women_results[women_results["tournament"].str.contains("FIFA", na=False)]["tournament"].value_counts())

tournament
UEFA Euro qualification                 1445
Algarve Cup                              551
FIFA World Cup                           284
AFC Championship                         268
Cyprus Cup                               258
African Championship qualification       226
UEFA Euro                                184
African Championship                     173
FIFA World Cup qualification             172
CONCACAF Gold Cup qualification          143
AFC Asian Cup qualification              141
Copa América                             131
Olympic Games                            130
CONCACAF Gold Cup                        126
Friendly                                 111
AFC Asian Cup                            111
Four Nations Tournament                  106
OFC Championship                          78
African Cup of Nations qualification      58
CONCACAF Championship                     42
SheBelieves Cup                           39
Euro                                      20

tournament
FIFA World Cup                  284
FIFA World Cup qualification    172
Name: count, dtype: int64

In [44]:
# Subset the women_results & men_results since 2002-01-01
women_results_2002 = women_results[women_results ['date'] >= '2002-01-01']

men_results_2002 = men_results[men_results['date'] >= '2002-01-01']

# Subset the women_results & men_results only official FIFA World Cup matches (not including qualifiers)

df_women_results_FIFA2002 = women_results_2002[women_results_2002['tournament'] == 'FIFA World Cup']

df_men_results_FIFA2002 = men_results_2002[men_results_2002['tournament'] == 'FIFA World Cup']


In [45]:
display(df_women_results_FIFA2002["tournament"].value_counts())

display(df_men_results_FIFA2002["tournament"].value_counts())



tournament
FIFA World Cup    200
Name: count, dtype: int64

tournament
FIFA World Cup    384
Name: count, dtype: int64

In [None]:
display(df_women_results_FIFA2002["year"].value_counts())

display(df_men_results_FIFA2002["year"].value_counts())

year
2015    52
2019    52
2003    32
2007    32
2011    32
Name: count, dtype: int64

year
2002    64
2006    64
2010    64
2014    64
2018    64
2022    64
Name: count, dtype: int64

In [47]:
# Total goals columb for both dataframes

df_women_results_FIFA2002["total_goals"] = df_women_results_FIFA2002["home_score"] + df_women_results_FIFA2002["away_score"]

df_men_results_FIFA2002["total_goals"] = df_men_results_FIFA2002["home_score"] + df_men_results_FIFA2002["away_score"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_women_results_FIFA2002["total_goals"] = df_women_results_FIFA2002["home_score"] + df_women_results_FIFA2002["away_score"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_men_results_FIFA2002["total_goals"] = df_men_results_FIFA2002["home_score"] + df_men_results_FIFA2002["away_score"]


In [48]:
display(df_women_results_FIFA2002.info())

display(df_women_results_FIFA2002.describe())

display(df_men_results_FIFA2002.info())

display(df_men_results_FIFA2002.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 1600 to 4469
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   200 non-null    int64         
 1   date         200 non-null    datetime64[ns]
 2   home_team    200 non-null    object        
 3   away_team    200 non-null    object        
 4   home_score   200 non-null    int64         
 5   away_score   200 non-null    int64         
 6   tournament   200 non-null    object        
 7   year         200 non-null    int32         
 8   total_goals  200 non-null    int64         
dtypes: datetime64[ns](1), int32(1), int64(4), object(3)
memory usage: 14.8+ KB


None

Unnamed: 0.1,Unnamed: 0,date,home_score,away_score,year,total_goals
count,200.0,200,200.0,200.0,200.0,200.0
mean,3094.485,2012-10-01 04:04:48,1.805,1.175,2012.2,2.98
min,1600.0,2003-09-20 00:00:00,0.0,0.0,2003.0,0.0
25%,2155.75,2007-09-17 18:00:00,1.0,0.0,2007.0,2.0
50%,3429.5,2015-06-07 12:00:00,1.0,1.0,2015.0,3.0
75%,4418.25,2019-06-08 00:00:00,2.0,2.0,2019.0,4.0
max,4469.0,2019-07-07 00:00:00,13.0,7.0,2019.0,13.0
std,1010.682192,,1.937977,1.289453,5.68521,2.022387


<class 'pandas.core.frame.DataFrame'>
Index: 384 entries, 25164 to 44352
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   384 non-null    int64         
 1   date         384 non-null    datetime64[ns]
 2   home_team    384 non-null    object        
 3   away_team    384 non-null    object        
 4   home_score   384 non-null    int64         
 5   away_score   384 non-null    int64         
 6   tournament   384 non-null    object        
 7   year         384 non-null    int32         
 8   total_goals  384 non-null    int64         
dtypes: datetime64[ns](1), int32(1), int64(4), object(3)
memory usage: 28.5+ KB


None

Unnamed: 0.1,Unnamed: 0,date,home_score,away_score,year,total_goals
count,384.0,384,384.0,384.0,384.0,384.0
mean,34629.875,2012-07-16 16:52:30,1.375,1.138021,2012.0,2.513021
min,25164.0,2002-05-31 00:00:00,0.0,0.0,2002.0,0.0
25%,28769.75,2006-06-19 18:00:00,0.0,0.0,2006.0,1.0
50%,34557.0,2012-06-26 00:00:00,1.0,1.0,2012.0,2.0
75%,40385.25,2018-06-24 06:00:00,2.0,2.0,2018.0,3.0
max,44352.0,2022-12-18 00:00:00,8.0,7.0,2022.0,8.0
std,6566.923215,,1.328538,1.107398,6.840213,1.652544


In [49]:
mean_women_goals = df_women_results_FIFA2002['total_goals'].mean()
print(mean_women_goals)

mean_men_goals = df_men_results_FIFA2002['total_goals'].mean()
print(mean_men_goals)

2.98
2.5130208333333335


In [None]:
alpha = 0.1 ### assuming a 10% significance level

### Non-parametric ANOVA and unpaired t-tests

# H0: The mean number of goals scored in women's international 
# soccer matches is the same as men's.

# mean_women_goals = mean_men_goals
# H0: mean_women_goals - mean_men_goals = 0


# Ha: The mean number of goals scored in women's international
#  soccer matches is greater than men's.

# mean_women_goals > mean_men_goals
# Ha: mean_women_goals - mean_men_goals > 0

In [79]:
ttest = pingouin.mwu(x=df_women_results_FIFA2002['total_goals'],
            y=df_men_results_FIFA2002['total_goals'],
            alternative='greater')
print(ttest)

       U-val alternative     p-val       RBC      CLES
MWU  43273.0     greater  0.005107  0.126901  0.563451


In [80]:
p_val = ttest.loc['MWU', 'p-val']

result_dict = {
    "p_val": p_val,
    "result": "fail to reject" if p_val > alpha else "reject"
}

print(result_dict)

{'p_val': 0.005106609825443641, 'result': 'reject'}
