# Hypothesis Testing using ANOVA 

In [None]:
Hypothesis test relationships between different kinds of variables (quantitative and categorical). 
In the context of Analysis of Variance is used when we have one quantitative variable and one categorical variable.

'data.csv' includes lastest edition FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, 
Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, 
Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, 
RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy,
LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, 
LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, 
GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

Url for data set :https://www.kaggle.com/karangadiya/fifa19

In [1]:
# import necessary libraries

import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

In [2]:
# Read the Dataset

df = pd.read_csv('data.csv')

In [3]:
# Examine the Dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
Unnamed: 0                  18207 non-null int64
ID                          18207 non-null int64
Name                        18207 non-null object
Age                         18207 non-null int64
Photo                       18207 non-null object
Nationality                 18207 non-null object
Flag                        18207 non-null object
Overall                     18207 non-null int64
Potential                   18207 non-null int64
Club                        17966 non-null object
Club Logo                   18207 non-null object
Value                       18207 non-null object
Wage                        18207 non-null object
Special                     18207 non-null int64
Preferred Foot              18159 non-null object
International Reputation    18159 non-null float64
Weak Foot                   18159 non-null float64
Skill Moves                 18159 non-null fl

Case 1: Hypothesis testing to check if there exists significant difference of wage paid amongst different country players.

Null Hypotheis: There is no difference in Wage paid amongst different country players.
Alternate Hypotheis: There is significant difference in Wage paid amongst different country players.


In [4]:
# subset data on the varibles we wanted to test our hypothesis

df_sub1 = df[['Nationality', 'Wage']]

In [5]:
# Cleaning Column Wage as it contains a prefix Euro (€) symbol and in thousands represented as K

df_sub1['Wage'] = df_sub1['Wage'].replace('[\€, K]', '', regex=True).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


F-Statistic = Variation Among Sample Means/ Variation Within Groups


In [6]:
# using ols function for calculating the F-statistic and associated p value for Case1

model1 = smf.ols(formula='Wage ~ C(Nationality)', data=df_sub1)
results1 = model1.fit()
print (results1.summary())

                            OLS Regression Results                            
Dep. Variable:                   Wage   R-squared:                       0.045
Model:                            OLS   Adj. R-squared:                  0.037
Method:                 Least Squares   F-statistic:                     5.233
Date:                Tue, 18 Feb 2020   Prob (F-statistic):           3.60e-91
Time:                        12:05:26   Log-Likelihood:                -81692.
No. Observations:               18207   AIC:                         1.637e+05
Df Residuals:                   18043   BIC:                         1.650e+05
Df Model:                         163                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

Result : Since the p value is less than significant value 0.05 we reject the Null Hypothesis, 
 Therefore we conclude that Wage paid to Foot ball players is different for differnt countries.

In [7]:
# Running posthoc test for analysis

mc1 = multi.MultiComparison(df_sub1['Wage'], df_sub1['Nationality'])
res1 = mc1.tukeyhsd()
print(res1.summary())

             Multiple Comparison of Means - Tukey HSD,FWER=0.05             
       group1               group2        meandiff   lower    upper   reject
----------------------------------------------------------------------------
    Afghanistan            Albania          5.85    -45.3035 57.0035  False 
    Afghanistan            Algeria        16.8333   -33.5391 67.2058  False 
    Afghanistan            Andorra         -0.25   -109.3096 108.8096 False 
    Afghanistan             Angola         6.4167   -48.4755 61.3088  False 
    Afghanistan       Antigua & Barbuda     1.25    -67.7254 70.2254  False 
    Afghanistan           Argentina       10.2527   -38.6243 59.1296  False 
    Afghanistan            Armenia         20.75    -36.9589 78.4589  False 
    Afghanistan           Australia        2.1017   -47.0828 51.2862  False 
    Afghanistan            Austria         6.4178   -42.6814  55.517  False 
    Afghanistan           Azerbaijan        0.95    -64.4858 66.3858  False 

# Case2: To further develop lets check If there exists significant difference between European Players and Latin Players

Null Hypothesis = There is no difference in Wage for European players and Latin Players
Alternate Hypothesis = There is significant differnece in Wage for Europena players to Latin Players

Remapping Countires to Continents

List of countries in Europe and Latin America are ectarcted from the World Atlas Website

Latin Counties List :https://www.worldatlas.com/articles/which-countries-make-up-latin-america.html
European Countries List :https://www.worldatlas.com/articles/the-four-european-regions-as-defined-by-the-united-nations-geoscheme-for-europe.html

In [8]:
# List of latin countries

Latin_coun = ['Brazil','Mexico','Colombia','Argentina','Peru','Venezuela', 'Chile','Guatemala','Ecuador','Cuba', 'Bolivia',
             'Haiti','Dominican Republic','Honduras','Paraguay','El Salvador','Nicaragua','Costa Rica','Panama','Puerto Rico',
              'Uruguay','Guadeloupe','Martinique','French Guiana','Saint Martin','Saint Barthélemy']


In [9]:
# List of European countries

Euro_coun = ['Belarus','Åland Islands','Albania','Austria','Bulgaria','Denmark','Andorra','Belgium','Czech Republic','Estonia','Bosnia and Herzegovina',
             'France','Republic of Moldova','Faroe Islands','Croatia','Germany','Poland','Finland','Gibraltar','Liechtenstein','Hungary',
             'Guernsey','Greece','Luxembourg','Romania','Iceland','Italy','Monaco','Russian Federation','Republic of Ireland','Republic of Macedonia',
             'Netherlands','Slovakia','Jersey','Malta','Switzerland','Ukraine','Isle of Man','Montenegro','Latvia','Portugal','Lithuania',
             'San Marino','Norway','Serbia','Sark','Slovenia','Svalbard and Jan Mayen','Spain','Sweden','Vatican City','United Kingdom']

In [10]:
# Subset data for Case2 Study

df_sub2 = df_sub1.copy()

In [11]:
#Mapper Function for mapping each country to its respective Continent

def GetConti(x):
    if x in Euro_coun:
        return "Europe"
    elif x in Latin_coun:
        return "Latin"
    else:
        return "other"

df_sub2['Continent'] = df['Nationality'].apply(lambda x: GetConti(x))

# print to rechek
print(df_sub2)

               Nationality  Wage Continent
0                Argentina   565     Latin
1                 Portugal   405    Europe
2                   Brazil   290     Latin
3                    Spain   260    Europe
4                  Belgium   355    Europe
5                  Belgium   340    Europe
6                  Croatia   420    Europe
7                  Uruguay   455     Latin
8                    Spain   380    Europe
9                 Slovenia    94    Europe
10                  Poland   205    Europe
11                 Germany   355    Europe
12                 Uruguay   125     Latin
13                   Spain   285    Europe
14                  France   225    Europe
15               Argentina   205     Latin
16                 England   205     other
17                  France   145    Europe
18                 Germany   240    Europe
19                 Belgium   240    Europe
20                   Spain   315    Europe
21                 Uruguay   200     Latin
22         

In [14]:
# Find the mean Wage of Each Continent Players

print ('Mean Wage estimate by Continent')
mean_con = df_sub2.groupby('Continent').mean()
print (mean_con)


# Find the Standard Deviation for Each Continent players

print ('Standard deviations for Wage by Continent')
sd_con = df_sub2.groupby('Continent').std()
print (sd_con)

Mean Wage estimate by Continent
                Wage
Continent           
Europe     10.760039
Latin      10.822430
other       7.775154
Standard deviations for Wage by Continent
                Wage
Continent           
Europe     24.148462
Latin      25.341513
other      16.102149


In [15]:
# using ols function for calculating the F-statistic and associated p value for Case2

model2 = smf.ols(formula='Wage ~ C(Continent)', data=df_sub2)
results2 = model1.fit()
print (results1.summary())

                            OLS Regression Results                            
Dep. Variable:                   Wage   R-squared:                       0.045
Model:                            OLS   Adj. R-squared:                  0.037
Method:                 Least Squares   F-statistic:                     5.233
Date:                Tue, 18 Feb 2020   Prob (F-statistic):           3.60e-91
Time:                        12:09:14   Log-Likelihood:                -81692.
No. Observations:               18207   AIC:                         1.637e+05
Df Residuals:                   18043   BIC:                         1.650e+05
Df Model:                         163                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

To determine which groups are different from the others, we would need to perform a post hoc test.
A post hoc test conducts post hoc paired comparisons.

In [16]:
# Running posthoc test for analysis for Case2

mc1 = multi.MultiComparison(df_sub2['Wage'], df_sub2['Continent'])
res1 = mc1.tukeyhsd()
print(res1.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
Europe Latin   0.0624  -0.9623  1.0871 False 
Europe other  -2.9849  -3.8446 -2.1252  True 
Latin  other  -3.0473  -4.1172 -1.9773  True 
---------------------------------------------


Summary and Conclusion for Case2:
    
    From post hoc test there is no difference in Wages amongst Latin and European players where there is a difference 
    in Wage paid to other continent players to that of European players and Latin Players
   

Citation:
    Based on course @ https://www.coursera.org/learn/data-analysis-tools/home/week/1