### ANOVA Test Comparing Viewership of Big 3 Tennis Players


Do Novak Djokovic, Rafael Nadal, and Roger Federer have similar impact on tennis matches?


If not, a deeper dive will be made into who has the biggest impact statistically.

We will only examine US Open finals played against non-Big 3 players.


##### Null Hypothesis: 

These players have similar impacts on viewership.


##### Alternate Hypothesis: 

These players have different impacts on viewership.

In [172]:
import scipy.stats as stats
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [173]:
# Collect data
df = pd.DataFrame({
    'Viewership': [2.75, 1.85, 2.07, 1.7, 2.34, 4.94],
    'Player': ['Nadal', 'Nadal', 'Djokovic', 'Djokovic', 'Federer', 'Federer'],
    'Year':[2019,2017,2018,2016,2009,2005]
})

df.sort_values(by='Year',inplace=True,ascending=False)
df.set_index('Year',inplace=True)
df

Unnamed: 0_level_0,Viewership,Player
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2019,2.75,Nadal
2018,2.07,Djokovic
2017,1.85,Nadal
2016,1.7,Djokovic
2014,0.0,Nishikori
2009,2.34,Federer
2005,4.94,Federer


In [174]:
# Perform ANOVA
anova_result = stats.f_oneway(df[df['Player'] == 'Nadal']['Viewership'],
                              df[df['Player'] == 'Djokovic']['Viewership'],
                             df[df['Player'] == 'Federer']['Viewership'])

print(f"F-Statistic: {anova_result.statistic:.4f}")

print(f"\nP-Value: {anova_result.pvalue:.4f}")

if anova_result.pvalue < 0.05:
    print("\nReject the null hypothesis: There is a significant difference in viewership based on players.")
else:
    print("\nFail to reject the null hypothesis: \nNo significant difference in viewership based on players.")


F-Statistic: 1.3100

P-Value: 0.3900

Fail to reject the null hypothesis: 
No significant difference in viewership based on players.


#### Adding More Data

To further validate the results, it has been decided to add Wimbledon, Roland Garros, and Australian Open Finals under the same circumstances to the Data Table.

We will, again, only use matches where only one Big 3 player is present to avoid the impact of both being present (which would magnify the views).

In [175]:
# Clarifying US Open data

df['Event']=['US Open']*len(df)
df.reset_index(inplace=True)

In [176]:
# Prepare new data for other slams
uso=df

wimbledon = pd.DataFrame({
    'Year': [2003, 2004, 2005,2009,2010,
             2012,2013,2017,2018,2021,
            2022,2023],
    'Viewership': [6.5, 6.7,7,9,7.4,
                   10,11,7.6,6.3,6.5,
                  6.8,7.9],
    'Player': ['Federer', 'Federer','Federer','Federer','Nadal',
               'Federer','Djokovic','Federer','Djokovic','Djokovic',
              'Djokovic','Djokovic']
})

french = pd.DataFrame({
    'Year': [2005,2009,2010,2013,2015,
            2016,2017,2018,2019,2021,
             2022,2023],
    'Viewership': [5,6.8,7,7.5,7.9,
                  8.6,8.2,8.5,8.7,8.9,
                   8.5,8.3],
    'Player': ['Nadal','Federer','Nadal','Nadal','Djokovic',
              'Djokovic','Nadal','Nadal','Nadal','Djokovic',
              'Nadal','Djokovic']
})

australian = pd.DataFrame({
    'Year': [2004,2006,2007,2008,2010,
            2011,2013,2014,2015,2016,
            2018,2020,2021,2022,2023],
    'Viewership': [1.7,1.9,2,2.4,2.6,
                  2,2.9,3.1,2.5,2.7,
                  3.5,3.4,2.8,3.6,3.1],
    'Player': ['Federer','Federer','Federer','Djokovic','Federer',
              'Djokovic','Djokovic','Nadal','Djokovic','Djokovic',
             'Federer','Djokovic','Djokovic','Nadal','Djokovic']
})

wimbledon['Event']=['Wimbledon']*len(wimbledon)
french['Event']=['Roland Garros']*len(french)
australian['Event']=['Australian Open']*len(australian)


In [177]:
# Standardize each slam's national viewership figures for comparison

scaler=StandardScaler()
def standardizer(views,dataframe):
    views_standard=scaler.fit_transform(views.to_numpy().reshape(-1,1))
    dataframe['Standardized Viewership']=views_standard

standardizer(uso['Viewership'],uso)
standardizer(french['Viewership'],french)
standardizer(wimbledon['Viewership'],wimbledon)
standardizer(australian['Viewership'],australian)


In [178]:
# Concatenate original DataFrame with the new data
four_slams = pd.concat([uso, wimbledon,french,australian])

# Organize the Table
four_slams.sort_values(by='Viewership',inplace=True,ascending=False)
four_slams.index=four_slams[['Event','Year']]
four_slams.drop(columns=['Event','Year'],inplace=True)
four_slams

Unnamed: 0,Viewership,Player,Standardized Viewership
"(Wimbledon, 2013)",11.0,Djokovic,2.262547
"(Wimbledon, 2012)",10.0,Federer,1.571693
"(Wimbledon, 2009)",9.0,Federer,0.880839
"(Roland Garros, 2021)",8.9,Djokovic,1.008952
"(Roland Garros, 2019)",8.7,Nadal,0.82124
"(Roland Garros, 2016)",8.6,Djokovic,0.727384
"(Roland Garros, 2018)",8.5,Nadal,0.633528
"(Roland Garros, 2022)",8.5,Nadal,0.633528
"(Roland Garros, 2023)",8.3,Djokovic,0.445816
"(Roland Garros, 2017)",8.2,Nadal,0.35196


In [180]:
# Perform ANOVA again with more data
anova_result = stats.f_oneway(four_slams[four_slams['Player'] == 'Nadal']['Standardized Viewership'],
                              four_slams[four_slams['Player'] == 'Djokovic']['Standardized Viewership'],
                             four_slams[four_slams['Player'] == 'Federer']['Standardized Viewership'])

print(f"F-Statistic: {anova_result.statistic:.4f}")

print(f"\nP-Value: {anova_result.pvalue:.4f}")

if anova_result.pvalue < 0.05:
    print("\nReject the null hypothesis: There is a significant difference in viewership based on players.")
else:
    print("\nFail to reject the null hypothesis: \nNo significant difference in viewership based on players.")


F-Statistic: 0.2170

P-Value: 0.8058

Fail to reject the null hypothesis: 
No significant difference in viewership based on players.


#### Conclusion:

After adding more data from other majors there is still no rejection of the null hypothesis. 

There is a high variability in viewership (ranging from 1-11 million depending mainly on the tournament and year).

Even though the Big 3's individual matches may have fluctuating viewership, their overall influence on attracting an audience appears to be equally strong.

However, it is possible that the variability within each player's grand slam finals dilutes the model's ability to find a statistically significant difference between the players' influence on viewership.