In [1]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  

%matplotlib inline

import scipy.stats as stats 
import random

# **F-test for Equality of Variances**

**Business Problem**

*A sports blogger published a post that discusses how the distance(km) covered between two groups of runners the TCS London Marathon route and TCS New York City Marathon route is actually vary.*

*The data (runners_data.csv) includes the distance (km) that each runner covered for two different groups of runners of two different marathons. It is assumed that the distance (km) that runners take to finish the routes for two sets of runners follow a normal distribution.*

*Do we have enough statistical evidence at a 5% significance level to conclude that there is a significant difference between the variances of the running distance (km) for the two groups?*

Let $\sigma_1^2, \sigma_2^2$ be the variances of distance of the runners covered by two different groups.

We will test the null hypothesis

>$H_0:\sigma_1^2 = \sigma_2^2$

against the alternate hypothesis

>$H_a:\sigma_1^2 \neq \sigma_2^2$

In [2]:
run_data = pd.read_csv('runners_data.csv')
run_data.drop(columns = ['Unnamed: 0'], inplace=True)
run_data.head(3)

Unnamed: 0,marthon1_distance,marthon2_distance
0,21.46,18.03
1,21.73,18.15
2,20.94,18.42


### Are the assumptions of the F-test are satisfied or not?

- Continuous data - Yes, the distance(km) is measured on a continuous scale.
- Normally distributed populations - Yes, it is assumed that the populations are normally distributed.
- Independent populations - As the two sets of runners from two different marathons, the populations are independent.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

### Let's find the p-value, f-test statistic

In [3]:
from scipy.stats import f

def f_test(x, y):
    x = np.array(x)
    y = np.array(y) 
    test_stat = np.var(x, ddof = 1)/np.var(y, ddof = 1) 
    dfn = x.size-1  
    dfd = y.size-1 
    p = (1 - f.cdf(test_stat, dfn, dfd))  
    p1 = p*2 
    return(print("The p_value is {}" .format(round(p1,8)))) 

In [4]:
f_test(run_data.dropna()['marthon1_distance'], run_data.dropna()['marthon2_distance'])

The p_value is 1.42045181


A p-value of 1.420 is greater than the level of significance 0.05, thus, we failed to reject the null hypothesis. We didn't have enough statistical evidence to say there's significant variation in distance (km) that is covered by the two groups.

### **Conclusion**

- We can conclude that : at the significance level 0.05, the variation in the distance(km) covered by two groups is approximately the same, there's not much variation between them.