<a href="https://colab.research.google.com/github/namans-git/iit_intern/blob/main/F_%26_ZTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

An ***F-Test*** is another statistical test that is used in hypothesis testing  to check whether the variances of two populations or samples are equal or not.

General steps for an F-Test:

1. State the Null Hypothesis ($H_0$) and the Alternate Hypothesis ($H_1$).
 * $H_0 : var_1 = var_2$ (the variances are equal)
 * $H_1 : var_1 ≠ var_2$ (the variances are not equal)

2. Calculate the **F value**. The F - value is calculated using the formula $F = (sse_1 – sse_2 / m)$ / $sse_2$ / $n-k$, where $sse$ = residual sum of squares, $m$ = number of restrictions and $k$ = number of independent variables.

3. Find the **F Statistic** (the **critical value** for this test). The F statistic formula is:
F Statistic = variance of the group means / mean of the within group variances.
We find the F Statistic in the **F-Table**.

4. Support or Reject the Null Hypothesis.

##### Assumptions:

In order to run the test accurately, we have to make some assumptions that need to be adhered to:

1. Populations must be approximately *normally distributed*.
2. The samples must be *independent events*.
3. The larger variance should always go in the *numerator* to force the test into a **right-tailed test**.
4. For two-tailed tests, divide $\alpha$ by 2 before finding the right critical value.

**When do we use the f-test?**

The F-test is typically used to answer one of the following questions:

1. Do two samples come from populations with equal variances?

2. Does a new treatment or process reduce the variability of some current treatment or process?

Running the test on artificially generated data and on real world data [22] ~

In [None]:
#import libraries
import numpy as np
import scipy.stats

# defining the f-test function
def f_test(x, y):
  x = np.array(x)
  y = np.array(y)
  f = np.var(x, ddof=1)/np.var(y, ddof=1) #calculating f test statistic
  dfn = x.size-1  #define degrees of freedom numerator
  dfd = y.size-1  #define degrees of freedom denominator
  p = 1 - scipy.stats.f.cdf(f, dfn, dfd) #find p-value of f-test statistic
  return f,p

In [None]:
# sample data
group1 = [0.28, 0.2, 0.26, 0.28, 0.5]
group2 = [0.2, 0.23, 0.26, 0.21, 0.23]

In [None]:
f_test(group1, group2)

(24.679245283018858, 0.004431318383760985)

we can simply say by looking at the ***p-value(0.0044)*** of the data used that the two population variances are not equal, and hence we reject the null hypothesis, i.e. the variances are equal.

Let's implement this test using real world data ~

In [None]:
import pandas as pd

#dataset [21]
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   patient    120 non-null    int64 
 1   sex        120 non-null    object
 2   agegrp     120 non-null    object
 3   bp_before  120 non-null    int64 
 4   bp_after   120 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 4.8+ KB


In [None]:
#descriptive statistics
df[df['sex'] == 'Male'].describe()

Unnamed: 0,patient,bp_before,bp_after
count,60.0,60.0,60.0
mean,30.5,159.266667,155.516667
std,17.464249,11.413442,15.243217
min,1.0,140.0,125.0
25%,15.75,150.75,145.0
50%,30.5,158.0,158.5
75%,45.25,170.0,164.75
max,60.0,185.0,185.0


In [None]:
#the groups that we'll be comparing
group1 = df['bp_after'][df['sex'] == 'Male']
group2 = df['bp_after'][df['sex'] == 'Female']

In [None]:
f_test(group1, group2)

(1.6850611305046137, 0.02360324462983243)

As 0.02 < 0.05, thus, **the null hypothesis can be rejected** and there is enough evidence to conclude that there is difference in variances between the two groups.

----

***Z Test*** is used to determine whether two population means are different when the variances are known and the sample size is large.

* closely related to t-tests, but t-tests are best performed when an experiment has a small sample size. z tests are best used for greater than 30 samples because under the central limit theorem, as the number of samples get larger, the samples are considered to be approximately normally distributed.

* z-tests assume the standard deviation to be known, while t-tests assume it is unknown.

> One sample z test

In [None]:
from statsmodels.stats.weightstats import ztest as ztest

#taking a group of people who've been given the medicine and seeing if the bp has changed statistically significantly to original bp

print(ztest(group1, value=155.6))#p value of 0.9, can't reject the null hypo. The means are statistically similar

print(ztest(group1, value=190))#very low p value understandably. because we're taking our value to be very far apart from the mean of the group which is 159

print(ztest(group1, value=163))#significant p value, we reject null hypo in this case. we'd say that the medicine did statistically significant changes.

(-0.04234652336929103, 0.9662224582407137)
(-17.522991370216605, 9.566068717867366e-69)
(-3.8027177985631924, 0.00014311735027259506)


> Two sample z test

In [None]:
#similarly we can compare two groups and see if the means are statistically significantly different
ztest(group1, group2)

(3.3479506182111387, 0.000814115163437746)