<img src="https://miro.medium.com/max/930/0*uj57wvrqEnVe9ijg" width=2000>

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Table of Contents</p></div>

1. [What is Hypothesis Testing](#1)
2. [Null Hypothesis vs Alternative Hypothesis](#2)
    * [Null Hypothesis](#3)
    * [Alternative Hypothesis](#4)
        * [Two-sided hypothesis](#5)
        * [Left Side hypothesis](#6)
        * [Right Side hypothesis](#7)
3. [Null Hypothesis vs Alternative Hypothesis: Which Statistical Test to Choose?](#8)
4. [Dataset to demonstrate the use of each type of statistical test](#9)
5. [Most Popular Types of Statistical Tests in Data Science](#10)
    * [Z-test for Population Mean](#11)
    * [One-Sample t-test for Population Mean](#12)
    * [Paired t-Test](#13)
    * [Two-Sample t-Test](#14)
    * [ANOVA](#15)
    * [Chi-Square Goodness-of-Fit (GoF)](#16)
    * [Chi-Square Independence Test](#17)
    
   
# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='1' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">1 | What is Hypothesis Testing</p></div>

As we may be aware, when we draw conclusions from data, we do it based on a group of samples rather than the actual population. Can we trust the results from our data to make a broad assumption about the population? is the main query that arises from it. The primary objective of hypothesis testing is this.

There are several steps that we should do to properly conduct a hypothesis testing:
* First, form our null hypothesis and alternative hypothesis.
* Set our significance level. The significance level varies depending on our use case, but the default value is 0.05.
* Perform a statistical test that suits our data.
* Check the resulting p-Value. If the p-Value is smaller than our significance level, then we reject the null hypothesis in favor of our alternative hypothesis. If the p-Value is higher than our significance level, then we go with our null hypothesis.

You've seen the broad methodology for conducting hypothesis testing up to this point. But everything up to this point may have seemed a little abstract to you. How can we correctly create a null hypothesis and an alternative hypothesis given our data? What kind of statistical analysis should we run?

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='2' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">2 | Null Hypothesis vs Alternative Hypothesis</p></div>

<h4 id='3'>Null Hypothesis</h4>

Null hypothesis is the accepted status quo. It’s the default value. It states that nothing happened, no association exists, no significant difference between the mean or the proportion of our sample and the population.

<h4 id='4'>Alternative Hypothesis</h4>

The alternative hypothesis is the complete opposite of the null hypothesis. It states that there is something going on, there is a significant difference between the mean or the proportion of our sample and the population

<h4 id='5'>Two-sided Hypothesis</h4>

When we only want to determine whether there is a significant difference between the mean or proportion of our sample data with the population, we can utilize a two-sided hypothesis.

<center><img src='https://cdn.sanity.io/images/oaglaatp/production/b7ede65157deaf288e6a13a977757f550c0a1ced-700x434.png?w=700&h=434&auto=format'></center>

<h4 id='6'>Left-sided Hypothesis</h4>

Left-sided hypothesis can be used when we want to know if the mean or proportion of the population is smaller than our sample data.

<center><img src='https://cdn.sanity.io/images/oaglaatp/production/09db4e87fa4e7ed3f18151e4fdc8959409d4cd9a-597x377.png?w=597&h=377&auto=format'></center>

<h4 id='7'>Right-sided Hypothesis</h4>

Right-sided hypothesis can be used when we want to know if the mean or proportion of the population is larger than our sample data.

<center><img src='https://cdn.sanity.io/images/oaglaatp/production/97f40fce6e4ff626e66c2e5dae077133ed6dd7e2-643x410.png?w=643&h=410&auto=format'></center>


# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='8' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">3 | Null Hypothesis vs Alternative Hypothesis: Which Statistical Test to Choose?</p></div>

Since the null hypothesis is always going to be our default value, we cannot ‘accept’ the null hypothesis. We can either reject the null hypothesis in favor of the alternative hypothesis, or go with our null hypothesis.

Now, to know whether or not we should reject the null hypothesis, it depends on two factors:

*  The significance level
*  The p-Value
    
We can set the value of significance level in advance, for example 0.05. Meanwhile, we need to conduct test statistics in order to find the p-Value.

The general idea is that if the resulting p-Value is less than our significance level, we reject the null hypothesis. If the p-Value is larger than our significance level, we go with our null hypothesis.

The problem is, there are various test statistics out there. Which type of statistical test we should apply is totally dependent on our use case and data. So the natural question that comes next is, which type of statistical test should we choose considering the problem and data that we have?

To answer this question, we’re going to show you different types of statistical tests available out there and when you’re going to need each of them with one example dataset as our use case. So, let’s take a look at the dataset first.


# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='9' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">4 | Dataset to demonstrate the use of each type of statistical test</p></div>

To demonstrate the use of each type of statistical test, of course we need data for it. In this article, we’re going to use a student dataset. Below is the snapshot of what the dataset looks like:

In [1]:
import pandas as pd

df = pd.read_csv('../input/student/students.csv')

def basic_summary(df):
    summary = pd.DataFrame(df.dtypes, columns=['Data Type']).reset_index().rename(columns={'index':'Feaure'})
    summary['Num of Nulls'] = df.isnull().sum().values
    summary['Num of Unique'] = df.nunique().values
    return summary

display(basic_summary(df))

Unnamed: 0,Feaure,Data Type,Num of Nulls,Num of Unique
0,stud.id,int64,0,8239
1,name,object,0,8174
2,gender,object,0,2
3,age,int64,0,47
4,height,int64,0,67
5,weight,float64,0,464
6,religion,object,0,5
7,nc.score,float64,0,301
8,semester,object,0,7
9,major,object,0,6


# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='10' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">5 | Most Popular Types of Statistical Tests in Data Science</p></div>

Now that we know the data that we will work with in this article, let’s start with the first statistical test type, which is the Z-test for population mean.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='11' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">Z-Test for population mean</p></div>

Z-test for population mean is the simplest statistical test type out there, which makes it a good subject for us to start learning about hypothesis testing. As the name suggests, Z-test is a statistical test to compare the average of sample mean against the population mean.

To properly conduct this test, we need to make sure that our data fulfills the prerequisites as follows:

* The variable in our data is a continuous variable and follows a normal distribution or
* We have a large sample size for our variable
* The sample is randomly selected from its population
* The population standard deviation is known
    
The last point there, which is the population standard deviation must be known, is almost never fulfilled in real-life because normally we don’t know the standard deviation of the population.

This is the use case that can be answered with the Z-test because we fulfill the following conditions:

* We have gathered a large sample data
* Weight is a continuous variable
    
Let’s say that our significance value is 0.05. Next, we can compute the Z-score and p-Value by using a statistical library in Python or R. In this article, we’re going to use the statsmodels library in Python to conduct the Z-test and compute the p-Value.

In [2]:
from statsmodels.stats.weightstats import ztest

test_stats, p_value = ztest(x1=df['weight'], value=70.8)

print(p_value)

if p_value < 0.05:
    print("\nWe reject Null hypothesis")
else:
    print("\nWe accept Alternate hypothesis")

4.0517857849264745e-118

We reject Null hypothesis


# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='12' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">One-sample t-test for population mean</p></div>

As you can see from the code snippet above, the p-Value that we got is 4.05e-118, which is way smaller than our significant value. Hence, we conclude that our data provides strong evidence to reject the null hypothesis at significance level of 0.05.

One-sample t-test is the more general version of the Z-test. As mentioned previously, what makes the Z-test difficult to implement in real-life is its assumption about population standard deviation that we need to fulfill. In real-life, oftentimes we don’t know the population standard deviation. And this is where we can conduct one sample t-test.

To properly conduct this test, we need to make sure that our data fulfills the following condition:

* The variable in our data is a continuous variable and follows a normal distribution or
* We have a large sample size
* The sample is randomly selected from its population
    
Let’s say that we set the significance value to be 0.05. Next, we can compute the test statistics and p-Value with statistical libraries available out there. In this case, we’re going to use Scipy library in Python to compute the test statistic and p-Value.

In [3]:
from scipy.stats import ttest_1samp

test_statistic, p_value = ttest_1samp(df['weight'], popmean=70.8, alternative='two-sided')

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

1.6709961011966602e-114

We Reject Null Hypothesis


As you can see, the p-Value that we got is extremely small, which is 167e-114. This means that at 0.05 significance level, our data provides very strong evidence to reject the null hypothesis, i.e the average weight of European students is indeed different from the average weight of European adults.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='13' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">Paired t-tests</p></div>

In the previous section, we have seen how we can conduct a statistical test when we want to compare the means of our sample with the general population.

With paired t-tests, the goal is different compared to one-sample t-test. Instead of comparing our sample with the population, we want to compare two different conditions on the same variable and then check whether there is any significant difference between the two conditions.

To properly conduct this test, we need to make sure that our data fulfill the following conditions:

* The variable in our data is a continuous variable and follows a normal distribution or
* We have a large sample size
* The sample is randomly selected from its population

In [4]:
df_score = df[
    (df['online.tutorial'] == 1) & 
    (df['score1'].notnull()) & 
    (df['score2'].notnull())
][['name','score1','score2','online.tutorial']]
df_score

Unnamed: 0,name,score1,score2,online.tutorial
11,"Lang, Mackenzie",62.0,61.0,1
13,"Covar Orendain, Christopher",71.0,76.0,1
14,"Lopez, Monique",66.0,70.0,1
16,"Adams, Jose",87.0,91.0,1
19,"Roybal, Ebony",69.0,46.0,1
...,...,...,...,...
8230,"Robinson, Clayton",88.0,89.0,1
8234,"Marler, Jalen",71.0,77.0,1
8237,"Villa, Raechelle",77.0,75.0,1
8238,"Ngo, Preston",50.0,46.0,1


First, as usual, we set our significance level, which in this case let’s set it to 0.05. Next, we form our null hypothesis and alternative hypothesis as follows:

* Null hypothesis: the average grades before and after taking an online learning tutorial is the same.
* Alternative hypothesis: the average grades after taking an online learning tutorial (score2) is higher than before (score1).

Notice that with the way we formulate the alternative hypothesis, we’re conducting a left-sided hypothesis. Now let’s compute the t-statistics by plugging in values to paired t-tests equation above:

In [5]:
from scipy.stats import ttest_rel

test_statistic, p_value = ttest_rel(df_score['score1'], df_score['score2'],alternative='less')

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

8.946942058314536e-77

We Reject Null Hypothesis


As you can see, in the end the p-Value is very small, which means that we can say that the average student’s grades after taking an online tutorial is indeed higher than before. At a significance level of 0.01, we reject the null hypothesis in favor of the alternative hypothesis.

But sometimes we may have the following hypothesis: probably the second exam is easier than the first exam, thus the students are performing better in the second exam. To prove this hypothesis, we’re going to take the data from students who didn’t take the online tutorial and compare the grades of their first and second exam.

In [6]:
df_score_no = df[(df['online.tutorial'] == 0) & 
              (df['score1'].notnull()) & 
              (df['score2'].notnull())][['name','score1','score2','online.tutorial']]
df_score_no

Unnamed: 0,name,score1,score2,online.tutorial
3,"Williams, Hanh",45.0,46.0,0
8,"Allen, Rebecca Marie",58.0,62.0,0
9,"Tracy, Robert",57.0,67.0,0
12,"Rodriguez, Brianna",76.0,82.0,0
17,"Hines, Haileigh",57.0,54.0,0
...,...,...,...,...
8214,"Tealer, Ashley",77.0,77.0,0
8215,"Smith, Isaiah",76.0,84.0,0
8219,"Woody, Kin-Lino",65.0,68.0,0
8229,"Hill, Anissa",87.0,86.0,0


This is our null and alternative hypothesis:

* Null hypothesis: among students who didn’t take an online tutorial, the average grades of the first exam is the same as the second exam.
* Alternative hypothesis: among students who didn't take an online tutorial, the average grades of the second exam is higher than the first exam.

From the way we formulate our hypothesis, we’re conducting a left-sided hypothesis. With Scipy, we can compute the p-Value as you can see below:

In [7]:
from scipy.stats import ttest_rel

test_statistic, p_value = ttest_rel(df_score_no['score1'], df_score_no['score2'],alternative='less')

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

0.7232359247303264

We Accept Alternate Hypothesis


As you can see, the p-Value for this case is 0.722, which means that it is higher than our significance level. Hence, we can’t reject our null hypothesis and deny the hypothesis that the students’ grades improvement is due to easier exams.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='14' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">Two-Sample t-Test</p></div>

So far, we have covered the case where we want to infer one variable. In some cases, what we want to do instead is to compare two independent variables and observe whether there is any significant difference between two variables. For this purpose, we can use a two-sample t-test.

To properly conduct this test, we need to make sure that our data fulfills the following conditions.

* Two variables are independent
* Two variables are randomly selected from their population
* Two variables are continuous variables and have normal distribution distribution
* The result of the statistical test will be more robust or reliable if the sample size of two variables are the same.

The t-statistic that comes out from the equation above measures the mean and standard error difference between two samples. This use case can be answered by conducting two sample t-tests. One variable contains the salary of male graduates and another contains the salary of female graduates.

In [8]:
df_male = df[
    (df['gender'] == 'Male') & 
    (df['salary'].notnull())
].sample(n=500)[['name', 'gender','salary']]

df_female = df[
    (df['gender'] == 'Female') & 
    (df['salary'].notnull())
].sample(n=500)[['name', 'gender','salary']]

display(df_male.head())
display(df_female.head())

Unnamed: 0,name,gender,salary
1476,"Salazar, Rojan",Male,61384.549346
1026,"Revello, Kevin",Male,54631.11424
7794,"Watanabe, Maurice",Male,40894.898232
3366,"Lopez, Tyler",Male,40882.835353
3764,"Salmon, James",Male,49704.767288


Unnamed: 0,name,gender,salary
1155,"Salgado, Tanya",Female,22710.2407
2209,"Jara, Ayrika",Female,49391.375704
2143,"Dominguez, Megan",Female,28453.064409
4455,"Guzman, Shauntece",Female,34830.770997
6437,"Ramos, Nicole",Female,31731.983116


Next we can form our null and alternative hypothesis as follows:

* Null hypothesis: the average mean salary of male graduates is equal to the average mean salary of female graduates.
* Alternative hypothesis: the average mean salary of male graduates is higher than the average mean salary of female graduates.

Notice that because of the way we formulate the alternative hypothesis, this means that we conduct a right-sided hypothesis.

Let’s define the significance level for this use case to 0.01. Same as before, we need to compute the test statistic and p-Value by using a statistical library. We’re going to use Scipy for this test.

In [9]:
from scipy.stats import ttest_ind

test_statistic, p_value = ttest_ind(df_male['salary'], df_female['salary'], alternative='greater')

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

9.173410720914889e-80

We Reject Null Hypothesis


As you can see from the result of code snippets above, the resulting p-Value is very small. Hence even if we set the significance level to 0.01, our data provides very strong evidence that the mean salary of male graduates is indeed higher than the mean salary of female graduates. Hence, we reject the null hypothesis.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='15' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">ANOVA</p></div>

You’ve seen previously that with a two-sample t-test, we can compare the means of two groups. Now the question is, what if we want to compare the means of more than two groups? In this case, we’ll use a different type of statistical test. We can use ANOVA.

There are 3 different types of ANOVA:

* One-way ANOVA, if we have just one independent variable.
* Two-way ANOVA, if we have two independent variables.
* N-way ANOVA, if we have n independent variables.

To properly conduct ANOVA, we need to fulfill the same requirements as two-sample t-test, such as:

* The variables are independent
* The variables are randomly selected
* The variables have normal distribution
* The result of statistical test will be more robust if the sample size of variables are similar

This use case is very similar to the one from a two-sample t-test. However, in the two-sample t-test, we compare salary with gender and gender only consists of two groups: male and female. Meanwhile, here we compare the salary and study major. Study major itself has 6 groups.

The use case is also an example of one-way ANOVA, since we only have one independent variable (study major). Meanwhile, if we want to test whether there is any difference in the average salary between university graduates in relation to their study major and gender, then we can implement two-way ANOVA. This is because in this case, we have two independent variables (study major and gender).

Let’s prepare our data first. We’re just going to take 200 samples from each group of study majors.

In [10]:
df_biology = df[
    (df['major'] == 'Biology') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_biology.head())

df_economics = df[
    (df['major'] == 'Economics and Finance') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_economics.head())

df_environmental = df[
    (df['major'] == 'Environmental Sciences') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_environmental.head())

df_mathematics = df[
    (df['major'] == 'Mathematics and Statistics') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_mathematics.head())

df_politics = df[
    (df['major'] == 'Political Science') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_politics.head())

df_social = df[
    (df['major'] == 'Social Sciences') & 
    (df['salary'].notnull())
].sample(n=200)[['name','major','salary']]
display(df_social.head())

Unnamed: 0,name,major,salary
1621,"Jacket, Yihang",Biology,51828.657952
1677,"Sandoval, Joseph",Biology,46238.410946
3485,"Rockwell, Brandon",Biology,45942.998393
5349,"Nack, Reece",Biology,55656.780315
5943,"Ingram, Aubrey",Biology,38820.483955


Unnamed: 0,name,major,salary
2663,"Yellowhawk, Ezequiel",Economics and Finance,69348.606874
4154,"Blanco, Brandon",Economics and Finance,51866.925753
6689,"Tran, Andy",Economics and Finance,49528.290473
3226,"Bear, Jethro-Eli",Economics and Finance,50206.141872
849,"Johnson, Loagyn",Economics and Finance,41672.009748


Unnamed: 0,name,major,salary
81,"Munoz, Derrick",Environmental Sciences,33068.848645
3281,"Flores, Paris",Environmental Sciences,34077.585525
7112,"Reeves, Uriah",Environmental Sciences,40498.299634
2046,"Simon, Eric",Environmental Sciences,36094.973529
2670,"Starcer, Ji",Environmental Sciences,41598.149858


Unnamed: 0,name,major,salary
4064,"Mullin, Luis",Mathematics and Statistics,61432.208399
5239,"Kim, Joseph",Mathematics and Statistics,45861.410697
572,"Rojas Duarte, Gladys",Mathematics and Statistics,48508.072646
8142,"Nguyen, Eric",Mathematics and Statistics,44367.343815
2948,"Hernandez, Yannell",Mathematics and Statistics,45286.132451


Unnamed: 0,name,major,salary
5768,"Ramirez, Ashley",Political Science,34444.639948
7562,"Jackson, Colton",Political Science,58614.719538
140,"Brown, Dmitriy",Political Science,44027.967877
1007,"Gear, Kalah",Political Science,37782.864706
7624,"Demmer-White, Gilbert",Political Science,14081.098715


Unnamed: 0,name,major,salary
5366,"Montgomery, Justice",Social Sciences,32462.419758
3923,"Tat, Drew",Social Sciences,40944.508945
3448,"Proctor, Sophearath",Social Sciences,28125.031337
6245,"Gutierrez, Phanath",Social Sciences,36187.22169
357,"Miller, Jordan",Social Sciences,33108.645237


Now we can form our null and alternative hypothesis as follows:

* Null hypothesis: there is no significant difference of average salary between university graduates in relation to their study majors.
* Alternative hypothesis: there is a significant difference of average salary between university graduates in relation to their study majors.

Next, let’s set the significance level. For this use case, let’s use 0.05 as our significance level. Same as before, we’re going to use Scipy library to conduct the ANOVA test and compute the resulting p-Value.

In [11]:
from scipy.stats import f_oneway

f_stats, p_value = f_oneway(
    df_biology['salary'], df_economics['salary'], df_environmental['salary'] , 
    df_mathematics['salary'], df_politics['salary'], df_social['salary']
)

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

4.2388625448202354e-139

We Reject Null Hypothesis


As you can see, the resulting p-Value is very small in comparison with our significance level. Thus, our data provides strong evidence that there is a significant difference of average salary between university graduates in relation to study majors. In other words, we reject our null hypothesis.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='16' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">Chi-Square Goodness-of-Fit (GoF)</p></div>

So far, we have seen the types of statistical tests that are relevant if we have continuous data. You might ask, what if my variable is a discrete or a categorical? If we have a categorical variable, then we need to apply a different approach to our statistical test, and Chi-Square Goodness-of-Fit (GoF) is one of them.

The basic idea of Chi-Square GoF is to compare the observed frequencies of a sample with its expected frequencies. In order to conduct this test, we need to make sure that our data fulfills the following prerequisites:

* The data should be categorical and not continuous
* The sample data is large enough in each category. As a rule of thumb, there should be at least 5 samples in each category
* The sample is randomly selected

The general step that we need to do to conduct Chi-Square GoF is similar to what we’ve seen previously. The test statistic should be computed and then the resulting p-Value will be used to decide whether or not we should reject the null hypothesis.

When we conduct a Chi-Square GoF test, we need to know three additional data: observed frequency, relative frequency, and expected frequency.

Observed frequency is just the number of samples in each group, as you can see below:

In [12]:
df_sample = df.sample(n=500)

df_obs_freq = pd.DataFrame({'observed_freq' : df_sample.groupby(['religion']).size()}).reset_index()

display(df_obs_freq.head())

Unnamed: 0,religion,observed_freq
0,Catholic,175
1,Muslim,10
2,Orthodox,36
3,Other,168
4,Protestant,111


Meanwhile, relative frequency is the proportion of each group in the population. For the sake of this article, let’s say that the Catholic population in the Europe is 48%, Muslim 2%, Orthodox 8%, Protestant 12%, and other religion 30%.

Based on the information above, we can set the relative frequency as below:

In [13]:
religion = ['Catholic', 'Muslim', 'Orthodox', 'Other', 'Protestant']
rel_freq = [0.48, 0.02, 0.08, 0.30, 0.12]

df_rel_freq = pd.DataFrame({'religion' : religion, 'rel_freq': rel_freq})

display(df_rel_freq.head())

Unnamed: 0,religion,rel_freq
0,Catholic,0.48
1,Muslim,0.02
2,Orthodox,0.08
3,Other,0.3
4,Protestant,0.12


Now, expected frequency is the multiplication of relative frequency and the total number of our samples. Hence,

In [14]:
df_exp_freq = df_obs_freq.merge(df_rel_freq, on='religion')
df_exp_freq['expected_freq'] = df_exp_freq['rel_freq'] * 500

display(df_exp_freq.head())

Unnamed: 0,religion,observed_freq,rel_freq,expected_freq
0,Catholic,175,0.48,240.0
1,Muslim,10,0.02,10.0
2,Orthodox,36,0.08,40.0
3,Other,168,0.3,150.0
4,Protestant,111,0.12,60.0


As you can see, now we have observed frequency and expected frequency in each group and we can use these values to conduct the Chi-Square GoF test.

But before that, as usual, we need to set our significance value for this case, which will be 0.01 and then form our hypothesis as follows:

* Null hypothesis: The religion distribution among students is similar to the religion distribution among European adults
* Alternative hypothesis: The religion distribution among students is different compared to the religion distribution among European adults.

Now let’s use Scipy library from Python to conduct this test.

In [15]:
from scipy.stats import chisquare

test_statistic, p_value = chisquare(df_exp_freq['observed_freq'], df_exp_freq['expected_freq'])

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

5.289068266556468e-13

We Reject Null Hypothesis


The resulting p-Value is very small, indicating that even at significance level of 0.01, our data provides strong evidence that the religion distribution among students is different compared to the religion distribution among European adults. Hence, we reject the null hypothesis.

# <div style="color:white;display:fill;border-radius:5px;background-color:#00B1D2FF;letter-spacing:0.1px;overflow:hidden"><p id='17' style="padding:15px;color:white;overflow:hidden;margin:0;font-size:100%;">Chi-Square Independence Test</p></div>

Although the name of this statistical test is similar to the previous test, the Chi-Square independence test has a different purpose compared to Chi-Square GoF. We use indepence test when we want to observe whether there is an association between two discrete or categorical variables.

To properly conduct this statistical test, we need to make sure that our data fulfills the following conditions:

* The two variables should be discrete or categorical variables
* The sample size of both categories should be large enough. As a rule of thumb, there should be 5 samples in each category
* The samples from both categories are randomly selected.

The most common practice when we’re dealing with Chi-Square independence test is creating a contingency table of our categorical variables, as you can see below:

In [16]:
df_major = pd.crosstab(df['major'], df['gender'])
df_major

gender,Female,Male
major,Unnamed: 1_level_1,Unnamed: 2_level_1
Biology,959,638
Economics and Finance,461,863
Environmental Sciences,745,881
Mathematics and Statistics,276,949
Political Science,978,477
Social Sciences,691,321


Now that we have a contingency table as above, we are ready to conduct the Chi-Square independence test.

But before that, we should set our significance level, which in our case will be 0.05. Next,we formulate our null hypothesis and alternative hypothesis as follows:

* Null hypothesis: there is no association between gender and study major of the students.
* Alternative hypothesis: there is an association between gender and study major of the students.

Now we can compute the resulting p-Value with Scipy library from Python as follows:

In [17]:
from scipy.stats import chi2_contingency

test_statistic, p_value, x, c = chi2_contingency(df_major)

print(p_value)

if p_value < 0.05:
    print("\nWe Reject Null Hypothesis")
else:
    print("\nWe Accept Alternate Hypothesis")

5.501737149286338e-187

We Reject Null Hypothesis


As you can see, the resulting p-Value is very small. This means that at 5% significance level, our data provides strong evidence that there is an association between gender and study major. Hence, we reject the null hypothesis.