<font color = 'orange'>

# Hypothesis Testing (Chi-square test of independence)

</font>

- to determine whether `Category` and `Platform` are independent of one another
- if `Category` and `Platform` are not independent, evaluate the strength of association
- further explore their relationship with residuals to find out if students prefer to enrol in certain `Category` on certain `Platform`

In [16]:
# import libraries
import pandas as pd
import numpy as np
import scipy.stats as stats

In [3]:
# load dataset into df
courses = pd.read_csv('data/online_courses_usage.csv')
courses.head()

Unnamed: 0,Course_ID,Course_Name,Category,Duration (hours),Enrolled_Students,Completion_Rate (%),Platform,Price ($),Rating (out of 5)
0,1,Course_1,Office Tools,21,4217,50.646827,Coursera,38.797425,4.811252
1,2,Course_2,Office Tools,57,4238,82.24024,edX,160.650991,3.829329
2,3,Course_3,Technology,52,2700,55.729028,LinkedIn Learning,123.503781,4.85195
3,4,Course_4,Office Tools,69,4308,58.664729,LinkedIn Learning,116.775704,3.913732
4,5,Course_5,Technology,43,4792,62.598147,Udemy,96.246696,4.921968


In [7]:
# retrieve shape of dataset
courses.shape

(10000, 9)

In [4]:
# retrieve descriptive stats
courses.describe(include='all')

Unnamed: 0,Course_ID,Course_Name,Category,Duration (hours),Enrolled_Students,Completion_Rate (%),Platform,Price ($),Rating (out of 5)
count,10000.0,10000,10000,10000.0,10000.0,10000.0,10000,10000.0,10000.0
unique,,10000,9,,,,4,,
top,,Course_1,Business,,,,Udemy,,
freq,,1,1148,,,,2554,,
mean,5000.5,,,55.144,2530.653,75.119729,,106.391332,3.994154
std,2886.89568,,,26.199242,1423.808243,14.462138,,55.100685,0.575502
min,1.0,,,10.0,101.0,50.008183,,10.037145,3.000026
25%,2500.75,,,32.0,1289.0,62.629516,,58.613731,3.49025
50%,5000.5,,,55.0,2532.0,75.156568,,108.042392,4.002789
75%,7500.25,,,78.0,3764.0,87.595268,,153.945558,4.483662


In [5]:
# check data types
courses.dtypes

Course_ID                int64
Course_Name             object
Category                object
Duration (hours)         int64
Enrolled_Students        int64
Completion_Rate (%)    float64
Platform                object
Price ($)              float64
Rating (out of 5)      float64
dtype: object

In [6]:
# check for null values
courses.isnull().sum()

Course_ID              0
Course_Name            0
Category               0
Duration (hours)       0
Enrolled_Students      0
Completion_Rate (%)    0
Platform               0
Price ($)              0
Rating (out of 5)      0
dtype: int64

<br><br>

## Create contingency table

In [8]:
# create the contingency table
contingency_table = pd.crosstab(courses['Category'], courses['Platform'], values=courses['Enrolled_Students'], aggfunc='sum', margins=False)
contingency_table

Platform,Coursera,LinkedIn Learning,Udemy,edX
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AI,678408,773871,679692,677898
Business,694513,711277,797389,645399
Data Science,620874,713685,722922,677325
Design,677732,641435,771699,692631
Finance,730345,651555,741540,768997
Marketing,705040,795666,748235,700797
Office Tools,764264,749713,746012,622724
Programming,685468,617923,641491,731717
Technology,655861,668475,689297,714660


- created a contingency table that displays the `total number of enrolled students` in each `Platform` for each `Category`

<br><br>

## Perform chi-square test of independence

In [37]:
# perform chi2 test of independence
result = stats.chi2_contingency(contingency_table)
print("chi2 statistic: ", round(result[0], 2))
print("p-value: ", result[1])

chi2 statistic:  89337.44
p-value:  0.0


- chi2 statistic is large, meaning there is significant difference between observed and expected frequencies assuming null hypothesis is true
- p-value < 0.05
- we reject the null hypothesis as there is strong evidence that variables are not independent
<br><br>
- we have established that `Platform` and `Category` are not independent
- next, we will determine their strength of association with cramer's v
- we will also further explore their relationship with standardised residuals to glean insights on enrolment preferences

<br><br>

## Check strength of association with Cramer's V

In [33]:
# define cramer's v function
def cramers_v(dataset):
    x2 = result[0]
    n = dataset.sum().sum()
    r, k = dataset.shape
    min_dim = min(k - 1, r - 1)
    return np.sqrt((x2 / n ) / min_dim)

In [34]:
# perform cramer's v on contingency_table
cramers_v(contingency_table)

0.034303610607190754

- cramer's v is very low, indicating that the association of `Category` and `Platform` is very weak

<br><br>

## Explore relationship using standardised residuals

In [35]:
# get residuals
residuals = (contingency_table - expected) / np.sqrt(expected)
residuals

Platform,Coursera,LinkedIn Learning,Udemy,edX
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AI,-13.710721,85.616056,-54.310853,-16.924084
Business,-5.722004,-0.623637,71.595983,-66.992159
Data Science,-61.625429,36.666898,19.449293,4.672036
Design,-6.761437,-64.875655,61.962152,8.634966
Finance,24.066921,-83.75848,-6.662371,67.165865
Marketing,-22.435572,68.238361,-15.886696,-30.064821
Office Tools,67.265584,34.616883,1.419379,-103.483262
Programming,35.022266,-62.24601,-60.178741,89.373066
Technology,-16.994477,-16.07321,-18.57316,52.182242


- notable examples of students' preferences
    - LinkedIn Learning preferred for AI and Marketing
    - Udemy preferred for Business and Design
    - Coursera preferred for Office Tools and Finance
    - edX preferred for Programming and Finance

<br><br>

<font color = 'orange'>

## Summary

</font>

- although the overall association between `Category` and `Platform` is weak, the statistically significant chi-square test and residual analysis highlight specific student preferences
- for eg, students show a notable preference for AI and Data Science courses on Linkedin Learning
- another eg, students show a notable preference for Programming and Technology courses on edX

<br><br><br><br>

# Acknowledgements
- data courtesy, provided via [Online Courses Usage and History Dataset](https://www.kaggle.com/datasets/mitul1999/online-courses-usage-and-history-dataset) on Kaggle