In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import plotly.express as px
%matplotlib inline

we use a two-sample t-test for equality of means with unequal and unknown standard deviations, and it's also known as welch's test. This method is useful when comparing the means of two independent samples where the population variances are unknown.

**Assumptions for 2 sample t-test**

* the data should be continious
* Each sample should be approximately normally distributed.
* The samples should be randomly sampled.

**Example 1:** A researcher wants to compare the average heights of male and female participants in a study.



Testing the null hypothesis

>$H_0:\mu_{male}=\mu_{female}$

against the alternate hypotthesis

>$H_1:\mu_{male}\neq\mu_{female}$

In [2]:
#check for the assumptions
male_heights = [175, 180, 178, 182, 176, 179, 181]
female_heights = [160, 162, 165, 161, 158, 159, 164, 163]
#check for normality
norm_male=stats.shapiro(male_heights)
norm_female=stats.shapiro(female_heights)
print(norm_male)
print(norm_female)

ShapiroResult(statistic=0.9621087312698364, pvalue=0.836609959602356)
ShapiroResult(statistic=0.974858283996582, pvalue=0.933165431022644)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [3]:
#calculating the hypothesis using p value 
t_stat,p_value=stats.ttest_ind(male_heights,female_heights,equal_var=False)
p_value

9.550691407835501e-09

since p value is less than the alpha we reject the null hypotheis

In [5]:
#checking the hypothesis using critical value approach
alpha=0.05
df=len(male_heights)+len(female_heights)-2
critical_value=stats.t.ppf(1-alpha/2,df)
critical_value

2.1603686564610127

In [6]:
t_stat

13.246558009296443

since test stat is greater than critical value we reject the null hypothesis

**Example 2:** A school district wants to compare the average test scores of students from two different schools to see if they are significantly different.

checking the null hypothesis 
>$H_0:\mu_{schoolA}=\mu_{schoolB}$

against the alternate hypothesis

>$H_1:\mu_{schoolA}\neq\mu_{schoolB}$

In [8]:
#checking for normality
school_A = [78, 85, 82, 80, 77, 79, 83]
school_B = [70, 75, 72, 73, 74, 71]
norm_A = stats.shapiro(school_A)
norm_B = stats.shapiro(school_B)
print(norm_A)
print(norm_B)

ShapiroResult(statistic=0.9641615748405457, pvalue=0.8535423278808594)
ShapiroResult(statistic=0.9818894863128662, pvalue=0.9605551362037659)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [9]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(school_A, school_B, equal_var=False)
p_value

0.00010385282032136216

since the p value is less than the alpha we reject the null hypothesis

In [10]:
#checking the hypothesis using critical value approach
alpha = 0.05
df = (len(school_A) - 1) + (len(school_B) - 1)
critical = stats.t.ppf(1 - alpha / 2, df)
critical

2.200985160082949

In [11]:
t_stat

6.071993487197258

since test stat is greater than critical value we reject the null hypothesis

**Example 3:** A financial analyst wants to determine if there is a significant difference in average monthly expenses between two cities.

Testing the null hypothesis 

>$H_0:\mu_{cityA}=\mu{cityB}$

against the alternate hypothesis

>$H_1:\mu_{cityA}\neq\mu{cityB}$

In [12]:
#check for normality
city_A= [1200, 1150, 1300, 1250, 1190]
city_B= [1500, 1550, 1480, 1600, 1570, 1550]
norm_A = stats.shapiro(city_A)
norm_B = stats.shapiro(city_B)
print(norm_A)
print(norm_B)

ShapiroResult(statistic=0.9670993089675903, pvalue=0.8563304543495178)
ShapiroResult(statistic=0.9482108354568481, pvalue=0.7257717847824097)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [13]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(city_A, city_B, equal_var=False)
p_value


1.2117628977092355e-05

since the p value is less than the alpha we reject the null hypothesis

In [14]:
#testing the hypothesis using critical value approach
df = (len(city_A) - 1) + (len(city_B) - 1)
critical = stats.t.ppf(1 - alpha / 2, df)
critical

2.2621571627409915

In [15]:
t_stat

-10.217656744902936

since the abs(test stat) is greater than the critical value we reject the null hypothesis

**Example 4:** A marketing firm wants to analyze if there is a difference in the average time spent on social media between two age groups.

In [16]:
# Sample time spent on social media (in hours per week)
age_group_1_time = [10, 12, 11, 13, 14, 15]
age_group_2_time = [8, 9, 7, 10, 9, 11, 8]
norm_1 = stats.shapiro(age_group_1_time)
norm_2 = stats.shapiro(age_group_2_time)
print(norm_1)
print(norm_2)

ShapiroResult(statistic=0.9818894863128662, pvalue=0.9605551362037659)
ShapiroResult(statistic=0.9666421413421631, pvalue=0.8732700943946838)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [17]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(age_group_1_time, age_group_2_time, equal_var=False)
p_value

0.0032899923786475173

since the p value is less than the alpha we reject the null hypothesis

In [18]:
#checking the hypothesis using critical value approach
alpha=0.05
df = (len(age_group_1_time) - 1) + (len(age_group_2_time) - 1)
critical = stats.t.ppf(1 - alpha / 2, df)
critical

2.200985160082949

In [19]:
t_stat

3.9703446152237665

since test stat is greater than the critical value we reject the null hypothesis

**Example 5:** A nutritionist wants to compare the average daily caloric intake of two different dietary groups.

Testing the null hypothesis

>$H_0:\mu_{DietA}=\mu_{DietB}$

against the alternate hypothesis

>$H_1:\mu_{DietA}>\mu_{DietB}$

In [22]:
#check for normality
diet_A=[2200, 2300, 2150, 2400, 2250]
diet_B=[2000, 2100, 1900, 2050, 1950, 1980]
p_norm_A = stats.shapiro(diet_A)
p_norm_B = stats.shapiro(diet_B)
print(p_norm_A)
print(p_norm_B)

ShapiroResult(statistic=0.9787160754203796, pvalue=0.9276362061500549)
ShapiroResult(statistic=0.9907985925674438, pvalue=0.9910317063331604)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [23]:
#check for hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(diet_A, diet_B, equal_var=False)
p_value

0.0012849270239915403

since the p value is less than alpha we reject the null hypothesis

In [24]:
#checking the hypothesis using critical value approach
df = (len(diet_A) - 1) + (len(diet_B) - 1)
critical = stats.t.ppf(1 - alpha, df) 
critical

1.8331129326536333

In [25]:
t_stat

5.073074262636689

since test stat is greater than the critical value we reject the null hypothesis

**Example 6:** A sleep researcher wants to know if students who study late at night get less sleep than those who study during the day.

Testing the null hypothesis

>$H_0:\mu_{Nightstudy}=\mu_{Daystudy}$

against the alternate hypothesis

>$H_1:\mu_{Nightstudy}<\mu_{Daystudy}$

In [26]:
#checking for normality
night_study_sleep=[5, 6, 5.5, 4.5, 5]
day_study_sleep =[8, 7.5, 9, 8.5, 7, 9, 8]
norm_night = stats.shapiro(night_study_sleep)
norm_day = stats.shapiro(day_study_sleep)
print(norm_night)
print(norm_day)

ShapiroResult(statistic=0.9608590006828308, pvalue=0.8139519691467285)
ShapiroResult(statistic=0.9345840215682983, pvalue=0.5905229449272156)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [27]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(night_study_sleep, day_study_sleep, equal_var=False)
p_value

1.6811208604192438e-05

since p value is less than the alpha we reject the null hypothesis

In [28]:
#checking the hypothesis using critical value
df = (len(night_study_sleep) - 1) + (len(day_study_sleep) - 1)
critical = stats.t.ppf(1 - alpha, df)
critical

1.8124611228107335

In [29]:
t_stat

-7.730134998318713

since the abs(test stat) is greater than the critical value we reject the null hypothesis

**Example 7:** A fitness expert wants to compare the average weight loss after following two different diet plans for a month.

Testing the null hypothesis

>$H_0:\mu_{Dietplan1}=\mu_{Dietplan2}$

against the alternate hypothesis

>$H_1:\mu_{Dietplan1}<\mu_{Dietpan2}$

In [30]:
#check for normality
diet_plan_1_weight_loss =[2, 3, 1.5, 2.5, 3]
diet_plan_2_weight_loss=[5, 4.5, 6, 5.5, 5, 6.5]
norm_1 = stats.shapiro(diet_plan_1_weight_loss)
norm_2 = stats.shapiro(diet_plan_2_weight_loss)
print(norm_1)
print(norm_2)

ShapiroResult(statistic=0.9020196199417114, pvalue=0.4211485683917999)
ShapiroResult(statistic=0.958012044429779, pvalue=0.8042958378791809)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [31]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(diet_plan_1_weight_loss, diet_plan_2_weight_loss, equal_var=False)
p_value

5.228919042146084e-05

since the p value is less than alpha we reject the null hypothesis

In [33]:
#checking the hypothesis using critical value approach
df = (len(diet_plan_1_weight_loss) - 1) + (len(diet_plan_2_weight_loss) - 1)
critical = stats.t.ppf(1 - alpha, df)
critical

1.8331129326536333

In [34]:
t_stat

-7.2054962293552

since the abs(test stat) is greater than the critical value we reject the null hypothesis

**Example 8:** A city planner wants to compare the average travel times between taking the bus and riding a bike.

Testing the null hypothesis

>$H_0:\mu_{Bus}=\mu_{Bike}$

against the alternate hypothesis

>$H_1:\mu_{Bus}>\mu_{Bike}$

In [35]:
#checking for normality
bus_travel_times =[30, 35, 32, 40, 28]
bike_travel_times = [20, 22, 25, 23, 24, 19]
norm_bus = stats.shapiro(bus_travel_times)
norm_bike = stats.shapiro(bike_travel_times)
print(norm_bus)
print(norm_bike)

ShapiroResult(statistic=0.9581565260887146, pvalue=0.795091450214386)
ShapiroResult(statistic=0.9575423002243042, pvalue=0.800613284111023)


The assumption of normality is met because the p-value is greater than 0.05 for both groups

In [36]:
#checking the hypothesis using p value approach
t_stat, p_value = stats.ttest_ind(bus_travel_times, bike_travel_times, equal_var=False)
p_value

0.003941342920251392

since the p value is less than alpha we reject the null hypothesis

In [37]:
#checking the hypothesis using critical value
df = (len(bus_travel_times) - 1) + (len(bike_travel_times) - 1)
critical = stats.t.ppf(1 - alpha, df)
critical

1.8331129326536333

In [38]:
t_stat

4.708167536631482

since the test stat is greater than the critical value we reject the null hypothesis