In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

1. A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.

In [2]:
## Based on the problem statement given above, we know there are two independent sample that we need to compare. 
## We use 2 sample t test to analyse the statistical significance between the two samples.

unit_a = np.array(pd.read_csv('../Cutlets.csv')['Unit A'])
unit_b = np.array(pd.read_csv('../Cutlets.csv')['Unit B'])

## First we use Shapiro-Wilk test to check normality 
_ , p_value_a = st.shapiro(unit_a)
_ , p_value_b = st.shapiro(unit_b)

if p_value_a > 0.05 and p_value_b > 0.05:
    print('Normality assumption is met.')
else:
    print('Normality assumption is not met.')

## Second, we use Levene's test to check Homegeneity of variance

_ , p_value_levene = st.levene(unit_a, unit_b)

## alpha is 0.05 

if p_value_levene > 0.05:
    print('Homogeneity of variance assumption is met.')
else:
    print('Homogeneity of variance assumption is not met.')

t_stat, p_value = st.ttest_ind(unit_a,unit_b)
print(f't statistic value: {t_stat}')
print(f'p value: {p_value}')
if p_value > 0.05:
    print("Fail to reject the null hypothesis. There is no significant difference in cutlet diameters.")
else:
    print("Reject the null hypothesis. There is a significant difference in cutlet diameters.")


Normality assumption is met.
Homogeneity of variance assumption is met.
t statistic value: 0.7228688704678061
p value: 0.47223947245995
Fail to reject the null hypothesis. There is no significant difference in cutlet diameters.


2. A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch.
   Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.


In [21]:
## From the above problem statement. We learn we have to compare 4 different samples (from 4 different laboratories)
## The one-way ANOVA test is a parametric test that is used to compare the means of three or more groups. 

lab_1 = np.array(pd.read_csv('../LabTAT.csv')['Laboratory 1'])
lab_2 = np.array(pd.read_csv('../LabTAT.csv')['Laboratory 2'])
lab_3 = np.array(pd.read_csv('../LabTAT.csv')['Laboratory 3'])
lab_4 = np.array(pd.read_csv('../LabTAT.csv')['Laboratory 4'])

# Shapiro-Wilk Test for Normality
_, p1 = st.shapiro(lab_1)
_, p2 = st.shapiro(lab_2)
_, p3 = st.shapiro(lab_3)
_, p4 = st.shapiro(lab_4)

# Levene's Test for Homogeneity of Variances
_, p_homogeneity = st.levene(lab_1, lab_2, lab_3, lab_4)

# Check the assumptions
if all(p > 0.05 for p in [p1, p2, p3, p4]):
    print("\nNormality assumption is met for all groups.")
else:
    print("\nNormality assumption is violated for one or more groups.")

if p_homogeneity > 0.05:
    print("Homogeneity of variances assumption is met.")
else:
    print("Homogeneity of variances assumption is violated.")

## Performing One-Way ANOVA test 
f_statistic, p_value = st.f_oneway(lab_1, lab_2, lab_3, lab_4)

print(f'f statistics: {f_statistic}')
print(f'p value: {p_value}')

# Significance level
alpha = 0.05

# Make a decision
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in average TAT among laboratories.")
else:
    print("Fail to reject the null hypothesis. No significant difference in average TAT among laboratories.")

## If rejected, you can perform post hoc tests to identify specific differences.
## The omnibus test tells us that there is a significant difference between at least two groups. 
## But it does not tell us which groups are different. Post hoc tests are used to answer this question.

print('\n')
print("Perform post hoc test using Tukey's")
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data into a single array
data = list(lab_1) + list(lab_2) + list(lab_3) + list(lab_4)

# Group labels (laboratory names)
groups = ['Lab1'] * len(lab_1) + ['Lab2'] * len(lab_2) + ['Lab3'] * len(lab_3) + ['Lab4'] * len(lab_4)

# Perform one-way ANOVA
f_statistic, p_value = st.f_oneway(lab_1, lab_2, lab_3, lab_4)

# If the ANOVA test is significant (p < alpha), perform post hoc Tukey's HSD
if p_value < 0.05:
    posthoc = pairwise_tukeyhsd(data, groups)
    print(posthoc.summary())



Normality assumption is met for all groups.
Homogeneity of variances assumption is met.
f statistics: 118.70421654401437
p value: 2.1156708949992414e-57
Reject the null hypothesis. There is a significant difference in average TAT among laboratories.


Perform post hoc test using Tukey's
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
  Lab1   Lab2   0.5413 0.9923  -4.4466   5.5293  False
  Lab1   Lab3  21.5517    0.0  16.5637  26.5396   True
  Lab1   Lab4 -14.6788    0.0 -19.6668  -9.6909   True
  Lab2   Lab3  21.0103    0.0  16.0224  25.9983   True
  Lab2   Lab4 -15.2202    0.0 -20.2081 -10.2322   True
  Lab3   Lab4 -36.2305    0.0 -41.2185 -31.2425   True
------------------------------------------------------
