## Statistics - Assignment
### T-Distribution Hypothesis Testing - Questions 15 - 20

In [5]:
import numpy as np
from scipy.stats import t

In [1]:
# Function to calculate t-stats for two samples
def calc_t_statistics_two_samples(sample1, sample2):
    n1 = len(sample1)
    n2 = len(sample2)

    dof = n1+n2-2

    print(f"Sample size: n1 = {n1} and sample size: n2 = {n2}")

    x1_bar = np.mean(sample1)
    x2_bar = np.mean(sample2)

    print(f"Sample Mean: x1_bar = {round(x1_bar, 2)} and x2_bar = {round(x2_bar, 2)}, diff_x1x2 = {round(abs(x1_bar - x2_bar), 2)}")

    s1 = np.std(sample1)
    s2 = np.std(sample2)

    print(f"Sample Standard Deviation: s1 = {round(s1, 2)} and s2 = {round(s2, 2)}")

    if (n1 == n2):
        print("Sample size is same")
        pooled_std = np.sqrt((s1 ** 2 + s2 ** 2)/2)
    else:
        print("Sample size is different")
        pooled_std = np.sqrt((((n1-1)*(s1 ** 2)) + ((n2-1)*(s2 ** 2)))/ dof)

    print("Pooled Standard Deviation: sp = ", round(pooled_std, 2))

    standard_error = pooled_std * np.sqrt(1/n1+1/n2)

    print("Standard Error : se = ", round(standard_error, 2))
    
    t_statistic = abs(x1_bar - x2_bar)/ standard_error
    print(f"t_statistic = {round(t_statistic, 2)}, dof = {dof}")

    return (t_statistic, dof)


In [2]:
# Function to calculate p-value
def calculate_p_value(t_score, df):
    p_value = 2 * (1 - t.cdf(np.abs(t_score), df))
    return p_value
 
# Function to interprete p-value
def interpret_p_value(p_value, alpha=0.05):
    if p_value < alpha:
        return "Reject the null hypothesis. There is a statistically significant difference."
    else:
        return "Fail to reject the null hypothesis. There is no statistically significant difference."
 

In [3]:
# Function to calculate t-stats for paired sample
def calc_t_statistics_paired_sample(sample_pre, sample_post):
    
    diff_sample = [] # To store difference between observations from sample
    diff_sum_squares = 0

    sample_size_pre = len(sample_pre)
    sample_size_post = len(sample_post)

    # Sample size for sample_a and sample_b are the same for paired sample test.
    if (sample_size_pre != sample_size_post):
        print(f"Sample size: n1 = {sample_size_pre} and sample size: n2 = {sample_size_post}")
        print("For paired sample t-test, sample size should be te same since this is a repeated-samples test")
        return 999
    
    sample_size = sample_size_pre
    dof = sample_size-1

    print(f"Sample size: n = {sample_size}, dof = {dof}")

    # Calculate difference of pre_i-post_i
    for i in range(sample_size):
        diff_sample.append(sample_pre[i]-sample_post[i])

    print("diff sample :", diff_sample)

    diff_mean = sum(diff_sample)/sample_size
    print(f"Mean difference : xd_bar = {round(diff_mean, 2)}")

    for i in range(sample_size):
        diff_sum_squares += (diff_sample[i] - diff_mean) ** 2

    #print(f"Sum of square difference from Mean: diff_sum_squares = {round(diff_sum_squares, 2)} and sample_size = {sample_size}")    

    # Calculate sample standard deviation
    diff_sample_std = np.sqrt(diff_sum_squares/dof)
    print(f"Sample Standard Deviation: sample_sd = {round(diff_sample_std, 2)}")

    #Calculate test statistic (t-statisctic)
    std_error = diff_sample_std/np.sqrt(sample_size)
    print("Standard Error =", round(std_error, 2))

    t_statistics = diff_mean/std_error
    print(f"t_statistics = {round(t_statistics, 2)}, dof = {dof}")
    
    return (round(t_statistics, 2), dof)

#### 15. A company is testing two different website layouts to see which one leads to higher click-through rates. Write a Python function to perform an A/B test analysis, including calculating the t-statistic, degrees of freedom, and p-value.

Use the following data:

```python
layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]
layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]
```

In [6]:
layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]
layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]

t_statistics, df = calc_t_statistics_two_samples(layout_a_clicks, layout_b_clicks)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

# Step 5: Interpret the p-value
print("p-value:", p_value)
print(result)
 

Sample size: n1 = 10 and sample size: n2 = 10
Sample Mean: x1_bar = 32.5 and x2_bar = 42.0, diff_x1x2 = 9.5
Sample Standard Deviation: s1 = 2.87 and s2 = 2.65
Sample size is same
Pooled Standard Deviation: sp =  2.76
Standard Error : se =  1.23
t_statistic = 7.69, dof = 18
p-value: 4.260288652968569e-07
Reject the null hypothesis. There is a statistically significant difference.


 #### Solution
 
 - There are two samples : layout_a_clicks and layout_b_clicks
 - Hence this problem can be solved using two sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0: \mu_1 = \mu_2$ \
  $H_A: \mu_1 \not= \mu_2$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|Branch         |Sample Size (n)|Average ($\bar{x}$) |Standard deviation (s)|
|---------------|---------------|----------------|----------------------|
|layout_a_clicks|       10      |       32.5     |      2.87            |
|layout_b_clicks|       10      |       42.0     |      2.65            |

Since sample size is same, we use pooled standard deviation $s_{p} = 2.76$

$t_{statistic} = (\bar x_1 - \bar x_2)/ s_{p} \sqrt{1/n1+1/n2} = 9.5/ 2.76*(0.447) = 9.5/1.23 = 7.72$

#### Step4: (t critical)

$dof = 10+10-2 = 18$

$t_{critical} ^{0.05} = 2.101$

$p-value: 4.260288652968569e-07$

#### Step5: Conclusion

$t_{statistics} > t_{critical}$ So, we reject the $H_0$ hypothesis.

Also, $p-value < t_{critical}$ So, we reject the $H_0$ hypothesis.


#### 16. A pharmaceutical company wants to determine if a new drug is more effective than an existing drug in reducing cholesterol levels. Create a program to analyze the clinical trial data and calculate the tstatistic and p-value for the treatment effect.

Use the following data of cholestrol level:

```python
existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]
new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]
```

In [7]:
existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]
new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]

t_statistics, df = calc_t_statistics_paired_sample(existing_drug_levels, new_drug_levels)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

# Step 5: Interpret the p-value
print("p-value:", p_value)
print(result)


Sample size: n = 10, dof = 9
diff sample : [10, 10, 10, 17, 3, 3, 2, 6, 7, 7]
Mean difference : xd_bar = 7.5
Sample Standard Deviation: sample_sd = 4.5
Standard Error = 1.42
t_statistics = 5.27, dof = 9
p-value: 0.0005138697113511448
Reject the null hypothesis. There is a statistically significant difference.


 #### Solution
 
 - There is only one sample : existing_drug_levels and new_drug_levels
 - Hence this problem can be solved using one sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0$: $\mu = 0$ \
  $H_A$: $\mu \not= 0$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|existing_drug_levels |new_drug_levels |Difference|
|---------------|--------------|----------------------|
|       180      |       170     |      10           |
|       182      |       172     |      10           |
|       175      |       165     |      10           |
|       185      |       168     |      17           |
|       178      |       175     |       3           |
|       176      |       173     |       3           |
|       172      |       170     |       2           |
|       184      |       178     |       6           |
|       179      |       172     |       7           |
|       183      |       176     |       7           |


$\bar {x_d} = 7.5$
Standard Deviation $s_d = 4.5$  
$Standard Error = s_d/sqrt(10) = 5.5/3.16 = 1.42$

$t_{statistic} = Mean difference/ Standard Error = \bar {x_d} / (s_d*/sqrt(n)) = 7.5/1.42 = 5.28$

#### Step4: (t critical)

$dof = 10-1 = 9$ \
$t_{critical} ^{0.05} = 2.262$

$p-value: 0.0005138697113511448$

#### Step5: Conclusion

$t_{statistics} > t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., There is a statistically significant difference between exsting and new drug levels.

Also, $p-value < t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., There is a statistically significant difference between exsting and new drug levels.


#### 17. A school district introduces an educational intervention program to improve math scores. Write a Python function to analyze pre- and post-intervention test scores, calculating the t-statistic and p-value to determine if the intervention had a significant impact.

Use the following data of test score:

```python
  pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]
  post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]
```

In [8]:
pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]
post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

t_statistics, df = calc_t_statistics_paired_sample(pre_intervention_scores, post_intervention_scores)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

print("p-value:", p_value)
print(result)
 

Sample size: n = 10, dof = 9
diff sample : [-10, -7, 2, -17, -7, -9, -4, -15, -4, -6]
Mean difference : xd_bar = -7.7
Sample Standard Deviation: sample_sd = 5.5
Standard Error = 1.74
t_statistics = -4.43, dof = 9
p-value: 0.0016471560531239327
Reject the null hypothesis. There is a statistically significant difference.


 #### Solution
 
 - There is only one sample : pre_intervention_scores and post_intervention_scores
 - Hence this problem can be solved using one sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0$: $\mu = 0$ \
  $H_A$: $\mu \not= 0$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|pre_intervention_scores|post_intervention_scores |Difference|
|---------------|--------------|----------------------|
|       80      |       90     |      -10           |
|       85      |       92     |       -7           |
|       90      |       88     |        2           |
|       75      |       92     |      -17           |
|       88      |       95     |       -7           |
|       82      |       91     |       -9           |
|       92      |       96     |       -4           |
|       78      |       93     |      -15           |
|       85      |       89     |       -4           |
|       87      |       93     |       -6           |



$\bar {x_d} = -7.7$
Standard Deviation $s_d = 5.5$  
$Standard Error = s_d/sqrt(10) = 5.5/3.16 = 1.74$

$t_{statistic} = Mean difference/ Standard Error = \bar {x_d} / (s_d*/sqrt(n)) = -7.7/1.74 = -4.43$

#### Step4: (t critical)

$dof = 10-1 = 9$ \
$t_{critical} ^{0.05} = 2.262$

$p-value: 0.0016471560531239327$

#### Step5: Conclusion

$t_{statistics} > t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., There is a statistically significant difference between the pre and post intervention score.

Also, $p-value < t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., There is a statistically significant difference between the pre and post intervention score.


#### 18. An HR department wants to investigate if there's a gender-based salary gap within the company. Develop a program to analyze salary data, calculate the t-statistic, and determine if there's a statistically significant difference between the average salaries of male and female employees.

Use the below code to generate synthetic data:

```python
# Generate synthetic salary data for male and female employees
np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)
female_salaries = np.random.normal(loc=55000, scale=9000, size=20)
  ```

In [9]:
# Generate synthetic salary data for male and female employees
np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)
female_salaries = np.random.normal(loc=55000, scale=9000, size=20)

In [10]:
t_statistics, df = calc_t_statistics_two_samples(male_salaries, female_salaries)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

print("p-value:", p_value)
print(result)
 

Sample size: n1 = 20 and sample size: n2 = 20
Sample Mean: x1_bar = 55693.35 and x2_bar = 55501.75, diff_x1x2 = 191.59
Sample Standard Deviation: s1 = 8501.83 and s2 = 10690.39
Sample size is same
Pooled Standard Deviation: sp =  9658.3
Standard Error : se =  3054.22
t_statistic = 0.06, dof = 38
p-value: 0.950309951000718
Fail to reject the null hypothesis. There is no statistically significant difference.


 #### Solution
 
 - There are two samples : male_salaries and female_salaries
 - Hence this problem can be solved using two sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0$: Customer Satisfaction score for $\mu_1 = \mu_2$ \
  $H_A$: Customer Satisfaction score for $\mu_1 \not= \mu_2$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|Branch         |Sample Size (n)|Average ($\bar{x}$) |Standard deviation (s)|
|---------------|---------------|----------------|----------------------|
|branch_a_scores|       20      |       55693.35    |      8501.83           |
|branch_b_scores|       20      |       55501.75     |      10690.39           |

Since sample size is same, we use pooled standard deviation $s_{p} = 9658.3$

$t_{statistic} = (\bar x_1 - \bar x_2)/ s_{p} \sqrt{1/n1+1/n2} = 0.06$

#### Step4: (t critical)

$dof = 20+20-2 = 38$ \
$t_{critical} ^{0.05} = 2.021$

#### Step5: Conclusion

$t_{statistics} < t_{critical}$ So, we fail to reject the $H_0$ hypothesis. i.e., significant difference between the average salaries of male and female employees.


#### 19. A manufacturer produces two different versions of a product and wants to compare their quality scores. Create a Python function to analyze quality assessment data, calculate the t-statistic, and decide whether there's a significant difference in quality between the two versions.

Use the following data:

```python
version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]
version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]
```

In [11]:
version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]
version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]

t_statistics, df = calc_t_statistics_two_samples(version1_scores, version2_scores)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

# Step 5: Interpret the p-value
print("p-value:", p_value)
print(result)
 

Sample size: n1 = 25 and sample size: n2 = 25
Sample Mean: x1_bar = 86.64 and x2_bar = 79.96, diff_x1x2 = 6.68
Sample Standard Deviation: s1 = 2.31 and s2 = 1.73
Sample size is same
Pooled Standard Deviation: sp =  2.04
Standard Error : se =  0.58
t_statistic = 11.56, dof = 48
p-value: 1.7763568394002505e-15
Reject the null hypothesis. There is a statistically significant difference.


 #### Solution
 
 - There are two samples : version1_scores and version2_scores
 - Hence this problem can be solved using two sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0$: Customer Satisfaction score for $\mu_1 = \mu_2$ \
  $H_A$: Customer Satisfaction score for $\mu_1 \not= \mu_2$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|Branch         |Sample Size (n)|Average ($\bar{x}$) |Standard deviation (s)|
|---------------|---------------|----------------|----------------------|
|branch_a_scores|       25      |       86.64     |      2.31            |
|branch_b_scores|       25      |       79.96     |      1.73            |

Since sample size is same, we use pooled standard deviation $s_{p} = 2.04$

$t_{statistic} = (\bar x_1 - \bar x_2)/ s_{p} \sqrt{1/n1+1/n2} = 11.56$

#### Step4: (t critical)

$dof = 25+25-2 = 48$ \
$t_{critical} ^{0.05} = 2.021$

#### Step5: Conclusion

$t_{statistics} > t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., Customer Satisfaction score for two branches is not equal.


#### 20. A restaurant chain collects customer satisfaction scores for two different branches. Write a program to analyze the scores, calculate the t-statistic, and determine if there's a statistically significant difference in customer satisfaction between the branches.

Use the below data of scores:

```python
branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]
branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]
```

In [12]:
branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]
branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]

t_statistics, df = calc_t_statistics_two_samples(branch_a_scores, branch_b_scores)

p_value = calculate_p_value(t_statistics, df)

result = interpret_p_value(p_value, alpha=0.05)

print("p-value:", p_value)
print(result)
 

Sample size: n1 = 31 and sample size: n2 = 31
Sample Mean: x1_bar = 4.13 and x2_bar = 3.13, diff_x1x2 = 1.0
Sample Standard Deviation: s1 = 0.71 and s2 = 0.71
Sample size is same
Pooled Standard Deviation: sp =  0.71
Standard Error : se =  0.18
t_statistic = 5.57, dof = 60
p-value: 6.320764851519556e-07
Reject the null hypothesis. There is a statistically significant difference.


 #### Solution
 
 - There are two samples : branch_a_score and branch_b_score
 - Hence this problem can be solved using two sample t-test

#### Steps to perform t test:

#### Step1:  (Framing Hypothesis)

  $H_0$: Customer Satisfaction score for $\mu_1 = \mu_2$ \
  $H_A$: Customer Satisfaction score for $\mu_1 \not= \mu_2$ (two tailed test)

#### Step2: (Significance Level)

$\alpha = 0.05$

#### Step3: (Calculate t statistics)

|Branch         |Sample Size (n)|Average ($\bar{x}$) |Standard deviation (s)|
|---------------|---------------|----------------|----------------------|
|branch_a_scores|       31      |       4.13     |      0.71            |
|branch_b_scores|       31      |       3.13     |      0.71            |

Since sample size is same, we use pooled standard deviation $s_{p} = 0.71$

$t_{statistic} = (\bar x_1 - \bar x_2)/ s_{p} \sqrt{1/n1+1/n2} = 5.57$

#### Step4: (t critical)

$dof = 31+31-2 = 60$ \
$t_{critical} ^{0.05} = 2.000$

#### Step5: Conclusion

$t_{statistics} > t_{critical}$ So, we reject the $H_0$ hypothesis. i.e., Customer Satisfaction score for two branches is not equal.
