In [1]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  

%matplotlib inline

import scipy.stats as stats 
import random

# 1. **Two Independent Sample Z-test for Equality of Means**

**Business Problem 1**

*To compare customer satisfaction levels of two competing fine dining restaurants, 120 customers of restaurants A and 250 customers of restaurants B were randomly selected and were asked to rate their experience on a scale of 1 to 5, with 1 being least satisfied and 5 being most satisfied (The survey results are summarized below). Suppose we know that, $\sigma_1$ = 0.15 and $\sigma_2$ = 0.89.*

*Test at a 0.05 level of significance whether the data provide sufficient evidence to conclude that restaurants A has a higher mean satisfaction rating than restaurants B.*


Let $\mu_A, \mu_B$ be the mean customer rating of restaurants A and restaurants B respectively.

We will test the null hypothesis

>$H_0:\mu_A=\mu_B$

against the alternate hypothesis

>$H_a:\mu_A>\mu_B$

In [2]:
rest_A = pd.read_csv('restaurant_A_ratings.csv')
rest_A.drop(columns = ['Unnamed: 0'], inplace=True)
rest_A.head(4)

Unnamed: 0,restaurant_A
0,4.529049
1,4.28947
2,4.532923
3,4.402421


In [3]:
rest_B = pd.read_csv('restaurant_B_ratings.csv')
rest_B.drop(columns = ['Unnamed: 0'], inplace=True)
rest_B.head(4)

Unnamed: 0,restaurant_B
0,1.405078
1,2.139428
2,1.950545
3,3.146699


In [4]:
rest_A['rating_A'] = rest_A ['restaurant_A'].round(1)
rest_B['rating_B'] = rest_B ['restaurant_B'].round(1)

In [5]:
rest_A.drop(columns = ['restaurant_A'], inplace=True)
rest_A.head(2)

Unnamed: 0,rating_A
0,4.5
1,4.3


In [6]:
rest_B.drop(columns = ['restaurant_B'], inplace=True)
rest_B.head(2)

Unnamed: 0,rating_B
0,1.4
1,2.1


In [7]:
ratings = pd.concat([rest_A, rest_B], axis=1)
ratings

Unnamed: 0,rating_A,rating_B
0,4.5,1.4
1,4.3,2.1
2,4.5,2.0
3,4.4,3.1
4,4.2,1.9
...,...,...
245,,1.2
246,,0.8
247,,2.1
248,,2.0


In [8]:
ratings['rating_A'].fillna(0, inplace=True)
ratings

Unnamed: 0,rating_A,rating_B
0,4.5,1.4
1,4.3,2.1
2,4.5,2.0
3,4.4,3.1
4,4.2,1.9
...,...,...
245,0.0,1.2
246,0.0,0.8
247,0.0,2.1
248,0.0,2.0


In [9]:
mean_1 = f'The mean rating for restaurant A is : {round(ratings.rating_A.mean(), 1)}'
mean_2 = f'The mean rating for restaurant B is : {round(ratings.rating_B.mean(), 1)}'
print(mean_1, mean_2)

The mean rating for restaurant A is : 2.1 The mean rating for restaurant B is : 1.9


### Are the Z-test assumptions are satisfied or not?

- Continuous data - The ratings are measured on a continuous scale.
- Normally distributed populations or Sample sizes > 30 - Since the sample sizes are greater than 30, Central Limit Theorem states that the distribution of sample means will be normal.
- Independent populations - As we are taking samples for two different restaurants, the two samples are from two independent populations.
- Known population standard deviations $\sigma_A$ and $\sigma_B$ - Yes, we know the population standard deviations of both populations.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

### Find the p-value

In [10]:
def ztest_2samp(X1, X2, pop_sd1, pop_sd2, n1, n2):
    '''
    X1 - first of the two independent samples (sample 1)
    X2 - second of the two independent samples (sample 2)
    pop_sd1 - Population standard deviation of sample 1
    pop_sd2 - Population standard deviation of sample 2
    n1 - the size of sample 1
    n2 - the size of sample 2
    '''
    from numpy import sqrt, abs
    from scipy.stats import norm
    se = sqrt(pop_sd1**2/n1 + pop_sd2**2/n2)
    test_stat = ((X1.mean() - X2.mean()) - 0)/ se
    pval = 1 - norm.cdf(abs(test_stat))
    return pval

In [11]:
p_val = ztest_2samp(ratings['rating_A'].dropna(), ratings['rating_B'].dropna(), 0.15, 0.89, 120, 250)
p_val

0.00016942771113326316

As the p-value is  less than the level of significance of 0.05, we reject the null hypothesis. Thus, we have enough statistical evidence to tell that  restaurant A has a higher mean satisfaction rating than restaurant B.

### **Conclusion**

- We are 95% confident that restaurant A is doing better than restaurant B in term of customer satisfaction..

# 2.**Two Independent Sample T-test for Equality of Means - Equal Std Dev**

**Business Problem 2**

*The sodium content of N1=20 Pepsi Diet cans of 335ml and N2=30 Pepsi Max Cans of 320ml is measured. Is the sodium content from  Pepsi Diet cans of 335ml different from the Pepsi Max Cans of 320ml?*
Note:
- The sample sizes are not equal but we will assume that the population variance of sample 1 and sample 2 are equal to satisfy the assumptions.
- Since the significance level is not provided. We can assume it to be 0.05.

**Formulate null hypothesis and alternate hypothesis.**

**H0**: mu1 = mu2 -> mu1 - mu2 = 0 - There is no difference between the means.

**Ha**: mu1 != mu2 -> mu1 - mu2 != 0 - There is difference between the means.

In [12]:
sodium_data = pd.read_csv('sodium_data.csv')
sodium_data.drop(columns = ['Unnamed: 0'], inplace=True)
sodium_data.head()

Unnamed: 0,sodium1_mg,sodium2_mg
0,24.18,17.84
1,23.73,17.59
2,23.69,17.31
3,23.98,16.04
4,23.57,17.92


### Are the T-test assumptions are satisfied or not?

- Continuous data - Yes, the sodium content is measured on a continuous scale.
- Normally distributed populations - Yes, we are informed that the populations are assumed to be normal.
- Independent populations - As we are taking random samples for two different types of users, the two samples are from two independent populations.
- Equal population standard deviations - As the sample standard deviations are almost equal, the population standard deviations can be assumed to be equal.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

### Find the t-test statistic and p-value

In [13]:
t, p_value = stats.ttest_ind(sodium_data['sodium2_mg'], sodium_data['sodium1_mg'].dropna())

print('t-stats = ',t, ', p_value = ', p_value)

t-stats =  0.48786077817746965 , p_value =  0.6274876408382779


The significance level is not given. So, we can assume it to be 0.05. As the p-value (0.627) > 0.05 (alpha), we failed to reject the null hypothesis, thus, we didn't have statistical evidence to say that there's a difference between two means of sodium.

### **Conclusion**

- At the level of significance of 0.05, we don't have enough statistical evidence to support our alternative hypothesis, therefore we conclude that there's no significant difference between the two Pepsi sodium contents..

# 3. **Two Independent Sample T-test for Equality of Means - Unequal Std Dev**

**Business Problem 3**

*Suppose Hershey Company is interested to know if the dark chocolate is more popular in the US than India, thus they randomly selected a sample of score ratings.*

*The scores rating system is :*
- 4.0 - 5.0 = Outstanding
- 3.5 - 3.9 = Highly Recommended
- 3.0 - 3.4 = Recommended
- 2.0 - 2.9 = Disappointing
- 1.0 - 1.9 = Unpleasant 

*Assuming Chocolate scores rating for two populations are normally distributed, do we have enough statistical evidence for this at a 5% significance level?*

We will test the null hypothesis

>$H_0:\mu_1=\mu_2$

against the alternate hypothesis

>$H_a:\mu_1>\mu_2$

In [14]:
chocolate_data = pd.read_csv('chocolate_scores.csv')
chocolate_data.drop(columns = ['Unnamed: 0'], inplace=True)
chocolate_data['US_scores'] = chocolate_data['US_scores'].round(1)
chocolate_data['India_scores'] = chocolate_data['India_scores'].round(1)
chocolate_data.head(2)

Unnamed: 0,US_scores,India_scores
0,4.2,3.5
1,4.4,3.4


In [15]:
print('The AVG ratings of US is ' + str(round(chocolate_data['US_scores'].mean(), 2)))
print('The AVG ratings of India is ' + str(round(chocolate_data['India_scores'].mean(), 2)))
print('The standard deviation of the ratings in US is ' + str(round(chocolate_data['US_scores'].std(),2)))
print('The standard deviation of ratings in India is ' + str(round(chocolate_data['India_scores'].std(),2)))

The AVG ratings of US is 4.28
The AVG ratings of India is 2.98
The standard deviation of the ratings in US is 0.14
The standard deviation of ratings in India is 0.58


### Are the T-test assumptions are satisfied or not?

- Continuous data - Yes, the Chocolate score raing system is measured on a continuous scale.
- Normally distributed populations - Yes, we are informed that the populations are assumed to be normal.
- Independent populations - As we are taking random samples for two different groups, the two samples are from two independent populations.
- Unequal population standard deviations - As the sample standard deviations are different, the population standard deviations can be assumed to be different.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

### Find the t-test statistic and p-value

In [16]:
from scipy.stats import ttest_ind

test_stat, p_value = ttest_ind(chocolate_data['US_scores'], chocolate_data['India_scores'], equal_var = False, alternative = 'greater')
print('The p-value is : {}, the t-test statistic is : {}'.format(str(p_value),str(test_stat)))

The p-value is : 1.6446400020384743e-09, the t-test statistic is : 9.686795920003261


As the p-value is  much small than the level of significance of 0.05, we reject the null hypothesis. Thus, we have enough statistical evidence to tell that new chocolate has a higher mean score rating in US than in India.

### **Conclusion**

- From the test result we can conclude that : 95% of times we're sure that the dark chocolate is more popular in US than in India..