In [1]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  

%matplotlib inline

import scipy.stats as stats 
import random

# **Paired Sample T-test for Equality of Means**

**Business Problem 1**

*Suppose a healthy Nutrition App wants to introduce new diet program and they interested to know if the new program is helps to improve the weight loss of their clients , therefore they randomly selected sample of  15 clients weights (kg) data before and after the program.*

*Assuming the clients weights are normally distributed, do we have enough statistical evidence to say that there is a descrease in the body weight at a 0.05 significance level?*

We will test the null hypothesis

>$H_0:\mu new = \mu old$

against the alternate hypothesis

>$H_a:\mu new <\mu old$

In [2]:
nutrition_data = pd.read_csv('nutrition_data.csv')
nutrition_data.drop(columns = ['Unnamed: 0'], inplace=True)
nutrition_data.head(5)

Unnamed: 0,id,old_weight(kg),new_weight(kg)
0,1235,112.98,86.21
1,1616,115.76,84.71
2,1998,115.32,84.28
3,2379,113.89,85.37
4,2761,115.01,83.96


In [3]:
diff = np.mean(nutrition_data['old_weight(kg)'] - nutrition_data['new_weight(kg)'])
print('The mean of the differences between the weight loss of the old program and the new program :', round(diff, 2))

The mean of the differences between the weight loss of the old program and the new program : 29.74


### Are the paired T-test assumptions are satisfied or not?

- Continuous data - Yes, the weigh(kg) is measured on a continuous scale.
- Normally distributed populations - Yes, we are informed that the populations are assumed to be normal.
- Independent observations - As we are taking the sampled unit randomly, the observed units are independent.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

### Find the t-test statistic and p-value

In [4]:
from scipy.stats import ttest_rel

test_stat, p_value = ttest_rel(nutrition_data['new_weight(kg)'], nutrition_data['old_weight(kg)'], alternative = 'less')
print('The p-value is : {}, the t-test statistic is : {}'.format(str(p_value),str(test_stat)))

The p-value is : 9.650620343805655e-21, the t-test statistic is : -85.5769483670576


p-value is very small than the level of significance 0.05, thus we can reject the null hypothesis with 95% confidence.

### **Conclusion**

- Practically speaking, losing weight by 29.74 kg is a significant observed difference from the old program for the nutrition app to adopt the new diet program. 
- In addition , based on the test result we have an enough statistical evidence to say that the new diet program has improved the weight loss process..


**Business Problem 2**

*Suppose we are interested into invest in (NYSE) Stock Exchange Market, so we randomly selected 30 different companies stocks prices from stock prices in Apr 2023 and stock prices in May 2023, but before we invest, we want to know if is there a significant change in the market prices between these two months, so we can decide if will go for it or not.*

*Assuming the stock prices are normally distributed, do we have enough statistical evidence to say that there is an increase in the market between two months at a 0.05 significance level?*

**Formulate null hypothesis and alternate hypothesis.**

- H<sub>0</sub>: x2 - x1 = 0 - The mean difference between the two samples is equal to zero.
- H<sub>a</sub>: x2 - x1 != 0 - The mean difference between the two samples is not equal to zero.

In [5]:
nyse_data = pd.read_csv('NYSE_stock_prices.csv')
nyse_data.drop(columns = ['Unnamed: 0'], inplace=True)
nyse_data.head(3)

Unnamed: 0,symbol,stock_price_april_2023,stock_price_may_2023
0,GOEV,240.57,611.22
1,CSCO,165.76,654.08
2,MARA,174.39,598.98


In [6]:
diff = np.mean(nyse_data['stock_price_may_2023'] - nyse_data['stock_price_april_2023'])
print('The mean of the differences between the stock prices in Apr 2023 and May 2023 :', round(diff, 2))

The mean of the differences between the stock prices in Apr 2023 and May 2023 : 393.39


### Are the paired T-test assumptions are satisfied or not?

- Continuous data - Yes, the stock price is measured on a continuous scale.
- Normally distributed populations - Yes, we are informed that the populations are assumed to be normal.
- Independent observations - As we are taking the sampled unit randomly, the observed units are independent.
- Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

In [7]:
from scipy.stats import ttest_rel

test_stat, p_value = ttest_rel(nyse_data['stock_price_april_2023'],nyse_data['stock_price_may_2023'] , alternative = 'two-sided')

result= f'the test statistic : {test_stat}, the p-value : { p_value}'
print(result)

the test statistic : -36.405982942945826, the p-value : 9.243844049014443e-26


Since the p-value is much smaller than 0.05 (alpha), we reject the null hypothesis and conclude that the mean difference between the two populations is not equal to zero and there is a significant difference between them.

### **Conclusion**

- We are 95% confident that the stock prices have been increased from Apr 2023 to May 2023.
- In addition , based on the test result we have enough statistical evidence to say that the stock prices have increased from Apr 2023 to May 2023, with an observed increment by 393.39.