I. T-Test

1. Objective
Compare the mean closing price (close) between two groups:  
Group 1: Closing prices before 2022.  
Group 2: Closing prices after 2022.  

2.  Hypotheses  
Null hypothesis (H0): There is no difference in the mean closing price between the two groups.   
Alternative hypothesis (H1): There is a significant difference in the mean closing price between the two groups.

3. Prepare data


In [1]:
import pandas as pd
from scipy.stats import ttest_ind

# Đọc dữ liệu
file_path = 'E:/Code/STAT3013.-P12_Nhom4/Dataset/Gold_data_filtered.xlsx'
df = pd.read_excel(file_path)

# Chuyển đổi cột 'date' thành định dạng datetime
df['date'] = pd.to_datetime(df['date'])

# Chia dữ liệu thành hai nhóm: trước và sau ngày 2020-01-01
group1 = df[df['date'] < '2022-01-01']['close']  # Trước ngày 01/01/2022
group2 = df[df['date'] >= '2022-01-01']['close']  # Sau ngày 01/01/2022
# Kiểm tra thông tin chi tiết về DataFrame, bao gồm kiểu dữ liệu của các cột



4. Perform T-Test

In [2]:
# Thực hiện kiểm định t-test
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)  # equal_var=False khi phương sai hai nhóm không bằng nhau

# In kết quả
print("T-statistic:", t_stat)
print("P-value:", p_value)


T-statistic: -25.616514009947352
P-value: 4.444073906586767e-118


5. Conclusion


p = 4.444073906586767e-118 < 0.05: indicating that the difference in mean between the two groups is statistically significant.

This means that the average closing price before and after 2022 is not due to randomness but there is a real change in the gold price trend.

II. Chi-square test

1. Objective  
Examine the relationship between two categorical variables:  
close_category: Classified as High or Low based on the mean closing price.  
volume_category: Classified as High or Low based on the mean trading volume.

2. Hypotheses  
Null hypothesis (H0): There is no relationship between closing price category and trading volume category.  
Alternative hypothesis (H1): There is a significant relationship between closing price category and trading volume category.

In [3]:
# Phân loại giá đóng cửa (close) và khối lượng giao dịch (volume) thành "High" và "Low"
mean_close = df['close'].mean()
mean_volume = df['volume'].mean()

df['close_category'] = ['High' if close >= mean_close else 'Low' for close in df['close']]
df['volume_category'] = ['High' if volume >= mean_volume else 'Low' for volume in df['volume']]

In [4]:
# Tạo bảng tần suất chéo
contingency_table = pd.crosstab(df['close_category'], df['volume_category'])

# Hiển thị bảng tần suất
print(contingency_table)


volume_category  High  Low
close_category            
High               34  714
Low                33  558


In [5]:
from scipy.stats import chi2_contingency

# Thực hiện kiểm định Chi-square
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# In kết quả
print("Chi-square statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
print("Expected frequencies:")
print(expected)


Chi-square statistic: 0.5462764778893154
P-value: 0.4598431272282141
Degrees of freedom: 1
Expected frequencies:
[[ 37.42793129 710.57206871]
 [ 29.57206871 561.42793129]]


Conclusion

P-value = 0.46: (>0.05), there is no statistical evidence to conclude that there is a relationship between close_category and volume_category. This means that "close price high/low" and "high/low trading volume" may not be related in this dataset.

III. Build a confidence gap

In [6]:
import numpy as np
import pandas as pd
from scipy.stats import norm


# Lấy cột giá đóng cửa (close)
data = df['close']


In [7]:
# Tính trung bình, độ lệch chuẩn, và cỡ mẫu
mean = np.mean(data)
std_dev = np.std(data, ddof=1)  # ddof=1 để tính độ lệch chuẩn mẫu
n = len(data)


In [8]:
# Giá trị Z cho mức tin cậy 95% (1.96)
z = norm.ppf(0.975)  # norm.ppf(0.975) = 1.96 cho phân phối chuẩn

# Tính khoảng tin cậy
margin_of_error = z * (std_dev / np.sqrt(n))
confidence_interval = (mean - margin_of_error, mean + margin_of_error)

# In kết quả
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Sample Size:", n)
print("Confidence Interval:", confidence_interval)


Mean: 1795.5178466067553
Standard Deviation: 228.52761006277385
Sample Size: 1339
Confidence Interval: (np.float64(1783.2774220827976), np.float64(1807.758271130713))


Conclusion

95% confidence interval = [1783.277, 1807.758]