# Hypothesis Testing with Two Samples

In [1]:
import numpy as np
import pandas as pd
import scipy.stats
from IPython.display import display, Markdown
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans

df = pd.read_csv('../data/wine.csv')

<b></b>

## a. The 'fixed acidity' column is divided evenly into two parts: the first half and the second half. Is it true that the means of these two parts are the same?

$Let:$\
$\mu_1:$ the mean of the population from which the first sample is drawn\
$\mu_2:$ the mean of the population from which the second sample is drawn\
$\sigma_1^2:$ the variance of the population from which the first sample is drawn\
$\sigma_2^2:$ the variance of the population from which the second sample is drawn\
$\bar x_1:$ the mean of the first sample\
$\bar x_2:$ the mean of the second sample\
$n_1:$ the size of the first sample\
$n_2:$ the size of the second sample\
$\alpha:$ the level of significance\
$d_0:$ the hypothesized difference between the mean of the first sample and the mean of the second sample

1. $H_0: \mu_1 - \mu_2 = 0$
2. $H_1: \mu_1 - \mu_2 \neq 0$
3. $\alpha = 0.05$
4. We shall use a two-tailed hypothesis test on two means with known variances.
   $$z = \frac{(\bar x_1 - \bar x_2) - d_0} {\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$
   with
   $$d_0 = \mu_1 - \mu_2 = 0$$

   Since $H_1: \mu_1 - \mu_2 \neq 0$, the critical region of the test is
   $$z < -z_{\alpha/2} \; or \; z > z_{\alpha/2}$$
   $$z < -z_{0.025} \; or \; z > z_{0.025}$$
   $$z < -1.96 \; or \; z > 1.96$$
5. Do the computations

In [2]:
# Determine the length of the column
column_length = len(df['fixed acidity'])

# Divide the column into two equal-sized parts
first_half = df['fixed acidity'][:column_length // 2]
second_half = df['fixed acidity'][column_length // 2:]

# Create two DescrStatsW instances
d1 = DescrStatsW(first_half)
d2 = DescrStatsW(second_half)

# Create a CompareMeans instance from d1 and d2
compare = CompareMeans(d1, d2)

# Calculate the z-statistic and p-value
z, p = compare.ztest_ind(alternative='two-sided', usevar='unequal', value=0)

# Determine whether to reject or fail to reject the null hypothesis (H0)
critical_value = 1.96
if z > critical_value or z < - critical_value:
    decision = f'We reject the null hypothesis ($H_0$).'
else:
    decision = f'We fail to reject the null hypothesis ($H_0$).'

display(Markdown(
    f'Result:\n'
    f'$$z = {z}$$'
    f'$$p = {p}$$'
    f'6. {decision}'
    )
)

Result:
$$z = 0.02604106999908715$$$$p = 0.9792245804253911$$6. We fail to reject the null hypothesis ($H_0$).

<b></b>

## b. The 'chlorides' column is divided evenly into two parts: the first half and the second half. Is it true that the mean of the first half is greater than the mean of second half by 0.001?

$Let:$\
$\mu_1:$ the mean of the population from which the first sample is drawn\
$\mu_2:$ the mean of the population from which the second sample is drawn\
$\sigma_1^2:$ the variance of the population from which the first sample is drawn\
$\sigma_2^2:$ the variance of the population from which the second sample is drawn\
$\bar x_1:$ the mean of the first sample\
$\bar x_2:$ the mean of the second sample\
$n_1:$ the size of the first sample\
$n_2:$ the size of the second sample\
$\alpha:$ the level of significance\
$d_0:$ the hypothesized difference between the mean of the first sample and the mean of the second sample

1. $H_0: \mu_1 - \mu_2 = 0.001$
2. $H_1: \mu_1 - \mu_2 \neq 0.001$
3. $\alpha = 0.05$
4. We shall use a two-tailed hypothesis test on two means with known variances.
   $$z = \frac{(\bar x_1 - \bar x_2) - d_0} {\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$
   with
   $$d_0 = \mu_1 - \mu_2 = 0.001$$

   Since $H_1: \mu_1 - \mu_2 \neq 0.001$, the critical region of the test is
   $$z < -z_{\alpha/2} \; or \; z > z_{\alpha/2}$$
   $$z < -z_{0.025} \; or \; z > z_{0.025}$$
   $$z < -1.96 \; or \; z > 1.96$$
5. Do the computations

In [3]:
# Determine the length of the column
column_length = len(df['chlorides'])

# Divide the column into two equal-sized parts
first_half = df['chlorides'][:column_length // 2]
second_half = df['chlorides'][column_length // 2:]

# Create two DescrStatsW instances
d1 = DescrStatsW(first_half)
d2 = DescrStatsW(second_half)

# Create a CompareMeans instance from d1 and d2
compare = CompareMeans(d1, d2)

# Calculate the z-statistic and p-value
z, p = compare.ztest_ind(alternative='two-sided', usevar='unequal', value=0.001)

# Determine whether to reject or fail to reject the null hypothesis (H0)
critical_value = 1.96
if z > critical_value or z < - critical_value:
    decision = f'We reject the null hypothesis ($H_0$).'
else:
    decision = f'We fail to reject the null hypothesis ($H_0$).'

display(Markdown(
    f'Result:\n'
    f'$$z = {z}$$'
    f'$$p = {p}$$'
    f'6. {decision}'
    )
)

Result:
$$z = -0.4673171228521429$$$$p = 0.6402730075810992$$6. We fail to reject the null hypothesis ($H_0$).

<b></b>

## c. Is it true that the mean of the first 25 rows of the 'volatile acidity' column is equal to the mean of the first 25 rows of the 'sulphates' column?

$Let:$\
$\mu_1:$ the mean of the population from which the first sample is drawn\
$\mu_2:$ the mean of the population from which the second sample is drawn\
$\sigma_1^2:$ the variance of the population from which the first sample is drawn\
$\sigma_2^2:$ the variance of the population from which the second sample is drawn\
$\bar x_1:$ the mean of the first sample\
$\bar x_2:$ the mean of the second sample\
$n_1:$ the size of the first sample\
$n_2:$ the size of the second sample\
$\alpha:$ the level of significance\
$d_0:$ the hypothesized difference between the mean of the first sample and the mean of the second sample

1. $H_0: \mu_1 - \mu_2 = 0$
2. $H_1: \mu_1 - \mu_2 \neq 0$
3. $\alpha = 0.05$
4. We shall use a two-tailed hypothesis test on two means with known variances.
   $$z = \frac{(\bar x_1 - \bar x_2) - d_0} {\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$
   with
   $$d_0 = \mu_1 - \mu_2 = 0$$

   Since $H_1: \mu_1 - \mu_2 \neq 0$, the critical region of the test is
   $$z < -z_{\alpha/2} \; or \; z > z_{\alpha/2}$$
   $$z < -z_{0.025} \; or \; z > z_{0.025}$$
   $$z < -1.96 \; or \; z > 1.96$$
5. Do the computations

In [4]:
# Get the samples
volatile_acidity = df['volatile acidity'][:25]
sulphates = df['sulphates'][:25]

# Create two DescrStatsW instances
d1 = DescrStatsW(volatile_acidity)
d2 = DescrStatsW(sulphates)

# Create a CompareMeans instance from d1 and d2
compare = CompareMeans(d1, d2)

# Calculate the z-statistic and p-value
z, p = compare.ztest_ind(alternative='two-sided', usevar='unequal', value=0)

# Determine whether to reject or fail to reject the null hypothesis (H0)
critical_value = 1.96
if z > critical_value or z < - critical_value:
    decision = f'We reject the null hypothesis ($H_0$).'
else:
    decision = f'We fail to reject the null hypothesis ($H_0$).'

display(Markdown(
    f'Result:\n'
    f'$$z = {z}$$'
    f'$$p = {p}$$'
    f'6. {decision}'
    )
)

Result:
$$z = -2.6374821676748703$$$$p = 0.008352401685453743$$6. We reject the null hypothesis ($H_0$).

<b></b>

## d. The 'residual sugar' column is divided evenly into two parts: the first half and the second half. Is it true that the variance of the first half is equal to the variance of the second half?

$Let:$\
$\sigma_1^2:$ the variance of the population from which the first sample is drawn\
$\sigma_2^2:$ the variance of the population from which the second sample is drawn\
$s_1^2:$ the variance of the first sample\
$s_2^2:$ the variance of the second sample\
$n_1:$ the size of the first sample\
$n_2:$ the size of the second sample\
$\alpha:$ the level of significance\
$v_1, v_2:$ the degrees of freedom of the F-distribution

1. $H_0: \sigma_1^2 = \sigma_2^2$
2. $H_1: \sigma_1^2 \neq \sigma_2^2$
3. $\alpha = 0.05$
4. We shall use a two-tailed hypothesis test concerning variances.
   $$f = \frac{s_1^2}{s_2^2}$$

   Since $H_1: \sigma_1^2 \neq \sigma_2^2$, the critical region of the test is
   $$f < f_{1-\alpha/2}(v_1,v_2) \; or \; f > f_{\alpha/2}(v_1,v_2)$$
   with $v_1 = n_1 - 1$ and $v_2 = n_2 - 1$

In [5]:
# Determine the length of the column
column_length = len(df['residual sugar'])

# Divide the column into two equal-sized parts
first_half = df['residual sugar'][:column_length // 2]
second_half = df['residual sugar'][column_length // 2:]

# Determine the degree of freedom
dfn = len(first_half) - 1
dfd = len(second_half) - 1

# Define significance level
alpha = 0.05

# Calculate critical values of F-distribution
lower_critical_value = scipy.stats.f.ppf(alpha / 2, dfn, dfd)
upper_critical_value = scipy.stats.f.ppf(1 - (alpha / 2), dfn, dfd)

display(Markdown(
    f'Therefore, the critical region is\n'
    f'$$f < {lower_critical_value} \; or \; f > {upper_critical_value}$$'
    )
)

Therefore, the critical region is
$$f < 0.8388857772763105 \; or \; f > 1.1920574017201653$$

5. Do the computations

In [6]:
# Determine the length of the column
column_length = len(df['residual sugar'])

# Divide the column into two equal-sized parts
first_half = df['residual sugar'][:column_length // 2]
second_half = df['residual sugar'][column_length // 2:]

# Determine the degree of freedom
dfn = len(first_half) - 1
dfd = len(second_half) - 1

# Compute f
f = first_half.var() / second_half.var()

# Calculate critical values of F-distribution
lower_critical_value = scipy.stats.f.ppf(alpha / 2, dfn, dfd)
upper_critical_value = scipy.stats.f.ppf(1 - (alpha / 2), dfn, dfd)

# Calculate the p-value using the cumulative distribution function (CDF) of the F-distribution
p_value = 1 - scipy.stats.f.cdf(f, dfn, dfd)

# Determine whether to reject or fail to reject the null hypothesis (H0)
if f < lower_critical_value or f > upper_critical_value:
    decision = f'We reject the null hypothesis ($H_0$).'
else:
    decision = f'We fail to reject the null hypothesis ($H_0$).'

display(Markdown(
    f'Result:\n'
    f'$$f = {f}$$'
    f'$$p = {p_value}$$'
    f'6. {decision}'
    )
)

Result:
$$f = 0.9420041066941615$$$$p = 0.7475898202376912$$6. We fail to reject the null hypothesis ($H_0$).

<b></b>

## e. Is the proportion of values greater than 7 higher in the first half compared to the second half of the 'alcohol' column?

$Let:$\
$p_1:$ the true proportion of values greater than 7 in the first half of the 'alcohol' column\
$p_2:$ the true proportion of values greater than 7 in the second half of the 'alcohol' column\
$n_1:$ the size of the first sample\
$n_2:$ the size of the second sample

1. $H_0: p_1 = p_2$
2. $H_1: p_1 > p_2$
3. $\alpha = 0.05$
4. We shall use a one-tailed hypothesis test concerning on two proportions.
   $$z = \frac{\hat{p_1} - \hat{p_2}} {\sqrt{\hat{p}\hat{q}(\frac{1}{n_1}+\frac{1}{n_2})}}$$
   $$\hat{p_1} = \frac{x_1} {n_1}$$
   $$\hat{p_2} = \frac{x_2} {n_2}$$
   $$\hat{p} = \frac{x_1 + x_2} {n_1 + n_2}$$
   $$\hat{q} = 1 - \hat{p}$$

   Since $H_1: p_1 > p_2$, the critical region of the test is
   $$z > z_{\alpha}$$
   $$z > z_{0.05}$$

In [7]:
# Desired level of significance
alpha = 0.05

# Compute the critical value
critical_value = scipy.stats.norm.ppf(1 - alpha)

display(Markdown(
    f'Therefore, the critical region is\n'
    f'$$z > {critical_value}$$'
    )
)

Therefore, the critical region is
$$z > 1.6448536269514722$$

5. Do the computations

In [8]:
# Determine the length of the column
column_length = len(df['alcohol'])

# Determine the length of both halves
n1 = n2 = column_length // 2

# Divide the column into two equal-sized parts
first_half = df['alcohol'][:column_length // 2]
second_half = df['alcohol'][column_length // 2:]

# Get the rows with the values greater than 7 from the first and the second half
gts_first_half = first_half[first_half > 7]
gts_second_half = second_half[second_half > 7]

# Get the number of rows greater than 7 from the first and the second half
x1 = len(gts_first_half)
x2 = len(gts_second_half)

# Compute z
p_hat = (x1 + x2)/(n1 + n2)
q_hat = 1 - p_hat
z = (x1/n1 - x2/n2)/(np.sqrt(p_hat * q_hat * (1/n1 + 1/n2)))

# Calculate the p-value using the cumulative distribution function (CDF) of the standard normal distribution
p_value = 1 - scipy.stats.norm.cdf(z)

# Desired level of significance
alpha = 0.05

# Compute the critical value
critical_value = scipy.stats.norm.ppf(1 - alpha)

if z > critical_value:
    decision = f'We reject the null hypothesis ($H_0$).'
else:
    decision = f'We fail to reject the null hypothesis ($H_0$).'

display(Markdown(
    f'Result:\n'
    f'$$z = {z}$$'
    f'$$p = {p_value}$$'
    f'6. {decision}'
    )
)

Result:
$$z = 0.0$$$$p = 0.5$$6. We fail to reject the null hypothesis ($H_0$).