In [1]:
import numpy as np
import matplotlib.pyplot as plt

# 1. One-sample tests

In [2]:
from scipy.stats import norm # Useful to compute CDF and quantile of the Gaussian distribution
from scipy.stats import t # Useful to compute CDF and quantile of the Student distribution

## 1.1 $Z$-test for the mean

A network of $n=10$ sensors measures the concentration of ozone in various places of Paris. Due to local variability and measure noise, on a given day, the ozone concentration recorded by the $i$-th sensor is representation by a random variable $X_i = \mu + \varepsilon_i$, where $\mu$ is the mean ozone concentration on this day and the variables $(\varepsilon_i)_{1 \leq i \leq n}$ are independent $\mathcal{N}(0,\sigma^2)$ variables, with known standard deviation $\sigma = 10~\mathrm{g}/\mathrm{m}^3$. In the block below, we give the measurements corresponding to 3 different days.



In [None]:
ozone_day_1 = np.array([24.5, 48.0, 32.0, 45.8, 32.1, 35.7, 41.1, 30.4, 30.8, 27.5])
ozone_day_2 = np.array([48.0, 57.3, 56.1, 43.2, 37.9,  39.9, 40.3,  28.7, 45.7, 57.2])
ozone_day_3 = np.array([56.2, 54.1, 53.4, 46.2, 53.5, 48.0, 57.5, 65.2, 49.3, 58.5])

print("Empirical mean for day 1:", np.mean(ozone_day_1))
print("Empirical mean for day 2:", np.mean(ozone_day_2))
print("Empirical mean for day 3:", np.mean(ozone_day_3))

The limit value of mean ozone concentration fixed by WHO is $\mu_0 = 40~\mathrm{g}/\mathrm{m}^3$. At the confidence level $5\%$, for which days was the mean ozone concentration above this threshold? You may also indicate the $p$-value for each day.

In [16]:
sigma = 10 # Known standard deviation of the sample
mu_0 = 40 # Limit value to be tested

In [None]:
# Your code here

## 1.2 $\mathrm{t}$-test for the mean

With the same data as above, we no longer assume that $\sigma$ is known. How to adapt the test?

Remark: to compute $\sqrt{S^2_n}$ for a sample `x`, you can use `np.std(x, ddof=1)`.

In [None]:
print("Standard deviation for day 1:", np.std(ozone_day_1, ddof=1))
print("Standard deviation for day 2:", np.std(ozone_day_2, ddof=1))
print("Standard deviation for day 3:", np.std(ozone_day_3, ddof=1))

In [None]:
# Your code here

What is your conclusion?

Remark: one-sample $\mathrm{t}$-tests can also be performed using the `ttest_1samp()` function of `scipy.stats`, as follows:

In [None]:
from scipy.stats import ttest_1samp

# Application on the sample ozone_day_1
# popmean is the value of µ_0
# alternative='greater' means that H_1 is {µ > µ_0}
t_score, p_value = ttest_1samp(ozone_day_1, popmean=mu_0, alternative='greater')
print("t-score:", t_score)
print("p-value:", p_value)

# 2. Two-sample tests

In this section, we work with the grades of IMI and SEGF students at the 2023 exam.

In [None]:
# Data for grades_imi and grades_segf
grades_imi = np.array([8.5, 10, 10, 10, 11, 11.5, 11.5, 12, 12, 12,
                       12, 12.5, 12.5, 12.5, 12.5, 13, 13, 13, 13, 13,
                       13, 13, 13, 13.5, 14, 14, 14, 14, 14, 14,
                       14.5, 14.5, 14.5, 15, 15, 15.5, 15.5, 15.5, 15.5, 15.5,
                       16, 16, 16, 16, 16, 16, 16.5, 16.5, 16.5, 16.5,
                       18.5, 18.5, 19, 19])

grades_segf = np.array([3.5, 10, 10, 10.5, 10.5, 11, 11, 11, 11.5, 11.5,
                        12.5, 12.5, 13.5, 13.5, 13.5, 14, 14.5, 15, 15.5, 15.5,
                        15.5, 16, 16.5, 17, 17, 17.5])

# Plot histogram of the two samples
plt.figure(figsize=(10, 6))
bins = np.linspace(0, 20, 21)  # Define bins range

plt.hist(grades_imi, bins=bins, alpha=0.7, label='IMI', color='blue', edgecolor='black')
plt.hist(grades_segf, bins=bins, alpha=0.7, label='SEGF', color='orange', edgecolor='black')

# Add labels, legend, and title
plt.xlabel('Grades')
plt.ylabel('Frequency')
plt.title('Histogram of Grades: IMI vs SEGF')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.tight_layout()
plt.show()

# Calculate means and standard deviations
n_imi = len(grades_imi)
n_segf = len(grades_segf)

mean_imi = np.mean(grades_imi)
mean_segf = np.mean(grades_segf)

sd_imi = np.std(grades_imi, ddof=1)
sd_segf = np.std(grades_segf, ddof=1)

# Print results
print("Mean of grades_imi:", mean_imi)
print("Mean of grades_segf:", mean_segf)

print("Standard deviation of grades_imi:", sd_imi)
print("Standard deviation of grades_segf:", sd_segf)

## 2.1 Equality of variance

At the confidence level $5\%$, can you reject the hypothesis that the two samples have different variances?

In [31]:
from scipy.stats import f # Fisher distribution

In [None]:
# Your code here

## 2.2 Tests on the mean

Using a $\mathrm{t}$-test, compute the $p$-value associated to the test of each of the following assertions:


1.   The grades of IMI students and SEGF students do not have the same mean.
2.   The grades of IMI students have a larger mean than the grades of SEGF students.
3.   The grades of IMI students have a smaller mean than the grades of SEGF students.

You can either implement the $\mathrm{t}$-test or use the `scipy.stats` function `ttest_ind()`.

In [None]:
# Your code here!