This notebook performs pairwise t-tests on results from each sampling method across five random seeds of initial training data to assess whether differences between sampling methods are statistically significant. A YOLOv8n model is trained for four iterations, with 50 epochs per iteration. Before each iteration, 5% of the total training data is added to the labeled pool according to the selected sampling method. All methods start from the same initial labeled set; only the samples selected after the first iteration differ.

In [None]:
import numpy as np
from scipy import stats

The array values below represent the mAP50-95 of the model after four training iterations for each random seed.

In [None]:
random_sampling = np.array([0.565, 0.559, 0.561, 0.568, 0.568])
lc_avg_sampling = np.array([0.51, 0.53, 0.539, 0.514, 0.515])
lc_min_sampling = np.array([0.575, 0.568, 0.586, 0.573, 0.591])
ent_avg_sampling = np.array([0.566, 0.563, 0.549, 0.566, 0.571])
ent_max_sampling = np.array([0.583, 0.575, 0.574, 0.582, 0.577])

The baseline sampling method is random sampling. The next few cells will compare the results of each sampling method to random sampling and determine if the results are statistically significant.

In [None]:
t_stat, p_value = stats.ttest_ind(lc_avg_sampling, random_sampling, equal_var=False)
print(p_value)
p_value.item() < 0.05

Based on these results, we reject the null hypothesis that Least Confidence Sampling with mean confidence score aggregation performs no differently from random sampling. Since the p-value is below 0.05, Least Confidence Sampling with mean confidence score aggregation is different from the random sampling baseline with statistical significance. Specifically, the mAP50–95 for fish detection on the DeepFish test dataset is lower under Least Confidence Sampling with mean confidence score aggregation than under random sampling.

In [None]:
t_stat, p_value = stats.ttest_ind(lc_min_sampling, random_sampling, equal_var=False)
print(p_value)
p_value.item() < 0.05

Based on these results, we reject the null hypothesis that Least Confidence Sampling with minimum confidence score aggregation performs no differently from random sampling. Since the p-value is below 0.05, Least Confidence Sampling with minimum confidence score aggregation is different from the random sampling baseline with statistical significance. Specifically, the mAP50–95 for fish detection on the DeepFish test dataset is higher under Least Confidence Sampling with minimum confidence score aggregation than under random sampling.

In [None]:
t_stat, p_value = stats.ttest_ind(ent_avg_sampling, random_sampling, equal_var=False)
print(p_value)
p_value.item() < 0.05

Since the p-value here is not less than 0.05, we don't reject the null hypothesis that entropy-based sampling with mean entropy aggregation performs no differently from random sampling.

In [None]:
t_stat, p_value = stats.ttest_ind(ent_max_sampling, random_sampling, equal_var=False)
print(p_value)
p_value.item() < 0.05

Based on these results, we reject the null hypothesis that entropy-based sampling with maximum entropy aggregation performs no differently from random sampling. Since the p-value is below 0.05, Entropy-based sampling with maximum entropy-score aggregation is different from the random sampling baseline. Specifically, the mAP50–95 for fish detection on the DeepFish test dataset is  higher under entropy-based sampling with maximum confidence-score aggregation than under random sampling.

In conclusion, Least Confidence Sampling with Mean Confidence Score aggregation consistently performs worse than Random Sampling while Entropy-based Sampling with Mean Entropy aggregation performs just about the same and both Entropy-based Sampling with Maximum entropy aggregation and Least Confidence sampling with Minimum confidence-score aggregation perform better than Random Sampling.

Now the next course of action is to determine which method out of Least Confidence Sampling with Minimum Confidence-Score aggregation and Entropy Sampling with Maximum Entropy-score aggregation performs better with statistical significance, or whether there is no statistically significant difference between these methods.

In [None]:
t_stat, p_value = stats.ttest_ind(lc_min_sampling, ent_max_sampling, equal_var=False)
print(p_value)
p_value.item() < 0.05

Since the p-value here is not less than 0.05, we don't reject the null hypothesis that Entropy-based Sampling with maximum entropy aggregation performs no differently from Least Confidence sampling with minimum confidence-score aggregation. Both these methods must be tested in a deployment environment to determine which method results in better performance with statistical significance.