# Minitest Comparisons With Ability Expectations 

This series of snippets conducts a number of analyses consisting on verifying if the sequence of eight questions that were used in the experiment -as a whole- for all students -as a group- fell within the range of expectations (>5% confidence). A series the datasets were used:
1. In the case of the diagnostic, the 36-row and 8-column dataset with the results was built by selecting the eight questions out of the 21 that form the whole test, and 36 students out of the 87 that eventually participated in the study.  This dataset is called "Diagnostic Answers" (DA).
2. The 40-row and 8-column dataset of the Preliminary Answers (PA) is made of the results of same eight questions by the 40 students. 
3. Likewise, the Final Answers (FA) dataset is also 40-row and 8-column dataset in the same fashion.
4. Correctness probabilities dataset (CP). This contains, for each student, the probability of answering correctly each of the eight selected questions. 

The verifications are:
1. Verifying if DA are within expectations predicted by CP
2. Verifying if PA are within expectations predicted by CP
3. Verifying if FA are within expectations predicted by CP.
    * Verifying if DA (Non-Deception questions 1,2,4,5,6,8) are within expectations predicted by CP
    * Verifying if DA (Deception questions 3 and 7) are within expectations predicted by CP


This analysis concluded that verifications 1 and 2 are suitable, but not verification 3. The problem with this approach is that this is not an optimal measure to compare with the other two cases. Whereas for 1, 2, 4, 5, 6 and 8, students are being influenced towards correctness, for questions 3 and 7 students were swayed towarsd incorrectness. The optimal measure is one that combines how much the students deviated from expectations in both directions, instead of how much they deviated to correctness despite the results in Q3 an 7, which is the current case.



# Diagnostic minitest (all questions)

Here the algorithm to show that the results in the minitest are within the range of the expected.

In [89]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('DiagnosticProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('DiagnosticResults_36s_8q.csv', index_col=0)

# Question values
#values = np.array([0.110, 0.106, 0.095, 0.174, 0.114, 0.108, 0.096, 0.196])
#values = np.array([0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125])
values = np.array([1, 1, 1, 1, 1, 1, 1, 1])

# Calculate the expected score for each student
expected_scores = (prob_df * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_df * (1 - prob_df)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_df * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error

print(f"Mean expected score: {mean_expected_score:.4f}")
print(f"Mean observed score: {mean_observed_score:.4f}")
print(f"Mean variance: {mean_variance:.4f}")
print(f"Z-score (mean comparison): {z_score_mean:.4f}")
print(f"p-value (mean comparison): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean: [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (p >= 0.05).")


Mean expected score: 2.7675
Mean observed score: 3.6667
Mean variance: 1.7594
Z-score (mean comparison): 0.6779
p-value (mean comparison): 0.2489
95% Confidence Interval for Expected Mean: [0.1678, 5.3673]
There is no significant difference between the mean of the observed and expected results (p >= 0.05).


In [88]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('DiagnosticProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('DiagnosticResults_36s_8q.csv', index_col=0)

# Question values
#values = np.array([0.110, 0.106, 0.095, 0.174, 0.114, 0.108, 0.096, 0.196])
values = np.array([0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125])

# Calculate the expected score for each student
expected_scores = (prob_df * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_df * (1 - prob_df)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_df * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error

print(f"Mean expected score: {mean_expected_score:.4f}")
print(f"Mean observed score: {mean_observed_score:.4f}")
print(f"Mean variance: {mean_variance:.4f}")
print(f"Z-score (mean comparison): {z_score_mean:.4f}")
print(f"p-value (mean comparison): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean: [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (p >= 0.05).")


Mean expected score: 0.3459
Mean observed score: 0.4583
Mean variance: 0.0275
Z-score (mean comparison): 0.6779
p-value (mean comparison): 0.2489
95% Confidence Interval for Expected Mean: [0.0210, 0.6709]
There is no significant difference between the mean of the observed and expected results (p >= 0.05).


In [92]:
import pandas as pd
import numpy as np
from scipy.stats import norm, shapiro, wilcoxon, kstest

# Load CSV files
# prob_df = pd.read_csv('DiagnosticProbabilities_8q.csv', index_col=0)
# results_df = pd.read_csv('DiagnosticResults_36s_8q.csv', index_col=0)
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('PreliminaryResults_8q.csv', index_col=0)

# Question values
# values = np.array([0.110, 0.106, 0.095, 0.174, 0.114, 0.108, 0.096, 0.196])
# values = np.array([0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125])
values = np.array([1, 1, 1, 1, 1, 1, 1, 1])

# Calculate the expected score for each student
expected_scores = (prob_df * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_df * (1 - prob_df)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_df * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error

# Perform the Shapiro-Wilk normality test for expected scores
shapiro_stat, shapiro_p_value = shapiro(expected_scores)

# Perform the Wilcoxon signed-rank test for non-parametric comparison
wilcoxon_stat, wilcoxon_p_value = wilcoxon(observed_scores, expected_scores)

# Perform the Kolmogorov-Smirnov test for normality
ks_stat, ks_p_value = kstest(expected_scores, 'norm', args=(mean_expected_score, np.sqrt(mean_variance)))

print(f"Mean expected score: {mean_expected_score:.4f}")
print(f"Mean observed score: {mean_observed_score:.4f}")
print(f"Mean variance: {mean_variance:.4f}")
print(f"Z-score (mean comparison): {z_score_mean:.4f}")
print(f"p-value (mean comparison): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean: [{ci_lower:.4f}, {ci_upper:.4f}]")

# Shapiro-Wilk test result
print(f"Shapiro-Wilk test statistic: {shapiro_stat:.4f}")
print(f"Shapiro-Wilk p-value: {shapiro_p_value:.4f}")

# Wilcoxon signed-rank test result
print(f"Wilcoxon test statistic: {wilcoxon_stat:.4f}")
print(f"Wilcoxon p-value: {wilcoxon_p_value:.4f}")

# Kolmogorov-Smirnov test result
print(f"Kolmogorov-Smirnov test statistic: {ks_stat:.4f}")
print(f"Kolmogorov-Smirnov p-value: {ks_p_value:.4f}")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (p >= 0.05).")

if shapiro_p_value < 0.05:
    print("The data does not follow a normal distribution according to Shapiro-Wilk test (p < 0.05). The Z-score method may not be appropriate.")
else:
    print("The data follows a normal distribution according to Shapiro-Wilk test (p >= 0.05). The Z-score method is appropriate.")

if wilcoxon_p_value < 0.05:
    print("There is a significant difference between the observed and expected results using the Wilcoxon signed-rank test (p < 0.05).")
else:
    print("There is no significant difference between the observed and expected results using the Wilcoxon signed-rank test (p >= 0.05).")

if ks_p_value < 0.05:
    print("The data does not follow a normal distribution according to the Kolmogorov-Smirnov test (p < 0.05). The Z-score method may not be appropriate.")
else:
    print("The data follows a normal distribution according to the Kolmogorov-Smirnov test (p >= 0.05). The Z-score method is appropriate.")


Mean expected score: 3.6338
Mean observed score: 4.7750
Mean variance: 1.9486
Z-score (mean comparison): 0.8175
p-value (mean comparison): 0.2068
95% Confidence Interval for Expected Mean: [0.8978, 6.3698]
Shapiro-Wilk test statistic: 0.9352
Shapiro-Wilk p-value: 0.0238
Wilcoxon test statistic: 127.0000
Wilcoxon p-value: 0.0001
Kolmogorov-Smirnov test statistic: 0.4290
Kolmogorov-Smirnov p-value: 0.0000
There is no significant difference between the mean of the observed and expected results (p >= 0.05).
The data does not follow a normal distribution according to Shapiro-Wilk test (p < 0.05). The Z-score method may not be appropriate.
There is a significant difference between the observed and expected results using the Wilcoxon signed-rank test (p < 0.05).
The data does not follow a normal distribution according to the Kolmogorov-Smirnov test (p < 0.05). The Z-score method may not be appropriate.


# Conclusion
### The results in the diagnostic minitest are within the range of expected values.


# Preliminary minitest (all questions)


Here the algorithm to show that the results in the minitest are within the range of the expected.

In [82]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('PreliminaryResults_8q.csv', index_col=0)

# Question values
values = np.array([0.110, 0.106, 0.095, 0.174, 0.114, 0.108, 0.096, 0.196])

# Calculate the expected score for each student
expected_scores = (prob_df * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_df * (1 - prob_df)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_df * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error

print(f"Mean expected score: {mean_expected_score:.4f}")
print(f"Mean observed score: {mean_observed_score:.4f}")
print(f"Mean variance: {mean_variance:.4f}")
print(f"Z-score (mean comparison): {z_score_mean:.4f}")
print(f"p-value (mean comparison): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (p >= 0.05).")


Mean expected score: 0.4426
Mean observed score: 0.5867
Mean variance: 0.0327
Z-score (mean comparison): 0.7974
p-value (mean comparison): 0.2126
95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [0.0883, 0.7969]
There is no significant difference between the mean of the observed and expected results (p >= 0.05).


# Conclusion
### The results in the preliminary minitest are within the range of expected values.


# Final minitest (all questions)

Here the algorithm to show that the results in the minitest are within the range of the expected.

In [83]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('FinalResults_8q.csv', index_col=0)

# Question values
values = np.array([0.110, 0.106, 0.095, 0.174, 0.114, 0.108, 0.096, 0.196])

# Calculate the expected score for each student
# Calculate the expected variance for each student
expected_variances = ((prob_df * (1 - prob_df)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_df * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error

print(f"Mean expected score: {mean_expected_score:.4f}")
print(f"Mean observed score: {mean_observed_score:.4f}")
print(f"Mean variance: {mean_variance:.4f}")
print(f"Z-score (mean comparison): {z_score_mean:.4f}")
print(f"p-value (mean comparison): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (p >= 0.05).")


Mean expected score: 0.4426
Mean observed score: 0.7977
Mean variance: 0.0327
Z-score (mean comparison): 1.9642
p-value (mean comparison): 0.0248
95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [0.0883, 0.7969]
There is a significant difference between the mean of the observed and expected results (p < 0.05).


# Conclusion
### The results in the final minitest are not within the range of expected values. The results are too high, even though questions 3 and 7 were only answered correctly 6 and 5 times respectively out of 40 attempts each.


The problem with this approach is that this is not an optimal measure to compare with the other two cases. Whereas for 1, 2, 4, 5, 6 and 8, students are being influenced towards correctness, for questions 3 and 7 students were swayed towarsd incorrectness. The optimal measure is one that combines how much the students deviated from expectations in both directions, instead of how much they deviated to correctness despite the results in Q3 an 7, which is the current case.





# Final minitest (non-critical questions 1,2,4,5,6,8)

In [84]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('FinalResults_8q.csv', index_col=0)

# Question values, excluding questions 3 and 7 (indexes 2 and 6)
values = np.array([0.110, 0.106, 0.174, 0.114, 0.108, 0.196])  # Excluding Q3 and Q7

# Drop columns for questions 3 and 7 (indexes 2 and 6)
prob_subset = prob_df.drop(columns=[prob_df.columns[2], prob_df.columns[6]])
results_subset = results_df.drop(columns=[results_df.columns[2], results_df.columns[6]])

# Calculate the expected score for each student
expected_scores = (prob_subset * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_subset * (1 - prob_subset)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_subset * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error


print(f"Mean expected score (Excluding Q3 and Q7): {mean_expected_score:.4f}")
print(f"Mean observed score (Excluding Q3 and Q7): {mean_observed_score:.4f}")
print(f"Mean variance (Excluding Q3 and Q7): {mean_variance:.4f}")
print(f"Z-score (mean comparison, Excluding Q3 and Q7): {z_score_mean:.4f}")
print(f"p-value (mean comparison, Excluding Q3 and Q7): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (excluding Q3 and Q7) (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (excluding Q3 and Q7) (p >= 0.05).")


Mean expected score (Excluding Q3 and Q7): 0.3368
Mean observed score (Excluding Q3 and Q7): 0.7714
Mean variance (Excluding Q3 and Q7): 0.0282
Z-score (mean comparison, Excluding Q3 and Q7): 2.5879
p-value (mean comparison, Excluding Q3 and Q7): 0.0048
95% Confidence Interval for Expected Mean (Excluding Q3 and Q7): [0.0077, 0.6660]
There is a significant difference between the mean of the observed and expected results (excluding Q3 and Q7) (p < 0.05).


# Conclusion
### The results in the non-critical questions of the final minitest are not within the range of expected values. The results are too high.

# Final minitest (critical questions 3 & 7)

In [85]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load CSV files
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('FinalResults_8q.csv', index_col=0)

# Question values, including only questions 3 and 7 (indexes 2 and 6)
values = np.array([0.095, 0.096])  # Q3 and Q7

# Drop columns for questions 3 and 7 (indexes 2 and 6)
prob_subset = prob_df.drop(columns=[prob_df.columns[0], prob_df.columns[1], prob_df.columns[3], prob_df.columns[4], prob_df.columns[5], prob_df.columns[7], ])
results_subset = results_df.drop(columns=[results_df.columns[0], results_df.columns[1], results_df.columns[3], results_df.columns[4], results_df.columns[5], results_df.columns[7]])

# Calculate the expected score for each student
expected_scores = (prob_subset * values).sum(axis=1)

# Calculate the expected variance for each student
expected_variances = ((prob_subset * (1 - prob_subset)) * (values ** 2)).sum(axis=1)

# Calculate the observed score for each student
observed_scores = (results_subset * values).sum(axis=1)

# Calculate the mean expected score and mean variance
mean_expected_score = expected_scores.mean()
mean_variance = expected_variances.sum() / len(expected_scores)

# Calculate the mean observed score
mean_observed_score = observed_scores.mean()

# Calculate the Z-score for the mean comparison
z_score_mean = (mean_observed_score - mean_expected_score) / np.sqrt(mean_variance)

# Calculate the p-value for the mean comparison
p_value_mean = 1 - norm.cdf(z_score_mean)

# Calculate the standard error
standard_error = np.sqrt(mean_variance)

# Calculate the 95% confidence interval for the expected mean
ci_lower = mean_expected_score - 1.96 * standard_error
ci_upper = mean_expected_score + 1.96 * standard_error


print(f"Mean expected score (Only Q3 and Q7): {mean_expected_score:.4f}")
print(f"Mean observed score (Only Q3 and Q7): {mean_observed_score:.4f}")
print(f"Mean variance (Only Q3 and Q7): {mean_variance:.4f}")
print(f"Z-score (mean comparison, Only Q3 and Q7): {z_score_mean:.4f}")
print(f"p-value (mean comparison, Only Q3 and Q7): {p_value_mean:.4f}")
print(f"95% Confidence Interval for Expected Mean (Only Q3 and Q7): [{ci_lower:.4f}, {ci_upper:.4f}]")

# Conclusion
if p_value_mean < 0.05:
    print("There is a significant difference between the mean of the observed and expected results (Only Q3 and Q7) (p < 0.05).")
else:
    print("There is no significant difference between the mean of the observed and expected results (Only Q3 and Q7) (p >= 0.05).")


Mean expected score (Only Q3 and Q7): 0.1058
Mean observed score (Only Q3 and Q7): 0.0263
Mean variance (Only Q3 and Q7): 0.0045
Z-score (mean comparison, Only Q3 and Q7): -1.1892
p-value (mean comparison, Only Q3 and Q7): 0.8828
95% Confidence Interval for Expected Mean (Only Q3 and Q7): [-0.0253, 0.2369]
There is no significant difference between the mean of the observed and expected results (Only Q3 and Q7) (p >= 0.05).


# Conclusion

### This Approach is suitable for the Overall Minitest or 6 questions, but not for Questions 3 and 7

Use of Z-Score for the Entire Minitest:
For the entire minitest, with 8 questions and 40 students, or an extract of 6 questions and 40 students there is a larger dataset (320 and 240 data points respectively). These sample sizes are sufficient to assume a normal distribution due to the Central Limit Theorem, making the Z-score and confidence interval approach a valid method to check if the overall results are within the expected range.

Not Using Z-Score for Questions 3 and 7:
For questions 3 and 7, the dataset is much smaller (only 2 questions and 40 students, resulting in 80 data points). With such a small dataset and limited variation, the normality assumption is not reliable, and the Z-score approach is not sensitive enough to detect the anomaly you expect.
In such cases, a different statistical test, like the binomial test, is more appropriate because it directly assesses whether the number of successes (correct answers) significantly deviates from the expected probability for a binomial distribution.

In [86]:
import pandas as pd
import numpy as np
from scipy.stats import binomtest

# Load CSV files
prob_df = pd.read_csv('PreliminaryProbabilities_8q.csv', index_col=0)
results_df = pd.read_csv('FinalResults_8q.csv', index_col=0)

# Select only columns for questions 3 and 7 (indexes 2 and 6)
prob_q3_q7 = prob_df.iloc[:, [2, 6]]
results_q3_q7 = results_df.iloc[:, [2, 6]]

# Calculate the total number of correct answers for each question
correct_answers_q3 = results_q3_q7.iloc[:, 0].sum()  # Total correct answers for Q3
correct_answers_q7 = results_q3_q7.iloc[:, 1].sum()  # Total correct answers for Q7

# Number of students (total trials)
n_students = len(results_q3_q7)

# Calculate the mean expected probability of correctness for each question
p_q3 = prob_q3_q7.iloc[:, 0].mean()
p_q7 = prob_q3_q7.iloc[:, 1].mean()

# Perform the binomial test for each question
p_value_q3 = binomtest(correct_answers_q3, n_students, p_q3, alternative='less').pvalue
p_value_q7 = binomtest(correct_answers_q7, n_students, p_q7, alternative='less').pvalue

print(f"Total correct answers for Q3: {correct_answers_q3}")
print(f"Expected probability for Q3: {p_q3:.4f}")
print(f"p-value for Q3 (binomial test): {p_value_q3:.4f}")

print(f"Total correct answers for Q7: {correct_answers_q7}")
print(f"Expected probability for Q7: {p_q7:.4f}")
print(f"p-value for Q7 (binomial test): {p_value_q7:.4f}")

# Conclusion
if p_value_q3 < 0.05:
    print("There is a significant difference for Q3 (p < 0.05). The observed number of correct answers is significantly lower than expected.")
else:
    print("There is no significant difference for Q3 (p >= 0.05).")

if p_value_q7 < 0.05:
    print("There is a significant difference for Q7 (p < 0.05). The observed number of correct answers is significantly lower than expected.")
else:
    print("There is no significant difference for Q7 (p >= 0.05).")


Total correct answers for Q3: 6
Expected probability for Q3: 0.5333
p-value for Q3 (binomial test): 0.0000
Total correct answers for Q7: 5
Expected probability for Q7: 0.5743
p-value for Q7 (binomial test): 0.0000
There is a significant difference for Q3 (p < 0.05). The observed number of correct answers is significantly lower than expected.
There is a significant difference for Q7 (p < 0.05). The observed number of correct answers is significantly lower than expected.
