In [None]:
# 1. Explain the properties of the F-distribution.

### Properties of the F-Distribution:

1. **Shape**:
   - The F-distribution is **right-skewed** and takes only **positive values** (i.e., \( F>= 0)).

2. **Degrees of Freedom**:
   - It is defined by two sets of degrees of freedom:
     - **Numerator degrees of freedom** ( df_1 ): Associated with the variance of the first sample.
     - **Denominator degrees of freedom** ( df_2 ): Associated with the variance of the second sample.

3. **Mean**:
   - The mean of the F-distribution is given by:

     Mean = \frac{df_2}{df_2 - 2} ;for \( df_2 > 2}
     \]

4. **Variance**:
   - The variance of the F-distribution is:

{Variance = frac{2(df_2)^2 (df_1 + df_1 - 2)}{df_1 (df_2 - 2)^2 (df_2 - 4)
(for df_2 > 4)


5. **Asymptotic Behavior**:
   - As \( df_2  to infty), the F-distribution approaches a **normal distribution**.

6. **Used in Hypothesis Testing**:
   - Commonly used in **ANOVA (Analysis of Variance)**, **regression analysis**, and **variance comparison** tests.

7. **Non-Negative**:
   - The values of the F-distribution are always **positive** since it is derived from squared terms (variances).

8. **Skewness**:
   - The distribution is **right-skewed**, especially when degrees of freedom are small. As the degrees of freedom increase, the distribution becomes more symmetric.


In [None]:
# 2. In which types of statistical tests is the F-distribution used, and why is it appropriate for these tests?

The **F-distribution** is used in the following statistical tests:

1. **ANOVA (Analysis of Variance)**: Compares the means of three or more groups by comparing the variance between groups to the variance within groups.

2. **F-test for Comparing Variances**: Compares the variances of two populations to check if they are significantly different.

3. **Regression Analysis (F-test for Model Significance)**: Tests whether a regression model explains a significant amount of variance in the dependent variable.

4. **MANOVA (Multivariate Analysis of Variance)**: Compares means of multiple dependent variables across groups.

**Why F-distribution?**
- It deals with the ratio of variances.
- It is non-negative and right-skewed, making it suitable for variance-based tests.

In [None]:
# 3. What are the key assumptions required for conducting an F-test to compare the variances of two populations?

The key assumptions for conducting an **F-test** to compare the variances of two populations are:

1. **Independence**: The two samples must be independent of each other.
2. **Normality**: Both populations should follow a **normal distribution**.
3. **Ratio of Variances**: The F-test compares the ratio of the variances from the two samples.

These assumptions ensure the validity of the F-test for comparing variances.

In [None]:
# 4. What is the purpose of ANOVA, and how does it differ from a t-test?

### Key Assumptions for F-test to Compare Variances

| Assumption         | Description                                                              |
|--------------------|--------------------------------------------------------------------------|
| **Independence**    | The two samples must be **independent** of each other.                   |
| **Normality**       | Both populations should follow a **normal distribution**.               |
| **Ratio of Variances** | The F-test compares the **ratio of the variances** of the two samples. |


Difference from a t-test:

A t-test compares the means of two groups.
ANOVA compares the means of three or more groups simultaneously and tests for overall differences, not just pairwise comparisons.


In [None]:
# 5. Explain when and why you would use a one-way ANOVA instead of multiple t-tests when comparing more than two groups.

The following are the 2 main reasons to use one single ANOVA test instad of multiple t-test as :

- Prevents Type I Errors: Multiple t-tests increase the risk of committing a Type I error (false positive) due to multiple comparisons. ANOVA controls this risk by testing all groups simultaneously.

- Efficient: ANOVA tests if there is any significant difference between the means of more than two groups with a single test, whereas multiple t-tests require more tests and complex adjustments.

In [None]:
#  6. Explain how variance is partitioned in ANOVA into between-group variance and within-group variance. How does this partitioning contribute to the calculation of the F-statistic?

In **ANOVA**, variance is partitioned into two components:

1. **Between-group variance**: Measures the variability of **group means** around the **overall mean**. It reflects how different the groups are from each other.
   
2. **Within-group variance**: Measures the variability **within each group**. It reflects how much individual data points vary from their respective group means.

### Contribution to F-statistic:
The **F-statistic** is calculated as the ratio of **between-group variance** to **within-group variance**:

F ={Between-group variance}}/{Within-group variance}


- A **large F-statistic** suggests that the between-group variance is much larger than the within-group variance, indicating significant differences between groups.
- A **small F-statistic** suggests that the within-group variance is large relative to the between-group variance, indicating no significant difference between groups.

In [None]:
# 7. Compare the classical (frequentist) approach to ANOVA with the Bayesian approach. What are the key differences in terms of how they handle uncertainty, parameter estimation, and hypothesis testing?

| **Aspect**               | **Classical (Frequentist) Approach**                                                                                      | **Bayesian Approach**                                                                                                                     |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| **Uncertainty Handling**  | Handled through point estimates (e.g., sample means) and sampling distributions (e.g., confidence intervals).             | Handled through probability distributions (prior and posterior), reflecting the degree of belief about parameters.                        |
| **Parameter Estimation**  | Parameters are estimated as fixed values (e.g., sample mean) using Maximum Likelihood Estimation (MLE).                    | Parameters are estimated as probability distributions (posterior) updated with observed data and prior beliefs.                           |
| **Hypothesis Testing**    | Uses p-values to test null hypothesis, rejecting it if p-value is below a threshold (e.g., 0.05).                          | Uses Bayes factors or posterior probabilities to compare hypotheses without a strict null hypothesis.                                      |
| **Interpretation of Results** | Results are based on the probability of observed data given the null hypothesis (p-value).                              | Results are based on the probability of a hypothesis given the data (posterior probability).                                              |
| **Model Assumptions**    | Assumes a fixed model with constant parameters across experiments.                                                        | Assumes prior beliefs about parameters, which are updated with data to form posterior distributions.                                      |
| **Flexibility in Hypotheses** | Rigid testing of pre-specified null and alternative hypotheses.                                                      | More flexible, allowing continuous updating of beliefs and comparison of multiple hypotheses.                                             |
| **Focus on Estimation**   | Focus on obtaining point estimates and hypothesis testing.                                                              | Focus on updating beliefs and estimating distributions, considering uncertainty in all parameters.                                        |
| **Reproducibility**       | Results are reproducible with the same data as the model is fixed.                                                        | Results can vary with different priors, but they converge as more data is gathered.                                                       |


In [None]:
# 8. Question: You have two sets of data representing the incomes of two different professions1
# V Profession A: [48, 52, 55, 60, 62'
# V Profession B: [45, 50, 55, 52, 47] Perform an F-test to determine if the variances of the two professions'
# incomes are equal. What are your conclusions based on the F-test?

# Task: Use Python to calculate the F-statistic and p-value for the given data.

# Objective: Gain experience in performing F-tests and interpreting the results in terms of variance comparison.

In [5]:
import numpy as np
from scipy import stats

# Data for Profession A and Profession B
profession_a = np.array([48, 52, 55, 60, 62])
profession_b = np.array([45, 50, 55, 52, 47])

# Calculate sample variances
var_a = np.var(profession_a, ddof=1)  # sample variance for A
var_b = np.var(profession_b, ddof=1)  # sample variance for B

# Calculate the F-statistic (larger variance in numerator)
F_statistic = max(var_a, var_b) / min(var_a, var_b)

# Degrees of freedom for both samples
df_a, df_b = len(profession_a) - 1, len(profession_b) - 1

# Calculate the p-value for the F-distribution
p_value = 2 * min(stats.f.cdf(F_statistic, df_a, df_b), 1 - stats.f.cdf(F_statistic, df_a, df_b))

F_statistic, p_value


(2.089171974522293, 0.49304859900533904)

Conclusion:
- Since the p-value (0.493) is greater than the common significance level of 0.05, we fail to reject the null hypothesis.

- The null hypothesis for this test states that the variances of the two professions' incomes are equal. Since the p-value is large, we do not have enough evidence to conclude that the variances of the incomes for Profession A and Profession B are significantly different.

Final Conclusion:
Based on the F-test, there is no significant difference in the variances of the two professions' incomes.

In [None]:
 # 9. Question: Conduct a one-way ANOVA to test whether there are any statistically significant differences in
# average heights between three different regions with the following data1
# V Region A: [160, 162, 165, 158, 164'
# V Region B: [172, 175, 170, 168, 174'
# V Region C: [180, 182, 179, 185, 183'
# V Task: Write Python code to perform the one-way ANOVA and interpret the results
# V Objective: Learn how to perform one-way ANOVA using Python and interpret F-statistic and p-value.

In [6]:

import numpy as np
from scipy import stats

# Data for the three regions
region_a = np.array([160, 162, 165, 158, 164])
region_b = np.array([172, 175, 170, 168, 174])
region_c = np.array([180, 182, 179, 185, 183])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(region_a, region_b, region_c)

# Display the results
f_statistic, p_value


(67.87330316742101, 2.870664187937026e-07)

Conclusion:

- Since the p-value is much smaller than 0.05, we reject the null hypothesis.
- The null hypothesis states that all group means (average heights of the regions) are equal.

- By rejecting the null hypothesis, we conclude that there are statistically significant differences in the average heights between the three regions.

Final Interpretation:
There is strong evidence to suggest that the average heights differ significantly between Region A, Region B, and Region C.