### Q1. Explain the properties of the F-distribution. 

Ans- The F-distribution is also known as the Fisher-Snedecor distribution.It is a probability distribution that is useful in context of comparing variances of two or more samples.It has several properties:

**a) Asymmetrical -** The F-distribution is skewed to the right and is not symmetrical. 

**b) Degrees of freedom -** The F-distribution is defined by two parameters: the degrees of freedom of the numerator(d1) and the degrees of freedom of the denominator(d2).  

**c) Positive values -** The F-distribution can only have positive values, like a chi-square distribution.

**d) F-statistic -** The F-statistic is greater than or equal to zero. 

**e) Curve -** As the degrees of freedom for the numerator and denominator increase, the F-distribution approximates the normal distribution and its shape depends on the values of the numerator and denominator degrees of freedom. 

### Q2. In which types of statistical tests is the F-distribution used, and why is it appropriate for these tests?

Ans- The F-distribution is used in statistical tests such as the F-test and analysis of variance (ANOVA) because it's a probability distribution that's used to compare variances:

**F-test -** It is used to compare the variances of two samples or the ratio of variances between multiple samples. The F-test uses the F-statistic, which is the ratio of two variances, to determine if the data has an F-distribution. It is also known as variance ratio test.

**Analysis of variance (ANOVA) -** It is a statistical method used to compare the means of two or more groups. ANOVA can determine if the means of three or more groups are different or same. 

The F-distribution is appropriate for these tests because it's used to compare variances, which is a measure of how far data are scattered from the mean.

### Q3. What are the key assumptions required for conducting an F-test to compare the variances of two populations?

Ans - The key assumptions required for conducting an F-test to compare the variances of two populations are as follows:

a) The population from which samples are drawn should be normally distributed.

b) The samples should be independent of each other i.e., random.

c) Homogenity of variances means that the variance among the groups should be approximately equal.

### Q4. What is the purpose of ANOVA, and how does it differ from a t-test? 
Ans - The purpose of ANOVA is to compare the means of two or more groups while a t-test is used to compare the means of two groups. 

Both of these tests are performed when 1) the samples are independent of each other and 2) have (approximately) normal distributions or when the sample number is high (e.g., > 30 per group). 


### Q5. Explain when and why you would use a one-way ANOVA instead of multiple t-tests when comparing more than two groups.

Ans - When there may be more chances of occuring error while comparing more than two groups, we should use one-way analysis of variance(ANOVA).

It is because when we are using t-test to compare the groups, there has a 5% chance of commiting Type-I Error. When we are using multiple t-test,the chances of error increases. By running two t-tests on the same data you will have increased your chance of "making a mistake" to 10%. The formula for determining the new error rate for multiple t-tests is not as simple as multiplying 5% by the number of tests.

However, if you are only making a few multiple comparisons, the results are very similar if you do. As such, three t-tests would be 15% (actually, 14.3%) and so on. These are unacceptable errors.

An ANOVA controls for these errors so that the Type I error remains at 5% and the results are statistically significant.

That's why we use a one-way ANOVA instead of multiple t-tests when comparing more than two groups.

### Q6. Explain how variance is partitioned in ANOVA into between-group variance and within-group variance How does this partitioning contribute to the calculation of the F-statistic?

Ans - The main objective of ANOVA is to partition the total variability observed in the data into different components to determine the relative contributions of different source of variation. These sources of variation can be categorized into two main types: 

**a) Between-group variation** - The between-group variation represents the variability between the different groups or treatments being compared. It measures how different the group means are from each other.      

**b) Within-group variation** - The within-group variation reflects the variability within each group and quantifies the spread of the observations within each group.

Then, F-statistic is calculated which is the ratio of the  mean square for the between groups divided by the mean square within groups. It is used to determine whether the means of three or more groups are different. 

If the null hypothesis is true, the variance between groups would be roughly the same as the variance within groups, and the F-statistic would be close to 1.0.

If the group assignment has an effect on the measurements, the between group variance would be greater than the within group variance, and the F-statistic would be greater than 1.0.

### Q7. Compare the classical (frequentist) approach to ANOVA with the Bayesian approach. What are the key differences in terms of how they handle uncertainty, parameter estimation, and hypothesis testing?

Ans - There are 2 main approaches to statistical inference, frequentist and Bayesian, differing in their interpretation of uncertainty, parameter estimation and hypothesis testing.
	
**a) Uncertainity-** The Bayesian approach deals with the probability of a hypothesis given a particular data set, whereas the frequentist approach deals with long-run probabilities (ie, how probable is this data set given the null hypothesis). 
	
**b) Parameter estimation-** Frequentists use point estimates of unknown parameters to predict new data points, while Bayesians use a full posterior distribution over the possible parameter values. 
	
**c) Hypothesis testing-** The Bayesian approach can calculate the probability that a particular hypothesis is true, whereas the frequentist approach calculates the probability of obtaining another data set at least as extreme as the one collected (giving the P value).

**d) Prior Information-** Bayesian analysis incorporates prior information into the analysis, whereas a frequentist analysis is purely driven by the data. 
	
Interpretation of results is more intuitive with a Bayesian approach compared with the frequentist approach, which can often be misinterpreted.

### Q8. You have two sets of data representing the incomes of two different professions:

**Profession A : [48, 52, 55, 60, 62]**

**Profession B : [45, 50, 55, 52, 47]**

**Perform an F-test to determine if the variances of the two professions income are equal. What are your conclusions based on the F-test?** 

**Task : Use Python to calculate the F-statistic and p-value for the given data.**

**Objective : Gain experience in performing F-test and interpreting the results in terms of variance comparison.** 

In [15]:
import numpy as np
import scipy.stats as stats
from scipy.stats import f

Profession_A = [48, 52, 55, 60, 62]
Profession_B = [45, 50, 55, 52, 47]

#calculating f-statistics
fstats = np.var(Profession_A)/np.var(Profession_B)

df1 = len(Profession_A) - 1
df2 = len(Profession_B) - 1
alpha = 0.05

#calculating critical value
critical_value = stats.f.ppf(q = 1-alpha, dfn = df1, dfd = df2)

In [17]:
fstats

2.089171974522293

In [19]:
critical_value

6.388232908695868

In [21]:
#comparing f-statistics and critical value
if fstats > critical_value:
    print('Reject the Null Hypothesis')
else:
    print('Fail to reject the Null Hypothesis')

Fail to reject the Null Hypothesis


In [9]:
#calculating p_value
p_value = f.sf(fstats, df1, df2) * 2  # Two-tailed test
p_value

0.4930485990053393

In [13]:
if p_value <= 0.05:
    print('Reject the Null Hypothesis')
else:
    print('Fail to Reject the Null Hypothesis')

Fail to Reject the Null Hypothesis


***Conclusions - The variances of the two professions income are significantly different.***

### Q9. Conduct a one-way ANOVA to test whether there are any statistically significant differences in average heights between three different regions with the following data:1

**Region A: [160, 162, 165, 158, 164]**

**Region B: [172, 175, 170, 168, 174]**

**Region C: [180, 182, 179, 185, 183]**

**Task : Write Python code to perform the one-way ANOVA and interpret the results.**

**Objective : Learn how to perform one-way ANOVA using Python and interpret F-statistic and p-value.**

In [25]:
Region_A =  [160, 162, 165, 158, 164]
Region_B =  [172, 175, 170, 168, 174]
Region_C =  [180, 182, 179, 185, 183]

#calculating fstats,pvalue
fstats, pvalue = stats.f_oneway(Region_A,Region_B,Region_C)

In [29]:
fstats

67.87330316742101

In [31]:
pvalue

2.870664187937026e-07

In [27]:
if pvalue <= 0.05:
    print('Reject the Null Hypothesis')
else:
    print('Fail to Reject the Null Hypothesis')

Reject the Null Hypothesis


***Conclusion - Since the p-value is extremely small (significantly less than 0.05), we reject the null hypothesis. This indicates that there are statistically significant differences in average heights among the three regions.***

# Assignment Completed