## 13th march assignment

## 1:ans:-

In [None]:
ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to 
determine if there are significant differences among them. To use ANOVA, certain assumptions must be met for
the results to be valid. These assumptions include:

Independence: The observations within each group are independent of each other. In other words, the values
within one group should not be influenced by or related to the values in another group.

Normality: The distribution of the dependent variable (the variable being measured) should follow a normal
distribution within each group. This assumption is important because ANOVA relies on the normality assumption
to make accurate inferences about population parameters.

Homogeneity of variance: The variance of the dependent variable should be approximately equal across all groups.
Homogeneity of variance ensures that the groups being compared have similar levels of variability.

Homogeneity of regression slopes (if using ANCOVA): If ANCOVA (Analysis of Covariance) is used, which involves 
adding one or more covariates to the analysis, the relationship between the covariate(s) and the dependent variable 
should be consistent across all groups. This assumption ensures that the effect of the covariate(s) is similar for all groups.

Violations of these assumptions can impact the validity of the ANOVA results. Some examples of violations and their 
impacts are:

Violation of independence: If observations within groups are not independent, such as when there are repeated measures 
or clustered data, the assumption is violated. This can lead to inflated or deflated significance levels and unreliable
estimates of the treatment effects.



## 2:ans:-

In [None]:
Analysis of Variance (ANOVA) is a statistical technique used to compare the means of two or more groups to determine
if there are significant differences among them. There are three main types of ANOVA:

One-Way ANOVA:
One-Way ANOVA is used when there is a single independent variable (factor) with three or more levels/groups.
It is employed to examine whether there are significant differences in the means of the dependent variable across
the different groups. For example, if you want to compare the average test scores of students from three different
schools, you can use One-Way ANOVA to determine if there are significant differences in the scores among the schools.

Two-Way ANOVA:
Two-Way ANOVA is used when there are two independent variables (factors) and the interaction between them.
Each factor has two or more levels/groups. It allows you to examine the main effects of each factor as well 
as their interaction effect on the dependent variable. For instance, if you want to investigate the effects
of both gender and treatment type on the recovery time of patients, you can use Two-Way ANOVA to analyze the
data and determine if there are significant differences among the groups based on these factors.

Repeated Measures ANOVA:
Repeated Measures ANOVA is used when you have a single group of participants and you measure the same variable
multiple times under different conditions or time points. It is useful for studying within-subject changes over
time or across conditions. For example, if you are conducting a study on the effectiveness of a new teaching
method and you measure the test scores of students before the method is implemented, immediately after, and one
month later, you can use Repeated Measures ANOVA to examine if there are significant differences in the scores
across the time points.



## 3:ans:-

In [1]:
import numpy as np
from scipy.stats import f

def f_test(sample1, sample2, var1, var2):
    n1 = len(sample1)
    n2 = len(sample2)
    
    dof1 = n1 - 1
    dof2 = n2 - 1
    
    f_value = np.var(sample1) / np.var(sample2)
    
    p_value = f.cdf(f_value, dof1, dof2)
    p_value = min(p_value, 1 - p_value)  # Two-tailed test
    
    return f_value, (dof1, dof2), p_value

# Generate random samples from two normal distributions
np.random.seed(42)  # For reproducibility

sample1 = np.random.normal(0, 1, size=100)
sample2 = np.random.normal(0, 2, size=100)

# Known variances
var1 = 1
var2 = 4

# Perform the F-test
f_value, degrees_of_freedom, p_value = f_test(sample1, sample2, var1, var2)

# Print the results
print("F-value:", f_value)
print("Degrees of freedom:", degrees_of_freedom)
print("P-value:", p_value)


F-value: 0.22671356584296698
Degrees of freedom: (99, 99)
P-value: 8.102773377858552e-13


## 4:ans:-

In [3]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]

# Combine the data
data = np.concatenate([group1, group2, group3])

# Create the group labels
labels = ['Group 1'] * len(group1) + ['Group 2'] * len(group2) + ['Group 3'] * len(group3)

# Create a dataframe
df = pd.DataFrame({'Data': data, 'Group': labels})

# Perform one-way ANOVA
model = ols('Data ~ C(Group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Calculate SST, SSE, and SSR
SST = np.sum((df['Data'] - np.mean(df['Data']))**2)
SSE = np.sum(model.resid**2)
SSR = np.sum((model.fittedvalues - np.mean(df['Data']))**2)

print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)


SST: 230.0
SSE: 140.0
SSR: 90.00000000000003


## 5:ans:-

In [4]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with your data
data = {'A': [1, 1, 2, 2, 3, 3],
        'B': [1, 2, 1, 2, 1, 2],
        'Y': [2, 4, 6, 8, 10, 12]}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Y ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effects
main_effect_A = anova_table['sum_sq']['A']
main_effect_B = anova_table['sum_sq']['B']
interaction_effect = anova_table['sum_sq']['A:B']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


Main Effect A: 64.00000000000009
Main Effect B: 6.000000000000015
Interaction Effect: 1.5974433330725486e-29


## 6:ans:-

In [None]:
Based on the one-way ANOVA results you provided, with an F-statistic of 5.23 and a p-value of 0.02,
we can draw the following conclusions:

Differences between the groups: The obtained F-statistic of 5.23 indicates that there are significant 
differences between the groups in the study. In other words, at least one of the groups is statistically
different from the others.

Interpretation of the results: The p-value of 0.02 suggests that the probability of observing such extreme
differences between the groups by chance alone is 0.02 (or 2%). In conventional statistical practice, 
if the p-value is below a predetermined significance level (e.g., 0.05), it is considered statistically 
significant. Therefore, with a p-value of 0.02, we can conclude that there is a statistically significant
difference between the groups.

Post-hoc tests: After observing a significant result in the ANOVA, it is common to conduct post-hoc tests
to determine which specific groups differ from each other. Post-hoc tests, such as Tukey's Honestly Significant
Difference (HSD) test or the Bonferroni correction, can help identify pairwise differences between groups.


## 7:ans:-

In [None]:
In a repeated measures ANOVA, missing data can pose challenges because the analysis requires complete data
for all participants at all time points. There are several methods to handle missing data in this context,
each with its own potential consequences. Here are a few common approaches:

Complete Case Analysis (Listwise deletion): This method involves excluding participants with missing data 
from the analysis. It only uses cases with complete data, leading to a reduction in sample size. This approach
can introduce bias if the missingness is related to the variables being analyzed, potentially impacting the
representativeness of the results.

Pairwise Deletion: With this approach, participants with missing data on specific variables are excluded only 
from the analyses involving those variables. It uses all available data, but can lead to biased results if the
missingness is not random. The estimation of standard errors and statistical power may also be affected.

Mean Imputation: This method replaces missing values with the mean of the available data for the respective
variable. It assumes that the missing data are missing completely at random (MCAR) and can lead to biased
estimates if the missingness is related to the outcome variable or other covariates.

Last Observation Carried Forward (LOCF): This approach carries forward the last observed value for missing
data points. It assumes that missing data points remain constant over time. However, LOCF can introduce bias
if the assumption is violated or if there is substantial change in the missing values.

Multiple Imputation: This technique involves creating multiple plausible imputations for the missing data based
on statistical models. The imputed datasets are then analyzed separately, and the results are combined using
specific rules. Multiple imputation accounts for uncertainty due to missing data and produces unbiased estimates
when certain assumptions are met. However, it can be computationally intensive and requires careful implementation.

The choice of method to handle missing data should be based on the assumptions about the missingness mechanism
and the potential consequences of each method. It is important to consider the potential bias, loss of power,
and generalizability of the results associated with each approach. Furthermore, sensitivity analyses can be conducted
to examine the robustness of the findings under different missing data assumptions and handling methods.




## 8:ans:-

In [None]:
After conducting an Analysis of Variance (ANOVA) and obtaining a significant result, post-hoc tests are often
used to determine which specific groups differ from each other. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD): This test is commonly used when the number of groups is equal
and sample sizes are equal or unequal. It controls the family-wise error rate and compares all possible pairs of means.

Bonferroni correction: This test adjusts the significance level for each comparison to maintain an overall
family-wise error rate. It is more conservative than Tukey's HSD and is often used when conducting multiple
pairwise comparisons.

Scheffé's test: This test is used when the number of groups is unequal, and it controls the family-wise error
rate for all possible comparisons. It is more conservative than Tukey's HSD and is suitable for situations where
there is no specific hypothesis about pairwise comparisons.

Dunnett's test: This test is used when comparing multiple treatment groups against a control group. It adjusts for 
multiple comparisons while maintaining the overall error rate.

Fisher's Least Significant Difference (LSD): This test is used when the number of groups is unequal and sample
sizes are unequal. It compares all possible pairs of means, but it does not control the family-wise error rate as 
strictly as Tukey's HSD.

Games-Howell test: This test is used when the assumptions of equal variances and/or sample sizes are violated.
It adjusts for unequal variances and performs pairwise comparisons with a modified t-test.

An example situation where a post-hoc test might be necessary is in a study examining the effectiveness of different
treatments for a medical condition. Suppose researchers conducted an ANOVA with four treatment groups (A, B, C, and D)
and found a significant overall difference. To determine which specific treatments differ from each other, they would
conduct post-hoc tests. They could use Tukey's HSD, which would allow them to compare all possible pairs of means and
identify which treatments are significantly different from each other. This would provide valuable information for 
clinicians in choosing the most effective treatment option.


## 9:ans:-

In [5]:
import scipy.stats as stats

# Define the weight loss data for each diet
diet_a = [2.5, 3.1, 1.8, 2.9, ...]  # Replace with actual data for Diet A
diet_b = [1.7, 2.2, 2.5, 2.1, ...]  # Replace with actual data for Diet B
diet_c = [1.9, 1.5, 2.8, 2.4, ...]  # Replace with actual data for Diet C

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


TypeError: float() argument must be a string or a real number, not 'ellipsis'