**Note: Please create a local copy of this notebook on your Google account before using it**

# *Data Science for Energy and Buildings*

# *ANalysis Of VAriance*

Authors: Tim Diller and Gregor Henze

Created: August 14th, 2023



In this chapter, we will explore the statistical technique known as Analysis of Variance (ANOVA), a tool that assists in evaluating differences among multiple groups. ANOVA is a systematic approach to determine whether observed differences between group means are statistically significant or if they could be attributed to random chance.

## Functional Explanation:
Consider a scenario with several groups, each subjected to different treatments or conditions. ANOVA comes into play by examining the variability present within each group's data. It achieves this by comparing the variance within each group to the variance observed between the groups. The primary question ANOVA addresses is: "Are the observed differences in means substantial enough to conclude that they are statistically significant, or could they have occurred by chance?"

## Statistical Significance:
In the toolkit of statistical methods, ANOVA holds a crucial role. Its utility lies in its ability to simultaneously account for the effects of multiple factors, making it a versatile analytical tool. By helping researchers discern whether the observed differences are likely to have real-world implications or are simply due to inherent variability, ANOVA guides decision-making based on experimental findings.

Beyond its core functionality, ANOVA extends its usefulness to more complex experimental designs. It allows for the exploration of interactions among factors, which is valuable for revealing dependencies and uncovering nuanced relationships that might not be apparent through simpler analyses.

## Practical Considerations:
For those delving into ANOVA, adherence to certain assumptions is pivotal. Ideally, the data should demonstrate characteristics such as normality, homogeneity of variances, and independence, ensuring the robustness of the results. Armed with these methodological foundations, researchers can confidently navigate the ANOVA process.

To summarize, ANOVA is a pragmatic cornerstone in statistical analysis. Its ability to evaluate group differences and distinguish genuine disparities from random fluctuations makes it an indispensable tool for researchers seeking meaningful insights from their data. As you delve into ANOVA's mechanics, you will gain a deeper understanding that enhances your analytical prowess. So, researchers, embrace ANOVA's functionality, and let its systematic approach shed light on the nuances of group comparisons.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import plotly.graph_objects as go
import matplotlib.pyplot as plt

np.random.seed(0)

# Create a function to generate datasets
def generate_dataset(treatment_effect, n_samples, baseline = 100, stdev = 20):

    treatments = ['A', 'B']
    data = []

    for _ in range(n_samples):
        treatment = np.random.choice(treatments)

        if treatment == 'A':
            score = np.random.normal(loc=baseline + treatment_effect, scale=stdev)
        else:
            score = np.random.normal(loc=baseline, scale=stdev)

        data.append({'Treatment': treatment,'Score': score})

    return pd.DataFrame(data)

# Create and analyze four datasets
datasets = []

sample_sizes = [10, 20, 50, 80, 100, 200, 300, 1000, 10000]
effects = [0, 1, 2, 4, 5, 10, 20, 50, 100]
stdev =20

max_sample = max(sample_sizes)

for effect in effects:
    datasets.append(generate_dataset(effect, max_sample, baseline=100, stdev=stdev))

p_values = []

for index, dataset in enumerate(datasets):
    print(index, effects[index])
    p_value_list = []
    for sample_size in sample_sizes:
        model = ols('Score ~ Treatment', data=dataset.head(sample_size)).fit()
        anova_table = anova_lm(model, typ=2)

        p_value_list.append(anova_table['PR(>F)']['Treatment'])  # Interaction p-value)

    p_values.append(p_value_list)


# Create a figure
fig = go.Figure()

# Add traces for each p_value_list
for index, p_value_list in enumerate(p_values):
    name = 'effect strength: ' + str(round(effects[index]/stdev, 3))
    fig.add_trace(go.Scatter(x=sample_sizes, y=p_value_list, mode='lines+markers', name=name))

# Set plot layout
fig.update_layout(
    title="P-Value Analysis for different effects and sample sizes",
    xaxis_title="Sample Sizes (log scale)",
    yaxis_title="P-Values",
    xaxis_type='log',  # Set x-axis to log scale
    showlegend=True
)

# Show the plot
fig.show()


0 0
1 1
2 2
3 4
4 5
5 10
6 20
7 50
8 100
