**Note: Please create a local copy of this notebook on your Google account before using it**

# *Data Science for Energy and Buildings*

### *The central limit theorem*

Authors: Tim Diller and Gregor Henze

Created: August 4th, 2023


## Introduction

The Central Limit Theorem (CLT) holds a significant place in the realm of statistics and probability theory. It offers a crucial insight into the behavior of sample averages drawn from different populations. What's fascinating is that even if the original populations have varying shapes, when we look at the averages of those samples, they tend to follow the pattern of a normal distribution. This theorem's importance lies in its ability to shed light on how randomness evens out when we're dealing with averages of data points, allowing us to better understand and work with the variability in datasets.

### Relevance of the Central Limit Theorem:

The CLT has broad relevance across scientific and practical domains. It serves as a foundational concept in fields ranging from physics to social sciences. This theorem is especially handy for making sense of data when we can't collect information from entire populations and have to work with smaller samples. It helps us make educated guesses about bigger groups based on what we've observed in smaller groups. By doing so, the CLT aids researchers in drawing accurate conclusions and building reliable models even in the face of uncertainty and randomness.

## Coding example:

We will illustrate the CLT with various distributions, observing how the distribution of the mean of a sample changes with the sample size.

Firstly, (like always in python), we import the relevant packages

In [None]:
import numpy as np
import plotly.subplots as sp
import plotly.graph_objs as go
from scipy.stats import norm

Secondly, we define the underlying distributions, and build functions to get samples of arbitrary size from these distributions

In [None]:
# this defines a class, holding our sampling functions:
class SampleGenerator:
    # this is the values we need to initialize our generator
    def __init__(self, normal_mean, normal_std, uniform_low, uniform_high, exp_scale):

        # parameters of normal distribution
        self.normal_mean = normal_mean,
        self.normal_std = normal_std

        # parameters of uniform distribution
        self.uniform_low = uniform_low
        self.uniform_high = uniform_high

        # parameters of exponential distribution
        self.exp_scale = exp_scale

    # this is a normal distribution
    def get_normal_sample(self, amount):
        return np.random.normal(self.normal_mean, self.normal_std, amount)

    # this is a uniform distribution
    def get_uniform_sample(self, amount):
        return np.random.uniform(self.uniform_low, self.uniform_high, amount)

    # this is an exponential distribution
    def get_nonuniform_sample(self, amount):
        return np.random.exponential(self.exp_scale, amount)


Now we can generate an instance of our SampleGenerator, and define the parameters of each distribution:

In [None]:
Generator = SampleGenerator(normal_mean=10,
                            normal_std = 7,
                            uniform_low=0,
                            uniform_high=20,
                            exp_scale=10)

We can test to see if it works:

In [None]:
Generator.get_normal_sample(5)

array([ 9.95758928,  6.93113818,  2.07510783,  3.88587497, -0.27131118])

To get a feeling of what kind of distributions our SampleGenerator generates, we can draw really large samples, and plot each of them in a histogram:

In [None]:


# we define the number of samples we want to take
num_samples = 100000

# we get a sample for each distribution
normal_sample = Generator.get_normal_sample(num_samples)
uniform_sample = Generator.get_uniform_sample(num_samples)
nonuniform_sample = Generator.get_nonuniform_sample(num_samples)

# we make a plot
fig = sp.make_subplots(rows=1, cols=3)

# we add the distributions to the plot
fig.add_trace(go.Histogram(x=normal_sample, nbinsx=100, name='Normal distribution'),
              row=1, col= 1)

fig.add_trace(go.Histogram(x=uniform_sample, nbinsx=100, name='Uniform distribution'),
              row=1, col=2)

fig.add_trace(go.Histogram(x=nonuniform_sample, nbinsx=100, name='Exponential distribution'),
              row=1, col=3)

fig.show()


In the next step, we want to see how the means of the samples behave. So we draw 10000 samples of sample size 1, and calculate the mean of the sample (which is simply the sample value).

In [None]:
# we define the amount of samples and the sample size
amount_samples = 1000
sample_size = 1

# we use the Generator to get our samples
normal_sample = [np.mean(Generator.get_normal_sample(sample_size)) for _ in range(amount_samples)]
uniform_sample = [np.mean(Generator.get_uniform_sample(sample_size)) for _ in range(amount_samples)]
nonuniform_sample = [np.mean(Generator.get_nonuniform_sample(sample_size)) for _ in range(amount_samples)]


Next we plot a histogram of the sampled values:


In [None]:
# we make a plot
fig = sp.make_subplots(rows=1, cols=3)

# we add the distributions to the plot
fig.add_trace(go.Histogram(x=normal_sample, nbinsx=100, name='Normal distribution'),
              row=1, col= 1)

fig.add_trace(go.Histogram(x=uniform_sample, nbinsx=100, name='Uniform distribution'),
              row=1, col=2)

fig.add_trace(go.Histogram(x=nonuniform_sample, nbinsx=100, name='Uniform distribution'),
              row=1, col=3)

fig.show()

The results are as we expected, the distribution of the samples mirrors the distribution. Next, we try with drawing samples of 3. We only change the sample_size, the rest of the code stays the same:

In [None]:
# we define the amount of samples we want to take, and the size of each sample
amount_samples = 1000
sample_size = 3

# get the samples
normal_sample = [np.mean(Generator.get_normal_sample(sample_size)) for _ in range(amount_samples)]
uniform_sample = [np.mean(Generator.get_uniform_sample(sample_size)) for _ in range(amount_samples)]
nonuniform_sample = [np.mean(Generator.get_nonuniform_sample(sample_size)) for _ in range(amount_samples)]

# we make a plot
fig = sp.make_subplots(rows=1, cols=3)

# we add the distributions to the plot
fig.add_trace(go.Histogram(x=normal_sample, nbinsx=100, name='Normal distribution'),
              row=1, col= 1)

fig.add_trace(go.Histogram(x=uniform_sample, nbinsx=100, name='Uniform distribution'),
              row=1, col=2)

fig.add_trace(go.Histogram(x=nonuniform_sample, nbinsx=100, name='Uniform distribution'),
              row=1, col=3)

fig.show()

We can see how the distributions of the means change compared to above. Now we can write a for_loop, and loop through a list of different sample sizes, and observe how the distribution of the means changes:

In [None]:


amount_samples = 3000
sample_sizes = [1, 2, 4, 5, 10, 15, 25, 50, 100]

# Create a plot with one row for each sample_size
fig = sp.make_subplots(rows=len(sample_sizes), cols=3, shared_xaxes=True)

# Define colors for each column
column_colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

# Add headers
fig.add_annotation(text="Normal Distribution", xref="paper", yref="paper", x=0, y=1, showarrow=False)
fig.add_annotation(text="Uniform Distribution", xref="paper", yref="paper", x=0.5, y=1, showarrow=False)
fig.add_annotation(text="Exponential Distribution", xref="paper", yref="paper", x=1, y=1, showarrow=False)

for index, sample_size in enumerate(sample_sizes):
    normal_sample = [np.mean(Generator.get_normal_sample(sample_size)) for _ in range(amount_samples)]
    uniform_sample = [np.mean(Generator.get_uniform_sample(sample_size)) for _ in range(amount_samples)]
    nonuniform_sample = [np.mean(Generator.get_nonuniform_sample(sample_size)) for _ in range(amount_samples)]

    # Add the distributions to the plot
    fig.add_trace(go.Histogram(x=normal_sample, nbinsx=100, showlegend=False, marker_color=column_colors[0]),
                  row=index + 1, col=1)

    fig.add_trace(go.Histogram(x=uniform_sample, nbinsx=100, showlegend=False, marker_color=column_colors[1]),
                  row=index + 1, col=2)

    fig.add_trace(go.Histogram(x=nonuniform_sample, nbinsx=100, showlegend=False, marker_color=column_colors[2]),
                  row=index + 1, col=3)

    # Calculate the center height of the subplot
    center_height = 1 - ((index + 0.5) / len(sample_sizes))

    # Add sample_size annotation at the center of each subplot
    fig.add_annotation(text=f"Sample Size: {sample_size}", xref="paper", yref="paper", x=0, y=center_height,
                       showarrow=False, font=dict(size=10))

# Update layout
fig.update_layout(height=len(sample_sizes) * 250, width=1500, title_text="Central limit theorem illustration",
                  showlegend=False,)  # Remove legend

# Show the plot
fig.show()
