To open this notebook in Google Colab and start coding, click on the Colab icon below.

<table style="border:2px solid orange" align="left">
  <td style="border:2px solid orange ">
    <a target="_blank" href="https://colab.research.google.com/github/neuefische/ds-welcome-package/blob/main/statistics/4-Law-of-Large-Numbers.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

---
# Introduction to the Law of large numbers

     
In this notebook you will find the statistics basics that we require for the bootcamp.
If you still have problems with one or the other term, we strongly recommend you to work on it a bit more. 

<img src="https://i.redd.it/e23ufi7yv7361.jpg"
     alt="Alice through the looking glass"
     style="float: left; margin-right: 10px; height: 300px" />


## Law of large numbers

     
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

The LLN is important because it guarantees stable long-term results for the averages of some random events. For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the law only applies (as the name indicates) when a large number of observations is considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others.

$$ \overline X_{n}= \frac {1}{n} (X_{1}+\cdots +X_{n})$$
converges to the expected value

$$ \overline X_{n}\,\to \,\mu \qquad \textrm {for} \qquad n\to \infty .$$

## Central limit theorem

In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

If $X_{1},X_{2},...,X_{n}$ are random samples each of size $ n $ taken from a population with overall mean $\mu$ and finite variance $ \sigma ^{2}$ and if $\bar {X}$ is the sample mean, the limiting form of the distribution of $ Z=\left({\frac {{\bar {X}}_{n}-\mu }{\sigma /\surd n}}\right)$ as $ n\to \infty $, is the standard normal distribution.

For example, suppose that a sample is obtained containing many observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic mean of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the probability distribution of the average will closely approximate a normal distribution. A simple example of this is that if one flips a coin many times, the probability of getting a given number of heads will approach a normal distribution, with the mean equal to half the total number of flips. At the limit of an infinite number of flips, it will equal a normal distribution.

The central limit theorem has several variants. In its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions or for non-independent observations, if they comply with certain conditions.

In [None]:
import numpy as np
import random
import matplotlib.pyplot as plt

## weak law of large number
# Step 1
# create population with a gamma distribution
shape, scale = 2., 2.  # mean=4, std=2*sqrt(2)
s = np.random.gamma(shape, scale, 1000000)

# Step 2
samplemeanlist = [] # list of sample mean
l = [] # list of smaple size, for x-axis of box plots
numberofsample = 50 # number of sample in each sample size
    
# set sample size (i) between 100 to 8100, step by 500
for i in range(100,8101,500):
    # set x-axis
    l.append(i)
    # list of mean of each sample
    ml = []
    # sample 50 time.
    for n in range(0,numberofsample):
        # random pick from population with sample size = i
        rs = random.choices(s, k=i)
        # calculate the mean of each sample and save it in list of mean
        ml.append(sum(rs)/i)  
    
    # save the 50 sample mean in samplemeanlist for box plots
    samplemeanlist.append(ml)
   
# Step 3
# set figure size
boxplots = plt.figure(figsize=(20,10))
# plot box plots of each sample mean
plt.boxplot(samplemeanlist,labels = l)
# show plot.
boxplots.show()

print("sample with 100 sample size," + \
      "mean:" + str(np.mean(samplemeanlist[0])) + \
      ", standard deviation: "+ str(np.std(samplemeanlist[0])))
print("sample with 8100 sample size," + \
      "mean:" + str(np.mean(samplemeanlist[16])) + \
      ", standard deviation: "+ str(np.std(samplemeanlist[16])))

# last hist plot
histplot = plt.figure(figsize=(20,10))
plt.hist(samplemeanlist[0], 10, density=True)
plt.hist(samplemeanlist[16], 10, density=True)
histplot.show();