# Unit 7: Simulation of sampling with replacement

## 1 Introduction

This notebook simulates the activity from the previous class: 

Each team was given an envelope with an**unknown population of tiles with numbers**

### {1,2,3,4,5,6}

We randomly selected one tile from the population, took note of the number on the tile
and returned the tile back into the envelope. This type of repeated sampling from the population is known as "*sampling with replacement*".

The experiment was designed in such a way that one of the six events in the sample space has a lower probability than the remaining five events, which all have the same probability.

1. All numbers are evenly distributed: the probability of the event '*tile shows number x*' is the same for all possible numbers x.
2. One number has half the probability compared with the rest of the numbers.

The question of interest: **What is a large enough sample size so that we can distinguish between the low-probabilty event and the others. In other words, after how many trials do we have enough observations to detect a difference in the frequency of the events?** 

Suprisingly, it takes quite a few samples before the random fluctuations in the relative frequency of each event is reduced. The more samples we have in our data set, the more accurate is the estimated relative frequency for each event.



## 2. How can we simulate a random process like the sampling from a bag with tiles?

First of all, we have to be able to generate random numbers. That's possible with numpy. Numpy can generate discrete (integer) numbers with equal probability within a defined range (e.g. integers 1,2,3,4,5,6).

The function that produces evenly (uniformly) distributed integer numbers is
*np.random.randint*



In [None]:
import numpy as np
import matplotlib.pyplot as plt

help(np.random.randint)

## 2.1 The basics of working with random numbers

In order to simulate the experiment of sampling from the bag of tiles we need a few parameters that control the simulation:

* define the 'event space': {1,2,3,4,5,6}, this we do here with the integer variables i1 and i2.
* we define a list with the event numbers for the plotting of the results
* it is of course a good idea to use a variable that controls the sample size


In [None]:
n=30 # sample size
i1 ,i2 = 1, 6+1
sample=np.random.randint(i1,i2,size=n)
print(sample)

## 2.2 Calculate relative frequencies

Function *np.histogram* allows us to calculate the frequency of events falling into a specific value range.
That range is defined by the lower and upper boundaries of the bins.


In [None]:
# for the histograms we need the control of the bins
event=      [    1,    2,    3,    4,    5,    6    ]
# Note the borders for our bins (event ranges)
bin_borders=[0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5]
# we can get the numerical results
result=np.histogram(sample,bins=bin_borders)
count=result[0] # first item in  returned list has the counts for each bin
rel_freq=count/sum(count)

## 2.3 Text-based summary of the results 
(e.g. for tables in research reports)

In [None]:
i=0
print ("counts and relative frequency of the events")
while i<len(count):
    print ("tile="+str(event[i])+", "+str(count[i])+", "+str(rel_freq[i]))
    i=i+1
print ("----------------")
print ("checksum : "+str(sum(rel_freq)))

## 2.4 Histogram plot 

Aside from the long parameter list that allows us to adjust the colors of the plot, there is one important optional parameter to switch into the mode showing relative frequencies: *density=True*

In [None]:
plt.hist(sample,bins=bin_borders,rwidth=0.8,lw=2,\
         edgecolor='purple',facecolor='gold',alpha=0.7,density=True)
plt.ylabel("relative frequency")
plt.xlabel("event 'tile shows number #'")
plt.text(0.2,0.175, 'expected value P=1/6',fontsize=16)
plt.plot([0,7],[1/6, 1/6],':',color='black')

## 3. Summary

* Numpy offers random number generators that produce discrete events (integer numbers) with equal probability (*np.random.randint*). We use the function to simulate the class experiment: 'sampling tiles from an evenlope with even distribution of all numbers'. 
* Numpy has a function (*np.histogram*) that calculates the frequency of events (integer or real numbers) falling inside a certain bin (range of values)


## 4. Next steps

We cannot simulate the experiment with a population with non-uniform distribution. One solution how we can do this with the same numpy function will be shown in the next notebook.