# Sample Size
In this investigation, our population will be numbers in an array. And we will take samples from that. Since we have the whole data, why are we going to take samples? 

There will be situations where having the complete data is not practical. We will only be able to sample the population and use our measurements of that _sample_ to talk about the _whole_ _population_. How should sample size affect our confidence when we use only the sample to reach conclusions? For instance: how accurate would the estimate of the mean be? 

(Don't worry too much about what the mean is. It's the 'everyday average'. We'll look into it in more detail later in the course. For now: just think of it as something we can use to describe a list of numbers.)

For our investigation: 
- start off by keeping the population size the same
- vary the sample size (as a percentage of the population) and compare the sample mean to the real mean of the population. The program actually picks 100 samples and reports on the biggest error.
- how big does the sample need to be so that the percentage error is about 2%


In [1]:
import pandas as pd
from numpy import random
from ipywidgets import interact, FloatLogSlider, FloatSlider
from math import floor

sd = 15


@interact(
    pop_size=FloatLogSlider(base=10, min=2, value=100, max=6,step=1, continuous_update=False, readout_format=".8"),
    sample_size=FloatSlider(min=0.1, value=0.2, max=1, step=0.1, continuous_update=False, readout_format=".0%")
)
def take_samples(pop_size, sample_size):
    # the computer will pick numbers from a normal distribution to put into a data series (a list)
    pop_size = floor(pop_size)
    population = pd.Series(random.normal(loc=100, scale=sd, size=pop_size))
    mean_of_population = population.mean()
    print("mean of the population:", mean_of_population)

    max_diff_in_means = 0
    sample_size = floor(sample_size * pop_size)
    print("sample size is", sample_size, "out of population of ", pop_size)
    for i in range(100):
        sample = population.sample(sample_size)
        mean_of_sample = sample.mean()
        diff = abs(mean_of_population - mean_of_sample)
        if diff > max_diff_in_means:
            max_diff_in_means = diff

    print("maximum error as percent of real value",
            100*max_diff_in_means/mean_of_population)


interactive(children=(FloatLogSlider(value=100.0, continuous_update=False, description='pop_size', max=6.0, mi…

How does the sample size compare with the results of the [previous investigation](./01_population_size_flat.ipynb)? 