# Sample size calculation: Cochran’s Formula

> Big data problems [are] actually small data problems, once you have the right subset/sample/summary. Inventing numbers on the spot, I’d say 90% of big data problems fall into this category.

Even if you don’t have huge data sets, you usually run into issues where even a fast computer will process too slowly in memory (especially if you’re using R or Python). It will go even slower if you’re processing data remotely.

In the cases where pulling down data takes longer than running regressions on it, you’ll need to sample.

But how big a sample is big enough? As I’ve been working through a couple of rounds of sampling lately, I’ve found that there’s no standard rule of thumb, either in the data science community or in specific industries like healthcare and finance. The answer is, as always, “It depends.”.

A sample size is a part of the population chosen for a survey or experiment. For example, you might take a survey of dog owner’s brand preferences. You won’t want to survey all the millions of dog owners in the country (either because it’s too expensive or time consuming), so you take a sample size. That may be several thousand owners. The sample size is a representation of all dog owner’s brand preferences. If you choose your sample wisely, it will be a good representation. 

## Cochran’s Sample Size Formula

The Cochran formula allows you to calculate an ideal sample size given a desired level of precision, desired confidence level, and the estimated proportion of the attribute present in the population.

Cochran’s formula is considered especially **appropriate in situations with large populations**. A sample of any given size provides more information about a smaller population than a larger one, so there’s a **‘correction’** through which the number given by Cochran’s formula can be reduced if the whole population is relatively small.

The Cochran formula is:

![alt text](https://a8h2w5y7.rocketcdn.me/wp-content/uploads/2018/01/cochran-1.jpeg)

Where:

- e: is the desired level of precision (i.e. the margin of error),
- p: is the (estimated) proportion of the population which has the attribute in question,
- q: is 1 – p.

The z-value is found in a [Z table](https://www.statisticshowto.com/tables/z-table/). 

## Modification for the Cochran Formula for Sample Size Calculation In Smaller Populations

If the population we’re studying is small, we can modify the sample size we calculated in the above formula by using this equation:
    
![alt text](https://a8h2w5y7.rocketcdn.me/wp-content/uploads/2018/01/cochran-2.jpeg)

Here n0 is Cochran’s sample size recommendation, N is the population size, and n is the new, adjusted sample size. 

## Calculating sample size with Python

There are a couple ways to run a quick calculation: 

- The fastest is probably this [site](https://www.surveysystem.com/sscalc.htm). 
- There is also a really nice Python script that does the same thing:

In [43]:
%matplotlib inline
import math
import scipy.stats as st

In [44]:
# CALCULATE Z VALUE
def get_z(confidence_level:float)->float:
    """
    Calculate Z value for a given confidence level.
    
    confidence_level -- confidence level into percent. 
    return -- z value.
    """
    return st.norm.ppf(1-(1-confidence_level/100.)/2)


# CALCULATE THE SAMPLE SIZE
def sample_size(population_size:int, confidence_level:float, confidence_interval:float):
    """
    Calculate the sample size using the Cochran’s Sample Size Formula.
    
    population_size -- the total population size.
    confidence_level -- the seleceted confidence level in percent. 
    confidence_interval -- the selected confidence interval in percent.
    return -- sample size with the correction for smaller population (no large).
    """
    Z = 0.0
    p = 0.5
    e = confidence_interval/100.0
    N = population_size
    n_0 = 0.0
    n = 0.0

    # FIND THE NUM STD DEVIATIONS FOR THAT CONFIDENCE LEVEL
    Z = get_z(confidence_level)

    if Z == 0.0:
        return -1

    # CALC SAMPLE SIZE
    n_0 = ((Z**2) * p * (1-p)) / (e**2)

    # ADJUST SAMPLE SIZE FOR FINITE POPULATION
    n = n_0 / (1 + ((n_0 - 1) / float(N)) )

    return int(math.ceil(n)) # THE SAMPLE SIZE

In [45]:
population_sz = 100000
confidence_level = 95.0
confidence_interval = 2.0

sample_sz = sample_size(population_sz, confidence_level, confidence_interval)

print("SAMPLE SIZE: %d" % sample_sz)

SAMPLE SIZE: 2345


### Reference:

- ["How large should your sample size be?" by Vicki Boykis](https://veekaybee.github.io/2015/08/04/how-big-of-a-sample-size-do-you-need/)
- [Sample Size Calculator](https://www.surveysystem.com/sscalc.htm)
- [Cochran’s Sample Size Formula](https://www.statisticshowto.com/probability-and-statistics/find-sample-size/)
- [Determination of appropriate Sample Size](https://shodhganga.inflibnet.ac.in/bitstream/10603/23539/7/07_chapter%202.pdf)