# Sampling

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns

## Population vs Sample

The **population** is the complete dataset. It doesnt have to refer to people. Typically we dont know what the whole population is.

The **sample** is the subset of data you calculate on.

In [None]:
# The following dataset corresponds to a set of professional evaluations of coffees
coffee = pd.read_feather('../data/coffee_ratings_full.feather')

In [None]:
coffee.head()

In [None]:
coffee.shape

In [None]:
coffee.dtypes

In [None]:
coffee['variety'] = coffee.variety.astype('category')

The 1338 observations of the coffee dataset correspond to a sample, and not to the population of the kinds of existing coffee varieties. Yet, in our particular context lets consider this dataset as our population.

We can take a sample of this *population* using the *.sample()* method.

In [None]:
coffee_samp = coffee.sample(n=10)
coffee_samp

## Population parameters and point estimates

A **population parameter** is a calculation made on the population dataset.




In [None]:
np.mean(coffee.aftertaste)

In [None]:
coffee.aftertaste.mean()

A **point estimate** or sample statistic is a calculation made on the sample dataset

In [None]:
np.mean(coffee_samp.aftertaste)

In [None]:
coffee_samp.aftertaste.mean()

## Convenience sample

**Sample bias** is a problem caused when the sample is not representative of the population.
Collecting data by the easiest method is called *convenience sampling* and often causes sample bias.

Plotting histograms of the sample vs population helps identifying selection bias

In [None]:
coffee_bad_samp = coffee.head(10)

In [None]:
coffee.total_cup_points.hist(bins=np.arange(0,101, 1))

In [None]:
coffee_samp.total_cup_points.hist(bins=np.arange(0,101, 1))

In [None]:
coffee_bad_samp.total_cup_points.hist(bins=np.arange(0,101, 1))

The random sample seems more representative than the head one.

## Pseudo-random number generation

Random numbers cannot be known beforehand. True randomness is expensive. Pseudorandomness is a good workaround.

Pseudo-random number generation is cheap and fast.
Next random number is calculated from the previous one.
The first one is calculated from a *seed*.
All future values are always the same.##

Numpy has many number generators from different statistical distributions under numpy.random

In [None]:
import numpy.random as random

betas = random.beta(a=2, b=2, size=5000)
betas
                   

In [None]:
sns.histplot(data=betas)

Numpy allows us to set the seed so our code is reproducible.

In [None]:
random.seed(42)

In [None]:
normals = random.normal(loc=2, scale=1.5, size=2000)
sns.histplot(normals)

## Simple Random and Systematic Sampling 

### Simple Random Sampling

Its like a raffle. We take n random examples, one at a time. The pandas .sample method for instance.

In [None]:
sample = coffee.sample(n=10)

### Systematic Random Sampling 

Picks random samples with a fixed interval. There is no pandas implementation for this, but the .iloc[::interval] works.
The systematic random sampling is only safe when there is no pattern in the data. Sampling the whole dataset avoids problems caused by patterns in the original dataset.

In [None]:
size = len(coffee)
sample_size = 10
interval = size//sample_size
sample_sys = coffee[::interval]

sampling the whole dataset:

In [None]:
shuffled = coffee.sample(frac=1)

In [None]:
shuffled

## Stratified and weighted random sampling


In [None]:
coffee.country_of_origin.value_counts(normalize=True)

In [None]:
coffee_sample = coffee.sample(frac=0.1, random_state=42)
coffee_sample.country_of_origin.value_counts(normalize=True)

If we care about the proportions of each category in the sample, being closer to the ones of the original population, we can group by before sampling:

In [None]:
coffee_strat = coffee.groupby("country_of_origin").sample(frac=0.1, random_state=42)
coffee_strat.country_of_origin.value_counts(normalize=True)

If we want the same amount of elements by category:

In [None]:
coffee_strat_eq = coffee.groupby("country_of_origin").sample(n=1, random_state=42)
coffee_strat_eq.country_of_origin.value_counts(normalize=True)

In this dataset, there are countries with only one observation, so we cannot have more than 1 per group if we dont do the sampling with Replacement.

Another way of doing sampling is taking into account weights: adding a column with weights to the dataframe and passing it to the sampling method.



## Cluster Sampling

The problem with stratified samping is we need to collect data from each group. This could be a problem in terms of time and/or money.

When collecting data is expensive, we can use **cluster sampling**

Cluster sampling uses simple random sampling to pick some subgroups and use simple random sampling on those subgroups.

Cluster sampling is an example of multistage sampling.

In [None]:
varieties_pop = list(coffee.variety.unique())
varieties_pop

In [None]:
# Step 1:
import random

varieties_samp = random.sample(varieties_pop, k=3)
varieties_samp

In [None]:
# Step 2:
variety_condition = coffee.variety.isin(varieties_samp)
coffee_cluster = coffee[variety_condition]

coffee_cluster['variety'] = coffee_cluster['variety'].cat.remove_unused_categories()

In [None]:
coffee_cluster.groupby('variety').sample(n=5, random_state=42)