# Sampling

1. Bernoulli Sampling: For each elephant, independently select it with a probability $p$ (which we can set to 0.3 for example). This is essentially flipping a biased coin for each elephant.
2. Poisson Sampling: The Poisson sampling is based on the weight of the elephants. For each elephant, we draw a Poisson random variable with a mean proportional to its weight (scaled by some factor $\lambda$).
3. Reservoir Sampling: Similar to fixed proportion sampling, but it’s useful when you’re working with streaming data or when you don’t know the size of the population upfront.
4. Fixed Proportion Sampling: Guarantees the selection of exactly a set proportion of the total population.


In [1]:
import numpy as np

elephant_weights = [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]

def bernoulli_sampling(elephants, p=0.3):
    # Generate a Bernoulli random variable (0 or 1) for each elephant
    selected = np.random.binomial(1, p, len(elephants))
    # Select elephants with 1
    sampled_elephants = [elephants[i] for i in range(len(elephants)) if selected[i] == 1]
    return sampled_elephants

def poisson_sampling(elephants, lambda_factor=1.0):
    # Generate Poisson random variables for each elephant based on its weight
    sampled_elephants = []
    for weight in elephants:
        count = np.random.poisson(lambda_factor * weight / 1000)  # Scaled by weight
        if count > 0:
            sampled_elephants.append(weight)
    return sampled_elephants

np.random.seed(42)  # For reproducibility

bernoulli_sampled = bernoulli_sampling(elephant_weights, p=0.3)
print(f"Sampled elephants using Bernoulli sampling: {bernoulli_sampled}")

poisson_sampled = poisson_sampling(elephant_weights, lambda_factor=1.0)
print(f"Sampled elephants using Poisson sampling: {poisson_sampled}")

Sampled elephants using Bernoulli sampling: [3200, 2800, 3300, 3600]
Sampled elephants using Poisson sampling: [3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]


Reservoir sampling is a technique used to randomly select a fixed-size sample from a large or unknown-size population where it is impractical to store the entire dataset in memory. In this case, we’ll apply reservoir sampling to randomly select 3 elephants from the 10 available elephants.

Here’s how reservoir sampling works in a high-level view:

- Step 1: Initialize a reservoir with the first  k  elements (in our case,  k = 3 ).
- Step 2: For each subsequent element (i.e., from index  k+1  to the end), randomly decide whether to include the element in the reservoir.
  - With probability  k/i  (where  i  is the index of the element), replace one of the current elements in the reservoir.


In [2]:
import random

elephant_weights = [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]

def reservoir_sampling(elephants, k=3):
    # Step 1: Fill the reservoir with the first k elephants
    reservoir = elephants[:k]
    
    # Step 2: Process the remaining elephants
    for i in range(k, len(elephants)):
        # Generate a random index from 0 to i (inclusive)
        j = random.randint(0, i)
        # If the random index is within the reservoir range, replace the element
        if j < k:
            reservoir[j] = elephants[i]
    
    return reservoir

random.seed(42)  # For reproducibility

reservoir_sampled = reservoir_sampling(elephant_weights, k=3)
print(f"Sampled elephants using reservoir sampling: {reservoir_sampled}")

Sampled elephants using reservoir sampling: [3500, 3200, 3600]


Fixed proportion sampling refers to selecting a specific proportion of elements from a dataset. For example, if you have 10 elephants and want to sample a fixed proportion (e.g., 30%) of them, you would select 3 elephants. The key here is that the number of elements to be sampled is determined by a fixed ratio relative to the total population size.

Unlike Bernoulli sampling, where each element is independently selected with some probability, fixed proportion sampling guarantees that exactly  n \times p  elements are selected, where  n  is the population size and  p  is the desired proportion.


In [3]:
import random

elephant_weights = [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]

def fixed_proportion_sampling(elephants, proportion=0.3):
    # Calculate the number of elements to sample based on the given proportion
    num_to_sample = int(len(elephants) * proportion)
    
    # Randomly sample the exact number of elements
    sampled_elephants = random.sample(elephants, num_to_sample)
    
    return sampled_elephants

random.seed(42)  # For reproducibility

fixed_proportion_sampled = fixed_proportion_sampling(elephant_weights, proportion=0.3)
print(f"Sampled elephants using fixed proportion sampling: {fixed_proportion_sampled}")

Sampled elephants using fixed proportion sampling: [3200, 3000, 3500]


Why Use Hash-Based Sampling?

- Reproducibility: Given the same input and bucket configuration, the sampling will always yield the same result because the hash function is deterministic.
- Efficient for Streaming: You don’t need to store all the items in memory; you can sample from a continuous stream without looking at all the data at once.
- Scalable: Hashing can be done in constant time  O(1) , and the number of buckets can be adjusted to control the sample size.


1. Hashing:
	- We use Python’s built-in hash() function to hash each elephant (or item in the stream).
	- The hash value is then divided by the number of buckets (num_buckets) to assign the item to a bucket using the modulus operator (%).
2.	Sampling:
	- In both the non-streaming and streaming versions, we only select items that belong to the specified selected_bucket. This ensures that a deterministic subset of the population is selected based on the hash values.
3.	Non-Streaming (Batch):
	- The function processes the entire list of elephants at once, hashes each item, and selects those in the designated bucket.
4.	Streaming:
	- The streaming version processes the elephants one by one (as if they were arriving over time), making it ideal for situations where the full dataset can’t fit in memory, or you’re handling a continuous stream of data.

In [4]:
import random

# List of elephant weights
elephant_weights = [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]

# Sampling using hash function (batch version)
def hash_bucket_sampling(elephants, num_buckets=5, selected_bucket=0):
    sampled_elephants = []
    
    # Hash each item and assign to buckets
    for elephant in elephants:
        # Use hash() to assign to a bucket
        bucket = hash(elephant) % num_buckets
        if bucket == selected_bucket:
            sampled_elephants.append(elephant)
    
    return sampled_elephants

# Example of using hash bucket sampling (non-streaming)
random.seed(42)  # For reproducibility

# Let's say we want to divide into 5 buckets and select bucket 0
sampled_elephants = hash_bucket_sampling(elephant_weights, num_buckets=5, selected_bucket=0)
print(f"Sampled elephants using hash bucket sampling (non-streaming): {sampled_elephants}")

Sampled elephants using hash bucket sampling (non-streaming): [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]


In [5]:
# Streaming version of hash bucket sampling
def hash_bucket_sampling_streaming(elephants_stream, num_buckets=5, selected_bucket=0):
    sampled_elephants = []
    
    # Process each item one at a time
    for elephant in elephants_stream:
        # Hash the item and check if it belongs to the selected bucket
        bucket = hash(elephant) % num_buckets
        if bucket == selected_bucket:
            sampled_elephants.append(elephant)
    
    return sampled_elephants

# Example of using hash bucket sampling (streaming)
elephant_stream = iter(elephant_weights)  # Simulate a stream using an iterator

# In streaming mode, you process the stream as it arrives
streaming_sampled_elephants = hash_bucket_sampling_streaming(elephant_stream, num_buckets=5, selected_bucket=0)
print(f"Sampled elephants using hash bucket sampling (streaming): {streaming_sampled_elephants}")

Sampled elephants using hash bucket sampling (streaming): [3000, 3200, 2800, 3100, 3500, 2900, 2700, 3300, 3400, 3600]
