# <font color='red'>Chapter 6: Cluster Sampling</font>

## <font color='green'>6.1 Introduction to Cluster Sampling</font>

Cluster sampling is a method where the population is divided into groups, known as **clusters**, and a sample of these clusters is selected to represent the population. This method is commonly used when a complete list of the population is unavailable, but natural groups or clusters exist. For example, schools in a district, neighborhoods in a city, or branches of a company.

### Key Characteristics:
- The population is divided into **non-overlapping groups** (clusters).
- Clusters are selected randomly, and either all elements within selected clusters are surveyed (**one-stage cluster sampling**) or a random sample is taken from within the selected clusters (**two-stage cluster sampling**).
- The method assumes clusters are representative of the population as a whole.


### <font color='green'>One-Stage Cluster Sampling</font>

In one-stage cluster sampling, **all elements** within the randomly selected clusters are included in the sample. This method is suitable when the clusters are internally homogeneous and the cost or effort to survey entire clusters is reasonable.  

### Key Points:
- **Advantage**: Simplicity, as it avoids additional sampling within clusters.  
- **Limitation**: Risk of bias if clusters do not represent the population accurately.


### <font color='green'>Two-Stage Cluster Sampling</font>

In two-stage cluster sampling, a **random sample** of elements is taken from within the randomly selected clusters. This method is used when clusters are large or internally heterogeneous, and it is impractical to survey all elements.  

### Key Points:
- **Advantage**: Greater control over sample size and cost.  
- **Limitation**: Increased complexity and potential for sampling error within clusters.


### <font color='blue'>Example 1: Surveying Students in a District</font>

A researcher wants to estimate the average test score for all students in a district. Instead of surveying every student, the researcher randomly selects 10 schools (clusters) and surveys all students within those schools.


### <font color='blue'>Example 2: Analyzing Customer Satisfaction</font>

A retail company with 100 stores (clusters) wants to assess customer satisfaction. They randomly select 15 stores and survey either:  
- **One-stage sampling**: All customers visiting those stores.  
- **Two-stage sampling**: A random sample of customers visiting those stores.


## <font color='green'>6.2 Advantages and Limitations</font>

### Advantages:
1. **Cost and Time Efficient**: Easier to implement when the population is spread out geographically.  
2. **Practical**: Requires fewer resources compared to other methods, such as stratified sampling.  
3. **Simplifies Data Collection**: Clusters can act as naturally occurring groups.  

### Limitations:
1. **Risk of High Variability**: Clusters may not always represent the population accurately.  
2. **Bias Risk**: If clusters are not homogeneous, estimates may be less precise.


## <font color='green'>6.4 Estimation in Cluster Sampling</font>

### Estimator for the Mean:
$$
\bar{y} = \frac{\sum_{i=1}^m \bar{y}_i}{m}
$$
Where:
- $ \bar{y}_i $: Sample mean for the $ i^{th} $ cluster.
- $ m $: Number of clusters sampled.

### Estimator for the Total:
$$
\hat{T} = N_c \cdot \bar{y}
$$
Where:
- $ N_c $: Total number of clusters.

### Variance of the Mean:
$$
\text{Var}(\bar{y}) = \frac{S_c^2}{m}
$$
Where:
- $ S_c^2 $: Variance between clusters.


## <font color='green'>6.5 Confidence Intervals</font>

### Confidence Interval for the Mean:
$$
CI = \bar{y} \pm Z \cdot \sqrt{\frac{S_c^2}{m}}
$$

### Confidence Interval for the Total:
$$
CI = \hat{T} \pm Z \cdot N_c \cdot \sqrt{\frac{S_c^2}{m}}
$$


## <font color='green'>6.6 Example: One-Stage Cluster Sampling</font>

### Problem:
A company wants to survey employee satisfaction across 50 branches (clusters). They randomly select 10 branches and survey all employees within those branches. The average satisfaction score for the surveyed branches is 8.2, and the variance between branch scores is 1.5.  

### Solution:
1. **Estimator for the Mean**:  
   Given $ \bar{y} = 8.2 $, the population mean estimate is also 8.2.  

2. **Confidence Interval**:  
   - Variance of the mean: $ \text{Var}(\bar{y}) = \frac{1.5}{10} = 0.15 $  
   - Standard error: $ \sqrt{0.15} = 0.387 $  
   - Margin of error: $ 1.96 \cdot 0.387 = 0.759 $  

   Confidence Interval:  
   $$
   CI = 8.2 \pm 0.759 \implies [7.44, 8.96]
   $$


In [17]:
import math
from scipy.stats import norm

# Problem: One-Stage Cluster Sampling
# Given data
mean_score = 8.2                # Sample mean (𝑦̄)
variance_between_clusters = 1.5 # Variance between cluster means (S²)
num_sampled_clusters = 10       # Number of sampled clusters
z_score = norm.ppf(0.975)       # Z-score for 95% confidence level (two-tailed)

# By hand:
# Step 1: Calculate the variance of the mean
variance_mean = variance_between_clusters / num_sampled_clusters

# Step 2: Calculate the standard error
standard_error = math.sqrt(variance_mean)

# Step 3: Calculate the margin of error
margin_of_error = z_score * standard_error

# Step 4: Compute the confidence interval
lower_bound = mean_score - margin_of_error
upper_bound = mean_score + margin_of_error

print("One-Stage Cluster Sampling (By Hand):")
print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")

# Using scipy.stats
lower_bound_scipy, upper_bound_scipy = norm.interval(
    confidence=0.95, loc=mean_score, scale=standard_error
)

print("\nOne-Stage Cluster Sampling (scipy.stats):")
print(f"Confidence Interval: [{lower_bound_scipy:.2f}, {upper_bound_scipy:.2f}]")


One-Stage Cluster Sampling (By Hand):
Confidence Interval: [7.44, 8.96]

One-Stage Cluster Sampling (scipy.stats):
Confidence Interval: [7.44, 8.96]


## <font color='green'>6.7 Example: Two-Stage Cluster Sampling</font>

### Problem:
In the same company example, instead of surveying all employees in selected branches, they randomly survey 5 employees per branch across the 10 selected branches. The sample mean satisfaction score is 8.5, and the variance between branch means is 2.  

### Solution:
1. **Estimator for the Mean**:  
   Given $ \bar{y} = 8.5 $, the population mean estimate is also 8.5.  

2. **Confidence Interval**:  
   - Variance of the mean: $ \text{Var}(\bar{y}) = \frac{2}{10} = 0.2 $  
   - Standard error: $ \sqrt{0.2} = 0.447 $  
   - Margin of error: $ 1.96 \cdot 0.447 = 0.876 $  

   Confidence Interval:  
   $$
   CI = 8.5 \pm 0.876 \implies [7.62, 9.38]
   $$


In [19]:
# Problem: Two-Stage Cluster Sampling
# Given data
mean_score = 8.5                # Sample mean (𝑦̄)
variance_between_clusters = 2   # Variance between cluster means (S²)
num_sampled_clusters = 10       # Number of sampled clusters
z_score = norm.ppf(0.975)       # Z-score for 95% confidence level (two-tailed)

# By hand:
# Step 1: Calculate the variance of the mean
variance_mean = variance_between_clusters / num_sampled_clusters

# Step 2: Calculate the standard error
standard_error = math.sqrt(variance_mean)

# Step 3: Calculate the margin of error
margin_of_error = z_score * standard_error

# Step 4: Compute the confidence interval
lower_bound = mean_score - margin_of_error
upper_bound = mean_score + margin_of_error

print("Two-Stage Cluster Sampling (By Hand):")
print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")

# Using scipy.stats
lower_bound_scipy, upper_bound_scipy = norm.interval(
    confidence=0.95, loc=mean_score, scale=standard_error
)

print("\nTwo-Stage Cluster Sampling (scipy.stats):")
print(f"Confidence Interval: [{lower_bound_scipy:.2f}, {upper_bound_scipy:.2f}]")


Two-Stage Cluster Sampling (By Hand):
Confidence Interval: [7.62, 9.38]

Two-Stage Cluster Sampling (scipy.stats):
Confidence Interval: [7.62, 9.38]


# <font color='red'>Exercises</font> 

## <font color='green'>Exercise 1: One-Stage Cluster Sampling</font>

## <font color='blue'>Problem Statement:</font>
A survey is conducted to measure the average commute time of employees across 40 offices (clusters). The survey randomly selects 8 offices, and all employees within these offices report their commute times. The mean commute time for the sampled offices is 45 minutes, and the variance between office means is 16 minutes².

**Tasks:**
1. Estimate the population mean commute time.  
2. Calculate the 95% confidence interval for the population mean.  

---

### **Hints:**
- Use the formula for variance of the mean:
$$
\text{Var}(\bar{y}) = \frac{S_c^2}{m}
$$
- Confidence interval for the mean:
$$
CI = \bar{y} \pm Z \cdot \sqrt{\frac{S_c^2}{m}}
$$
Where:
- $ \bar{y} $: Sample mean (45 minutes).  
- $ S_c^2 $: Variance between cluster means (16 minutes²).  
- $ m $: Number of sampled clusters (8).  
- $ Z $: Z-score for 95% confidence level (1.96).


In [26]:
# Solution 

## <font color='green'>Exercise 2: Two-Stage Cluster Sampling</font>

## <font color='blue'>Problem Statement:</font>
A company with 60 branches wants to evaluate customer satisfaction. They randomly select 12 branches and survey 10 customers per branch. The average satisfaction score for the sampled branches is 8.4, and the variance between branch means is 2.5.

**Tasks:**
1. Estimate the population mean satisfaction score.  
2. Calculate the 99% confidence interval for the population mean.  

---

### **Hints:**
- Variance of the mean:
$$
\text{Var}(\bar{y}) = \frac{S_c^2}{m}
$$
- Confidence interval for the mean:
$$
CI = \bar{y} \pm Z \cdot \sqrt{\frac{S_c^2}{m}}
$$
Where:
- $ \bar{y} $: Sample mean (8.4).  
- $ S_c^2 $: Variance between cluster means (2.5).  
- $ m $: Number of sampled clusters (12).  
- $ Z $: Z-score for 99% confidence level (use $ Z = 2.576 $).


In [29]:
# Solution 