# <font color='red'>Chapter 4: Stratified Sampling</font>

## <font color='green'>4.1 Introduction to Stratified Sampling</font>

Stratified Sampling divides the population into distinct subgroups (**strata**) based on specific characteristics, such as age, income, or location. A random sample is then drawn from each stratum.  

### Key Characteristics:
- The population is divided into **homogeneous groups** (strata).
- Samples are taken **randomly** from each stratum.
- This ensures representation of all groups, improving precision.

### Why Use Stratified Sampling?
1. **Improved Accuracy:** Reduces variability by focusing within homogeneous groups.  
2. **Representation:** Guarantees smaller groups are included.  
3. **Efficiency:** Fewer units are needed for the same level of precision.

### Example:
Imagine a company wants to study employee satisfaction across three departments:
- HR, IT, and Sales.

By stratifying the population by department, we ensure each group is proportionally represented in the sample.


## <font color='green'>4.2.1 Proportional Stratified Sampling</font>

### Definition:
The sample size for each stratum is proportional to the stratum's size in the population.

### Formula:
For a given stratum $h$, the sample size $n_h$ is:  
$$
n_h = n \cdot \frac{N_h}{N}
$$  
Where:  
- $n_h$: Sample size for stratum $h$  
- $N_h$: Population size for stratum $h$  
- $N$: Total population size  
- $n$: Total sample size  

### Example:
Suppose the population has 3 strata:  
- Stratum 1: 500 people  
- Stratum 2: 300 people  
- Stratum 3: 200 people  

For a total sample of $n = 100$:  
$$
n_1 = 100 \cdot \frac{500}{1000} = 50, \quad n_2 = 100 \cdot \frac{300}{1000} = 30, \quad n_3 = 100 \cdot \frac{200}{1000} = 20
$$

This ensures proportional representation across the strata.


## <font color='green'>4.2.2 Disproportional Stratified Sampling</font>

### Definition:
The sample size for each stratum is **not proportional** to the stratum's size. Instead, it is determined based on:  
1. Variability within the stratum.  
2. Importance of the stratum for the study.  

### When to Use:
- Small but important groups (e.g., minority populations).  
- Strata with higher variability that require greater precision.

### Example:
If a minority group makes up 5% of the population but is critical to the study, we may oversample that group to ensure accurate analysis.

This prioritizes precision for specific strata.


## <font color='green'>4.3.1 Estimator for the Population Mean</font>

### Formula:
The population mean is estimated as a weighted average of the stratum means:  
$$
\bar{y} = \sum_{h=1}^L W_h \cdot \bar{y}_h
$$  
Where:  
- $W_h = \frac{N_h}{N}$: Weight of stratum $h$  
- $\bar{y}_h$: Mean of stratum $h$  
- $L$: Number of strata  

### Example:
For 3 strata:  
- Stratum 1: $W_1 = 0.4, \bar{y}_1 = 50$  
- Stratum 2: $W_2 = 0.3, \bar{y}_2 = 70$  
- Stratum 3: $W_3 = 0.3, \bar{y}_3 = 60$  

The population mean is:  
$$
\bar{y} = (0.4 \cdot 50) + (0.3 \cdot 70) + (0.3 \cdot 60) = 20 + 21 + 18 = 59
$$


## <font color='green'>4.3.2 Estimator for the Population Total</font>

### Formula:
The population total is calculated as:  
$$
\hat{T} = N \cdot \bar{y}
$$  
Where:  
- $N$: Population size  
- $\bar{y}$: Weighted mean of the strata  

### Example:
Suppose $N = 1,000$ and $\bar{y} = 59$ (from the previous example):  
$$
\hat{T} = 1000 \cdot 59 = 59,000
$$

The estimated total for the population is **59,000**.


## <font color='green'>4.3.3 Variance of the Estimator for the Mean</font>

### Formula:
The variance of the stratified mean is:  
$$
\text{Var}(\bar{y}) = \sum_{h=1}^L W_h^2 \cdot \frac{s_h^2}{n_h}
$$  
Where:  
- $W_h$: Weight of stratum $h$ ($W_h = N_h / N$)  
- $s_h^2$: Sample variance in stratum $h$  
- $n_h$: Sample size for stratum $h$  

### Example:
Given:  
- $W_1 = 0.4, s_1^2 = 25, n_1 = 20$  
- $W_2 = 0.3, s_2^2 = 30, n_2 = 25$  
- $W_3 = 0.3, s_3^2 = 40, n_3 = 15$  

The variance is:  
$$
\text{Var}(\bar{y}) = (0.4)^2 \cdot \frac{25}{20} + (0.3)^2 \cdot \frac{30}{25} + (0.3)^2 \cdot \frac{40}{15}
$$


## <font color='green'>4.5 Confidence Intervals</font>

Confidence intervals in stratified sampling allow us to estimate a **range** within which the population mean or proportion is likely to fall. These intervals account for variability **across strata** and **within each stratum**, ensuring precise estimates.

---

### **Confidence Interval for the Population Mean**

1. **Formula**  
The confidence interval for the population mean is calculated as:  
$$
CI = \bar{y} \pm Z \cdot \sqrt{\text{Var}(\bar{y})}
$$  

Where:  
- $ \bar{y} $: Weighted sample mean.  
- $ Z $: Z-score corresponding to the desired confidence level (e.g., 1.96 for 95%).  
- $ \text{Var}(\bar{y}) $: Variance of the stratified mean, calculated as:  
$$
\text{Var}(\bar{y}) = \sum_{h=1}^L W_h^2 \cdot \frac{s_h^2}{n_h}
$$  

2. **Explanation of Terms**  
- $ W_h = \frac{N_h}{N} $: Weight of stratum $ h $ based on its proportion in the population.  
- $ s_h^2 $: Sample variance within stratum $ h $.  
- $ n_h $: Sample size for stratum $ h $.  
- $ L $: Total number of strata.

---

### **Confidence Interval for the Population Proportion**

1. **Formula**  
For categorical data, the confidence interval for the population proportion is:  
$$
CI = \hat{P} \pm Z \cdot \sqrt{\frac{\hat{P} (1 - \hat{P})}{n}}
$$  

Where:  
- $ \hat{P} $: Weighted sample proportion.  
- $ Z $: Z-score for the desired confidence level.  
- $ n $: Total sample size.

2. **Key Notes**  
- If strata proportions differ significantly, calculate the overall $ \hat{P} $ as a weighted average:  
$$
\hat{P} = \sum_{h=1}^L W_h \cdot \hat{P}_h
$$  
Where $ \hat{P}_h $ is the sample proportion in stratum $ h $.

---

### **Example 1: Confidence Interval for the Mean**

**Problem:**  
A population is divided into 3 strata. The following information is provided:  

| Stratum | Weight ($ W_h $) | Sample Mean ($ \bar{y}_h $) | Sample Variance ($ s_h^2 $) | Sample Size ($ n_h $) |
|---------|------------------|----------------------------|----------------------------|-----------------------|
| 1       | 0.4              | 50                         | 25                         | 20                    |
| 2       | 0.3              | 70                         | 30                         | 25                    |
| 3       | 0.3              | 60                         | 40                         | 15                    |

**Solution:**  

1. **Calculate the Weighted Mean ($ \bar{y} $):**  
$$
\bar{y} = (0.4 \cdot 50) + (0.3 \cdot 70) + (0.3 \cdot 60)
$$  
$$
\bar{y} = 20 + 21 + 18 = 59
$$  

2. **Calculate the Variance of the Mean:**  
$$
\text{Var}(\bar{y}) = (0.4^2 \cdot \frac{25}{20}) + (0.3^2 \cdot \frac{30}{25}) + (0.3^2 \cdot \frac{40}{15})
$$  
Simplify each term:  
- For Stratum 1: $ 0.4^2 \cdot \frac{25}{20} = 0.2 $  
- For Stratum 2: $ 0.3^2 \cdot \frac{30}{25} = 0.108 $  
- For Stratum 3: $ 0.3^2 \cdot \frac{40}{15} = 0.24 $  

Add the variances:  
$$
\text{Var}(\bar{y}) = 0.2 + 0.108 + 0.24 = 0.548
$$  

3. **Standard Error (SE):**  
$$
SE = \sqrt{\text{Var}(\bar{y})} = \sqrt{0.548} = 0.74
$$  

4. **Construct the Confidence Interval:**  
For 95% confidence, $ Z = 1.96 $:  
$$
CI = 59 \pm 1.96 \cdot 0.74
$$  
Simplify:  
$$
CI = 59 \pm 1.45
$$  
Final Confidence Interval:  
$$
[57.55, 60.45]
$$  

---

### **Example 2: Confidence Interval for Proportions**

**Problem:**  
A survey divides a population into 2 strata:  
- Stratum 1: $ W_1 = 0.6 $, $ \hat{P}_1 = 0.4 $, $ n_1 = 100 $  
- Stratum 2: $ W_2 = 0.4 $, $ \hat{P}_2 = 0.6 $, $ n_2 = 50 $  

Find the 95% confidence interval for the overall proportion.

**Solution:**  

1. **Calculate the Weighted Proportion ($ \hat{P} $):**  
$$
\hat{P} = (0.6 \cdot 0.4) + (0.4 \cdot 0.6)
$$  
$$
\hat{P} = 0.24 + 0.24 = 0.48
$$  

2. **Calculate the Standard Error (SE):**  
The total sample size $ n = n_1 + n_2 = 100 + 50 = 150 $.  
$$
SE = \sqrt{\frac{\hat{P} (1 - \hat{P})}{n}} = \sqrt{\frac{0.48 \cdot (1 - 0.48)}{150}}
$$  
Simplify:  
$$
SE = \sqrt{\frac{0.48 \cdot 0.52}{150}} = \sqrt{0.001664} = 0.041
$$  

3. **Construct the Confidence Interval:**  
For 95% confidence, $ Z = 1.96 $:  
$$
CI = 0.48 \pm 1.96 \cdot 0.041
$$  
Simplify:  
$$
CI = 0.48 \pm 0.08
$$  
Final Confidence Interval:  
$$
[0.40, 0.56]
$$  
---



## <font color='green'>4.5 Confidence Intervals</font>

Confidence intervals in stratified sampling allow us to estimate a **range** within which the population mean or proportion is likely to fall. These intervals account for variability **across strata** and **within each stratum**, ensuring precise estimates.

---

### **Confidence Interval for the Population Mean**

1. **Formula**  
The confidence interval for the population mean is calculated as:  
$$
CI = \bar{y} \pm Z \cdot \sqrt{\text{Var}(\bar{y})}
$$  

Where:  
- $ \bar{y} $: Weighted sample mean.  
- $ Z $: Z-score corresponding to the desired confidence level (e.g., 1.96 for 95%).  
- $ \text{Var}(\bar{y}) $: Variance of the stratified mean, calculated as:  
$$
\text{Var}(\bar{y}) = \sum_{h=1}^L W_h^2 \cdot \frac{s_h^2}{n_h}
$$  

2. **Explanation of Terms**  
- $ W_h = \frac{N_h}{N} $: Weight of stratum $ h $ based on its proportion in the population.  
- $ s_h^2 $: Sample variance within stratum $ h $.  
- $ n_h $: Sample size for stratum $ h $.  
- $ L $: Total number of strata.

---

### **Confidence Interval for the Population Proportion**

1. **Formula**  
For categorical data, the confidence interval for the population proportion is:  
$$
CI = \hat{P} \pm Z \cdot \sqrt{\frac{\hat{P} (1 - \hat{P})}{n}}
$$  

Where:  
- $ \hat{P} $: Weighted sample proportion.  
- $ Z $: Z-score for the desired confidence level.  
- $ n $: Total sample size.

2. **Key Notes**  
- If strata proportions differ significantly, calculate the overall $ \hat{P} $ as a weighted average:  
$$
\hat{P} = \sum_{h=1}^L W_h \cdot \hat{P}_h
$$  
Where $ \hat{P}_h $ is the sample proportion in stratum $ h $.

---

### **Example 1: Confidence Interval for the Mean**

**Problem:**  
A population is divided into 3 strata. The following information is provided:  

| Stratum | Weight ($ W_h $) | Sample Mean ($ \bar{y}_h $) | Sample Variance ($ s_h^2 $) | Sample Size ($ n_h $) |
|---------|------------------|----------------------------|----------------------------|-----------------------|
| 1       | 0.4              | 50                         | 25                         | 20                    |
| 2       | 0.3              | 70                         | 30                         | 25                    |
| 3       | 0.3              | 60                         | 40                         | 15                    |

**Solution:**  

1. **Calculate the Weighted Mean ($ \bar{y} $):**  
$$
\bar{y} = (0.4 \cdot 50) + (0.3 \cdot 70) + (0.3 \cdot 60)
$$  
$$
\bar{y} = 20 + 21 + 18 = 59
$$  

2. **Calculate the Variance of the Mean:**  
$$
\text{Var}(\bar{y}) = (0.4^2 \cdot \frac{25}{20}) + (0.3^2 \cdot \frac{30}{25}) + (0.3^2 \cdot \frac{40}{15})
$$  
Simplify each term:  
- For Stratum 1: $ 0.4^2 \cdot \frac{25}{20} = 0.2 $  
- For Stratum 2: $ 0.3^2 \cdot \frac{30}{25} = 0.108 $  
- For Stratum 3: $ 0.3^2 \cdot \frac{40}{15} = 0.24 $  

Add the variances:  
$$
\text{Var}(\bar{y}) = 0.2 + 0.108 + 0.24 = 0.548
$$  

3. **Standard Error (SE):**  
$$
SE = \sqrt{\text{Var}(\bar{y})} = \sqrt{0.548} = 0.74
$$  

4. **Construct the Confidence Interval:**  
For 95% confidence, $ Z = 1.96 $:  
$$
CI = 59 \pm 1.96 \cdot 0.74
$$  
Simplify:  
$$
CI = 59 \pm 1.45
$$  
Final Confidence Interval:  
$$
[57.55, 60.45]
$$  

---

### **Example 2: Confidence Interval for Proportions**

**Problem:**  
A survey divides a population into 2 strata:  
- Stratum 1: $ W_1 = 0.6 $, $ \hat{P}_1 = 0.4 $, $ n_1 = 100 $  
- Stratum 2: $ W_2 = 0.4 $, $ \hat{P}_2 = 0.6 $, $ n_2 = 50 $  

Find the 95% confidence interval for the overall proportion.

**Solution:**  

1. **Calculate the Weighted Proportion ($ \hat{P} $):**  
$$
\hat{P} = (0.6 \cdot 0.4) + (0.4 \cdot 0.6)
$$  
$$
\hat{P} = 0.24 + 0.24 = 0.48
$$  

2. **Calculate the Standard Error (SE):**  
The total sample size $ n = n_1 + n_2 = 100 + 50 = 150 $.  
$$
SE = \sqrt{\frac{\hat{P} (1 - \hat{P})}{n}} = \sqrt{\frac{0.48 \cdot (1 - 0.48)}{150}}
$$  
Simplify:  
$$
SE = \sqrt{\frac{0.48 \cdot 0.52}{150}} = \sqrt{0.001664} = 0.041
$$  

3. **Construct the Confidence Interval:**  
For 95% confidence, $ Z = 1.96 $:  
$$
CI = 0.48 \pm 1.96 \cdot 0.041
$$  
Simplify:  
$$
CI = 0.48 \pm 0.08
$$  
Final Confidence Interval:  
$$
[0.40, 0.56]
$$  
---


# <font color='blue'>Solved exercises</font> 

# <font color='red'>Exercise 2: Weighted Mean</font>

## <font color='green'>Problem:</font>
Estimate the overall population mean from the following strata data:

| Stratum | Weight ($ W_h $) | Sample Mean ($ \bar{y}_h $) |
|---------|------------------|----------------------------|
| A       | 0.5              | 80                         |
| B       | 0.3              | 60                         |
| C       | 0.2              | 50                         |

## <font color='green'>Solution:</font>

The formula for the weighted mean is:  
$$
\bar{y} = \sum_{h=1}^L W_h \cdot \bar{y}_h
$$  

### **Step-by-Step Calculation**

1. For Stratum A:  
$$
0.5 \cdot 80 = 40
$$  

2. For Stratum B:  
$$
0.3 \cdot 60 = 18
$$  

3. For Stratum C:  
$$
0.2 \cdot 50 = 10
$$  

Add the contributions from all strata:  
$$
\bar{y} = 40 + 18 + 10 = 68
$$  

## <font color='green'>Final Answer:</font>  
The overall population mean is **68**.


# <font color='red'>Exercise 3: Variance of the Mean</font>

## <font color='green'>Problem:</font>
Calculate the variance of the mean for the following strata data:

| Stratum | Weight ($ W_h $) | Sample Variance ($ s_h^2 $) | Sample Size ($ n_h $) |
|---------|------------------|----------------------------|-----------------------|
| A       | 0.4              | 25                         | 20                    |
| B       | 0.3              | 30                         | 25                    |
| C       | 0.3              | 40                         | 15                    |

## <font color='green'>Solution:</font>

The formula for the variance of the mean is:  
$$
\text{Var}(\bar{y}) = \sum_{h=1}^L W_h^2 \cdot \frac{s_h^2}{n_h}
$$  

### **Step-by-Step Calculation**

1. For Stratum A:  
$$
\text{Var}_A = (0.4)^2 \cdot \frac{25}{20} = 0.16 \cdot 1.25 = 0.20
$$  

2. For Stratum B:  
$$
\text{Var}_B = (0.3)^2 \cdot \frac{30}{25} = 0.09 \cdot 1.2 = 0.108
$$  

3. For Stratum C:  
$$
\text{Var}_C = (0.3)^2 \cdot \frac{40}{15} = 0.09 \cdot 2.67 = 0.24
$$  

### Total Variance:  
Add the variances from all strata:  
$$
\text{Var}(\bar{y}) = 0.20 + 0.108 + 0.24 = 0.548
$$  

## <font color='green'>Final Answer:</font>  
The variance of the mean is **0.548**.


# <font color='red'>Exercise 4: Confidence Interval for the Mean</font>

## <font color='green'>Problem:</font>
Using the variance from the previous exercise ($ \text{Var}(\bar{y}) = 0.548 $), calculate the 95% confidence interval for the mean when $ \bar{y} = 68 $.

## <font color='green'>Solution:</font>

The formula for the confidence interval is:  
$$
CI = \bar{y} \pm Z \cdot \sqrt{\text{Var}(\bar{y})}
$$  

1. **Standard Error (SE):**  
$$
SE = \sqrt{0.548} = 0.74
$$  

2. **Margin of Error (ME):**  
For a 95% confidence level, $ Z = 1.96 $:  
$$
ME = 1.96 \cdot 0.74 = 1.45
$$  

3. **Confidence Interval:**  
$$
CI = 68 \pm 1.45
$$  
$$
CI = [66.55, 69.45]
$$  

## <font color='green'>Final Answer:</font>  
The 95% confidence interval is **[66.55, 69.45]**.


# <font color='red'>Exercise 5: Confidence Interval for Proportions</font>

## <font color='green'>Problem:</font>
A survey finds 60% of 150 respondents favor a policy. Construct the 95% confidence interval for the proportion.

## <font color='green'>Solution:</font>

The formula for proportions is:  
$$
CI = \hat{P} \pm Z \cdot \sqrt{\frac{\hat{P}(1 - \hat{P})}{n}}
$$  

1. **Values:**  
- $ \hat{P} = 0.6 $  
- $ n = 150 $, $ Z = 1.96 $  

2. **Standard Error (SE):**  
$$
SE = \sqrt{\frac{0.6 \cdot (1 - 0.6)}{150}} = \sqrt{\frac{0.6 \cdot 0.4}{150}} = 0.04
$$  

3. **Margin of Error (ME):**  
$$
ME = 1.96 \cdot 0.04 = 0.08
$$  

4. **Confidence Interval:**  
$$
CI = 0.6 \pm 0.08
$$  
$$
CI = [0.52, 0.68]
$$  

## <font color='green'>Final Answer:</font>  
The 95% confidence interval is **[0.52, 0.68]**.


# <font color='blue'>Exercises</font> 

# <font color='red'>Exercise 1: Proportional Sample Allocation</font>

## <font color='green'>Problem:</font>
A population of 2,000 people is divided into 4 strata as follows:

| Stratum | Population Size ($ N_h $) |
|---------|---------------------------|
| A       | 800                       |
| B       | 600                       |
| C       | 400                       |
| D       | 200                       |

If the total sample size is **200**, determine the sample size for each stratum using **proportional allocation**.


In [16]:
# Solution

# <font color='red'>Exercise 2: Weighted Mean</font>

## <font color='green'>Problem:</font>
Estimate the overall population mean given the following strata data:

| Stratum | Weight ($ W_h $) | Sample Mean ($ \bar{y}_h $) |
|---------|------------------|----------------------------|
| A       | 0.3              | 70                         |
| B       | 0.4              | 60                         |
| C       | 0.3              | 80                         |

Use the formula for the weighted mean:
$$
\bar{y} = \sum_{h=1}^L W_h \cdot \bar{y}_h
$$


In [18]:
# Solution

# <font color='red'>Exercise 3: Variance of the Mean</font>

## <font color='green'>Problem:</font>
Calculate the variance of the mean for the following strata data:

| Stratum | Weight ($ W_h $) | Sample Variance ($ s_h^2 $) | Sample Size ($ n_h $) |
|---------|------------------|----------------------------|-----------------------|
| A       | 0.5              | 20                         | 25                    |
| B       | 0.3              | 25                         | 30                    |
| C       | 0.2              | 15                         | 20                    |

Use the formula for variance:
$$
\text{Var}(\bar{y}) = \sum_{h=1}^L W_h^2 \cdot \frac{s_h^2}{n_h}
$$


# <font color='red'>Exercise 4: Confidence Interval for the Mean</font>

## <font color='green'>Problem:</font>
Given the following information:

- Weighted mean ($ \bar{y} $): 75
- Variance of the mean ($ \text{Var}(\bar{y}) $): 1.5
- Confidence level: 95% ($ Z = 1.96 $)

Construct the **95% confidence interval** for the population mean using the formula:
$$
CI = \bar{y} \pm Z \cdot \sqrt{\text{Var}(\bar{y})}
$$


In [21]:
# Solution

# <font color='red'>Exercise 5: Confidence Interval for Proportions</font>

## <font color='green'>Problem:</font>
A survey is conducted in 2 strata:  

| Stratum | Weight ($ W_h $) | Proportion ($ \hat{P}_h $) | Sample Size ($ n_h $) |
|---------|------------------|---------------------------|-----------------------|
| A       | 0.6              | 0.5                       | 100                   |
| B       | 0.4              | 0.7                       | 80                    |

1. Compute the weighted overall proportion ($ \hat{P} $):  
$$
\hat{P} = \sum_{h=1}^L W_h \cdot \hat{P}_h
$$  

2. Calculate the **95% confidence interval** for the proportion using:  
$$
CI = \hat{P} \pm Z \cdot \sqrt{\frac{\hat{P} (1 - \hat{P})}{n}}
$$  
Where:  
- $ Z = 1.96 $ (for 95% confidence).  
- $ n = n_1 + n_2 $ (total sample size).


In [23]:
# Solution