# <font color='red'>Chapter 5: Systematic Sampling</font>

## <font color='green'>5.1 Introduction to Systematic Sampling</font>

Systematic sampling is a method where elements are selected from the population at regular intervals, known as the **sampling interval**. This technique is especially useful when the population is large, and creating a random list for sampling is impractical.

### Key Characteristics:
- The population is arranged in an **ordered list**.
- A **random starting point** is selected.
- Every $ k^{th} $ element is included in the sample, where $ k $ is the sampling interval.


## <font color='green'>5.2 Sampling Interval ($ k $)</font>

### Formula:
The sampling interval, $ k $, is calculated as:
$$
k = \frac{N}{n}
$$
Where:
- $ N $: Total population size.
- $ n $: Desired sample size.


In [9]:
import pandas as pd

# Example population: List of 100 employees
population = [f"Employee_{i}" for i in range(1, 101)]

# Step 1: Define population size (N) and desired sample size (n)
N = len(population)  # Total population size
n = 10               # Desired sample size

# Step 2: Calculate sampling interval (k)
k = N // n

# Step 3: Select a random starting point (r)
import random
random.seed(42)  # For reproducibility
r = random.randint(1, k)

# Step 4: Generate the systematic sample
systematic_sample = [population[i] for i in range(r - 1, N, k)]

# Step 5: Output the selected sample
sample_df = pd.DataFrame(systematic_sample, columns=["Selected Employees"])
print(sample_df)

  Selected Employees
0         Employee_2
1        Employee_12
2        Employee_22
3        Employee_32
4        Employee_42
5        Employee_52
6        Employee_62
7        Employee_72
8        Employee_82
9        Employee_92


## <font color='green'>5.3 Steps in Systematic Sampling</font>

1. **Determine the Population Size** ($ N $): Count the total number of elements in the population.  
2. **Decide the Sample Size** ($ n $): Determine how many elements you need in your sample.  
3. **Calculate the Sampling Interval** ($ k $): Use the formula $ k = \frac{N}{n} $.  
4. **Select a Random Starting Point** ($ r $): Choose a random integer between 1 and $ k $.  
5. **Select Every $ k^{th} $ Element**: Start from $ r $ and include every $ k^{th} $ element until the sample size is met.


## <font color='green'>5.4 Advantages and Limitations</font>

### Advantages:
- **Simple and Quick**: Easy to implement compared to random sampling.  
- **Ensures Representation**: Useful when the population is evenly distributed.

### Limitations:
- **Risk of Bias**: If the population has a periodic pattern that aligns with the sampling interval, the sample may not be representative.  
- **Requires a Complete List**: The population must be fully listed and accessible.


## <font color='green'>5.5 Estimation in Systematic Sampling</font>

Systematic sampling allows for the estimation of the population **mean** and **total**, similar to simple random sampling.

### Estimator for the Mean:
$$
\bar{y} = \frac{\sum_{i=1}^n y_i}{n}
$$
Where:
- $ \bar{y} $: Sample mean.
- $ y_i $: Value of the $ i^{th} $ element in the sample.
- $ n $: Sample size.

### Estimator for the Total:
$$
\hat{T} = N \cdot \bar{y}
$$
Where:
- $ \hat{T} $: Estimated population total.
- $ N $: Population size.

### Variance of the Mean:
In systematic sampling, the variance of the mean is similar to random sampling:
$$
\text{Var}(\bar{y}) = \frac{S^2}{n}
$$
Where:
- $ S^2 $: Population variance.


## <font color='green'>5.6 Confidence Intervals</font>

### Confidence Interval for the Mean:
$$
CI = \bar{y} \pm Z \cdot \sqrt{\frac{S^2}{n}}
$$
Where:
- $ Z $: Z-score for the desired confidence level (e.g., 1.96 for 95%).

### Confidence Interval for the Total:
$$
CI = \hat{T} \pm Z \cdot N \cdot \sqrt{\frac{S^2}{n}}
$$


In [26]:
import math

# Step 1: Define the variables
sample_mean = 75           # Sample mean (𝑦̄)
population_variance = 1.5  # Population variance (S²)
sample_size = 10           # Sample size (n)
confidence_level = 0.95    # Confidence level (e.g., 95%)

# Step 2: Calculate the standard error (SE)
standard_error = math.sqrt(population_variance / sample_size)

# Step 3: Determine the Z-score for the confidence level
# For 95% confidence level, Z ≈ 1.96
from scipy.stats import norm
z_score = norm.ppf((1 + confidence_level) / 2)

# Step 4: Calculate the margin of error (ME)
margin_of_error = z_score * standard_error

# Step 5: Compute the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Step 6: Display the results
print(f"Sample Mean: {sample_mean}")
print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")

Sample Mean: 75
Confidence Interval: [74.24, 75.76]


## <font color='green'>Alternative Method for Confidence Intervals</font>

Instead of manually calculating the confidence interval using the formula:
$$
CI = \bar{y} \pm Z \cdot \text{SE}
$$
You can use the `norm.interval` function from the **`scipy.stats`** module for a more concise approach.

---

### **How to Use:**

```python
from scipy.stats import norm

# Parameters
confidence_level = 0.95      # Confidence level (e.g., 95%)
population_variance = 1.5  # Population variance (S²)
sample_size = 10           # Sample size (n)
sample_mean = 75             # Sample mean (𝑦̄)
standard_error = math.sqrt(population_variance / sample_size)       # Standard error (SE)

# Calculate confidence interval
lower_bound, upper_bound = norm.interval(confidence=confidence_level, loc=sample_mean, scale=standard_error)

print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")


In [13]:
from scipy.stats import norm

In [23]:
lower_bound, upper_bound = norm.interval(confidence=confidence_level, loc=sample_mean, scale=standard_error)
print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")

Confidence Interval: [74.24, 75.76]


# <font color='red'>Solved problems</font>  

## <font color='green'>Problem 1: Selecting a Systematic Sample</font>

### <font color='green'>Problem Statement:</font>
A company has 120 employees, and a survey needs to select 12 employees using systematic sampling.

- Population size ($ N $): 120  
- Sample size ($ n $): 12  

**Steps:**
1. Calculate the sampling interval ($ k $).  
2. Select a random starting point ($ r $).  
3. Select every $ k^{th} $ employee to form the sample.


In [35]:
# Step 1: Define the population
population = [f"Employee_{i}" for i in range(1, 121)]  # Employees 1 to 120

# Step 2: Define parameters
N = len(population)  # Total population size
n = 12               # Desired sample size
k = N // n           # Sampling interval
random.seed(42)      # For reproducibility
r = random.randint(1, k)  # Random starting point

# Step 3: Generate the systematic sample
systematic_sample = [population[i] for i in range(r - 1, N, k)]

# Output the sample
sample_df = pd.DataFrame(systematic_sample, columns=["Selected Employees"])
print(sample_df)


   Selected Employees
0          Employee_2
1         Employee_12
2         Employee_22
3         Employee_32
4         Employee_42
5         Employee_52
6         Employee_62
7         Employee_72
8         Employee_82
9         Employee_92
10       Employee_102
11       Employee_112


## <font color='green'>Problem 2: Estimating the Population Mean</font>

### <font color='green'>Problem Statement:</font>
A systematic sample of size $ n = 10 $ is drawn from a population of size $ N = 100 $. The sample consists of the following values:

$$
[45, 50, 55, 40, 60, 70, 65, 55, 50, 60]
$$

Estimate:
1. The population mean ($ \bar{y} $).  
2. The population total ($ \hat{T} $).  


In [38]:
# Step 1: Define the sample data
sample_data = [45, 50, 55, 40, 60, 70, 65, 55, 50, 60]
sample_size = len(sample_data)
population_size = 100

# Step 2: Calculate the sample mean (ȳ)
sample_mean = sum(sample_data) / sample_size

# Step 3: Estimate the population total (T̂)
population_total = population_size * sample_mean

# Output results
print(f"Sample Mean (ȳ): {sample_mean:.2f}")
print(f"Estimated Population Total (T̂): {population_total:.2f}")


Sample Mean (ȳ): 55.00
Estimated Population Total (T̂): 5500.00


## <font color='green'>Problem 3: Confidence Interval for the Mean</font>

### <font color='green'>Problem Statement:</font>
Using the sample data from **Problem 2**, construct a 95% confidence interval for the population mean. Assume the population variance is $ S^2 = 200 $.

**Steps:**
1. Calculate the standard error (SE).  
2. Use the Z-score for 95% confidence ($ Z = 1.96 $).  
3. Construct the confidence interval.


In [42]:
# Step 1: Define the known parameters
population_variance = 200  # S²
confidence_level = 0.95
z_score = norm.ppf((1 + confidence_level) / 2)

# Step 2: Calculate the standard error (SE)
standard_error = math.sqrt(population_variance / sample_size)

# Step 3: Calculate the confidence interval
lower_bound, upper_bound = norm.interval(
    confidence=confidence_level, loc=sample_mean, scale=standard_error
)

# Output the results
print(f"Sample Mean (ȳ): {sample_mean:.2f}")
print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")


Sample Mean (ȳ): 55.00
Confidence Interval: [46.23, 63.77]


# <font color='red'>Exercises</font>

### Practice what you've learned with the exercises below 

## <font color='green'>Exercise 1: Systematic Sampling</font>

### <font color='green'>Problem Statement:</font>
A population of 500 households is being surveyed to assess energy consumption patterns. You need to select a systematic sample of 50 households.

**Tasks:**
1. Calculate the sampling interval ($ k $).  
2. Select a random starting point ($ r $).  
3. Identify the households included in the sample.  

---

### **Hints:**
- Use the formula $ k = \frac{N}{n} $ to find the sampling interval.  
- Choose a random starting point $ r $ between 1 and $ k $.  
- Select every $ k^{th} $ household starting from $ r $.  


In [49]:
# Solution 

## <font color='green'>Exercise 2: Population Mean and Total</font>

### <font color='green'>Problem Statement:</font>
A systematic sample of size $ n = 15 $ is selected from a population of size $ N = 150 $. The observed sample values (in dollars) are:

$$
[100, 120, 110, 130, 140, 150, 125, 135, 145, 155, 115, 105, 95, 125, 135]
$$

**Tasks:**
1. Calculate the sample mean ($ \bar{y} $).  
2. Estimate the total population value ($ \hat{T} $).  

---

### **Hints:**
- Use the formula for the sample mean:
$$
\bar{y} = \frac{\sum y_i}{n}
$$
- Use the formula for the population total:
$$
\hat{T} = N \cdot \bar{y}
$$


In [None]:
# Solution

## <font color='green'>Exercise 3: Confidence Interval for the Mean</font>

### <font color='green'>Problem Statement:</font>
You have the following data:
- Sample size: $ n = 20 $  
- Sample mean: $ \bar{y} = 70 $  
- Population variance: $ S^2 = 250 $  
- Confidence level: 95% ($ Z = 1.96 $)  

**Tasks:**
1. Calculate the standard error ($ SE $).  
2. Construct the 95% confidence interval for the population mean.  

---

### **Hints:**
- Use the formula for standard error:
$$
SE = \sqrt{\frac{S^2}{n}}
$$
- Use the formula for confidence interval:
$$
CI = \bar{y} \pm Z \cdot SE
$$


In [53]:
# Solution 