## Lab 4: Point Estimation and Confidence Intervals
#### MA 189 Data Dive Into Birmingham (with R)
##### _Blazer Core: City as Classroom_

Course Website: [Github.com/kerenli/statbirmingham/](https://github.com/kerenli/statbirmingham/) 


#### Levels:
<div class="alert-success"> Concepts and general information</div>
<div class="alert-warning"> Important methods and technique details </div>
<div class="alert-info"> Extended reading </div>
<div class="alert-danger"> (Local) Examples, assignments, and <b>Practice in Birmingham</b> </div>

<div class="alert alert-block alert-danger">
<b>Local Application</b>: Alabama Power Company Customer Loads
   
</div>

In this lab, we will be working with the Alabama Power Company Customer Loads dataset. This data contains information on the system marginal cost (system lambda) for generating electricity in Alabama over a specified period.

#### Steps:
1. Load the data into R from a provided dataset (or simulated data for illustration).
2. Familiarize yourself with the variables, focusing on the system lambda values.

In [None]:
# Load necessary libraries
library(tidyverse)

# Lambdas and RTP Customer Loads
if (!require(readxl)) {
  install.packages("readxl")
  library(readxl)
}

# Define the path to the Excel file
file_path <- "data/Lambdas and RTP Customer Loads.xlsx"

# Read the Excel file from the "data" subfolder
lambda_data <- read_excel(file_path, skip = 2)

# Illustrate the content: Display the first few rows of the data
lambda_data

<div class="alert alert-block alert-success">Point Estimation</div>
We will calculate point estimates for the system lambda, focusing on the mean and proportion for specific time intervals.

#### Steps:
1. **Mean Estimate:** Calculate the point estimate of the mean system lambda during specific hours (e.g., 12am-1am and 6am-7am).
2. **Proportion Estimate:** Estimate the proportion of hours in a day where the system lambda exceeds a certain threshold.

In [None]:
# Calculate the mean system lambda for 12am-1am and 6am-7am
mean_lambda_12am <- mean(lambda_data$hour01, na.rm = TRUE)
mean_lambda_6am <- mean(lambda_data$hour06, na.rm = TRUE)

# Calculate the proportion of hours where lambda exceeds a threshold (e.g., 50)
prop_lambda_above_50 <- mean(lambda_data %>% select(starts_with("hour")) > 50, na.rm = TRUE)

mean_lambda_12am
mean_lambda_6am
prop_lambda_above_50

<div class="alert alert-block alert-success">Confidence Intervals for the Mean</div>
Next, we will construct confidence intervals for the mean system lambda during specific hours. This will help us understand the range within which the true mean of the system lambda is likely to fall.

#### Steps:
1. **Calculate the Confidence Interval:** Using the t-distribution, calculate a 95% confidence interval for the mean system lambda at 12am-1am and 6am-7am.
2. **Interpret the Confidence Interval:** Discuss what these intervals mean in the context of the data.

In [None]:
# critical values of t distribution
confidence_level <- 0.95  # 95% confidence level
df <- 365-1  # Degrees of freedom

# Calculate the critical t-value for the given confidence level and df
t_star <- qt((1 + confidence_level) / 2, df)

# Create a sequence of x values to plot the t-distribution
x_values <- seq(-4, 4, length = 1000)

# Calculate the t-distribution (density) for each x value
y_values <- dt(x_values, df)

# Plot the t-distribution
plot(x_values, y_values, type = "l", lwd = 2, col = "blue",
     xlab = "t values", ylab = "Density",
     main = paste("t-Distribution with", df, "df"))

# Shade the area under the curve between -t* and t*
x_shade <- seq(-t_star, t_star, length = 500)
y_shade <- dt(x_shade, df)
polygon(c(-t_star, x_shade, t_star), c(0, y_shade, 0), col = rgb(0.1, 0.9, 0.1, 0.3), border = NA)

# Annotate the shaded area as the confidence level
text(t_star, 0.01, "Area = 0.025", pos = 4, col = "red")
text(0, 0.1, "Area = 0.95", col = "black", cex = 1.2)

In [None]:
# Calculate the sample mean and standard deviation for 12am-1am
sample_mean_12am <- mean(lambda_data$hour01, na.rm = TRUE)
sample_sd_12am <- sd(lambda_data$hour01, na.rm = TRUE)
n <- sum(!is.na(lambda_data$hour01))
n
sample_mean_12am
sample_sd_12am

# Calculate the margin of error
error_margin_12am <- qt(0.975, df = n-1) * (sample_sd_12am / sqrt(n))
error_margin_12am

# Calculate the confidence interval
ci_lower_12am <- sample_mean_12am - error_margin_12am
ci_upper_12am <- sample_mean_12am + error_margin_12am

c(ci_lower_12am, ci_upper_12am)

<div class="alert alert-block alert-success">Confidence Intervals for Proportions</div>
We'll also construct a confidence interval for the proportion of hours where the system lambda exceeds a certain threshold (e.g., 50).

#### Steps:
1. **Calculate the Confidence Interval for Proportions:** Using the z-distribution, calculate a 95% confidence interval for the proportion of hours where lambda exceeds 50.
2.  **Interpret the Confidence Interval:** Discuss the implications of this confidence interval in the context of the dataset.

In [None]:
# Calculate the sample proportion
sample_proportion <- prop_lambda_above_50
sample_proportion

# Calculate the margin of error
error_margin_prop <- qnorm(0.975) * sqrt((sample_proportion * (1 - sample_proportion)) / n)
error_margin_prop

# Calculate the confidence interval
ci_lower_prop <- sample_proportion - error_margin_prop
ci_upper_prop <- sample_proportion + error_margin_prop

c(ci_lower_prop, ci_upper_prop)


<div class="alert alert-block alert-success">Sample Size Calculation</div>
We will determine the sample size required to estimate the mean system lambda within a specific margin of error.

Use the formula for sample size calculation for estimating the mean:
   $$n = \left(\frac{Z \cdot \sigma}{E}\right)^2$$
   
#### Steps:
1. **Determine the Required Sample Size:** Calculate the sample size needed to estimate the mean system lambda with a margin of error of ±2 using a 95% confidence level.
2. **Discuss the Implications:** How does increasing the sample size affect the accuracy of our estimates?

In [None]:
# Define the desired margin of error
desired_margin_error <- 2

# Calculate the required sample size for the 12am-1am lambda values
required_sample_size_12am <- (qt(0.975, df = n-1) * sample_sd_12am / desired_margin_error)^2
required_sample_size_12am

To calculate the average cost (in dollars per megawatt-hour) for generating electricity during specific time periods (e.g., 12am-6am and 6am-12pm) in 2018, you can follow these steps. We'll first merge the relevant hourly columns for each time period and then calculate the averages.

In [None]:
head(lambda_data)

# Reshape the dataset: Merging hourly columns into a single column
data_long <- lambda_data %>%
  pivot_longer(cols = starts_with("hour"), # Select columns to merge
               names_to = "hour",          # Name for the new column with hour identifiers
               values_to = "lambda")       # Name for the new column with lambda values

# View the reshaped dataset
head(data_long)

In [None]:
# Calculate the average cost from 12am-6am (hour01 to hour06)
avg_cost_12am_6am <- data_long %>%
  filter(hour %in% paste0("hour0",1:6)) %>%
  summarise(avg_lambda_12am_6am = mean(lambda, na.rm = TRUE))

avg_cost_12am_6am

### <div class="alert alert-block alert-danger"><b>Practice in Birmingham: Estimating Actuary Salaries in Birmingham</b> </div>

You are interested in becoming an actuary ([source](https://www.bls.gov/ooh/math/actuaries.htm)) and want to estimate the average income of an actuary in the Birmingham area specifically. You want to determine, with 90% confidence, what the average income is and wish to be accurate within $3,000. 

You estimate the standard deviation of actuary salaries to be around $9,000, based on national data.

#### Steps:
**Question 1: Calculate the Required Sample Size**

Using the given confidence level and margin of error, calculate the required sample size for your study.

##### Your answer:

**Question 2: Simulate Data**

Once you have determined the sample size, simulate a dataset of actuary salaries in Birmingham using the estimated mean and standard deviation.


##### Your answer:

**Question 3: Calculate the Point Estimate**

From your simulated data, calculate the point estimate (sample mean) for actuary salaries in Birmingham.

##### Your answer:

**Question 4: Construct a Confidence Interval**

Using the simulated data, construct a 90% confidence interval for the average salary.

##### Your answer:

**Question 5: Interpret the Confidence Interval**

Explain what the confidence interval means in the context of the problem.

##### Your answer: