## Lab 3: R in Probability and Distributions
#### MA 189 Data Dive Into Birmingham (with R)
##### _Blazer Core: City as Classroom_

Course Website: [Github.com/kerenli/statbirmingham/](https://github.com/kerenli/statbirmingham/) (to be published)


#### Levels:
<div class="alert-success"> Concepts and general information</div>
<div class="alert-warning"> Important methods and technique details </div>
<div class="alert-info"> Extended reading </div>
<div class="alert-danger"> (Local) Examples, assignments, and <b>Practice in Birmingham</b> </div>

### <div class="alert alert-block alert-success"> Lab 3: R in Probability and Distributions </div>

In this lab, we will work with probability and distribution concepts using R. We will cover calculating probabilities for different events, explore basic probability distributions, and visualize them using R.

---

### <div class="alert alert-block alert-danger"><b>Example</b>: Alabama Home Values</div>

In April 2024, the average home value in Alabama was \\$228,241, with a standard deviation of \\$20,000. Assume the dataset follows a normal distribution. Let’s calculate probabilities and visualize this data.

**Question 1:** What percentage of homes are worth more than \\$250,000?

**Step 1: Calculate the probability of a home being worth more than \\$250,000**


In [2]:
# Define the mean and standard deviation for home values
mean_home_value <- 228241
sd_home_value <- 20000

# Calculate the probability of a home being worth more than $250,000
P_more_than_250k <- 1 - pnorm(250000, mean = mean_home_value, sd = sd_home_value)
P_more_than_250k  # This gives the percentage of homes worth more than $250,000


**Question 2:** What is the probability of a home value being between \\$200,000 and \\$250,000?

**Step 2: Calculate the probability of a home value being in a certain range**


In [3]:
# Calculate the probability of a home value between $200,000 and $250,000
P_between_200k_250k <- pnorm(250000, mean = mean_home_value, sd = sd_home_value) - pnorm(200000, mean = mean_home_value, sd = sd_home_value)
P_between_200k_250k  # This gives the probability of home value falling in this range

**Question 3:** Find the minimum home value for the top 10% of homes.

**Step 3: Calculate the value corresponding to the 90th percentile of home values**


In [4]:
# Calculate the 90th percentile value (top 10% of home values)
top_10_percent_value <- qnorm(0.90, mean = mean_home_value, sd = sd_home_value)
top_10_percent_value  # This is the minimum value for the top 10% of homes

---
### <div class="alert alert-block alert-danger"><b>Example</b>: ACT Scores for Incoming Students at UAB</div>

Recall the information from the lecture regarding ACT scores being normally distributed. Assume an average national ACT score of 20.8 with a standard deviation of 5.8.

**Question 4:** A student earns an ACT score of 26.5 to improve their chances of UAB scholarships. What percentile are they in?


In [5]:
# Define the mean and standard deviation for ACT scores
mean_ACT <- 20.8
sd_ACT <- 5.8

# Calculate the percentile for a score of 26.5
percentile_26_5 <- pnorm(26.5, mean = mean_ACT, sd = sd_ACT)
percentile_26_5  # This gives the percentile rank of the student

**Question 5:** What percentile would a student earning the average Alabama ACT score of 18 be in?

In [6]:
# Calculate the percentile for a score of 18 (Alabama's average ACT score)
percentile_18 <- pnorm(18, mean = mean_ACT, sd = sd_ACT)
percentile_18  # This gives the percentile rank of a student with an ACT score of 18

**Question 6:** What ACT scores make up the middle 68% of the normal distribution?

**Step 4: Calculate the range for the middle 68% using 1 standard deviation from the mean**

In [7]:
# Calculate the lower and upper bounds of the middle 68% (within 1 standard deviation)
lower_bound <- mean_ACT - sd_ACT
upper_bound <- mean_ACT + sd_ACT
c(lower_bound, upper_bound)  # This gives the range of ACT scores for the middle 68%

**Question 7:** A student scores 29 on their ACT. What percentile are they in?

**Question 8:** In April 2024, the average teacher salary in Alabama was \\$53,572 with a standard deviation of \\$10,000. Assuming salaries follow a normal distribution:
- What percentage of teachers earn more than $60,000?
- What is the probability that a teacher’s salary is between \\$50,000 and \\$60,000?

*Hints:*
- Use `pnorm()` to calculate the probabilities for normal distributions.
- Use `qnorm()` to find the percentile ranks.

---

### <div class="alert alert-block alert-danger"><b>Practice in Birmingham</b></div>

Consider the data set provided by Alabama Power Company. Alabama Power’s incremental cost of generating electricity is monitored using the **system marginal cost**, also known as **system lambda**. Lambda represents the incremental cost of generating one more unit (megawatt-hour) of electricity. In a typical year (non-leap year), there are **8760 hours**; thus, this industry often refers to the “**8760 lambdas**.” As a general rule, generating units that run to help meet peak energy usage on the system are incrementally more expensive to run than baseload plants (those plants that run in both peak and off-peak times).

**Data**: Use the **Lambdas and RTP Customer Loads** excel data set for analysis.

#### Questions:
**Question 1**: Calculate what the average cost was (in dollars per megawatt-hour) for generating electricity from **12am-1am** in the year 2020. What about from **6am-7am** in 2020? What conclusion might you make from this comparison?

*Hints:*
- Use the lambda values in the excel dataset for the given times.
- Use R functions such as `mean()` to calculate the averages for the specific time ranges.

**Question 2**: Calculate the **average lambda value** for all of 2020. Then, determine the **quartiles** for the lambda data in 2020.

*Hints:*
- Use `summary()` to get the summary statistics, including the quartiles of the data.
- Visualize the quartiles using box plots to understand the distribution of the lambda values.

---
### Example R Code for Data Import and Analysis:

In [10]:
# Install and load the readxl package if not already installed
if (!require(readxl)) {
  install.packages("readxl")
  library(readxl)
}

In [14]:
# Load the data from the file
lambda_data <- read_excel("data/Lambdas and RTP Customer Loads.xlsx", skip = 2)

In [15]:
lambda_data

date_val,timezone,hour01,hour02,hour03,hour04,hour05,hour06,hour07,hour08,⋯,hour15,hour16,hour17,hour18,hour19,hour20,hour21,hour22,hour23,hour24
<dttm>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2018-01-01,CPT,26.935,28.190,33.558,34.005,36.075,35.430,41.828,35.3470,⋯,36.296,33.8530,39.0320,205.5950,294.030,294.030,294.0300,294.0300,203.640,203.640
2018-01-02,CPT,203.640,203.640,203.640,203.640,294.030,294.030,339.740,339.7400,⋯,37.798,34.4520,294.0300,294.0300,294.030,294.030,294.0300,339.7400,339.740,339.740
2018-01-03,CPT,339.740,255.060,255.060,255.060,294.030,294.030,294.030,294.0300,⋯,36.962,38.4570,34.4070,228.9300,228.930,228.930,228.9300,228.9300,228.930,37.042
2018-01-04,CPT,33.399,36.591,41.662,313.310,329.260,329.260,329.260,380.3357,⋯,37.872,41.0610,331.5950,419.8940,424.110,454.860,454.8600,529.5180,519.966,454.860
2018-01-05,CPT,520.001,454.860,454.860,520.057,530.605,530.729,530.729,454.8600,⋯,33.930,34.3460,46.2920,168.1890,41.057,43.341,46.6370,418.6100,418.610,203.980
2018-01-06,CPT,203.980,203.980,203.980,208.710,418.610,418.610,418.610,418.6100,⋯,36.791,252.9413,159.0917,159.6169,205.650,205.650,205.6500,205.6500,205.650,244.040
2018-01-07,CPT,244.040,244.040,201.930,257.140,201.930,201.930,201.930,350.7882,⋯,37.583,39.3140,365.3110,45.6890,201.930,201.930,176.6292,166.0164,33.969,32.644
2018-01-08,CPT,27.660,29.015,28.143,28.739,32.343,48.091,204.220,205.6500,⋯,33.140,32.8330,31.5260,33.4950,33.601,31.680,29.9410,27.9280,25.998,21.320
2018-01-09,CPT,21.029,19.629,20.035,20.111,22.414,27.702,31.594,27.4020,⋯,19.993,19.9950,20.1380,20.1780,20.182,20.058,19.7810,19.1870,18.936,19.211
2018-01-10,CPT,18.536,18.248,18.816,19.403,19.766,22.206,24.995,20.8340,⋯,19.887,19.9040,20.3640,23.0510,20.847,20.489,21.0850,20.0220,19.777,19.859
