## Example to show least squares minimization

This little example is meant as a way to show the least square really minimizes the criterion, $ \sum \left( Y - \hat{Y} \right)^2 $.

In this example, we will generate some data so that we know what the truth is. Then, upon data generation, we will compute a bunch of different values for the linear slope and y-intercept. For each combination of the y-intercept and slope, I will compute the sum of squares error depicted above. 


### Simulate some data

The following example simulates data based on the following linear regression formula:

$$
Y = 5 + 0.5 X + \epsilon
$$

More explicitly, the simulation allows us to specify what the intercept and slope is in the population. These are specified below in the `reg_weights` simulation argument. 

In [None]:
library(tidyverse)
library(simglm)

theme_set(theme_bw(base_size = 18))

set.seed(2023)

sim_arguments <- list(
    formula = y ~ x,
    fixed = list(x = list(var_type = 'continuous', mean = 100, sd = 20)),
    error = list(variance = 100),
    sample_size = 1000,
    reg_weights = c(5, .5)
)

sim_data <- simulate_fixed(data = NULL, sim_arguments) |>
  simulate_error(sim_arguments) |>
  generate_response(sim_arguments)

head(sim_data)

### Visualize the Simulated Data

The following code visualizes the simulated data from above. What would you estimate the correlation to be? 

In [None]:
library(ggformula)

gf_point(y ~ x, data = sim_data, size = 4) |>
  gf_smooth(method = 'lm')

### Estimate Regression Coefficients

Even though we know what truth is, there is error involved in the simulation process, therefore, the population values specified above will not equal the exact regression coefficients estimated. Below, we estimate what those regression coefficients are. 

In [None]:
sim_lm <- lm (y ~ x, data = sim_data)
coef(sim_lm)

### Create different combinations of intercept and slope coefficients

The following code generates a sequence of intercept and corresponding slope conditions. We will use these different values to estimate the sum of squares error shown at the top of the notes for each of these intercept and slope values to show that the regression estimates are optimal to minimize the sum of square error. 

In [None]:
y_intercept <- seq(0, 15, by = .25)
slope <- seq(0, 1.5, by = .01)

conditions <- rbind(expand.grid(y_intercept = y_intercept, 
                          slope = slope),
                          coef(sim_lm))

tail(conditions)
dim(conditions)

### Showing Two Combinations

Here we visualize two possible slope conditions. Which one seems better for the data? 

In [None]:
gf_point(y ~ x, data = sim_data, size = 4) |>
  gf_smooth(method = 'lm') |>
  gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 1), linetype = 2, size = 2) |>
  gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 855), linetype = 2, color = 'lightgreen', size = 2) |>
  gf_refine(coord_cartesian(xlim = c(0, 160), ylim = c(0, 120)))



### Compute Sum of Squares Error

The following code creates a new function that computes the sum of square error. The function takes two arguments, the combination of intercept and slope values and the simulated data. The output is the sigma or average error from the regression line. The first code chunk below performs the computation for a single condition. The second code chunk does it for all of the conditions. 

In [None]:
sum_square_error <- function(conditions, sim_data) {
    fitted <- conditions[['y_intercept']] + conditions[['slope']] * sim_data[['x']]

    deviation <- sim_data[['y']] - fitted

    sqrt((sum(deviation^2) / (nrow(sim_data) - 2)))
}

sum_square_error(conditions[1892, ], sim_data)
summary(sim_lm)$sigma

In [None]:
library(future)

plan(multicore)

conditions$sse <- unlist(lapply(1:nrow(conditions), function(xx) sum_square_error(conditions[xx, ], sim_data)))

head(conditions)