-
Notifications
You must be signed in to change notification settings - Fork 3
/
max-lik.qmd
521 lines (401 loc) · 20.2 KB
/
max-lik.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
# Maximum Likelihood {#sec-max-lik}
## Generate some data {#sec-data}
First, let's generate some data for the case of a simple linear regression.
```{r}
### PARAMS
beta0 = 1.5
beta1 = 0.5
sigma = 0.4
n = 30
### GENERATE DATA
set.seed(5)
x = runif(n, -1.5, 1.5)
exp_y = beta0 + beta1*x
y = exp_y + rnorm(n, mean=0, sd=sigma)
# Create a data frame
my_df = data.frame(y = y, x = x)
plot(my_df$y ~ my_df$x, pch = 19,
xlab = "x", ylab = "y")
```
## Calculate a likelihood
Remember that, for simple linear regression, the likelihood of a single data point is as follows:
$$y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2)$$
$$P(y_i | X_i B, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \text{e}^{-\frac{1}{2}\frac{(y_i - X_i B)^2}{\sigma^2}}$$
Then, the full likelihood of the data set $Y$ is computed as:
$$ P(Y | X B, \sigma^2) = \prod^n_{i=1} P(y_i | X_i B, \sigma^2)$$
Or, on the natural logarithmic scale:
$$ ln\left(P(Y | X B, \sigma^2)\right) = \sum^n_{i=1} ln\left(P(y_i | X_i B, \sigma^2)\right)$$
```{r}
# How to calculate the likelihood of a single data point:
dnorm(y[1],
mean = beta0 + beta1*x[1],
sd = sigma,
log = FALSE)
# Calculate the full likelihood of the data, using the product
## Vectorized:
LH_notlog=
prod(
dnorm(y,
mean = beta0 + beta1*x,
sd = sigma,
log = FALSE)
)
# Log-likelihood, vectorized
LH_log =
sum(
dnorm(y,
mean = beta0 + beta1*x,
sd = sigma,
log = TRUE)
)
# Make sure the output makes sense
LH_notlog
# Indeed the log-likelihood is the log of the
# likelihood on the raw probability scale.
LH_log; log(LH_notlog)
```
## Using `optim()` to minimize the negative log-likelihood
As we discussed in lecture, it is more computationally convenient to minimize functions, rather than to maximize. Therefore, to conduct linear regression analysis with maximum likelihood methods, we will find the values of $\hat{B}$ that minimize the negative log-likelihood of the data: $-ln\left(P(Y | X B, \sigma^2)\right)$.
### Function to calculate the negative log-likelihood
First, we need to construct a function that calculates the negative log-likelihood and that specifies the parameters of the model that eventually need to be estimated by the `optim()` function.
```{r}
### NEG LOG-LIK MINIMIZATION
neg_log_lik = function(p, data_df){
beta0=p[1]
beta1=p[2]
sigma=p[3]
mu = beta0 + beta1*data_df$x
nll = -sum(dnorm(data_df$y, mean=mu, sd=sigma, log = TRUE))
return(nll)
}
```
Here you can see the inputs to the function: `p` is a vector of parameters to be estimated (i.e., optimized), and `data_df` is a `data.frame` that holds the values of outcome variable $y$ and associated input variables, in this case, just one $x$.
Then, we can use the `optim()` function, which implements a gradient descent algorithm to estimate the values of the parameters that minimize the provided function, `neg_log_lik()`. We will learn more about gradient descent later, because this is a very important method used widely across machine learning and neural networks.
```{r}
m_nll =
optim(
par = c(0.1,0,0.1),
fn = neg_log_lik,
data_df = my_df,
method = "L-BFGS-B",
lower=c(-5,-5,0.001),
upper=c(5,5,5),
hessian = TRUE
)
par_tab_nll = rbind(m_nll$par)
colnames(par_tab_nll) = c("int", "slope", "sigma")
par_tab_nll
```
Note the `optim()` options. `par` specifies the initial guesses of the three parameters, whereas `lower` and `upper` specify the bounds across which to search for the best values of the parameters. With `hessian = TRUE` we are asking the function to output an estimate of the Hessian matrix of the function, which helps us to estimate the standard errors of the parameter estimates (see [Footnotes @sec-hessian]). The `method` specifies the algorithm used to minimize the function, which in this case is a modified quasi-Newton method, `L-BFGS-B`, which is a type of gradient descent algorithm, to be discussed later.
We can see that the function outputs three point-estimates, which are $\hat{B}$ (i.e., slope and intercept), as well as the residual standard deviation, $\hat{\sigma}$.
### Compare `optim()` results to the OLS output
```{r}
### COMPARE TO LM()
m_ols = lm(y ~ 1 + x, data = my_df)
m_ols_summary = summary(m_ols)
m_ols_summary # Notice p-value
par_tab_ols = c(coef(m_ols), m_ols_summary$sigma)
names(par_tab_ols) = c("int", "slope", "sigma")
par_tab_ols; par_tab_nll
plot(my_df$y ~ my_df$x, pch = 19,
xlab = "x", ylab = "y")
# Line from OLS
abline(coef = coef(m_ols), col = "black", lwd = 2)
# Line from MaxLikelihood
abline(coef = m_nll$par[1:2], col = "red", lty = 2, lwd = 2)
```
As we learned in lecture, the estimates of $\hat{B}$ from least squares and maximum likelihood are equivalent. And indeed, we see the same estimates produced from `lm()` and `optim()`. Also if you look at `Residual standard error` in the `lm()` output, you see the equivalent estimate for $\hat{\sigma}$ compared to the `optim()` output.
## Hypothesis-testing for maximum likelihood
As explained in lecture, we will use the likelihood ratio test to test:
$$H_0: \beta_i = 0$$
$$H_A: \beta_i \ne 0$$
In the least squares framework, we used a $t$-test. But, for maximum likelihood, we are going to base our test on the *likelihood* of a model that does or does not include the slope, similar to the $F$-test we learned before.
For the case of simple linear regression, we're testing whether there is a significant difference between these models:
$$H_0: y_i = \beta_0 + \epsilon_i$$
$$H_A: y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$
Notice that in $H_0$, the slope $\beta_1$ is assumed to be zero.
### Likelihood ratio and the $\chi^2$ test
Our goal is to understand if the likelihood of the null model, $P(Y | \beta_0, \sigma^2)$, is sufficiently low compared to the likelihood of the full model, $P(Y | \beta_0, \beta_1, x, \sigma^2)$, that we can reliably reject the null hypothesis.
We therefore construct a ratio of the likelihoods of the full and null model, very similar to the $F$-test framework. The log-likelihood ratio ($LHR$) becomes our test statistic:
$$LHR_{\text{test}} = -2 ln \left(\frac{LH_{\text{null}}}{LH_{\text{full}}} \right)$$
Then, folks smarter than I have done the math to prove that this test statistic is equivalent to a $\chi^2$ test statistic, such that:
$$LHR_{\text{test}} \sim \chi^2_k$$
where $\chi^2_k$ is a $\chi^2$ probability distribution with $k$ degrees of freedom. $k$ is equal to $p_{\text{full}} - p_{\text{null}}$, where $p$ is the number of model coefficients. In the case of simple linear regression, where we are removing just one model coefficient from the full model (i.e., set slope equal to zero), then $k = 2-1 = 1$.
Finally, we can determine $P(\chi^2 > LHR_{\text{test}})$, which gives us our $p$-value. This statistical test is known as the "likelihood ratio test," and it is equivalently referred to as the "$\chi^2$" test, which we'll see in the code below.
### Manual calculation of the likelihood ratio test
To begin, we need to use maximum likelihood to estimate the likelihood of the "null" model. We need to adjust our function that will be used by `optim()` to only include two parameters: the intercept, and the residual standard deviation.
```{r}
# Need a null model:
nll_null = function(p, data_df){
beta0=p[1]
sigma=p[2]
mu = beta0
nll = -sum(dnorm(data_df$y, mean=mu, sd=sigma, log = TRUE))
return(nll)
}
m_nll_null =
optim(
par = c(0.1,0.1),
fn = nll_null,
data_df = my_df,
method = "L-BFGS-B",
lower=c(-5,0.001),
upper=c(5,5)
)
par_tab_nll_null = rbind(m_nll_null$par)
colnames(par_tab_nll_null) = c("int", "sigma")
par_tab_nll_null
plot(my_df$y ~ my_df$x, pch = 19,
xlab = "x", ylab = "y")
abline(coef = m_nll$par[1:2], col = "red", lty = 1)
abline(h = m_nll_null$par[1], col = "red", lty = 2)
```
The flat dashed line represents the null (intercept-only) model. Now, we calculate the likelihood ratio test statistic, and compare to the $\chi^2$ probability distribution to determine our $p$-value of the test. Note that within the `optim()` function's output list, there is a numeric object called `value`. This `value` is the negative log-likelihood of the model with the estimated coefficients. We can use this to calculate our test statistic.
```{r}
# use exp() to convert the negative log likelihood to
# raw probability scale
log_lh_full = -m_nll$value
log_lh_null = -m_nll_null$value
lh_full = exp(log_lh_full)
lh_null = exp(log_lh_null)
# Now calculate LHR
lhr = -2 * log(lh_null / lh_full)
lhr
# Of course, using rules of natural logs, this is equivalent:
-2 * (log_lh_null - log_lh_full)
```
Now that we have our value of $LHR_{\text{test}}$, we use the $\chi^2$-distribution to find $P(\chi^2 > LHR_{\text{test}})$, which is the $p$-value of the test.
```{r}
# How many parameters being "removed" (i.e., set to zero) in test:
df_chi = 2 - 1
# Prob null is true
p_val = 1 - pchisq(lhr, df = df_chi)
p_val
```
Based on this low $p$-value, we would say there is sufficient evidence to reject the null hypothesis and that the slope $\beta_1$ is significantly different than zero.
We can compare this outcome to a built-in `R` function called `drop1()`.
```{r}
drop1(m_ols, test = "Chisq")
```
In the function, we specified `Chisq` test, which implements the $\chi^2$ test using the likelihood ratio. What we see in this summary output is `Pr(>Chi)` which is equivalent to our manually computed value of $P(\chi^2 > LHR_{\text{test}})$. This output from `drop1()` does not provide a whole lot of detail, but if you look at the `help()`, it says that if you specify `test = "Chisq"`, it conducts a likelihood-ratio test. It doesn't specifically output the likelihood ratio, but we can see the $p$-value is equivalent to our manual calculation above.
## Gradient descent algorithm
How does the `optim()` function work? There are various optimization algorithms that can be implemented by `optim()` that are specified by the user in the `method` option. Several of these are gradient-based algorithms, which is a family of algorithms that are very common in engineering problems (e.g., optimal control of drones) and machine learning (e.g., neural networks, reinforcement learning). These algorithms generally minimize an "objective function": $\min_{{x\in {\mathbb R}^{n}}}\;f(x)$.
A gradient descent algorithm is the most common algorithm for minimizing a function. There are many varieties of these algorithms that employ various "tricks" to make the algorithms more efficient (e.g., find the solution in a smaller number of iterations) or more reliable. In lecture we outlined the foundational gradient descent algorithm, upon which many more advanced algorithms are based. For a function $f(x)$, find the value of $x$ that solves the problem: $\min _{{x\in {\mathbb R}^{n}}}\;f(x)$
1. Choose a starting value of $x$
2. Evaluate $\nabla f(x)$. With $h=10^{-4}$:
$$\text{grad} = \frac{f(x+h) - f(x)}{h}$$
3. Set the search direction as $\text{direct} = -\nabla f(x)$
4. Set the step size as a fraction of $\nabla f(x)$: $\text{alpha} = 0.005$
5. Update $x$ for next iteration: $$x = x + \text{alpha}* \text{direct}$$
6. Repeat Steps 2-5 until $\nabla f(x) \approx 0$ (i.e., $||\text{grad}|| \le 10^{-4}$)
Let's implement this algorithm for the maximum likelihood solution of simple linear regression, using the data set above. To make this easier, we're going to assume that we know the intercept and the residual standard deviation perfectly, so we are only trying to estimate the slope. See [Footnotes @sec-grad-multi] for the case in which we estimate all three model parameters simultaneously using gradient descent.
```{r}
# Set up some storage arrays:
slope_guess = NULL
nll_guess = NULL
# initial guess
slope = 0.1
# set h for approx of gradient
h = 1e-4
# set step size
alpha = 0.005
# Set gradient to arbitrary high number
## This makes the while() loop work
grad = 100
# initialize counter
i = 1
# While gradient is not yet \approx zero
while (norm(grad, "2") > 10^-4) {
# Store the current value of slope:
slope_guess[i] = slope
#############
# Approximate the gradient of the nll
#############
## Adjust the slope by a small number, h
slope_adj = slope + h
## Calculate f(slope)
f_slope = neg_log_lik(p = c(beta0, slope, sigma),
data_df = my_df)
## Store the current nll
nll_guess[i] = f_slope
## Calculate f(slope + h)
f_slope_adj = neg_log_lik(p = c(beta0, slope_adj, sigma),
data_df = my_df)
## Calculate gradient
grad = (f_slope_adj - f_slope) / h
# search direction
direct = -grad
# update slope
slope = slope + alpha * direct
# counter for x
i = i + 1
}
# Output the optimal slope and associated nll
n_iter = i-1
grad_descent_tab = cbind(slope_guess[n_iter], nll_guess[n_iter])
colnames(grad_descent_tab) = c("slope", "neg_log_lik")
# Compare to optim() outpu
optim_nll_tab = cbind(m_nll$par[2], m_nll$value)
colnames(optim_nll_tab) = c("slope", "neg_log_lik")
print("Gradient Descent:");grad_descent_tab
print("optim() output:");optim_nll_tab
```
Note that the estimates are not exactly the same, especially because when we used `optim()` we were estimating all three parameters simultaneously: intercept, slope, residual standard error.
Now, let's visualize the gradient descent.
```{r}
plot(nll_guess ~ slope_guess,
pch = 19, type = "b",
ylab = "Neg. Log. Likelihood",
xlab = expression(hat(beta)[1]),
xlim = c(0.1, 0.6), ylim = c(15, 40))
```
## Footnotes
### Hessian matrix {#sec-hessian}
The `optim()` function provides point estimates for the maximum likelihood-derived model coefficients. Just like in least squares regression, however, we want to quantify the uncertainty in these estimates. We therefore want the standard error in the model coefficient estimates.
In the case of least squares, we showed how we can calculate a variance-covariance matrix for the model coefficients, and then the square-root of the diagonal of this matrix equals the standard error. For maximum likelihood we can estimate this same variance-covariance matrix, but it comes from a different matrix called the Hessian. We do not need to go into detail, but the Hessian is the matrix of second derivatives of the likelihood with respect to the parameters (I will not ask you to recall this information). Then the variance-covariance matrix of the estimated model coefficients is calculated as the inverse of the Hessian matrix that corresponds to the negative log-likelihood. If the Hessian matrix of the negative log-likelihood is $H$, then
$$SE(\hat{\beta_i}) = \sqrt{\text{diag}\left( H^{-1}\right)_i}$$
I understand that's complicated, but it's easy enough to extract these values computationally from `optim()` output, assuming you use the option `hessian = TRUE`.
```{r}
# Extract the Hessian from the optim() output
hessian = m_nll$hessian
# Calculate the var-cov matrix from the inverse Hessian
# Remember solve(X) gives X^-1
params_varcov = solve(hessian)
# Then extract the diagonal and take the square root
# This gives a vector of SE(\param_i)
se_params = sqrt(diag(params_varcov))
params_tab = cbind(m_nll$par, se_params)
colnames(params_tab) = c("Estimate", "Std. Error")
rownames(params_tab) = c("Intercept", "slope", "sigma")
params_tab
# Same as OLS?
m_ols_summary$coefficients[c(1:2), 1:2] # Pretty close!
```
We can see that the standard errors for the maximum likelihood estimators are the same as the OLS estimators.
### `optim()` using least squares {#sec-least-sq}
Remember that `optim()` is not specific to maximum likelihood, but rather it implements one of several optional minimization algorithms. Therefore, we can use it to minimize any quantity. To emphasize this point, remember that in least squares regression, we are finding the values of the model coefficients $\hat{B}$ that minimize the sum of squared errors, $\sum_i^n \epsilon_i^2 = \epsilon^T\epsilon$. Let's minimize this quantity using `optim()`.
```{r}
### LEAST SQUARES MINIMIZATION
# We need a function to calculate the sum of squared errors:
least_sq = function(p, data_df){
beta0=p[1]
beta1=p[2]
y = data_df$y
n = length(y)
expected_y = beta0 + beta1*data_df$x
sse = 0
for(i in 1:n){
epsilon_i = y[i] - expected_y[i]
sse = sse + (epsilon_i)^2
}
return(sse)
}
### OPTIMIZE LEAST SQUARES
fit_least_sq =
optim(
par = c(0,0),
fn = least_sq,
data_df = my_df,
method = "L-BFGS-B",
lower=c(-5,-5),
upper=c(5,5),
hessian = TRUE
)
# Create a table of estimates:
par_tab_least_sq = rbind(fit_least_sq$par)
colnames(par_tab_least_sq) = c("int", "slope")
par_tab_least_sq
# Compare to original OLS estimates:
coef(m_ols)
```
You could also use the Hessian output to calculate the standard errors of the model coefficients, but I will leave that up to you.
### Gradient descent, multiple parameters {#sec-grad-multi}
Now let's suppose we want to estimate all of our model parameters (intercept, slope, residual standard deviation) simultaneously using gradient descent. Recall from lecture that we need to estimate three components of the gradient (the partial derivates of the function with respect to each model parameter).
```{r}
# How many params to estimate?
n_param = 3
# Set up some storage arrays:
## Guess that the number of iterations will be 100 or less...
param_guess = array(0, dim=c(100,n_param))
nll_guess = vector("numeric", length = 100)
# initial guesses
param = vector("numeric", length = n_param)
param[1] = 0.0 # intercept
param[2] = 1.0
param[3] = 2.5
# set h for approx of gradient
h = 1e-4
# set step size
alpha = 0.005
# Set gradient components to arbitrary high numbers
## This makes the while() loop work
grad = rep(100, times = n_param)
# initialize counter
i = 1
# While gradient is not yet \approx zero
while (norm(grad, "2") > 10^-4) {
# Store the current value of slope:
param_guess[i, ] = param
#############
# Approximate the gradient of the nll
#############
## Calculate nll with all current params
f_x = neg_log_lik(p = param,
data_df = my_df)
## Store the current nll
nll_guess[i] = f_x
## One param at a time, approximate gradient component
## (i.e., partial derivative)
for(j in 1:n_param){
## Keep all but one params the same
param_adj = NULL
param_adj = param
param_adj[j] = param[j] + h
f_x_adj = neg_log_lik(p = param_adj,
data_df = my_df)
## Calculate gradient component
grad[j] = (f_x_adj - f_x) / h
}
# search direction
## Note that 'direct' is an array of size n_param
direct = -grad
# update params
for(j in 1:n_param){
param[j] = param[j] + alpha * direct[j]
}
# counter for x
i = i + 1
}
# Output the optimal slope and associated nll
n_iter = i-1
grad_descent_tab = cbind(rbind(param_guess[n_iter,]),
nll_guess[n_iter])
colnames(grad_descent_tab) =
c("int", "slope", "sigma", "neg_log_lik")
# Compare to optim() output
optim_nll_tab = cbind(rbind(m_nll$par),
m_nll$value)
colnames(optim_nll_tab) = colnames(grad_descent_tab)
print("Gradient Descent:");grad_descent_tab
print("optim() output:");optim_nll_tab
```
Now, plot the gradient descent:
```{r}
par(mfrow=c(1,3))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 1],
pch=19, type = "b",
col = "blue",
ylab = "Neg. Log Likelihood",
xlab = expression("Intercept, "~beta[0]))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 2],
pch=19, type = "b",
col = "orange",
ylab = "Neg. Log Likelihood",
xlab = expression("Slope, "~beta[1]))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 3],
pch=19, type = "b",
col = "red",
ylab = "Neg. Log Likelihood",
xlab = expression("Residual Std.Dev., "~sigma))
```
We can see for the intercept, our first guess was too low, so we descended the gradient by addition (i.e., search direction was positive), whereas for the slope and the residual standard deviation, our first guess was too large, so we descended the gradients by subtraction (i.e., search direction was negative).