-
Notifications
You must be signed in to change notification settings - Fork 4
/
05-regression-b.Rmd
353 lines (211 loc) · 19.3 KB
/
05-regression-b.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
# Simple Linear Regression
Simple Linear Regression is an analytical method that looks to model the relationship between an outcome variable and **one** explanatory predictor variables. For example, thinking about the data that we used for the Pearson Correlation analysis in this book, say instead of asking is HeadSize and IQ related, we could ask can you reliably predict IQ scores (our outcome or dependent variable) from HeadSize measurements (our predictor or independent variable), with the hypothesis of, "We predict that a linear model based on head size, as measured in cms, will significantly predict IQ scores as measured on a standard IQ test". We are going to look at this example to walk through some of the analysis but lets first remind ourselves of the data.
```{r we, message=FALSE, warning=FALSE, echo=FALSE}
dat_o <- tibble(Participants = 1:10,
HeadSize = c(50.8, 63.5, 45.7, 25.4, 29.2, 49.5, 38.1, 30.5, 35.6, 58.4),
IQ = c(107, 121, 106, 72, 85, 105, 93, 88, 97, 123))
m_x <- mean(dat_o$HeadSize) %>% round2(2)
m_y <- mean(dat_o$IQ) %>% round2(2)
sd_x <- sd(dat_o$HeadSize) %>% round2(2)
sd_y <- sd(dat_o$IQ) %>% round2(2)
cov_xy <- cov(dat_o$HeadSize, dat_o$IQ) %>% round2(2)
r_xy <- cor(dat_o$HeadSize, dat_o$IQ) %>% round2(3)
slope <- (r_xy * (sd_y/sd_x)) %>% round2(3)
intercept <- (m_y - slope * m_x) %>% round2(2)
HSp <- 60.1
IQp <- (intercept + slope * HSp) %>% round2(2)
HSp2 <- 59.4
IQp2 <- (intercept + slope * HSp2) %>% round2(2)
R2 <- (r_xy^2) %>% round2(3)
mod <- lm(IQ ~ HeadSize, data = dat_o) %>% anova() %>% tidy()
mod_df1 <- mod %>% filter(term == "HeadSize") %>% pull(df)
mod_df2 <- mod %>% filter(term == "Residuals") %>% pull(df)
mod_F <- mod %>% filter(term == "HeadSize") %>% pull(statistic) %>% round2(2)
mod_p <- ifelse(mod %>% filter(term == "HeadSize") %>% pull(p.value) < .001,
"< .001",
paste0("= ", mod %>% filter(term == "HeadSize") %>% pull(p.value) %>% round2(3)))
knitr::kable(dat_o, align = "c")
```
Now in order to determine how well Head Size predicts IQ we need to know the underlying formula for this type of analysis and use that to calculate elements such as the slope of the relationship and the intercept. You might remember from previous years that a line in a two-dimensional coordinate system (i.e. two dimensional space made of variables X and Y) can be described by a linear equation (model) of the form:
$$Y = b_{0} + b_{1}X + error$$
Where:
* $Y$ is the value on the outcome variable Y
* $X$ is the value on the predictor variable X
* $b_{0}$ is the intercept (or regression constant) – pronounced beta-zero - and stated as the value of $Y$ when $X = 0$ when X = 0
* $b_{1}$ is the slope (or regression coefficient) – pronounced beta-one - and stated as the change in $Y$ associated with a one-unit increase in $X$
* $error$ is a measure of your residuals which are the difference between an actual value and a predicted value
Before getting into the full simple linear regression however we will look at making predictions from this linear model formula using the slope and the intercept, based on the following equations.
**the slope:**
$$b_{1} = \frac{cov_{(x, y)}}{s^2_{x}}$$
Or alternatively it can be written as,
$$b_{1} = r_{(x,y)} \times \frac{s_{y}}{s_{x}}$$
**the intercept**
$$b_{0} = \overline{y} - b_{1} \times \overline{x}$$
So lets spend some time looking at those two equations in turn.
## The slope:
As above, the formula for the slope is:
$$b_{1} = \frac{cov_{(x, y)}}{s^2_{x}}$$
which is stated as the covariance of x and y ($cov_{(x,y)}$) divided by the variance of x ($s^2_{x}$). If we translate that into outcome and predictors then, given our Head Size is the predictor ($x$) and the IQ is the outcome ($y$), that becomes:
$$b_{1} = \frac{cov_{(HeadSize, IQ)}}{s^2_{HeadSize}}$$
But this can also be written in terms of the correlation between the two variables as opposed to the covariance, and this would look like:
$$b_{1} = r_{(x,y)} \times \frac{s_{y}}{s_{x}}$$
Which is stated as the correlation between x and y ($r_{(x,y)}$) multiplied by the standard deviation of y ($s_{y}$) divided by the standard deviation of x ($s_{x}$). [Remember that $s$ is the standard deviation of a variable, and $s^2$ is the variance of the variable.]
And again we can translate that back into terms of HeadSize ($x$) and IQ ($y$) as:
$$b_{1} = r \times \frac{s_{IQ}}{s_{HeadSize}}$$
Now if we think back to the correlation chapter we actually do know all the values we need, or at least can calculate them. Here is what we know from the correlation chapter:
* $cov_{(x,y)}$ = `r cov_xy`
* $s_{HeadSize}$ = `r sd_x`
* $s_{IQ}$ = `r sd_y`
* $r_{(x,y)}$ = `r r_xy`
And from that we can calculate the variance of HeadSize ($s^2_{HeadSize}$) and IQ ($s^2_{IQ}$), based on $s^2 = s \times s$, meaning:
* $s^2_{HeadSize}$ = $`r sd_x` \times `r sd_x`$ = `r sd_x^2`
* $s^2_{IQ}$ = $`r sd_y` \times `r sd_y`$ = `r sd_y^2`
So we could really use either formula for the slope as we have all the information, but for now let's use:
$$b_{1} = r \times \frac{s_{IQ}}{s_{HeadSize}}$$
And if we start to feed in the values, remembering that the IQ is ($y$ - the outcome) and HeadSize is ($x$ - the predictor) we get:
$$b_{1} = `r r_xy` \times \frac{`r sd_y`}{`r sd_x`}$$
And then if we sort out the fraction first, that becomes:
$$b_{1} = `r r_xy` \times `r sd_y/sd_x`$$
Which leads down to:
$$b_{1} = `r r_xy * (sd_y/sd_x)`$$
Giving a slope of $b_{1}$ = `r slope`, to three decimal places, meaning that for a 1 unit change in $x$ we get a `r slope` unit change in $y$. Or in other words, for a 1 unit change in HeadSize we get a `r slope` unit change in IQ.
## The intercept
Great. So now we know the slope of the regression line between HeadSize and IQ, what we need next is the intercept. The formula for the intercept, as above, is:
$$b_{0} = \overline{y} - b_{1} \times \overline{x}$$
Which is stated as the mean of y ($\overline{y}$ - in this case IQ, our outcome) minus the slope ($b_{1}$) multiplied by the mean of x ($\overline{x}$ - in this case HeadSize, our predictor).
Looking back at the correlation chapter, and above, we know:
* $b_{1}$ = `r slope`
* $\overline{x}$ = `r m_x`
* $\overline{y}$ = `r m_y`
And if we substitute those values into the formula we get:
$$b_{0} = `r m_y` - `r slope` \times `r m_x`$$
Remembering BODMAS we need to do the multiplication first, which gives us:
$$b_{0} = `r m_y` - `r slope * m_x`$$
And we then complete the formula by doing the subtraction giving us:
$$b_{0} = `r m_y - slope * m_x`$$
Meaning that we have an intercept of $b_{0}$ = `r intercept`, to two decimal places. And if we remember that the intercept is the value of $y$ when $x$ is zero, then this tells us that when $x$ = 0, the value of $y$ is $y$ = `r intercept`. Or stated in terms of HeadSize and IQ, when Headsize is 0, then IQ is `r intercept`.
## Making a Prediction
Now based on the information we have gained above we can actually use that to make predictions about a person we have not measured based on the formula:
$$\hat{Y} = b_{0} + b_{1}X$$
Well technically it is
$$\hat{Y} = b_{0} + b_{1}X + error$$
But we will disregard the error for now. Also you might have noticed the small hat about the $Y$ making it $\hat{Y}$ (pronounced Y-hat). This means we are making a prediction of Y ($\hat{Y}$) as opposed to an actually measured (i.e. observed) value ($Y$).
Now let's say we want to make a prediction about someone who has a HeadSize of 60.1. Then, using the information above, we know:
* the slope, $b_{1}$ = `r slope`
* the intercept, $b_{0}$ = `r intercept`
* and the HeadSize, $X$ = `r HSp`
If we put all that into the formula we would get:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp`$$
And if we start to work that through, dealing with the multiplication first, we see:
$$\hat{Y} = `r intercept` + `r slope * HSp`$$
Which then becomes:
$$\hat{Y} = `r intercept + slope * HSp`$$
Giving a predicted value of $\hat{Y}$ = `r IQp`, to two decimal places. Meaning that for a participant with a HeadSize of `r HSp` we would predict an IQ of `r IQp`. And you can use the same approach for any number of predictions with this model. Say you measured someone with a HeadSize of `r HSp2` cm, then they would be:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp2` = `r intercept + slope * HSp2`$$
Meaning that for a participant with a HeadSize of `r HSp2` we would predict an IQ of `r IQp2`.
## The R-squared
Obviously the question now is how good is our model at making these predictions, and one measure of that is $R^2$ (pronounced R-squared). This relates to the $error$ term that we saw in the original formula, i.e. $\hat{Y} = b_{0} + b_{1}X + error$, and as we said at the start it relates to the residuals. The residuals you might know are the difference between predicted values and observed values. For example, if we look at Participant 1 in our data we see they had a HeadSize of `r dat_o %>% slice(1) %>% pull(HeadSize)` and an IQ of `r dat_o %>% slice(1) %>% pull(IQ)`. Those are our observed values of $X$ and $Y$ for that participant, but what if we predicted a value for that participant instead ($\hat{Y}$). Using the above formula we would see:
$$\hat{Y} = `r intercept` + `r slope` \times `r dat_o %>% slice(1) %>% pull(HeadSize)` = `r intercept + slope * (dat_o %>% slice(1) %>% pull(HeadSize))`$$
So we would predict for Participant 1, with a HeadSize of `r dat_o %>% slice(1) %>% pull(HeadSize)`, to have an IQ of `r intercept + slope * (dat_o %>% slice(1) %>% pull(HeadSize))`. And if we compare that to the actually observed value for that participant ($Y = `r dat_o %>% slice(1) %>% pull(IQ)`$) we see a difference of `r abs((intercept + slope * (dat_o %>% slice(1) %>% pull(HeadSize))) - (dat_o %>% slice(1) %>% pull(IQ)))` IQ points (i.e. $\hat{Y} - Y = `r abs((intercept + slope * (dat_o %>% slice(1) %>% pull(HeadSize))) - (dat_o %>% slice(1) %>% pull(IQ)))`$). And that difference is our residuals for that person - the left over part between the observed and the predicted. If we were then to check that for all the observed participants we would get an overall value that summarises how big or small our residuals are, and the smaller the residuals are, the better the model is at predicting values of $Y$ from $X$, because there is little difference between our predicted values and our observed values. And that is what our $R^2$ tells us. It is a measure of how good our model is at prediction, and a measure of how small our residuals are. $R^2$ can take values between 0 and 1 and simply put, the closer $R^2$ is to 1, the smaller our residuals and the better our model is at prediction.
Now there is a full approach to calculating the $R^2$ which we will show you in a little bit, but when you are dealing with simple linear regression as we are here, there is a quick approach that gives the same answer, and it is:
$$R^2 = r_{(x,y)} \times r_{(x,y)}$$
where $r_{(x,y)}$ is the correlation between the two variables, in this case Headsize ($x$) and IQ ($y$). And if we look above we see we already know the correlation between HeadSize and IQ, as we calculated it in the correlation chapter. The correlation between HeadSize and IQ is $r_{(HeadSize, IQ)} = `r r_xy`$. And if we put that into our formula for $R^2$ we see:
$$R^2 = `r r_xy` \times `r r_xy` = `r r_xy * r_xy`$$
Meaning that we have an $R^2$ of $R^2 = `r R2`$, to three decimal places. And given that that is quite close to an $R^2$ of 1, it would suggest that our model is very good at predicting IQ from HeadSize (Y from X), that our residuals are small, and that there is a strong positive relationship between the two variables.
## The Write-up
Ok great, so we have now got a lot of information on our model including the slope, the intercept, and the $R^2$. The one thing we don't actually know is where our model is significant. We will go into what that means and how to determine that in a later part, but in the meantime we will just tell you that the model is significant, (F(`r mod_df1`, `r mod_df2`) = `r mod_F`, p `r mod_p`). So now we have everything we need for the write up:
* The model is significant - F(`r mod_df1`, `r mod_df2`) = `r mod_F`, p `r mod_p`
* The $R^2$ = `r R2` is very high
The only other element we need is the coefficent - the relationship between the two variables - which can take two versions; unstandardised and standardised. But, we actually do already know these. We just called them something else:
* what we call **the slope**, above, is the **unstandardised coefficient** and is written as $b$ = `r slope`
* what we called **the correlation** between the variables is actually the **standardised coefficient** and is written as $\beta$ = `r r_xy`
And it is totally acceptable to write-up either the standardised coefficient ($\beta$) or the unstandardised coefficient ($b$). The main thing is to make sure you use the right symbol with the right value and don't mix them up as that will mislead a reader.
Now if we take all the information we have, including some information from the correlation chapter, we can put it into a blurb as such:
**A team of researchers were interested in the relationship between Head Size and IQ, and specifically whether they could predict IQ from Head Size. The researchers set out the hypothesis that a linear model based on head size, as measured in cms, will significantly predict IQ scores as measured on a standard IQ test. Descriptive analysis suggested a strong positive relationship between Head Size (M = `r m_x`, SD = `r sd_x`) and IQ (M = `r m_y`, SD = `r sd_y`), (r(`r mod_df2`) = `r r_xy`). On analysis, a linear regression model revealed that head size (measured in cm) significantly predicted participant scores on an IQ test ($\beta$ = `r r_xy`, F(`r mod_df1`, `r mod_df2`) = `r mod_F`, p `r mod_p`, $R^2$ = `r R2`), with head size explaining `r R2*100`% of the variance in IQ scores. As such the alternative hypothesis was accepted suggesting that...**
But I guess it might look a bit odd as we give the standardised coeffiencient twice, just in different ways, so we could write it with the unstandardised coefficient as:
**A team of researchers were interested in the relationship between Head Size and IQ, and specifically whether they could predict IQ from Head Size. The researchers set out the hypothesis that a linear model based on head size, as measured in cms, will significantly predict IQ scores as measured on a standard IQ test. Descriptive analysis suggested a strong positive relationship between Head Size (M = `r m_x`, SD = `r sd_x`) and IQ (M = `r m_y`, SD = `r sd_y`), (r(`r mod_df2`) = `r r_xy`), with head size explaining `r R2*100`% of the variance in IQ scores. On analysis, a linear regression model revealed that head size (measured in cm) significantly predicted participant scores on an IQ test ($b$ = `r slope`, F(`r mod_df1`, `r mod_df2`) = `r mod_F`, p `r mod_p`, $R^2$ = `r R2`). As such the alternative hypothesis was accepted suggesting that...**
## Test Yourself
Here we are going to try out a few examples based on the above. First we will use the above numbers we have calculated to make some predictions to check our understanding. So, assuming the values of `r slope` for the slope, and `r intercept` for the intercept, try to answer the following questions:
```{r ty1, message=FALSE, warning=FALSE, echo=FALSE}
dat_o <- tibble(Participants = 1:10,
HeadSize = c(50.8, 63.5, 45.7, 25.4, 29.2, 49.5, 38.1, 30.5, 35.6, 58.4),
IQ = c(107, 121, 106, 72, 85, 105, 93, 88, 97, 123))
m_x <- mean(dat_o$HeadSize) %>% round2(2)
m_y <- mean(dat_o$IQ) %>% round2(2)
sd_x <- sd(dat_o$HeadSize) %>% round2(2)
sd_y <- sd(dat_o$IQ) %>% round2(2)
cov_xy <- cov(dat_o$HeadSize, dat_o$IQ) %>% round2(2)
r_xy <- cor(dat_o$HeadSize, dat_o$IQ) %>% round2(3)
slope <- (r_xy * (sd_y/sd_x)) %>% round2(3)
intercept <- (m_y - slope * m_x) %>% round2(2)
HSp <- 60.1
IQp <- (intercept + slope * HSp) %>% round2(2)
HSp2 <- 59.4
IQp2 <- (intercept + slope * HSp2) %>% round2(2)
HSp3 <- 37.8
IQp3 <- (intercept + slope * HSp3) %>% round2(2)
R2 <- (r_xy^2) %>% round2(3)
mod <- lm(IQ ~ HeadSize, data = dat_o) %>% anova() %>% tidy()
mod_df1 <- mod %>% filter(term == "HeadSize") %>% pull(df)
mod_df2 <- mod %>% filter(term == "Residuals") %>% pull(df)
mod_F <- mod %>% filter(term == "HeadSize") %>% pull(statistic) %>% round2(2)
mod_p <- ifelse(mod %>% filter(term == "HeadSize") %>% pull(p.value) < .001,
"< .001",
paste0("= ", mod %>% filter(term == "HeadSize") %>% pull(p.value) %>% round2(3)))
```
* To two decimal places, what would be the predicted value of IQ if the HeadSize was `r HSp3`? `r longmcq(sample(c(answer = IQp3, IQp2, IQp)))`
`r hide("Show the working")`
If the HeadSize is `r HSp3` and the slope is `r slope` and intercept is `r intercept`, then using the formula:
$$\hat{Y} = b_{0} + b_{1}X + error$$
If we fill in the details:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp3`$$
And if we start to work that through, dealing with the multiplication first, we see:
$$\hat{Y} = `r intercept` + `r slope * HSp3`$$
Which then becomes:
$$\hat{Y} = `r intercept + slope * HSp3`$$
Giving a predicted value of $\hat{Y}$ = `r IQp3`, to two decimal places.
`r unhide()`
```{r ty2, message=FALSE, warning=FALSE, echo=FALSE}
HSp4 <- 24.2
IQp4 <- (intercept + slope * HSp4) %>% round2(2)
HSp5 <- 52.9
IQp5 <- (intercept + slope * HSp5) %>% round2(2)
HSp6 <- 48.4
IQp6 <- (intercept + slope * HSp6) %>% round2(2)
```
* To two decimal places, what would be the predicted value of IQ if the HeadSize was `r HSp4`? `r longmcq(sample(c(answer = IQp4, IQp2, IQp3)))`
`r hide("Show the working")`
If the HeadSize is `r HSp4` and the slope is `r slope` and intercept is `r intercept`, then using the formula:
$$\hat{Y} = b_{0} + b_{1}X + error$$
If we fill in the details:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp4`$$
And if we start to work that through, dealing with the multiplication first, we see:
$$\hat{Y} = `r intercept` + `r slope * HSp4`$$
Which then becomes:
$$\hat{Y} = `r intercept + slope * HSp4`$$
Giving a predicted value of $\hat{Y}$ = `r IQp4`, to two decimal places.
`r unhide()`
* To two decimal places, what would be the predicted value of IQ if the HeadSize was `r HSp5`? `r longmcq(sample(c(answer = IQp5, IQp3, IQp2)))`
`r hide("Show the working")`
If the HeadSize is `r HSp5` and the slope is `r slope` and intercept is `r intercept`, then using the formula:
$$\hat{Y} = b_{0} + b_{1}X + error$$
If we fill in the details:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp5`$$
And if we start to work that through, dealing with the multiplication first, we see:
$$\hat{Y} = `r intercept` + `r slope * HSp5`$$
Which then becomes:
$$\hat{Y} = `r intercept + slope * HSp5`$$
Giving a predicted value of $\hat{Y}$ = `r IQp5`, to two decimal places.
`r unhide()`
* To two decimal places, what would be the predicted value of IQ if the HeadSize was `r HSp6`? `r longmcq(sample(c(answer = IQp6, IQp2, IQp4)))`
`r hide("Show the working")`
If the HeadSize is `r HSp6` and the slope is `r slope` and intercept is `r intercept`, then using the formula:
$$\hat{Y} = b_{0} + b_{1}X + error$$
If we fill in the details:
$$\hat{Y} = `r intercept` + `r slope` \times `r HSp6`$$
And if we start to work that through, dealing with the multiplication first, we see:
$$\hat{Y} = `r intercept` + `r slope * HSp6`$$
Which then becomes:
$$\hat{Y} = `r intercept + slope * HSp6`$$
Giving a predicted value of $\hat{Y}$ = `r IQp6`, to two decimal places.
`r unhide()`