<div >
<img src = "banner.jpg" />
</div>

# Linear Regression for Prediction

Outcome to be predicted: $Y_i$

> *example:* a worker's log wage

Characteristics (aka **features**): $X_i=\left(X_{1i},\ldots,X_{pi}\right)'$

> *example:* education, age, state of birth, parents' education, cognitive ability, family background


In [None]:
#Load Packages
#install.packages("pacman")
require("pacman")
p_load("tidyverse","stargazer")

In [None]:
nlsy = read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')
nlsy = nlsy  %>%   drop_na(educ)
head(nlsy)

In [None]:
nlsy <- nlsy  %>% mutate(educ2=educ^2,
                         educ3=educ^3,
                         educ4=educ^4,
                         educ5=educ^5,
                         educ6=educ^6,
                         educ7=educ^7,
                         educ8=educ^8,
                         educ9=educ^9,
                         educ10=educ^10,
                        )

In [None]:
reg<- lm(lnw_2016~educ+ educ2 + educ3 + educ4 + educ5 + educ6 + educ7 + 
    educ8 + educ9 + educ10,   data = nlsy)
#reg <- lm(lnw_2016 ~ educ +I(educ^2)  +I(educ^3)  +I(educ^4)  +I(educ^5)  +I(educ^6)  +I(educ^7) 
#  +I(educ^8)  +I(educ^9)  +I(educ^10),   data = nlsy) #otra forma sin crear variables
summary(reg)

In [None]:
stargazer(reg,type="text")

In [None]:
nlsy$yhat= predict(reg)

In [None]:
# plot predicted values
summ = nlsy %>%  
  group_by(
    educ, educ2, educ3, educ4, educ5, 
    educ6, educ7, educ8, educ9, educ10
  ) %>%  
  summarize(
    mean_y = mean(lnw_2016),
    yhat_reg = mean(yhat), .groups="drop"
  ) 

ggplot(summ) + 
  geom_point(
    aes(x = educ, y = mean_y),
    color = "blue", size = 2
  ) + 
  geom_line(
    aes(x = educ, y = yhat_reg), 
    color = "green", size = 1.5
  ) + 
  labs(
    title = "ln Wages by Education in the NLSY",
    x = "Years of Schooling",
    y = "ln Wages"
  ) +
  theme_bw()


As we can see, least squares linear regression can approximate any continuous function and can certainly be used for prediction. Include a rich enough set of transformations, and OLS predictions will yield unbiased estimates of the true ideal predictor, the conditional expectation function. But these estimates will be quite noisy. 

# Example 2

In [None]:
p_load("fabricatr")

#for reproducibility
set.seed(101010)


db1 <- fabricate(
  N = 100000,
  ability=rnorm(N,mean=.5,sd=2),
  schooling = round(runif(N, 2, 14)),
  logwage =rnorm(N, mean=7+.15*schooling, sd=20)
)
head(db1)

In [None]:
reg1<-lm(logwage~schooling,db1)
reg2<-lm(logwage~schooling+ability,db1)

stargazer(reg1,reg2,type="text")

In [None]:
db1<- db1 %>% mutate(yhat_reg1=predict(reg1),
                     yhat_reg2=predict(reg2))


In [None]:
var(db1$yhat_reg1)
var(db1$yhat_reg2)

In [None]:
db2 <- fabricate(
  N = 100000,
  ability=rnorm(N,mean=.5,sd=2),
  schooling = round(runif(N, 2, 14)),
  schooling = round(ceiling(schooling+1*ability)),
  logwage =rnorm(N, mean=7+.15*schooling+.25*ability, sd=20)
)
head(db2)

In [None]:
reg3<-lm(logwage~schooling,db2)
reg4<-lm(logwage~schooling+ability,db2)
stargazer(reg3,reg4,type="text")

In [None]:
db2$yhat_reg3<-predict(reg3)
db2$yhat_reg4<-predict(reg4)


In [None]:
var(db2$yhat_reg3)
var(db2$yhat_reg4)