### HSML 6295 Session 2 (Regression) - ANSWERS

#### I. Load the Data Set "HSML 6295 s2 Data Set Wealth and Health.csv"

Load the data set "Wealth and Health.csv" by dragging and dropping it into the canvas.


In [None]:
d = read.csv(file = "HSML 6295 s2 Data Set Wealth and Health.csv", header=TRUE, sep=",")
# head(d)
names(d)
le = d[,-1]
rownames(le) = d[,1]
colnames(le)[colnames(le) == 'Life.Expectancy.at.Birth..Years.'] = 'Life.Expectancy'
colnames(le)[colnames(le) == 'Health.Spending.per.Capita...000.US..'] = 'Health.Spending'
colnames(le)[colnames(le) == 'GDP.per.Capita...000.US..'] = 'GDP'
colnames(le)[colnames(le) == 'Countries.A.I'] = 'AI'
le$GDPT = 1000*le$GDP
# head(le)
attach(le)
le


#### II. Measures of Dispersion

The *coefficient of variation* (CV) is a measure of relative variability. It is the ratio of the standard deviation to the mean.

1. Go to the `Summary Statistics` tab and compute the coefficient of variation for `GDP per capita` and `Life Expectancy`.

The cv of `GDP` is 


In [None]:
round(sd(GDP)/mean(GDP), 3)




The cv of ``Life Expectancy`` is


In [None]:
round(sd(Life.Expectancy)/mean(Life.Expectancy), 3)



The *range rule of thumb* says that the range is approximately four times the standard deviation. 

2. What is the range of `Life Expectancy` divided by its standard deviation? What is the range of `GDP per capita` divided by its standard deviation? Which ratio is larger?

The range of `Life Expectancy` divided by its standard deviation is


In [None]:
round((max(Life.Expectancy)-min(Life.Expectancy))/sd(Life.Expectancy),2)




The range of `GDP per capita` divided by its standard deviation is


In [None]:
round((max(GDP)-min(GDP))/sd(GDP),2)



#### III. Outliers

*Outliers* are data points that "break away" from the majority of the data points in the scatter. One way to spot outliers is to measure how many standard deviations a data point is away from the mean.

3. Which countries can you find whose `Life Expectancy` is at least 2 standard deviations away from the mean? Which countries can you find whose `GDP per capita` is at least 2 standard deviations away from the mean?

We are looking for countries whose `Life Expectancy` falls outside the interval


In [None]:
round(mean(Life.Expectancy) - 2*sd(Life.Expectancy),2)



In [None]:
round(mean(Life.Expectancy) + 2*sd(Life.Expectancy),2)



These countries include 



In [None]:
rownames(le)[Life.Expectancy < round(mean(Life.Expectancy) - 2*sd(Life.Expectancy),2)]




and


In [None]:
rownames(le)[Life.Expectancy > round(mean(Life.Expectancy) + 2*sd(Life.Expectancy),2)]




Similarly, we are looking for countries whose `GDP per capita` falls outside the interval


In [None]:
round(mean(GDPT) - 2*sd(GDPT),0)



In [None]:
round(mean(GDPT) + 2*sd(GDPT),0)



These countries include 



In [None]:
rownames(le)[GDP < round(mean(GDP) - 2*sd(GDP),2)]



In [None]:
rownames(le)[GDP > round(mean(GDP) + 2*sd(GDP),2)]



By construction the *median* is a measure of central tendency that is more robust to outliers than the *mean.*

4. Remove the country whose `GDP per capita` is at least 2 standard deviations from the mean. You can remove a data point from the sample by right-clicking on it. Do the mean and median values increase or decrease? Does the mean change more or less than the median when you remove outlier countries from the sample?

Mean `GDP per capita` decreases from 


In [None]:
round(mean(GDPT),0)




to 


In [None]:
round(mean(GDPT[-20]),0)




, i.e. by 


In [None]:
round(abs(mean(GDPT)-mean(GDPT[-20])),0)




By comparison, median `GDP per capita` decreases from 


In [None]:
round(median(GDPT),0)




to 


In [None]:
round(median(GDPT[-20]),0)




, i.e. by 


In [None]:
round(abs(median(GDPT)-median(GDPT[-20])),0)



#### IV. Correlation Coefficient

The `Correlation` tab shows you the *matrix of correlation coefficients* between all variables in your model. A correlation between two variables is always a number between -1 and 1. The closer the absolute value of the correlation is to 1, the closer is the linear relationship between the two variables. A positive (negative) correlation means that an increase in the value of one variable is typically matched by an increase (a decrease) of the value of the other variable. 

In the graph, positively correlated data display an upward trend: they run from the lower left to the upper right. Negatively correlated data display a downward trend: they run from the upper left to the lower right. Data that are perfectly correlated lie on a straight line.


In [None]:
plot(GDP, Life.Expectancy)
with(le, text(Life.Expectancy ~ GDP, labels = row.names(le), pos = 1))


5. In the "Wealth and Health" data set, what is the correlation between `Life Expectancy` and `GDP per capita`? Is it closer to 1 in absolute value or 0?

The correlation between `GDP per capita` and `Life Expectancy` is 


In [None]:
round(cor(GDP,Life.Expectancy), 2)



6. How does the correlation between `Life Expectancy` and `GDP per capita` change when you delete the data point representing Luxembourg?

When Luxembourg is omitted from the calcuation, the correlation between `GDP per capita` and `Life Expectancy` is 


In [None]:
round(cor(GDP[-20],Life.Expectancy[-20]), 2)



Undo the deletion of the data point representing `Luxembourg`.

#### V. Intercept and Slope

Now go to the `Equation` tab. If `Show Linear Fit` is checked in the `Graph Options`, the figure shows a solid red line. Broadly, this line is chosen to be as close as possible to all the data points in the sample. Specifically, this line minimizes the average squared difference between the observed response (represented by a data point) and the predicted (or "fitted") response, which lies on the red line.

A line is always completely defined by two parameters: 

* the value of the response at which the fitted line crosses ("intercepts") the vertical axis (i.e. where the value of the predictor is 0). This value is commonly called *intercept* (or *constant term*).
* the increase in the response for each one-unit increase in the predictor, commonly called the *slope*.

7. What are the values of the intercept and slope in the simple linear regression of `Life Expectancy` on `GDP per capita`?


In [None]:
m = lm(Life.Expectancy ~ GDP, data = le)




The values of the intercept and slope are 


In [None]:
round(coef(summary(m))["(Intercept)", "Estimate"], 3)



years and 


In [None]:
round(coef(summary(m))["GDP", "Estimate"], 3)



years/\$1,000, respectively.

[Load the R package "stargazer" for displaying the regression results:]


In [None]:
library(stargazer)



In [None]:
stargazer(m,
          type="text", 
          dep.var.labels=c("Life Expectancy at Birth (Years)"), 
          covariate.labels=c("Constant", "GDP per Capita ('000 US$)"),
          report = "vcsp",
          intercept.bottom = FALSE,
          df = FALSE)


8. Interpret the value of the intercept. If the simple linear regression model was true, under what circumstances would we expect to observe a country whose life expectancy at birth matched that of the intercept? Given the shape of the scatter of data points, do you think this is a realistic assumption?

9. Now consider the United States whose `Life Expectancy` was 78.7 and whose `GDP per capita` was \$49,782. Given the simple linear regression of `Life Expectancy` on `GDP per capita` that you just estimated, by how many years would you expect `Life Expectancy` in the United States to change if its `GDP per capita` increased by \$5,000?

`Life Expectancy` would increase by 


In [None]:
5.000*round(coef(summary(m))["GDP", "Estimate"], 3)



10. Perform the same thought experiment for Mexico, the country with the lowest `Life Expectancy` at 74.2 years and the lowest `GDP per capita` at \$17,125 in this sample. Given the simple linear regression of `Life Expectancy` on `GDP per capita` that you just estimated, by how many years would you expect `Life Expectancy` in Mexico to change if its `GDP per capita` increased by \$5,000? Would you say the simple linear regression underpredicts or overpredicts the likely gains in `Life Expectancy` if Mexico's health spending?

The simple linear regression predicts that `Life Expectancy` would increase by 


In [None]:
5.000*round(coef(summary(m))["GDP", "Estimate"], 3)



years, the same gain in years predicted for the United States.


11. Now delete the data point representing `Luxembourg`. What are the new estimates of the intercept and slope parameters? How do they differ from the estimates you obtained for the full sample?


In [None]:
m = lm(Life.Expectancy ~ GDP, data = le[-20,])




The new estimates of the intercept and slope parameters are 


In [None]:
round(coef(summary(m))["(Intercept)", "Estimate"], 3)




and


In [None]:
round(coef(summary(m))["GDP", "Estimate"], 3)



Undo the deletion of the data point representing `Luxembourg`.

#### VI. Prediction


In [None]:
m = lm(Life.Expectancy ~ GDP, data = le)




12. What is the predicted value of `Life Expectancy` at the mean value of `GDP per capita`,


In [None]:
 round(mean(GDPT),0)



How does the predicted value compare to the mean value of `Life Expectancy` in the sample?

The predicted value of `Life Expectancy` at the mean value of `GDP per capita` is 


In [None]:
 coef(summary(m))["(Intercept)", "Estimate"] + coef(summary(m))["GDP", "Estimate"] * mean(GDP)



Note that this is the mean value of `Life Expectancy` in the sample.

13. Now consider Argentina whose variable values in 2011 were:


Variable | Value
--- | ---
`Life Expectancy at Birth` | 75.7 years
`Health spending per Capita` | \$1,227
`GDP per Capita` | \$19,629


Given the simple linear regression of `Life Expectancy` on `GDP per capita`, what is the predicted value of `Life Expectancy` for Argentina? What is the (absolute value of the) prediction error?

The predicted value of `Life Expectancy` for Argentina is 


In [None]:
round(coef(summary(m))["(Intercept)", "Estimate"] + coef(summary(m))["GDP", "Estimate"] * 19.629, 2)




The (absolute value of the) prediction error is 


In [None]:
round(abs(75.7 - coef(summary(m))["(Intercept)", "Estimate"] + coef(summary(m))["GDP", "Estimate"] * 19.629), 2)



#### VII. R-Squared

The R-squared statistic is the proportion of *total sum of squares* that is accounted for by the *explained sum of squares*, i.e. the sum of squared differences between the predicted values of the response and their mean:
$$R^2 = \frac{\Sigma(\hat{y}_i - \bar{y})^2}{\Sigma(y_i - \bar{y})^2} = \frac{\mathrm{Var}[\hat{y}]}{\mathrm{Var}[y]}$$
where $\bar{y} = \bar{\hat{y}}$, i.e. the mean of the observed values of the response is the same as the mean of the predicted values of the response, whenever we include an intercept in the model (see Knowledge Check 12 above).

The $R^2$ statistic is often used to gauge a model’s “goodness of fit”: it measures how well the fitted line captures the scatter of data points.

14. What is the correlation coefficient between `Health Spending per capita` and `GDP per capita`? What is its square?

The correlation coefficient between `Health Spending per capita` and `GDP per capita` is 


In [None]:
 round(cor(Health.Spending,GDP),3)




and its square is 


In [None]:
 round(cor(Health.Spending,GDP)^2,3)



15. Regress `Health Spending per capita` on `GDP per capita`. What is the $R^2$? How does the $R^2$ compare to the squared correlation coefficient between `Health Spending per capita` and `GDP per capita`?



In [None]:
m = lm(Health.Spending ~ GDP, data = le)
prediction = predict(m, le)



The $R^2$ is 


In [None]:
 round(summary(m)$r.squared, 3)



16. Now save the data set as "Health and Wealth FITTED.csv" to your computer. This data set contains two additional variables: 

    * `Prediction`: The predicted values of `Health Spending`
    * `Residual`: The residuals of the regression of `Health Spending` on `GDP per capita`
  
    Load this data set by dragging and dropping it into the canvas. What is the variance of `Prediction`, i.e. the predicted values of `Health Spending`? What is the variance of the observed values of `Health Spending`? What is their ratio?
    
The variance of `Prediction` is 


In [None]:
 round(var(prediction),3)




The variance of `Health Spending` is 


In [None]:
 round(var(Health.Spending),3)




The variance of `Prediction` divided by the variance of `Health Spending` is


In [None]:
 round(var(prediction)/var(Health.Spending),3)



17. What is the correlation between the observed and predicted values of `Health Spending`? What is their square?

The correlation between the observed and predicted values of `Health Spending` is 


In [None]:
 round(cor(prediction,Health.Spending),3)




and their square is 


In [None]:
 round(cor(prediction,Health.Spending)^2,3)



18. Regress the *predicted* values of `Health Spending`, i.e. the variable `Prediction`, on `GDP per capita`. Note how all the fitted values lie on the regression line. What is the $R^2$?



In [None]:
m = lm(prediction ~ GDP, data = le)




The $R^2$ is 


In [None]:
 round(summary(m)$r.squared, 3)



#### VIII. Quadratic and Cubic Fits

Now drag and drop the original data set "Wealth and Health.csv" into the canvas. 
In the `Load Data` screen add the quadratic and cubic transformations of `GDP per capita`.

19. Regress `Life Expectancy` on `GDP per capita`. This is called a *linear fit*. What is the value of the $R^2$ statistic? 


In [None]:
m = lm(Life.Expectancy ~ GDP, data = le)




The $R^2$ is 


In [None]:
 round(summary(m)$r.squared, 3)



20. Regress `Life Expectancy` on `GDP per capita` and (`GDP per capita`)$^2$. This is called a *quadratic fit*. What is the value of the $R^2$ statistic? How does it compare to the value of $R^2$ you found for the linear fit?



In [None]:
m = lm(Life.Expectancy ~ poly(GDP,2), data = le)




The $R^2$ is 


In [None]:
 round(summary(m)$r.squared, 3)



21. Now consider again Argentina whose variable values in 2011 were:


Variable | Value
--- | ---
`Life Expectancy at Birth` | 75.7 years
`Health spending per Capita` | \$1,227
`GDP per Capita` | \$19,629


Given the quadratic fit of `Life Expectancy` to `GDP per capita`, what is the predicted value of `Life Expectancy` for Argentina? Hint: The square of 19.629 is 


In [None]:
round(19.629^2,2)



What is the (absolute value of the) prediction error? How does it compare to the prediction error you found for the linear fit (Question 13)?

The predicted value of `Life Expectancy` for Argentina is 


In [None]:
 round(predict(m,data.frame(GDP=c(19.629))),2)



 
The prediction error is 


In [None]:
round(abs(75.7 - predict(m,data.frame(GDP=c(19.629)))),2)



22. Regress `Life Expectancy` on `GDP per capita`, (`GDP per capita`)$^2$, and (`GDP per capita`)$^3$. This is called a *cubic fit*. What is the value of the $R^2$ statistic? How does it compare to the value of $R^2$ you found for the quadratic fit?



In [None]:
m = lm(Life.Expectancy ~ poly(GDP,3), data = le)




The $R^2$ is 


In [None]:
 round(summary(m)$r.squared, 3)




23. Given the cubic fit of `Life Expectancy` to `GDP per capita`, what is the predicted value of `Life Expectancy` for Argentina? Hint: The cube of 19.629 is 


In [None]:
round(19.629^3,1)



What is the (absolute value of the) prediction error? How does it compare to the prediction error you found for the linear and quadratic fits?

The predicted value of `Life Expectancy` for Argentina is 


In [None]:
round(predict(m,data.frame(GDP=c(19.629))),2)




and the prediction error is


In [None]:
round(abs(75.7 - predict(m,data.frame(GDP=c(19.629)))),2)

