# Simple Linear, Multiple Linear, and Cubic Regression Models

## Exercise 4

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β0 + β1X + β2X^2 + β3X^3 + ϵ.

### Excercise 4a:

Suppose that the true relationship between X and Y is linear, i.e. Y = β0 + β1X + ϵ. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

<span style="color: blue;">If the relationship between X and Y is linear, then the cubic regression's training RSS would be less than the linear regression’s RSS. This is because the cubic model can always match or be better than linear fit by adjusting its extra terms. It's basically more flexible than a linear model.<span>

### Excercise 4b:

Answer (a) using test rather than training RSS.

<span style="color: blue;">If the relationship is actually linear, the linear regression's training RSS would probably be better because the cubic model has extra terms that aren't needed. So, it might overfit the training data, which basically  means that it could find patterns in data that don't actually exist.<span>

### Exercise 4c:

Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

<span style="color: blue;">If the relationship between X and Y isn’t actually linear, then the training RSS for the cubic regression would be less than or equal to the linear regression’s RSS. This is because the cubic model can use its extra terms to essentially curve the graph to fit the data better than the linear model in the training set.<span>

### Exercise 4d:

Answer (c) using test rather than training RSS.

<span style="color: blue;">If the true relationship isn’t linear, then we can’t really say for sure which model will do better. If the true relationship has a big curve, the cubic model might be better and have a lower test RSS. However, if the relationship is only slightly curved or if there isn't much data, the cubic model could overfit and end up doing worse than the linear model. <span>

<br><br><br><br><br><br><br><br>

## Exercise 7

It is claimed in the text that in the case of simple linear regression of Y onto X, the R2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that x̄ = ȳ = 0

<span style="color: blue;">
The slope of the regression line is equal to:
    
$$ b_1 = \frac{\sum x_i y_i}{\sum x_i^2} $$

The total variation in Y is the total sum of squares:

$$ TSS = \sum y_i^2 $$

The variation in Y that can be explained by the regression line is:

$$ RSS = \sum \hat{y}_i^2 = b_1^2 \sum x_i^2 $$

$\hat{y}_i = b_1 x_i$ is the predicted values from the line. This shows how much of Y the line can explain.

The $R^2$ statistic is the fraction of the total variation explained by the line:

$$
R^2 = \frac{RSS}{TSS} = \frac{b_1^2 \sum x_i^2}{\sum y_i^2}
$$

Now, we substitute the expression for $b_1$:

$$
R^2 = \frac{\left(\sum x_i y_i\right)^2}{\sum x_i^2 \cdot \sum y_i^2}
$$

The equation for the correlation between X and Y measures how closely the points follow a straight line:

$$
r = \frac{\sum x_i y_i}{\sqrt{\sum x_i^2} \cdot \sqrt{\sum y_i^2}}
$$

If we square the correlation, we get:

$$
r^2 = \frac{\left(\sum x_i y_i\right)^2}{\sum x_i^2 \cdot \sum y_i^2}
$$

This is the same as $R^2$. So, we can conclude that the $R^2$ statistic is equal to the square of the correlation between X and Y.
<span>

<br><br><br><br><br><br><br><br>

## Exercise 10

This question should be answered using the Carseats data set.

### Exercise 10a:

Fit a multiple regression model to predict **Sales** using **Price**, **Urban**, and **US**.

In [1]:
library(ISLR)
data(Carseats)

fit_full = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit_full)


Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,	Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16


### Exercise 10b:

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

<span style="color: blue;">
Intercept (13.043): predicted Sales (in thousands) when Price = 0, Urban = No, US = No. (Intercept often has no direct practical meaning here since Price=0 is outside data range.)
    


Price (−0.054459): holding Urban and US fixed, increasing the price by $1 is associated with a decrease in sales of about 0.0545 thousand units (i.e. ~54.5 units). This effect is highly statistically significant (p < 2e−16). 

UrbanYes (−0.021916): stores located in urban areas have predicted sales about 0.0219 thousand lower than non-urban stores (≈22 units lower), holding other variables fixed — but this estimate is not statistically significant (p ≈ 0.936), so we have no evidence of a real urban effect. 

USYes (1.200573): stores in the US have predicted sales about 1.2006 thousand higher (≈1,200 units more) than non-US stores, with price and urban/rural fixed — this is statistically significant (p ≈ 4.86e−06)
<span>

### Exercise 10c:

Write out the model in equation form, being careful to handle the qualitative variables properly.

<span style="color: blue;">

$$
\widehat{\text{Sales}} = 13.043469 - 0.054459 \times \text{Price} - 0.021916 \times \text{UrbanYes} + 1.200573 \times \text{USYes}
$$
    
<span>

### Exercise 10d:

For which of the predictors can you reject the null hypothesis H0: βj = 0?

<span style="color: blue;">
Reject H0 for Price (p < 2e−16) and US (p ≈ 4.86e−06).
    
Fail to reject H0 for Urban (p ≈ 0.936)
<span>

### Exercise 10e:

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

In [2]:
fit_red = lm(Sales ~ Price + US, data = Carseats)
summary(fit_red)


Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,	Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16


### Exercise 10f:

How well do the models in (a) and (e) fit the data?

<span style="color: blue;">
Full model: (Price + Urban + US): Multiple R² ≈ 0.2393, Adj R² ≈ 0.2335, Residual SE ≈ 2.472. 

Reduced model: (Price + US): R² and residual SE are essentially the same (very small change) because Urban contributed almost nothing. Reduced model is preferred since Urban was not significant. (Exact R² for reduced model is nearly identical to full model; see summary(fit_red).)
<span>

### Exercise 10g:

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

In [3]:
confint(fit_red, level = 0.95)

Unnamed: 0,2.5 %,97.5 %
(Intercept),11.7903202,14.27126531
Price,-0.06475984,-0.04419543
USYes,0.69151957,1.70776632


### Exercise 10h:

Is there evidence of outliers or high leverage observations in the model from (e)?

In [10]:
resid_vals = rstandard(fit_red)
top5_resid = order(abs(resid_vals), decreasing = TRUE)[1:5]

cat("Top 5 standardized residuals (possible outliers):\n")
for (i in top5_resid) {
  cat("Observation", i, ":", round(resid_vals[i], 3), "\n")
}


lev_vals = hatvalues(fit_red)
top5_lev = order(lev_vals, decreasing = TRUE)[1:5]

cat("\nTop 5 leverage values (possible high-leverage points):\n")
for (i in top5_lev) {
  cat("Observation", i, ":", round(lev_vals[i], 3), "\n")
}


cook_vals = cooks.distance(fit_red)
top5_cook = order(cook_vals, decreasing = TRUE)[1:5]

cat("\nTop 5 Cook's distance values (influential points):\n")
for (i in top5_cook) {
  cat("Observation", i, ":", round(cook_vals[i], 4), "\n")
}


Top 5 standardized residuals (possible outliers):
Observation 377 : 2.865 
Observation 51 : -2.811 
Observation 69 : 2.623 
Observation 26 : 2.581 
Observation 210 : -2.563 

Top 5 leverage values (possible high-leverage points):
Observation 43 : 0.043 
Observation 175 : 0.03 
Observation 166 : 0.029 
Observation 126 : 0.026 
Observation 368 : 0.024 

Top 5 Cook's distance values (influential points):
Observation 26 : 0.0261 
Observation 368 : 0.0243 
Observation 50 : 0.0228 
Observation 317 : 0.0205 
Observation 166 : 0.0198 
