In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.outliers_influence as st_inf
from IPython.display import display, Markdown
%matplotlib inline

In [None]:
def printm(input_str):
    display(Markdown(input_str))

# Conceptual

## #1 
Describe the null hypotheses to which the p-values given in Table 3.4
correspond. Explain what conclusions you can draw based on these
p-values. Your explanation should be phrased in terms of sales, TV,
radio, and newspaper, rather than in terms of the coefficients of the
linear model.

In [None]:
df = pd.read_csv("../data/Advertising.csv", index_col=0)
print(sm.OLS(df["Sales"], sm.add_constant(df[["TV", "Radio", "Newspaper"]])).fit().summary())

For each coefficient the null hypothesis is that the true population paramater is zero. That is, there is no impact on the expected value of sales for an observed change in the amount of advertising spent on TV/Radio/Newspaper.

The coefficient on the intercept is positive, that is we expect that there will be some sales even in the absence of advertising. The p value is very close to 0, which means that based on the observed data it is very unlikely we would observe that coefficient from a sample, if the true average sales in absence of advertising was zero.

For both TV and radio, the coefficient is positive and the p value is very close to zero. based on this, if we observe a higher level of TV or radio advertising, we should expect a higher level of sales. 

For newspaper advertising the coefficient is slightly negative, but the p value is large. Based on this we cannot reject the null hypothesis that observing a change in newspaper advertising provides no meaningful information on the expected level of sales.

## #2
Carefully explain the differences between the KNN classifier and KNN
regression methods.

In general, classification models are concerned with associating an observation with a discrete category, whereas regression uses and observation to predict a continuous value.

KNN models use known values of nearby observations to predict an unknown value, where "nearby" is determined by the other characteristics of the observation in the model. 

For classification this leads to ranges of values where a majority of the K nearest observations belong to a class. Any unknown observations in that range will be predicted to be in that class. For regression the result is a weighted average of the value of the nearest neighbors.

## #3
Suppose we have a data set with five predictors, $X_1$ = GPA, $X_2$ = IQ, $X_3$ = Gender (1 for Female and 0 for Male), $X_4$ = Interaction between GPA and IQ, and $X_5$ = Interaction between GPA and Gender. The  response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get $\beta_0$ = 50, $\beta_1$ = 20, $\beta_2$ = 0.07, $\beta_3$ = 35, $\beta_4$ = 0.01, $\beta_5\$ = −10.


(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, males earn more on average than females.

ii. For a fixed value of IQ and GPA, females earn more on average than males.

iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

(a)
$X_3 = 1$ For females and $\beta_3 = 35$, $\beta_5 = -10$, which is going to be multiplied by GPA in the case of females.
For GPA < 3.5 The combined impact is positive, for GPA = 3.5 there is no gender differential predicted, and for GPA > 3.5 the combined impact is negative.
Therefore iii. is correct

(b)
$\text{salary} = \beta_0 + \beta_1 \text{GPA} + \beta_2 \text{IQ} + \beta_3 \text{Gender} + \beta_4 \text{GPAxIQ} + \beta_5{GPAxGender}$

$\text{salary} = 50 + (20 * 4) + (20 * 110) + 35 + (0.01 * 4 * 110) + (-10 * 4)$

In [None]:
50 + (20 * 4) + (20 * 110) + 35 + (0.01 * 4 * 110) + (-10 * 4)

(c)
A small coefficient size may indicate that the effect is not of large practical significance (although it could be if the independent variable it operates on is typically large relative to the dependent variable). To determine statistical significance it is necessary to know the standard error of the coefficient. We have insufficient evidence to evaluate statistical significance just from the point parameter estimate.

## #4

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$.

(a) Suppose that the true relationship between X and Y is linear, i.e. $Y = \beta_0 + \beta_1 X + \epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

In [None]:
printm("(a) Let's find out: ")
intercept = np.ones(100) * 10
# Just make beta 1 = 1 for simplicity
X_train = np.random.rand(100) * 100
noise_train = np.random.normal(0, 5, 100)
X_train_sq = X_train**2
X_train_cube = X_train**3
y_train = intercept + X_train + noise_train
ols_linear = sm.OLS(y_train, sm.add_constant(X_train)).fit()
ols_poly = sm.OLS(y_train, sm.add_constant(np.vstack([X_train, X_train_sq, X_train_cube]).T)).fit()
printm(f"Linear RSS: {ols_linear.ssr:0.3}")
printm(f"Polynomial RSS: {ols_poly.ssr:0.3}")

The polynomial terms are fitting to some of the noise, so on the data it's trained on the model has a slightly better fit.

(b) Answer (a) using test rather than training RSS.

In [None]:
printm("(b)")
X_test = np.random.rand(100) * 100
noise_test = np.random.normal(0, 5, 100)
X_test_sq = X_test**2
X_test_cube = X_test**3
y_test = intercept + X_test + noise_test
y_pred_lin = ols_linear.predict(sm.add_constant(X_test))
resid_lin = y_test - y_pred_lin
rss_lin = (resid_lin**2).sum()
y_pred_poly = ols_poly.predict(sm.add_constant(np.vstack([X_test, X_test_sq, X_test_cube]).T))
resid_poly = y_test - y_pred_poly
rss_poly = (resid_poly**2).sum()
printm(f"Linear test RSS: {rss_lin:0.3}\nPolynomial RSS: {rss_poly:0.3}")

Since the estimated relationships on the polynomial terms was spurious, when you get to a new set of data the model performs worse with them included.

(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not
enough information to tell? Justify your answer.

(d) Answer (c) using test rather than training RSS.

We saw above that even if there's no non-linear component to the model, adding extra terms will get you a better fit on your training set. Given that we are now assuming there is at least a small bit of information captured in the non-linear component of the relationship, the training RSS for the cubic regression will clearly be lower.

On the test data it will depend on the degree of non-linearity. If it's a particularly weak relationship you could see the same sort of overfitting pattern as in part b dominate the weak additional information in the non-linear component. If it's a strong non-linearity then the polynomial model should perform better even on test data. Where strong is a measure of the degree of non-linearity relative to the amount of noise in the model.

## #5

Consider the fitted values that result from performing linear regression without an intercept. In this setting, the $i$th fitted value takes the form
$\hat{y}_i = x_i\hat{\beta}$

Where:
$$\hat{\beta} = \frac{\sum_{i=1}^{n}x_i y_i}{\sum_{j=1}^{n}x^2_{j}}$$
Show that we can write:
$$\hat{y_i}=\sum_{j=1}^{n}a_{j}y_{j}$$

What is $a_{j}$?

Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

$$\hat{y}_i = x_i\hat{\beta}$$

$$\hat{\beta} = \frac{\sum_{i=1}^{n}x_i y_i}{\sum_{j=1}^{n}x^2_{j}}$$

$$\hat{y}_i = x_i \frac{\sum_{j=1}^{n}x_j y_j}{\sum_{k=1}^{n}x^2_{k}}$$

$$\hat{y}_i = x_i \sum_{j=1}^{n}\frac{x_j y_j}{\sum_{k=1}^{n}x^2_{k}}$$

$$\hat{y}_i =  \sum_{j=1}^{n}\frac{x_i x_j}{\sum_{k=1}^{n}x^2_{k}} y_j$$

$$a_j = \frac{x_i x_j}{\sum_{k=1}^{n}x^2_{k}}$$

## #6

Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point $(\bar{x}, \bar{y})$.

3.4:

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}$$

Least squares line:
$$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i$$

Need to show that for $x_i = \bar{x}$ that $y_i = \bar{y}$

Let $x_i = \bar{x}$

$$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}\bar{x}$$

$$\hat{y_i} = \bar{y} - \hat{\beta_1}\bar{x} + \hat{\beta_1}\bar{x} = \bar{y}$$

## #7

It is claimed in the text that in the case of simple linear regression of Y onto X, the $R^2$ statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that $\bar{x} = \bar{y} = 0$.

3.17:

$$ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$$

$$ \text{RSS} = \sum_{i=1}^n(y_i - \hat{y_i})^2 = \sum_{i=1}^n(y_i - \hat{\beta_0} - \hat{\beta_1}x_i)^2 = \sum_{i=1}^n(y_i - \hat{\beta_0} - \frac{\sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\sum_{j=1}^n (x_j - \bar{x})^2} x_i)^2 $$

$$ \text{TSS} = \sum_{i=1}^n(y_i - \bar{y})^2$$

3.18:

$$\text{CORR}(X, Y) = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}$$

Simplifying with $\bar{x} = \bar{y} = 0$:

$$\beta_0 = \bar{y} - \bar{\beta_1}\bar{x} = 0$$

$$\beta_1 = \frac{\sum_{i=1}^n x_i y_i}{x_i^2}$$


$$R^2 = \frac{\sum_{i=1}^ny_i^2 - \sum_{i=1}^n(y_i - \hat{\beta_0} - \frac{\sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\sum_{j=1}^n (x_j - \bar{x})^2} x_i)^2}{\sum_{i=1}^ny_i^2}$$

Writing latex to do math is terrible. Worked the rest out on paper

# Applied

## 8

This question involves the use of simple linear regression on the Auto data set.

(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.

For example:

i. Is there a relationship between the predictor and the response?

ii. How strong is the relationship between the predictor and the response?

iii. Is the relationship between the predictor and the response positive or negative?

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

In [None]:
df = sm.datasets.get_rdataset("Auto", "ISLR", cache=True).data

In [None]:
df.head()

In [None]:
y_train = df["mpg"]
X_train = sm.add_constant(df["horsepower"])
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

i) The model does find a relationship between mpg and horsepower. Looking at the standard error and p-value it is statistically significant.

ii) The p value is very close to 0, so there's a strong relationship from a statistical perspective, how strong it is from a practical perspective is hard to interpret without knowing about the range of values for mpg and horsepower, which I'll see when I do the plots in a later part of this question.

iii) There's a negative relationship, higher horsepower is associated with lower miles per gallon.

iv) See code below

In [None]:
lm.get_prediction((1, 98)).summary_frame(alpha=0.05)

b)

In [None]:
sns.regplot(x="horsepower", y="mpg", data=df);

In [None]:
result_df = df[["horsepower", "mpg"]].copy()
result_df["fitted"] = lm.fittedvalues
result_df["resid"] = lm.resid
result_df.plot(x="horsepower", y="resid", kind="scatter");

There's a pretty clear non-linear pattern in the fit. Residuals are high at low and high horsepower. Adding in horsepower squared might help.

## 9 

This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

In [None]:
sns.pairplot(df);

(b) Compute the matrix of correlations between the variables using the function cor() . You will need to exclude the name variable, which is qualitative.

In [None]:
df.corr().style.background_gradient(cmap='viridis')

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

In [None]:
y_train = df["mpg"]
X_train = sm.add_constant(pd.concat([df.drop(columns=["mpg", "name", "origin"]), pd.get_dummies(df["origin"], drop_first=True, prefix="origin")], axis="columns"))
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significantcrelationship to the response?

iii. What does the coefficient for the year variable suggest?

There is a relationship between the independent variables and the dependent variable. Of them, displacement, weight, year, and origin are statistically signficant. The positive coefficient on year suggests that mileage is improving over time, holding the other observable characteristics constant.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

In [None]:
y_true = df["mpg"]
resid = lm.resid
y_pred = lm.predict(X_train)
sns.scatterplot(x=y_true, y=resid);

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(lm, ax=ax, criterion="cooks")

The residual plot shows there's still some non linearity, with larger residuals concentrated at low, and particularly at high mpg. The influence plot shows that observation 14 is a clear outlier in terms of leverage.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

In [None]:
y_train = df["mpg"]
X_train = (
    pd.concat([df.drop(columns=["mpg", "name", "origin"]), pd.get_dummies(df["origin"], drop_first=True, prefix="origin")], axis="columns")
    .assign(cylinders_x_weight=lambda df: df["cylinders"] * df["weight"])
    .assign(cylinders_x_displacement=lambda df: df["cylinders"] * df["displacement"])
    .assign(horsepower_x_weight=lambda df: df["horsepower"] * df["weight"])
    .assign(horsepower_x_acceleration=lambda df: df["horsepower"] * df["acceleration"])
    .pipe(sm.add_constant)
)
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

Cylinders * weight is significant at the 10% confidence level, but not at 5. Cylinders * displacement is not significant. Horsepower * weight and acceleration are both significant. Adjusted $R^2$ has increased from about 0.82 to about 0.86, so model fit overall has improved. 

(f) Try a few different transformations of the variables, such as $\log{X}$, $\sqrt{X}$, $X^2$. Comment on your findings.

In [None]:
y_train = df["mpg"]
X_train = (
    pd.concat([df.drop(columns=["mpg", "name", "origin"]), pd.get_dummies(df["origin"], drop_first=True, prefix="origin")], axis="columns")
    .assign(weight_sq=lambda df: df["weight"]**2)
    .pipe(sm.add_constant)
)
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

In [None]:
y_train = df["mpg"]
X_train = (
    pd.concat([df.drop(columns=["mpg", "name", "origin"]), pd.get_dummies(df["origin"], drop_first=True, prefix="origin")], axis="columns")
    .assign(log_weight=lambda df: np.log(df["weight"]))
    .drop(columns="weight")
    .pipe(sm.add_constant)
)
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

In [None]:
y_train = df["mpg"]
X_train = (
    pd.concat([df.drop(columns=["mpg", "name", "origin"]), pd.get_dummies(df["origin"], drop_first=True, prefix="origin")], axis="columns")
    .assign(weight_sq=lambda df: df["weight"]**2)
    .assign(horsepower_sq=lambda df: df["horsepower"]**2)
    .pipe(sm.add_constant)
)
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

Adding in non-linearities for weight and horsepower definitely help. I'm a little surprised the log transformation fits a bit worse than the linear plus squared transformation but in general it's what I'd expect from having looked at the residuals.

## #10

This question should be answered using the Carseats data set.

In [None]:
text_to_bool = {"Yes": 1, "No": 0}
df = (
    sm.datasets.get_rdataset("Carseats", "ISLR", cache=True).data
    .assign(Urban=lambda df: df["Urban"].map(text_to_bool))
    .assign(US=lambda df: df["US"].map(text_to_bool))
    .pipe(pd.get_dummies, drop_first=True)
    .pipe(sm.add_constant)
)
df.head()

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

In [None]:
y_train = df["Sales"]
X_train = df[["const", "Price", "Urban", "US"]]
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The intercept is interpreted as the predicted sales in a rural, non-US location with a price of zero. Don't take that too literally.
Price has a negative coefficient, indicating that for every unit increase in price we expect about a -0.05 reduction in sales.
Urban has a negative coefficient, which would suggest that urban locations have slightly lower sales, all else equal. However, the coefficient on urban is not significant, so the more accurate interpretation is that there is no evidence that urban vs rural is associated with a change in sales.
US has a positive coefficient, suggesting for a given price a US location will have higher sales by 1.2 than a non-US location.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

$$\text{Sales} = 13.0 - 0.05 * \text{Price} -0.02 * \text{is_Urban} + 1.2 * \text{in_US}$$

Not sure what this question was asking for exactly

(d) For which of the predictors can you reject the null hypothesis $H_0: \beta_j = 0$?

As discussed in b, I can reject the null for all but Urban.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

In [None]:
y_train = df["Sales"]
X_train = df[["const", "Price", "US"]]
lm = sm.OLS(y_train, X_train).fit()
print(lm.summary())

(f) How well do the models in (a) and (e) fit the data?

Both fit pretty poorly. $R^2$ is about 0.24 in each of them, meaning the model is explaining about a quarter of the observed variation in sales. The adjusted $R^2$ in the second model is slightly higher, supporting the idea that removing Urban was the right decision.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

In [None]:
lm.conf_int(alpha=0.05)

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

In [None]:
y_true = df["Sales"]
resid = lm.resid
y_pred = lm.predict(X_train)
sns.scatterplot(x=y_true, y=resid);

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(lm, ax=ax, criterion="cooks")

Residuals increase with sales, suggesting we want a log transformation on sales to predict the percent rather than unit change in sales based on the model factors. Observation 42 has fairly high leverage.

## #11

In this problem we will investigate the t-statistic for the null hypothesis $H_0 : \beta = 0$ in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

In [None]:
x = np.random.normal(size=100)
y = 2*x + np.random.normal(size=100)

a) Perform a simple linear regression of y onto x , without an intercept. Report the coefficient estimate $\hat{\beta}$, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis $H_0 : \beta = 0$. Comment on these results.

In [None]:
lm = sm.OLS(y, x).fit()
print(lm.summary())

The coefficient estimate is 1.94, which is close to the true population parameter 2. The standard error is 0.086 and the associated t-stat and p-value give strong evidence to reject the null hypothesis

b) Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis $H_0 : \beta = 0$. Comment on these results.
c) What is the relationship between the results obtained in (a) and (b)?

In [None]:
lm = sm.OLS(x, y).fit()
print(lm.summary())

The coefficient for y is 0.4317, which is also close to the true population parameter. Remember, if $y = 2x + \epsilon$ then it follows that $x = \frac{y - \epsilon}{2} = \frac{1}{2}y - \frac{1}{2}\epsilon$. The t-statistic and p value are identical to the previous model.

For the regression of Y onto X without an intercept, the t-statistic for $H_0 : \beta = 0$ takes the form $\frac{\hat{\beta}}{SE(\hat{\beta})}$, where $\hat{\beta}$ is
given by (3.38), and where big equation for $SE(\hat{\beta})$ that I don't feel like copying out.

Show algebraically and confirm numerically in python that the t-statistic can be written as


$$\frac{\sqrt{n-1}\sum_i^nx_iy_i}{\sqrt{(\sum_i^nx_i^2)(\sum_i^ny_i^2)-(\sum_i^nx_iy_i)^2}}$$

In [None]:
statsmodels_t = lm.tvalues[0]

root_n_min1 = np.sqrt(len(x) -1)
sum_xy = (x * y).sum()
sum_x_sq = (x**2).sum()
sum_y_sq = (y**2).sum()
t_calc = (root_n_min1 * sum_xy)/(np.sqrt((sum_x_sq * sum_y_sq) - sum_xy**2))
print(f"t-stat from statsmodels: {statsmodels_t:0.3} t-stat from calc: {t_calc:0.3}")

e) Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y.

The equation is symmetric. If you swap x and y there's no change to the formula. 

f) In R , show that when regression is performed with an intercept, the t-statistic for $H_0: \beta_1 = 0$ is the same for the regression of y
onto x as it is for the regression of x onto y.

In [None]:
y_x_t = sm.OLS(y, sm.add_constant(x)).fit().tvalues[1]
x_y_t = sm.OLS(x, sm.add_constant(y)).fit().tvalues[1]
print(f"y on x: {y_x_t:0.3}, x on y: {x_y_t:0.3}")

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate $\hat{\beta}$ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

a)
$$\hat{\beta} = \frac{\sum_i^nx_iy_i}{\sum_i^nx_i^2}$$
so $\hat{\beta}$ will be the same if $x_i^2 = y_i^2$ so $y = x$ or $y = -x$

b) I just did that above

c)

In [None]:
x = np.arange(start=0, stop=100, step=1)
y = -x + np.random.normal(size=100)
lmx = sm.OLS(x, y).fit()
lmy = sm.OLS(y, x).fit()
print(lmx.summary())
print(lmy.summary())

## #13
In this exercise you will create some simulated data and will fit simple linear regression models to it

 (a)Using the rnorm() function, create a vector, x , containing 100 observations drawn from a N (0, 1) distribution. This represents
a feature, X.

(b) Using the rnorm() function, create a vector, eps , containing 100 observations drawn from a N (0, 0.25) distribution i.e. a normal
distribution with mean zero and variance 0.25.

(c) Using x and eps , generate a vector y according to the model $Y = −1 + 0.5X + \epsilon$

What is the length of the vector y ? What are the values of $\beta_0$ and $\beta_1$ in this linear model?

Create a scatterplot displaying the relationship between x and y . Comment on what you observe.

(e) Fit a least squares linear model to predict y using x . Comment on the model obtained. How do $\hat{\beta_0}$ and $\hat{\beta_1}$ compare to $\beta_0$ and $\beta_1$?

(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different
color. Use the legend() command to create an appropriate legend.

(g) Now fit a polynomial regression model that predicts $y$ using $x$ and $x^2$. Is there evidence that the quadratic term improves the
model fit? Explain your answer.

(h) Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39)
should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term $\epsilon$ in (b). Describe your results.

(i) Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term $\epsilon$ in (b). Describe your results.

(j) What are the confidence intervals for $\beta_0$ and $\beta_1$ based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.