

# <font color=red>**General Commands for getting information about the data set**</font>

In [None]:
#mtcars is a dataset already loaded in R

head(mtcars) #gives a preview of the data
nrow(mtcars) #gives number of rows
ncol(mtcars) #gives number of columns
fit = lm (y ~ x, data = mtcars)
summary(fit) # gives us the regression summary with fitted parameters

**reading in a csv file**

In [None]:
data = read.csv("filename.csv")
var1 = data$var1 #how to extract variables
var2 = data$var2
hist(var1) #making a histogram
fit = lm (var1 ~ var2, data = data)
summary(fit) #gives regression summary

**plotting two variables**


In [None]:
#sample of plotting two variables
x = data$reading.score
xlab = "Reading Score"
y= data$math.score
ylab = 'Math Score'
plot(x,y, xlab = xlab, ylab = ylab, main = "Plot of Math Scores Vs Reading Scores")
abline(lm(y~x), col = "red") #adds the trend line

# <font color=red>**Prerequiste Material**</font>

- Linearity of expectation/variance:
  - $E(aX+b)=aE[X]+b$,
  - $\mathrm{Var}(aX+b)=a^2\mathrm{Var}(X)$.
- Central Limit Theorem: $\bar{X}$ is approximately normal for moderate/large $n$.
- One-sample mean, $\sigma$ unknown (t-CI):
  $$
  \bar{x} \ \pm\ t_{1-\alpha/2,\ n-1}\ \frac{s}{\sqrt{n}}
  $$
- One-sample mean, $\sigma$ unknown (t-test):
  $$
  t \;=\; \frac{\bar{x}-\mu_0}{s/\sqrt{n}},\quad \text{df}=n-1
  $$




- **Population mean ($\mu$) vs Sample mean ($\bar{x}$):**
  - $\mu$ is the true mean (unknown), $\bar{x}$ estimates $\mu$.
  - $\sigma$ **unknown** (typical) $\Rightarrow$ **t** methods with $\text{df}=n-1$; $\sigma$ known $\Rightarrow$ z (rare).

by hand confidence interval for true population average

In [None]:
#replace math_scores with the variable we are concerned with
n = length(math_scores)
x_bar = mean(math_scores)
s = sd(math_scores)
t_value = qt(1-.05/2,n-1) #tval will always be this qt(level,df)
upper = x_bar + t_value * (s/sqrt(n))
lower = x_bar - t_value * (s/sqrt(n))
print('95% confidence interval')
print(c(lower,upper))

built in confidence interval commands for true population mean

In [None]:
t_test = t.test(math_scores,conf.level = 0.95) #default is 95%
print(t_test) #gives t statistic, df, p-value, C.I, average

# <font color=red>**Ordinary Least Squares- Simple Linear Regression**</font>

- **Model:**
  $$
  Y_i=\beta_0+\beta_1 X_i+\varepsilon_i,\quad E[\varepsilon_i]=0,\ \mathrm{Var}(\varepsilon_i)=\sigma^2,\ \text{iid}
  $$
  
- **OLS estimators:**
  $$
  \hat{\beta}_1=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i (x_i-\bar{x})^2},
  \qquad
  \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}
  $$
- **Fitted values & residuals:** $$\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i, \qquad e_i=y_i-\hat{y}_i$$
- **RSS & variance estimate:**
  $$
  \mathrm{RSS}=\sum_i (y_i-\hat{y}_i)^2,\qquad
  s^2=\frac{\mathrm{RSS}}{n-2}
  $$
- **One predictor variable**

**$\hat\beta_{1}$ calculation
$$
\hat{\beta}_1=\frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sum_i (x_i-\bar{x})^2}
$$**

In [None]:
xbar = mean(data$reading.score)
ybar = mean(data$math.score)
num = sum((data$reading.score - xbar) * (data$math.score - ybar))
denom = sum((data$reading.score - xbar)^2)
beta1hat = num / denom

**$\hat\beta_{0}$ calculation
$$
 \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}
 $$**

In [None]:
beta0hat = ybar - beta1hat*xbar

**regression summary, gives $\hat\beta_{0}$ (intercept), $\hat\beta_{1}$ (slope)**

In [None]:
fit = lm(y ~ x, data = data)
summary(fit)

**Fitted values & residuals:** $$\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i, \qquad e_i=y_i-\hat{y}_i$$

In [None]:
yhat = beta0hat + beta1hat*x
check = sum((yhat - fit$fitted.values)^2) #make sure its close to zero

resid = y - yhat #error terms/residuals
check = sum((resid - fit$residuals)^2) #make sure its close to zero


**RSS & variance estimate:**
  $$
  \mathrm{RSS}=\sum_i (y_i-\hat{y}_i)^2,\qquad
  s^2=\frac{\mathrm{RSS}}{n-2}
  $$

In [None]:
num = sum((resid)^2)
n = length(data$y)
denom = n-2 #df
s2 = num / denom
s = sqrt(s2)
fit = lm(y~x, data = data) #to check
summary(fit)

# <font color=red>**Normal Error Regression & Maximum Likelihood**</font>

- **Assume** $\varepsilon_i\sim\mathcal{N}(0,\sigma^2)\ \Rightarrow\ Y_i\mid X_i\sim\mathcal{N}(\beta_0+\beta_1 x_i,\sigma^2)$.
- **Log-likelihood (up to constants):**
  $$
  \ell(\beta_0,\beta_1,\sigma)= -n\log\sigma -\frac{n}{2}\log(2\pi)
  -\frac{1}{2\sigma^2}\sum_i (y_i-\beta_0-\beta_1 x_i)^2
  $$
- **MLE vs OLS:**
  - $\hat{\beta}_0,\hat{\beta}_1$ are the **same** as OLS.
  - $\hat{\sigma}^2_{\text{ML}}=\mathrm{RSS}/n$ (biased); unbiased uses $n-2$.
- no R assignment, just the proof



# <font color=red>**Inference on Parameters**</font>

Let $S_{xx}=\sum_i (x_i-\bar{x})^2$.

- **Standard errors:**
  $$
  \mathrm{SE}(\hat{\beta}_1)=\frac{s}{\sqrt{S_{xx}}},\qquad
  \mathrm{SE}(\hat{\beta}_0)=s\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}}
  $$
- **t-statistics (df $=n-2$):**
  $$
  t_{\beta_1}=\frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)},\qquad
  t_{\beta_0}=\frac{\hat{\beta}_0}{\mathrm{SE}(\hat{\beta}_0)}
  $$
- **$(1-\alpha)$ CIs:**
  $$
  \hat{\beta}_j \ \pm\ t_{1-\alpha/2,\ n-2}\ \mathrm{SE}(\hat{\beta}_j),\quad j\in\{0,1\}
  $$

**extracting necessary information from regression summary**

<center> $s^2 = \frac{1}{n-2} \sum_i \left(Y_i - \hat{Y_i}\right)^2$
</center>
$\qquad$

<center> $\widehat{Var(\hat{\beta}_1)} = \frac{s^2}{\sum_i (X_i - \bar{X})^2}$
</center>

In [None]:
fit = lm(y~x, data = data)
summary(fit) #gives you the coefficients
beta1hat = fit$coefficients[2] #extract
beta0hat = fit$coefficients[1] #extract
s = sqrt(fit$sigma^2) #extract

#beta hat 1 variance
num = s2
denom = sum((x - mean(x))^2)
beta1hat_var = num / denom
beta1hat_var
sqrt(beta1hat_var)


**using information to do a $(1-\alpha)$ Confidence Interval on the SLOPE**

In [None]:
se = sqrt(beta1hat_var)
alpha = 0.05 #change depending on level
degrees_freedom = nrow(data) - 2 #samples minus 2
critical_value = qt(1 - alpha/2, degrees_freedom)
lower = beta1hat - critical_value * se
upper = beta1hat + critical_value * se
print(c(lower,upper))

fit = lm(y~x, data = data)
confint(fit, level = 0.95) #gives you the confidence interval as well



**Hypothesis Testing on the SLOPE (1)**
  - $H_{0}$ : $\beta_{1} = 0$
  - $H_{a}$ : $\beta_{1} \neq 0$
  - test to see if the slope is significantly different from zero. i.e, is there a relationship between x and y.

In [None]:
fit = lm(y~x, data = data)
summary(fit) #will give you a p-value or t value to compare t_stat to

#t-stat
t_stat = beta1hat - 0 / se
#p-value by hand or just use summary(fit) table
p_value = 2 * pt(abs(t_stat), degrees_freedom, lower.tail = FALSE)





**Hypothesis Testing on the SLOPE (2)**
- $H_{0}$ : $\beta_{1} = .5$
 - $H_{a}$ : $\beta_{1} \neq .5$
 - testing to see if the slope is significantly different from *.5*


In [None]:
fit = lm(y~x, data = data)
summary(fit) #will give you a p-value, t-stat

t_stat = beta1hat - .5 / se
p_value = 2 * pt(abs(t_stat), degrees_freedom, lower.tail = FALSE)



# <font color=red>**Inference on Fitted Values & Predictions at $x_h$**</font>

Let $S_{xx}=\sum_i (x_i-\bar{x})^2$ and $\hat{y}_h=\hat{\beta}_0+\hat{\beta}_1 x_h$.

- **SE of fitted mean response:**
  $$
  \mathrm{SE}(\hat{y}_h)=
  s\sqrt{ \frac{1}{n}+\frac{(x_h-\bar{x})^2}{S_{xx}} }
  $$
- **CI for mean response $E[Y\mid X=x_h]$:**
  $$
  \hat{y}_h \ \pm\ t_{1-\alpha/2,\ n-2}\ \mathrm{SE}(\hat{y}_h)
  $$
- **Prediction SE (adds new-error term):**
  $$
  s\sqrt{ 1+\frac{1}{n}+\frac{(x_h-\bar{x})^2}{S_{xx}} }
  $$
- **PI for new $Y_{\text{new}}$ at $x_h$:**
  $$
  \hat{y}_h \ \pm\ t_{1-\alpha/2,\ n-2}\ s\sqrt{ 1+\frac{1}{n}+\frac{(x_h-\bar{x})^2}{S_{xx}} }
  $$
- **Rules of thumb:** CIs/PIs are narrowest near $\bar{x}$; larger $n$ $\Rightarrow$ all intervals shrink.
- **Confidence Band:** gives us the range of $E(Y_{h})$ given $X_{h}$. This just gives us a confidence interval but for all points since intervals just show us the range for a single point. here we can see the confidence intervals acorss the entire line.
- **Prediction Band:** this band gives us the range of $E(Y_{h})$ of a new value of $X_{h}$.
- **C.B. VS P.B.** prediction bands are always wider because it accounts for variability in the sample AND the new observation whereas the confidence band only accounts for variability in the sample. More variability = wider range.


- **CI vs PI (at $x_h$):**
  - **CI** is for the **mean** response $E[Y\mid X=x_h]$.
  - **PI** is for a **new single observation** $Y_{\text{new}}$ at $x_h$ (always wider).

- **Fitted value $\hat{y}_h$ vs Prediction $y_{\text{new}}$:**
  - CI $\to$ around $\hat{y}_h$ (mean).
  - PI $\to$ around $y_{\text{new}}$ (future draw).

In [None]:
fit = lm(y~x, data = data) #fit the data
summary(fit) #get the summarym gives almost all variables needed

**CI and PI**

In [None]:
#extract values from summary(fit)
beta0hat = fit$coefficients[1]
beta1hat = fit$coefficients[2]
s = sqrt(fit$sigma^2)

confint(fit, level = 0.95) #gives you the confidence interval
data = data.frame(x = 200)
predict(fit, newdata, interval = "confidence", level = 0.95) # CI for given value of X

#say we have a new oberservation at X = 200
new_obs = data.frame(x = 200)
predict(fit, newdata = new_obs, interval = "prediction", level = 0.95) # PI for a future data point

# <font color=red>**Matrix Representation of Regression**</font>

- **Model:** $Y=X\beta+\varepsilon$, where $X=[\mathbf{1},\ x]$ (first column ones, second is $x$).
- **OLS solution:**
  $$
  \hat{\beta}=(X^\top X)^{-1}X^\top Y
  $$
- **Variances:**
  $$
  \mathrm{Var}(\hat{\beta})=\sigma^2 (X^\top X)^{-1},
  \qquad
  \mathrm{Var}(\hat{Y})=\sigma^2\, X (X^\top X)^{-1} X^\top
  $$


# <font color=red>**Summary**</font>
- One-mean t-CI: $$\bar{x} \pm t_{(1-\alpha/2,n-1)}\, \frac{s}{\sqrt{n}}$$.  
- One-mean t-test: $$t= \frac{(\bar{x}-\mu_0)}{(s/\sqrt{n})}$$

- OLS: $$\hat{\beta}_1=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}$$ $$\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}$$ $$s^2=\frac{RSS}{(n-2)}$$  

- Matrix Representation: $$\hat{\beta}=(X^\top X)^{-1}X^\top Y$$ $$\ \mathrm{Var}(\hat{\beta})=\sigma^2(X^\top X)^{-1} $$

- MLE (normal errors): same $\hat{\beta}$ as OLS; $$\hat{\sigma}^2_{\text{ML}}=\mathrm{RSS}/n \qquad\text{(biased)}$$

- SEs: $$\mathrm{SE}(\hat{\beta}_1)=\frac{s}{\sqrt{S_{xx}}}$$ $$\ \mathrm{SE}(\hat{\beta}_0)=\frac{s}{\sqrt{\frac{1}{n}+\frac{\bar{x}^2}S{_{xx}}}}$$

- CI/PI at $x_h$:
  - CI: $\hat{y}_h \pm t\, s\sqrt{\frac{1}{n}+\frac{(x_h-\bar{x})^2}{S_{xx}}}$  
  - PI: $\hat{y}_h \pm t\, s\sqrt{1+\frac{1}{n}+\frac{(x_h-\bar{x})^2}{S_{xx}}}$
