# Notations for Statistics

# ISL: An Introduction to Statistical Learning

* $n$: 采样中观测点的数量
* $p$: 用于做出预测的变量的数量
* $\bold{X}$是$n \times p$的矩阵, 第$(i,j)$个元素是$x_{ij}$

$$
\bold{X} = \begin{bmatrix}
   x_{11} & x_{12} & \cdots & x_{1p} \\
   x_{21} & x_{22} & \cdots & x_{2p} \\
   \vdots & \vdots & \ddots & \vdots \\
   x_{n1} & x_{n2} & \cdots & x_{np}
\end{bmatrix}
$$

** 行视角 $x_{1}, x_{2}, \cdots, x_{n}$, $x_{i}$是长度为$p$的向量

$$
x_{i} = \begin{bmatrix}
   x_{i1} \\
   x_{i2} \\
   \vdots \\
   x_{ip}
\end{bmatrix}

\bold{X} = \begin{bmatrix}
x_{1}^{T} \\
x_{2}^{T} \\
\vdots \\
x_{n}^{T} \\
\end{bmatrix}

$$

** 列视角 $\bold{x_{1}}, \bold{x_{2}}, \cdots, \bold{x_{n}}$, $\bold{x_{i}}$是长度为$n$的向量

$$
\bold{x_{j}} = \begin{bmatrix}
   x_{1j} \\
   x_{2j} \\
   \vdots \\
   x_{nj}
\end{bmatrix}

\bold{X} = ( \bold{x_{1}} \bold{x_{2}} \cdots \bold{x_{n}} )
$$

* $y_{i}$: 用于预测的变量的第$i$次观测

$$
\bold{y} = \begin{bmatrix}
   y_{1} \\
   y_{2} \\
   \vdots \\
   y_{n}
\end{bmatrix}
$$


* 观测的数据: $\{(x_{1}, y_{1}), (x_{2}, y_{2}), \cdots, (x_{n}, y_{n}) \}$, $x_{i}$是长度为$p$的向量.

* 长度为$n$的向量: $\bold{a} \in \reals^{n}$
* 长度为$p$的向量: $a \in \reals^{p}$
* 矩阵: $\bold{A} \in \reals^{r \times d}$, $\bold{B} \in \reals^{d \times s}$
* 随机变量: $A$
* 标量: $a \in \reals$
* 矩阵乘法: $(\bold{AB})_{ij} = \sum_{k=1}^{d}a_{ik}b_{kj}$

定量响应$Y$, $p$个不同的预测器$X = ( X_{1}, X_{2}, \cdots, X_{p} )$

假设$Y$与$X$之间的关系 (2.1)

$$
Y = f(X) + \epsilon
$$

预测:

$$
\hat{Y} = \hat{f}(X)
$$

$$
\begin{equation}
\begin{split}
E(Y - \hat{Y})^{2} &= E[f(x) + \epsilon - \hat{f}(X)]^{2} \\
&= [f(X) - \hat{f}(X)]^{2} + Var(\epsilon)
\end{split}
\end{equation}
$$



推断:

* Which predictors are associated with the response? 
* What is the relationship between the response and each predictor?
* Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

## Statistical Learning

### Accessing Model Accuracy

MSE: mean squared error (2.5)

$$
\texttt{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{f}(x_{i}))^{2}
$$

## Linear Regression

### Simple Linear Regression

Simple linear Regression (3.1)

$$
Y \approx \beta_{0} + \beta_{1} X
$$

* $\beta_{0}$: intercept
* $\beta_{1}$: slope

prediction (3.2) - least squares line

$$
\hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}} x
$$

RSS: Residual Sum of Squares (3.3) (3.16)

$$
\hat{y_{i}} = \hat{\beta_{0}} + \hat{\beta_{1}} x_{i} \\
e_{i} = y_{i} - \hat{y_{i}}
$$

$$
\begin{align}
\texttt{RSS} 

&= e_{1}^{2} + e_{2}^{2} + \cdots + e_{n}^{2} \\

&= \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^{2} \\

&= (y_{1} - \hat{\beta_{0}} - \hat{\beta_{1}} x_{1})^{2} + (y_{2} - \hat{\beta_{0}} - \hat{\beta_{1}} x_{2})^{2} + \cdots + (y_{n} - \hat{\beta_{0}} - \hat{\beta_{1}} x_{n})^{2}
\end{align}
$$



least square approach (3.4)

$$
\begin{align}
   \hat{\beta_{1}} &= \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}} \\
   \hat{\beta_{0}} &= \bar{y} - \hat{\beta_{1}} \bar{x}
\end{align}
$$

$$
\bar{y} \equiv \frac{1}{n} \textstyle\sum_{i=1}^{n} y_{i} \\
\bar{x} \equiv \frac{1}{n} \textstyle\sum_{i=1}^{n} x_{i}
$$

$Y = f(X) + \epsilon$, $f$近似为线性函数 (3.5) - population regression line

$$
Y = \beta_{0} + \beta_{1} X + \epsilon
$$

* population mean of random variable $Y$: $\mu$
* sample mean $\hat{\mu} = \bar{y} = \frac{1}{n} \textstyle\sum_{i=1}^{n} y_{i}$, $n$ observations from $Y$ $y_{1}, \cdots, y_{n}$

standard error of $\hat{\mu}$, $\texttt{SE}(\hat{\mu})$ (3.7)

$$
\texttt{Var}(\hat{\mu}) = \texttt{SE}(\hat{\mu})^{2} = \frac{\sigma^{2}}{n}
$$



standard error of $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ (3.8)

$$
\begin{align}
\texttt{SE}(\hat{\beta_{0}})^{2} &= \sigma^{2} \left [  \frac{1}{n} + \frac{\bar{x}^{2}}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2} \right] \\
\texttt{SE}(\hat{\beta_{1}})^{2} &= \frac{\sigma^{2}}{\sum_{i=1}^{n}(x_{i} - \bar{x})^2} \\
\end{align}
$$

$\sigma^{2} = \texttt{Var}(\epsilon)$

RSE: residual standard error, the estimate of $\sigma$ (3.15)

$$
\texttt{RSE} = \sqrt{\frac{\texttt{RSS}}{n - 2}} = \sqrt{\frac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^{2}}{n - 2}}
$$

95% confidence intervals for $\beta_{0}$: $\hat{\beta_{0}} \pm 2 \cdot \texttt{SE}(\hat{\beta_{0}})$

95% confidence intervals for $\beta_{1}$: $\hat{\beta_{1}} \pm 2 \cdot \texttt{SE}(\hat{\beta_{1}})$

t-statistic (3.14)

measure the number of standard deviations that $\hat{\beta_{1}}$ is away from 0

$$
t = \frac{\hat{\beta_{1}} - 0}{\texttt{SE}(\hat{\beta_{1}})}
$$

p-value: the probability of observing any number equal to $|t|$ or larger in absolute value, assuming $\beta_{1} = 0$

$R^{2}$ statistic (3.17)

$$
R^{2} = \frac{\texttt{TSS} - \texttt{RSS}}{\texttt{TSS}} = 1 - \frac{\texttt{RSS}}{\texttt{TSS}}
$$

TSS: total sum of squares

$$
\texttt{TSS} = \sum (y_{i} - \bar{y})^{2}
$$

correlation

$$
r = \texttt{Cor} (X, Y) = \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}
{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2} \sum_{i=1}^{n}(y_{i} - \bar{y})^{2}}
$$

### Multiple Linear Regression

multiple linear regression model (3.19)

$p$ distinct predictors

$$
Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{p}X_{p} + \epsilon
$$

correlation matrix

interaction term (3.31)

$$
Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \epsilon
$$

quardratic (3.36)

$$
\texttt{mpg} = \beta_{0} + \beta_{1} \times \texttt{horsepower} + \beta_{2} \times \texttt{horsepower}^2 + \epsilon
$$

## Classification

## Resampling Methods

## Linear Model Selection and Regularization

## Moving Beyond Linearity

## Tree-Based Methods

## Support Vector Machines

## Deep Learning

## Survival Analysis and Censored Data

## Unsupervised Learning

## Multiple Testing