# Markdown Mathematical Examaples

$\displaystyle x_{scale} = \frac{x}{X_{max}-X_{min}}$

$\displaystyle x_{minscale} = \frac{x-x_{min}}{X_{max}-X_{min}}$

$\displaystyle x_{norm} = \frac{x-\mu}{X_{max}-X_{min}}$

$\displaystyle x_{stand} = \frac{x - \mu}{\sigma}$

## Regression scoring equations

- $y_{i}$ is an observed value
- $\hat{y_{i}}$ is a predicted value
- $\bar{y_{i}}$ is the mean value
- $\mu$ is also the mean value
- $n$ is the number of observations
- $\sum$ means to sum things together
- $e = y_{i} - \hat{y_{i}}$
- $absolute\text{ }error = | y_{i} - \hat{y_{i}} |$
- $squared\text{ }error = (y_{i} - \hat{y_{i}})^2$

Mean Absolute Error (MAE):$MAE = \frac{\sum | y_{i} - \hat{y_{i}} |}{n} = \frac{1}{n} \sum | y_{i} - \hat{y_{i}} |$

Sum of Squared Errors (SSE):    $SSE = \sum (y_{i} - \hat{y_{i}})^2$

Root Mean Squared Error (RMSE):    $RMSE = \sqrt{\frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2}$

$R^2$ or coefficient of determination:    $R^2 = 1 - \frac{\sum{(y_{i} - \hat{y_{i}})^2}}{\sum{(y_{i} - \bar{y_{i}})^2}}$ = $R^2 = \frac{MSE}{Var_{yactual}}$

Max error:    $\text{max error} = max(| y_{i} - \hat{y_{i}} |)$

Median absolute error:    $median\text{ }squared\text{ }error = median(\sum | y_{i} - \hat{y_{i}} |)$

Mean squared logarithmic error:    $MSLE = \frac{1}{n} \sum{(log_{e}(1+y_{i}) - log_{e}(1+\hat{y_{i}}))}^2$

### Mean Squared Error (MSE):

$MSE = \frac{\sum (y_{i} - \hat{y_{i}})^2}{n} = \frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2$

$MSE(X, h_{\theta}) = \frac{1}{m} \sum (\theta^{T}x^{(i)} - y^{(i)})^2$

Where:

- $X$: matrix of features
- $h_{\theta}$: prediction function, also called a *hypothesis*; $h_{\theta} = \theta^{T}x^{(i)}$
- $\theta$: array of weights
- $x^{(i)}$: array of features for a specific observation
- $y^{(i)}$: observed output for a specific observation

### Explained variance

$explained\text{ }variance = 1 - \frac{Var(y - \hat{y})}{Var(y)}$

Where $Var(y-\hat{y}) = \frac{\sum{(error^2)} - mean(error)}{n}$

### Mean poisson deviance and mean gamma deviance

$mean\text{ }poisson\text{ }deviance = 2(y_{i} log(\frac{y_{i}}{\hat{y_{i}}}) + \hat{y_{i}} - y_{i})$

$mean\text{ }gamma\text{ }deviance = 2(log(\frac{\hat{y_{i}}}{y_{i}}) + \frac{y_{i}}{\hat{y_{i}}} - 1)$

## Classification scoring equations

Prevalence: $prevalence = \frac{TP+FN}{TP+TN+FP+FN}$

Accuracy: $ACC = \frac{TP+TN}{TP+TN+FP+FN}$

Balanced accuracy: $BA = \frac{TPR+TNR}{2}$

Jaccard index: $J(y_{i}, \hat{y_{i}}) = \frac{| y_{i} \bigcap \hat{y_{i}} |}{| y_{i} \bigcup \hat{y_{i}} |}$

Zero one loss count: $zero \text{ } one \text{ } loss \text{ } count = FP + FN$

Zero one loss ratio: $zero \text{ } one \text{ } loss \text{ } ratio = \frac{FP + FN}{TP+TN+FP+FN}$

Precision / Positive predictive value (PPV): $precision = \frac{TP}{TP+FP} = 1 - FPR$

Markedness (MK): $MK = \frac{TP}{TP+FP} - \frac{FN}{FN+TN} = PPV + NPV - 1$

Negative predictive value (NPV): $NPV = \frac{TN}{FN + TN} = 1 - FOR$

False discovery rate (FDR): $FDR = \frac{FP}{TP + FP} = 1 - PPV$

False omission rate (FOR): $FOR = \frac{FN}{FN + TN} = 1 - NPV$

Recall / Sensitivity / True positive rate (TPR): $recall = \frac{TP}{TP + FN} = 1-FNR$

Informedness / Bookmaker Informedness (BM): $BM = \frac{TP}{TP+FN} - \frac{FN}{TN+FP} = TPR + TRN - 1$

Youden's J index: $J = sensitivity + specificity - 1$

Specificity (SPC) / True negative rate (TNR): $specificity = \frac{TN}{TN + FP} = 1 - FPR$

False positive rate (FPR), fall-out: $FPR = 1 - \frac{TN}{TN + FP} = 1 - TNR$

False negative rate (FNR) / Miss rate: $FNR = \frac{FN}{TP + FN} = 1 - TPR$

F1-score: $F_{1} = 2 \times (\frac{precision \times recall}{precision + recall}) = \frac{2TP}{2TP+FP+FN}$

Matthews correlation coefficient: $MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

Threat score (TS) / Critical success index (CSI): $TS = \frac{TP}{TP+FN+FP}$

Log loss: $L_{log}(y,p) = -\text{logPr}(y | p) = -(y\text{log}(p) + (1-y)(\text{log}(1-p))$

fbeta score: $F_{\beta} = (1+\beta^2) \frac{precision \times recall}{\beta^2 precision + recall}$

### No information rate

*no information rate* $= \frac{1}{C}$

To account for the frequency of the classes:

*no information rate* $= max(\frac{count(c)}{n})$

### Cohen's Kappa / Kappa statistic

$kappa = \frac{O - E}{1 - E}$

Where

- $O$ = observed accuracy
- $E$ = expected accuracy based on the confusion matrix's marginal totals

### Brier score

*brier score* $= \frac{1}{n} \sum{(f_{t} - o_{t})}^2$

Where:
- *n* = the total number of predictions
- $f_{t}$ = the predicted probability
- $o_{t}$ = the actual outcome

### Hinge loss

$L_{Hinge}(y, w) = max(1 - wy, 0) = | 1 - wy |_{+}$

Where:
- *y* = true value
- *w* = predicted probability

### Hamming loss

$L_{hamming}(y, \hat{y}) = \frac{1}{n_{labels}} \sum{1(\hat{y_{i}} \neq y_{i})}$

Where
- $n_{labels}$ is the number of classes (or labels)

## Profit and cost equations

Profit: $profit = xTP - yFP - zFN$

### Probability cost function (PCF)

$PCF = \frac{P \times C(fn)}{P \times C(fp) + (1 - P) \times C(fn)}$

Where:

- *P* is the (prior) probability of the event (all positives)
  + I.E., *P* is the proportion of positives in the data
  + As such, 1 - *P* is the probability of a non-event, or the proportion of all negatives in the data
- *C(fn)* is the cost of a false negative (positive observation predicted as a negative)
- *C(fp)* is the cost of a false positive (negative observation predicted as a positive)

### Normalized expected cost (NEC)

$NEC = PCF \times (1-TP) + (1-PCF) \times FP$

$\frac{TP_{1} - TP_{2}}{FP_{1} - FP_{2}} = \frac{p(-)C(+|-)}{p(+)C(-|+)}$ = $\frac{p(a)C(b|a)}{p(b)C(a|b)}$

Where:
- $p(a)$: the probability of a given example being in class $a$
- $C(a|b)$: the cost incurred if an example in class $b$ is misclassified as being in class $a$

## Regression

- $\hat{y} = w_{0} x_{0} + w_{1} x_{1} + \dotsb + w_{p} x_{p} + b$
- $\hat{y} = \theta_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + \dotsb + \theta_{n} x_{n}$
- $Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dotsb + \beta_{p}X_{p} + \epsilon$

### Normal Equation

$\hat{\theta} = (X^{T} X)^{-1} X^{T} y$

Where:

- $\hat{\theta}$: theta array, the hypothesized weights
- $X$: input feature matrix
- $X^{T}$: the transpose of X
- $y$: array of target values

### Gradient Descent

$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\displaystyle\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}$

Where:
- $\theta_{j}$: the specific feature being updated
- $:=$ is assignment (in Python it's like writing `==`)
- $\alpha$: learning rate
- $m$: number of training examples
- $x^{(i)}$: the feature array of the *i*th observation
- $h_{\theta}(x^{(i)})$: returns the predicted value for the *i*th observation
- $y^{(i)}$: the observed value for the *i*th observation

A vectorized version as written in [Hands-On Machine Learning with Scikit-Learn](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646) is:

$\theta^{\text{next step}} = \theta - \eta \frac{2}{m}X^{T}(X \theta -y)$

Where
- $\theta$: the theta array
- $\eta$: the learning rate (previously notated as $\alpha$); the symbol is eta
- $X$: input feature matrix
- $X^{T}$: the transpose of $X$
- $y$: the array of target values

Gradient Descent = $\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)$

- The theta array ($\theta$)
- The learning rate ($\alpha$)
- The partial derivative ($\frac{\partial}{\partial \theta_{j}}$)
- The cost function ($J(\theta)$)

### Expanded Gradient Descent equation

This variation spells out all the different parts of the equation. When all the terms are explicitly spelled out it becomes easier to see underlying similarities like the one between MSE and the cost function ($\frac{1}{m} \sum(\hat{y}_{i} - y_{i})^2 = \frac{1}{m} \sum(h_{\theta}(x_{i}) - y_{i})^2 \approx \frac{1}{m} \sum(h_{\theta}(x^{(i)}) - y^{(i)})$

$\theta_{j} := \theta_{j} - \alpha \frac{1}{m}\displaystyle\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}$

Where:

- $\theta_{j}$: the specific feature being updated
- $:=$ is assignment (in Python it's like writing `==` instead of `=`)
- $\alpha$: alpha, the learning rate or step size
- $m$: number of training examples
- $x^{(i)}$: the feature array of the *i*th observation
- $h_{\theta}(x^{(i)})$: returns the predicted value for the *i*th observation
  - $h_{\theta}(x)$: a function that outputs a predicted value
  - $h_{0}(x) = \hat{y} = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \dotsb + \theta_{n}x_{n} = \theta^{T} x$
  - For the line example in the Purpose section, the function would be $h_{\theta}(x) = \hat{y} = 0.88X + 0.03$
  - In Linear Regression $h_{\theta}(x)$ is assumed to be a linear function
- $y^{(i)}$: the observed value for the *i*th observation

### Simplified Gradient Descent equation

The simplified version just condenses some of the terms. This makes it easier to see how changes to the cost function effect the overall equation.

$\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)$

Where:

- $\theta_{j}$: a specific value in the theta array
- $:=$ is assignment (in Python it's like writing `==`)
- $\alpha$: alpha, the learning rate or step size
- $\frac{\partial}{\partial \theta_{j}}$: partial derivative, i.e. direction of change for $\theta$
- $J(\theta)$: cost function (also written as: $J(\theta_{0}, \theta_{1}, \dotsb, \theta_{n})$)
  - $\frac{1}{2m}\displaystyle\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$
    - Where
      - *m* is the number of training examples
      - $x^{(i)}$ is the feature array of the *i*th observation
      - $h_{\theta}(x^{(i)})$ returns the predicted value for the *i*th observation
      - $y^{(i)}$ is the observed value for the *i*th observation
    - So, if we're finding the *mean* squared error, why are we multiplying by $\frac{1}{2m}$ instead of $\frac{1}{m}$? According to Andrew Ng in the [Machine Learning course](https://www.coursera.org/learn/machine-learning), multiplying by $\frac{1}{2}$ makes the math easier
    - For more than one feature the partial derivative for the cost function changes slightly
        - $\frac{\partial}{\partial \theta_{j}} J(\theta) = \frac{1}{m}\displaystyle\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)}$
          - Where the j in $\theta_{j}$ is the weight for a specific feature being updated
          - Note that the "squared" portion of the term has disappeared from the $J(\theta)$ portion of the equation and instead is multiplied by $x_{j}^{(i)}$
          - This is why the cost function portion of the full equation looks different from the straight cost function

### Vectorized Gradient Descent equation

The vectorized version can be applied to the full dataset at once instead of row by row. This variation is from *Hands-On Machine Learning with Scikit-Learn* instead of the *Machine Learning* course, so the notation is slightly different.

$\theta^{\text{next step}} = \theta - \eta \frac{2}{m}X^{T}(X \theta -y)$

Where

- $\theta$: the theta array
- $\eta$: eta, the learning rate (previously notated as $\alpha$)
- $X$: input feature matrix
- $X^{T}$: the transpose of $X$
- $y$: the array of target values

Making the notation consistent with the other two variations, the equation would look like this:

$\theta_{j} := \theta_{j} - \alpha \frac{2}{m}X^{T}(X \theta - y)$

Breaking out the pieces of the equation helps us see where the four main components are:

Gradient descent: $\theta^{\text{next step}} = \theta - \eta \nabla_{\theta} MSE(\theta)$
- Theta array
  - Still written as $\theta$
- Learning rate
  - Previously we had notated the learning rate as $\alpha$, whereas in this equation it's written as $\eta$
- Partial derivative of the cost function
  - $\frac{\partial}{\partial \theta_{j}} MSE(\theta) = \frac{2}{m} \displaystyle\sum_{i=1}^m (\theta^{T}x^{(i)} - {y}^{(i)})x_{j}^{(i)}$
  - Previously we had notated the cost function as $J(\theta)$, whereas in this equation it's written as $MSE(\theta)$
  - The vectorized version is notated as: $\nabla_{\theta} MSE(\theta) = \frac{2}{m} X^{T}(X \theta - y)$
- Cost function
  - $MSE(\theta) = \frac{1}{m} \displaystyle\sum_{i=1}^m (\theta^{T}x^{(i)} - {y}^{(i)})^{2}$
  - Previously we had notated the cost function as $J(\theta)$, whereas in this equation it's written as $MSE(\theta)$

### OLS linear regression - LASSO

- Cost function: $J(\theta) = MSE(\theta) + \alpha \displaystyle\sum_{i=1}^n |\theta_{i}|$
  - Where
    - MSE: mean squared error
    - $\alpha$: regularization term
    - $\theta$: the theta array
- Lasso regression subgradient vector: $g(\theta, J) = \nabla_{0} \mathrm{MSE}(\theta) + \alpha 
  \begin{pmatrix}
    \mathrm{sign}(\theta_{1}) \\
    \mathrm{sign}(\theta_{2}) \\
    \vdots \\
    \mathrm{sign}(\theta_{n})
  \end{pmatrix}$
  - Where
    - $g(\theta, J)$: subgradient vector
    - $ \nabla_{0}$: differential operator (the symbol is called "nabla")
    - $\mathrm{MSE}$: mean squared error
    - $\alpha$: the regularization term
    - $\mathrm{sign}(\theta_{i}) = 
  \begin{cases}
    -1   & \mathrm{if } \; \theta_{i} < 0 \\
    \;\,\,0    & \mathrm{if } \; \theta_{i} = 0 \\
    +1   & \mathrm{if } \; \theta_{i} > 0
  \end{cases}$
  
### OLS linear regression - ridge

- Cost function: $J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \displaystyle\sum_{i=1}^n \theta_{i}^{2}$
  - Note: The bias term $\theta_{0}$ is not regularized as indicated by the start of $i$ at 1 instead of 0
  - Where
    - MSE: mean squared error
    - $\alpha$: regularization term
    - $\theta$: the theta array
- Vectorized ridge regression: $\hat{\theta} = (X^{T} X + \alpha A)^{-1} X^{T}y$
  - Where
    - $\theta$: the theta array
    - $X$: the feature matrix
    - $X$: the transpose of the feature matrix
    - $\alpha$: the regularization term
    - $A$: the identify matrix, except with the top-left cell being 0 (i.e., the bias term)
    - $y$: the target array
    
### Elastic Net

$J(\theta) = MSE(\theta) + r \alpha \displaystyle\sum_{i=1}^n |\theta_{i}| + \frac{1 - r}{2} \alpha \displaystyle\sum_{i=1}^n \theta_{i}^{2}$

Where

- $J(\theta)$: cost function
- MSE: mean squared error
- $r$: mix ratio
  - When r = 0, Elastic Net is the same as Ridge Regression
    - $r \alpha \displaystyle\sum_{i=1}^n |\theta_{i}|$ cancels to 0
  - When r = 1, Elastic Net is the same as Lasso Regression
    - $\frac{1 - r}{2} \alpha \displaystyle\sum_{i=1}^n \theta_{i}^{2}$ cancels to 0
- $\alpha$: regularization term
- $\theta$: the theta array

## L1 and L2 Regularization

- $\text{L1 norm} = ||w||_{1} = |w_{1}| + |w_{2}| + \dotsb + |w_{n}|$
- $\text{L2 norm} = ||w||_{2} = \sqrt{|w_{1}|^2 + |w_{2}|^2 + \dotsb + |w_{n}|^2}$
- $\text{Lp norm} = ||w||_{p} = \sqrt[p]{|w_{1}|^p + |w_{2}|^p + \dotsb + |w_{n}|^p}$

#### Cost functions

- Base cost function:
  - $J(\theta) = MSE(\theta) = \frac{1}{m} \displaystyle\sum_{i=1}^m (\theta^{T}x^{(i)} - y^{(i)})^{2}$
- Cost function with L1 regularization:
  - $J(\theta) = MSE(\theta) + \alpha \displaystyle\sum_{i=1}^n |\theta_{i}|$
- Cost function with L2 regularization:
  - $J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \displaystyle\sum_{i=1}^n \theta_{i}^{2}$
  
From there, it's easy to turn into a classification prediction:

$\hat{y} = 
\begin{cases}
    0 \text{ if } \hat{p} < 0.5\\
    1 \text{ if } \hat{p} \geq 0.5
  \end{cases}$
  


## Logistic Regression

$\hat{y} = h_{0}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \dotsb + \theta_{n}x_{n} = \theta^{T} x$

Where

- $h_{\theta}$: prediction function, also called a *hypothesis*; $h_{\theta} = \theta^{T}x^{(i)}$
- $\theta$: vector (I.E., list) of weights, with the first value being the y-intercept type value (represented by *b* in *Introduction to Machine Learning*)
- $x$: the matrix of feature values (I.E., the DataFrame) with the first column ($x_{0}$, not listed in the equation) being all 1s so that $\theta_{0}$ is always evaluated as the same value

To make the linear regression equation into a logistic regression equation, a sigmoid function is added:

$h_{0}(x) = g(\theta^{T}x) = g(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-\theta^{T}x}}$

Where:
- $z = X \theta = \theta^{T}x$
  - (m x n) x (n x 1 ) = (m x 1)
- e = Euler's number (i.e., the base of natural logarithm)
  - The exact value is in the `math` library: `math.e`
  - To use it in this equation, `math.exp(-z)`
- g = sigmoid function

## Decision Trees

#### `DecisionTreeClassifier`

- **Default**: gini impurity
  - From page 234 of *Machine Learning with Python Cookbook*
    - $G(t) = 1 - \displaystyle\sum_{i=1}^{c} P_{i}^2$
    - Where
      - $G(t)$: gini impurity at node $t$
      - $t$: a specific node
      - $c$: class
      - $P_{i}$: proportion of observations of class $c$ at node $t$
  - From page 177 of *Hands-On Machine Learning*
    - $G_{i} = 1 - \displaystyle\sum_{k=1}^{n} P_{i,k}^2$
    - Where
      - $G_{i}$: gini impurity at node $i$
      - $i$: a specific node
      - $k$: class
      - $P_{i, k}^2$: the raio of class $k$ instances among the training instances of node $i$
- **Alternate**: entropy
    - $H_{i} = - \displaystyle\sum_{\substack{k=1\\ P_{i, k} \neq 0}}^{n} P_{i, k} log_{2}(P_{i, k})$

## Other

- density = $\frac{M}{v}$
- gravity = $\frac{GM}{r^2}$
- escape velocity = $\sqrt{\frac{2GM}{r}}$

Where:

- M = mass
- v = volume
- G = gravational constant
- r = radius

## References / Examples

- [Entry 8: Centering and Scaling](https://github.com/julielinx/datascience_diaries/blob/master/01_ml_process/08_center_scale_and_latex.ipynb)
- [Entry 21: Scoring Regression Models - Theory](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/21_reg_score_theory_and_latex.ipynb)
- [Entry 22: Scoring Regression models - Implementation](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/22_reg_score_implement.ipynb)
- [Entry 23: Scoring Classification Models - Theory](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/23_class_score_theory.ipynb)
- [Entry 24: Scoring Classification Models - Implementation](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/24_class_score_implement.ipynb)
- [Entry 29: Thresholds - Profit and cost](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/29_thresholds_profit_cost.ipynb)
- [Entry 35: Regression](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/35_regression.ipynb)
- [Entry 36: Ordinary Least Squares (OLS)](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/36_regression_OLS.ipynb)
- [Entry 37a: Normal Equation](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/37a_regression_normal_equation.ipynb)
- [Entry 37b: Gradient Descent](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/37b_regression_gradient_descent.ipynb)
- [Entry 38: Regularization](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/38_l1_l2_regularization.ipynb)
- [Entry 39: Lasso Regression](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/39_regression_lasso.ipynb)
- [Entry 40: Ridge Regression](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/40_regression_ridge.ipynb)
- [Entry 41: Elastic Net](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/41_regression_elasticnet.ipynb)
- [Entry 42: Logistic Regression](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/01_regression/42_regression_logistic.ipynb)
- [Entry 48: Decision Tree Impurity Measures](https://github.com/julielinx/datascience_diaries/blob/master/03_supervised_learning/02_tree_based/48_trees_impurity.ipynb)