# ML

## Example Tasks $T$: 
- Regression
- Classification
- The aboves, but now with missingness in inputs
- Transcription
    - like classification but structures in which to play classification have to be identified
- Translation
    - inputs are already structured, and need to be reformatted into alternative structures and vocabularies
- Structured output
    - outputs are multiple and tightly correlated
    - Transcription and Translation are cases of this class
- Anomoly detection
    - a little bit like classification
    - a little bit like regression
    - a lot like distribution theory and p-values
- Synthesis/Sampling
    - creation of new data
    - often structured output; but, we want lots of examples of outputs, not just a single best output
- Imputing missing values
    - often a regression of classification problem... a prediction problem, anyway
    - often involves *special* considerations
- Denoising
    - $Pr(x|\tilde x)$
- Density/PMF estimation
    - other tasks are theoretically driven by this
    - though sometimes/often they do this implicitly with a sort of shortcut or avoid it all together so they don't actually tackle this directly

## Performance Measures $P$: 
- Classification (with missingness) and translation/scription: accuracy/error rate (0-1 loss)
    - scoring gets complicated when there's a lot happening, like in translation/scription
- Regression: "should we penalize thesystem more if it frequently makes medium-sized mistakes or if it rarely makesvery large mistakes?" 
- distribution estimation: average log probability of examples (better models will overall assign higher log probabilities since they will more closely fit the example data)
    - sometimes it's computationally intractible to compute this, though, so an alternative else is needed


## Experience $E$:

- unsupervised: learn $p(x)$ from $x$
    - density estimation
        - directly
    - implicitly:
        - sampling/synthesis, denoising
    - classification like with clustering
- supervised: usually interested in $p(y|x)$
    - label/target focussed
    - E.g., regression, but $\beta$'s are now called *weights*, so we write $\hat y=w^T x + b$ (for "bias" term $b$ which is not included via an "intercept" feature in the $x$ vector)

- They're not so different from a probability distributions perspective

    - semi-supervised: some $y$ missing
    - multi-instance learning: label *sets* of data points with a single $y$ label
        - e.g., present or not in the set?
- Reinforcement learning: adds an environment to interact with
    - we don't cover this


## Capacity, Under/Overfitting

####  Generalization: ability to perform well on new, previously unseen data

- training error versus generalization/test error
    - generalization error: E(error on new data)
    
- $p_{data}$ and i.i.d. assumptions are implied

#### Model fitting process
- entails E[training error] < E[test error]
- task is then to both
    1. optimize Fit: E[training error] -> _underfitting is when this is poorly done_
        - struggling to fit well
    2. minimize Gap: E[test error]-E[training error] -> _overfitting is when this is poorly done_
        - memorizing

#### Capacity 
- means flexibility, complexity, expressiveness
- often these are hyperparameters
- if the model complexity -- i.e., capacity -- is appropriate for the actual complexity of the data at hand then they generally perform fairly well
    - representational capacity and effective capacity can differ due to
        - local optimuim
        - numerical optimization limitations
        - optimization algorithm imperfections, generally
    - regularization
        - penalty term to favor [_whatever_]
        - attempts/intends to reduce generalization error but not training error
        - only optimization is an equally influential aspect of ML

#### Estimation

- $Bias(\hat \theta_m) = E[\hat \theta_m - \theta]$, assymptotically unbiased if this approaches 0 as $m$ grows
\begin{align*}
E\left[\frac{\sum_{i=1}^m (x_i - \bar x)^2}{m} \right] &=
E\left[\frac{\sum_{i=1}^m ((x_i - \mu) - (\bar x - \mu))^2}{m} \right] \\
&= E\left[\frac{\sum_{i=1}^m \left( (x_i - \mu)^2 - 2(x_i - \mu)(\bar x - \mu) + (\bar x - \mu)^2 \right) }{m} \right] \\
&= E\left[\frac{\sum_{i=1}^m (x_i - \mu)^2 - 2m(\bar x - \mu)(\bar x - \mu) + m(\bar x - \mu)^2}{m} \right] \\
&= E\left[\frac{\sum_{i=1}^m (x_i - \mu)^2 - m(\bar x - \mu)^2}{m} \right] \\
&= \frac{\sum_{i=1}^m E[(x_i - \mu)^2] - m E[(\bar x - \mu)^2]}{m} \\
&=  E[(x_i - \mu)^2] - E[(\bar x - \mu)^2] \\
&= \sigma^2 - \frac{\sigma^2}{m} \\
&= \frac{m-1}{m} \sigma^2
\end{align*}

So multiply everything through by $\frac{m}{m-1}$ to get the unbiased estimate: 
- you can see the denomenator is $m-1$.

#### MSE

\begin{align*}
MSE &= E_{\hat \theta}\left[(\hat \theta - \theta)^2\right]\\
&= E_{\hat \theta}\left[\left((\hat \theta - E[\hat \theta]) + (E[\hat \theta] - \theta)\right)^2\right]\\
&= E_{\hat \theta}\left[(\hat \theta - E[\hat \theta])^2\right] + 2 E_{\hat \theta}\left[\hat \theta - E[\hat \theta]\right](E[\hat \theta]-\theta) + (E[\hat \theta]-\theta)^2\\
&= E_{\hat \theta}\left[(\hat \theta - E[\hat \theta])^2\right] + 0 + (E[\hat \theta]-\theta)^2\\
&= Var[\hat \theta ] + Bias\left(E[\hat \theta]\right)^2\\
\end{align*}

#### (Weak) Consistency

- For any $\epsilon>0, Pr(|\hat \theta_m − \theta| > \epsilon) \rightarrow 0$ as $m \rightarrow \infty$
    - Almost surely (strong consistency) if $Pr\left(\underset{m \rightarrow \infty}{lim} \hat \theta_{m} = \theta\right) = 1$
- both mean bias is decreasing with more data $m$
- but just because an estimator is unbiased doesn't mean it is getting closer and closer to the true parameter
    - consistency means it's variance around the true estimator is incessantly decreasing with increasing data $m$. 
    
#### MLE

- best estimator assymptotically (in terms of convergence rate) as $m$ increases
    - has (weak) consistency if (a) $p_{data}$ is in the $p_{model}$ model family and (b) $\hat p_{data}$ is indeed generated from only one such member of the $p_{model}$ model family
    - no other consistent estimator is more efficient (smaller MSE) for large $m$ (by the  Cramér-Rao lower bound)
- $p_{data}(\mathbf{x})$ and it's approximate model $\prod p_{model}(\mathbf{x},\theta)$
- $\underset{\theta}{argmax} \prod_i p_{model}(\mathbf{x_i},\theta) = \underset{\theta}{argmax} \sum_i \log p_{model}(\mathbf{x_i},\theta)$ (for underflow avoidance purposes)
- minimizes $KL[\hat p_{data}||\log p_{model}] = E_{\mathbf{x} \sim \hat p_{data}}[\log \hat p_{data}(\mathbf{x}) - \log p_{model}(\mathbf{x},\theta)]$
    - i.e., minimizes $- E_{\mathbf{x} \sim \hat p_{data}}[\log \hat p_{model}(\mathbf{x},\theta)]$, i.e., maximizes $E_{\mathbf{x} \sim \hat p_{data}}[\log p_{model}(\mathbf{x},\theta)]$

        
- _cross-entropy_: any negative log likelihood loss
    - but usually used to refer to negative loglikelihood of a Bernoulli or softmax distribution
    - nonetheless, MSE *IS* the cross-entropy between the empirical distribution and a Gaussian model, i.e., the negative log likelihood of a Gaussian model
    
   

#### Bayes 

- averages over uncertainty... so doesn't chase optimals... and instead chases generalizability
- adds a prior
    - can regularize
    - can add in information




Bayesian estimation tends to generalize better than MLE because
1. it averages over all possible models based on their uncertainty
2. MLE estimates tend to suffer from statistical efficiency problems
3. consistency is more challenging in MLE relative to Bayesian analysis
4. a prior provides regularization which helps improve generalization

Which of the following are not regularization specifications
1. $\underset{\theta}{argmax} \log p(\theta|\mathbf{x}) + \log p(\theta)$
2. $\underset{w}{argmax} \sum_i(\mathbf{w}^T\mathbf{x}_i-y_i)^2 + \lambda \mathbf{w}^T\mathbf{w}$
3. $\underset{w}{argmax} ||\mathbf{X}\mathbf{w}-\mathbf{y}||^2_2 + \lambda ||\mathbf{w}||^2_2$
4. $\hat y = \mathbf{w}^T\mathbf{x}_i + b$ 

A "broad", "wide", or "uninformative" prior
1. has high entropy
2. has low entropy
3. expresses no preference about parameters

The most appropriate definition of _cross-entropy_:

1. any loss involving the negative log likelihood of the empirical data distribution
    - i.e., the cross-entropy between the model and empirical distribution
2. negative log-likelihood of a Bernoulli or softmax distribution
3. the Kullback-Leibler divergence $KL[p_{data}|| p_{model}]$
4. Maximum likelihood estimation $\underset{\theta}{argmax} \sum_i \log p_{model}(\mathbf{x_i},\theta)$ 


Both $- \sum_i  \log p_{model}(\mathbf{x_i},\theta)$ and $KL[\hat p_{data}|| p_{model}]$ ***are equivalently minimized***, and have bounds

1. $(-\infty,\infty)$
2. $(-\infty,\infty)$ and $[0,\infty)$
3. $(-\infty,\infty)$ and $(-\infty,0]$
4. $[0,\infty)$ and $(-\infty,0]$
5. $(-\infty,0]$
6. $[0,\infty)$


$KL[\hat p_{data}|| p_{model}], \; \prod_i p_{model}(\mathbf{x_i},\theta), \;$ and _cross-entropy_

1. address different classes of problems
2. refer to the same objective function 
3. have equivalent optimization solutions



The estimation in Bayesian statistics is
1. a probability distribution
2. a point estimate
3. better than an MLE estimate
    - not according to the Cramér-Rao lower bound
    - not according to computational tractability (in many "large $m$" contexts)
    - but *does* generally perform better generalization
        - because Bayesian estimation incorporates uncertainty directly into it's calculation...
        - so if it doesn't know something well it doesn't act like it does because it's the best choice over other choices        
4. more subjective than MLE estimation
    - some say the prior is subjective
    - but so is the choice of likelihood
    - and hence the choice of your Maximum Likelihood estimator
5. has an expected value that must be biased if the alternative MLE estimator is unbiased
    - no: an improper prior could be used

Maximum Likelihood Estimation 

1. does not lend itself to cost function minimization
2. uses neither an objective function nor a cost function
3. is formal statistical methodology with extremely close ties to ML
4. is _never_ used by any self respecting "Bayesian" statistician

Bayesian analysis does not
1. allow information outside of the data to be included into the analysis
2. provide a mechanism to construct parameter regularization strategies 
3. attempt to remove bias from the parameter estimation process, generally
4. use probability as a language to make belief statements on parameters

Which concluding statement is false?  Logistic and Linear Regression
1. are both based on probability models
2. can both be optimized analytically
3. make their predictions using linear models
4. predict different kinds of response varaibles






The *support vector machine* (SVM)
1. has the form $\mathbf{w}^T\mathbf{x}+b$
2. which predicts based on $sign(\mathbf{w}^T\mathbf{x}+b)$
3. where $\mathbf{w}=\sum_i \alpha_i \mathbf{x}_i$
4. so $\mathbf{w}^T\mathbf{x}+b = b + \alpha_i \mathbf{x}^T\mathbf{x}_i$
5. or, $b + \alpha_i \phi(\mathbf{x})\cdot\phi(\mathbf{x}_i)$ for any space transform $\phi$ and inner product therein
  1. e.g., vector inner product $\phi(\mathbf{x})^T\phi(\mathbf{x}_i)$
6. or, $b + \alpha_i \kappa(\mathbf{x}, \mathbf{x}_i)$ for any kernel $\kappa$ which calculates the inner product under the $\phi$ transform
  1. $\phi$ could be an infinite dimensional space...
 
- the *kernel trick* 
    - provides highly non-linear models of x
        - (which *are*, of course, linear in the $\phi$ space)
        - (but the $\phi$ space can be a highly non-linear transform of $x$)
    - is often extremely computationally efficient, and optimization for $\boldsymbol{\alpha}$ turns out to be a quite straightforward convex optimization problem, as, given the kernel, the model is linear in the $\boldsymbol{\alpha}$
    
    
- Radial Basis Function (RBF) 
    - $\kappa(\mathbf{u},\mathbf{v}) = N(\mathbf{u}-\mathbf{v};\mathbf{0}, \sigma^2I)$
    - functions like template matching 
        - the better the match the more weight the matches training $y$ value is given

- non-zero $\alpha_i$ are called support vectors.  If most are zero the evaluation of $b + \alpha_i \mathbf{x}^T\mathbf{x}_i$ is cheaper
- computational cost of optimization is also quite steep for SVMs...
- kernel methods/machines use the kernel trick
    - but generic kernels seem to have trouble generalizing super well

#### others
- KNN: cannot distinguish when one feature is more important
    - at most $m$ "smoothness" areas
- DT: struggles for x_1>x_2 means class 1
    - at most $m$ "smoothness" areas


### Unsupervised -- no labeling effort
- density estimation
- learning to sample from a distribution
- learning to denoise
- finding manifolds where data resides
- grouping data into clusters
  
- "find 'best' representation of data"
    - lower dimensional representation
    - sparse representation
    - indepdendent representations (dis-entangle contributing factors)
    
- PCA
    - lower dimensional representation and uncorrelated (so, *partially* independent)
- K-means
    - sparse one-hot-encoding representation and dimensionality reduction
    - what level of grouping hierarchy is appropriate?

## Deep Learning
1. provides better generalization based on medium-sized data sets
2. provdes a scalable approach to nonlinear models on large data sets

VIA:

### SGD (Stochastic Gradient Descent)
- GD traditional regarding as foolhardy or unprincipled for non-convex optimization
- but it works really quite fine for ML models
- even when it doesn't even arrive at a local minimum it's usually pretty good at arriving a quite fine level of the cost functio fairly quickly


#### ML Algorithms
1. data
2. model
3. cost function
4. optimization algorithm

- smoothness prior *is not enough*!
- need to be able to generalize nonlocally...
    - composition of factors is one such assumption/approach
    - they provide exponential gains that can counteract the curse of dimensionality
    - manifold learning algorithms also reduce dimensions under consideration
        1. concentrated probability densities are a prerequisite
        2. and if local transformations seem plausible then maybe so!
        - representing in manifold coordinates facilitates the use of ML tools on manifolds

