# Overview

## Basic Concepts

### What is Machine Learning
- **Machine Learning:** machine learning is a field of study that gives computers the ability to learn from data without being explicitly program (no explicit specification on data pattern)
- Three pillars of ML:
    - Supervised Learning
    - Unsupervised Learning
    - Reinforcement Learning

### Supervised Learning
- Learning input-output mapping based on a dataset with features and labelled output
- Example: using linear regression model to predict outcomes
- Diff between ML and Econometrics:
    - Econometrics:
        - Econometrics build models based on mathmatics and assumptions
        - This resulsts in good interpretability, exact anticipation on model capability, but weak prediction
    - Machine learning:
        - ML builds models from engineering perspective
        - Find best model from trails
        - Without exact understanding on why the model is capable on specific tasks
        - Weak  model interpretability
- Regression and classification:
    - **Regression:** the outcome of prediction is continous
    - **Classification:** the outcome of prediction is discrete

### Unsupervised Learning
- Unsupervised learning applies to datasets with input features but no labeled outputs
- This means that unsupervised learning can only learn distribution of the input data (for tasks such as clustering)
- The learned distribution can then be used for generation taks (generative unsupervised models)

### Reinforcement Learning
- The model interacts with an environment
- It observes the state of the environment (as input)
- It learns the optimal action policy through interaction with the environment:
    - Observes the state of the environment
    - Choose an action according to a policy
    - Execute the action
    - Observe the reward and change in state
    - Update the policy based on the reward signal
    - Iterate theough above process

## Probability Fundamentals

### Joint and Marginal Distributions
- Joint distribution: 
    - A function that maps a vectors to the probability of relaization of the vector
    - $Pr(x, y)$
- Marginal distribution: 
    - Captures the probability of realization of one variable regardless of the value the other one took
    - Marginalization:
        - If we know the joint distribution P(x, y) over two variables, we can recover the marginal distributions
        - $Pr(y) = \int \Pr(x, y) dx$
        - $Pr(x) = \int \Pr(x, y) dy$

### Conditional Distribution
- Conditional probability $Pr(x|y)$ is the probability of variable x taking a certain value, given the value of y
- $$Pr(x|y) = \frac{Pr(x, y)}{Pr(y)} = \frac{Pr(x, y)}{\int \Pr(x, y) dx}$$
- Intutition:
    - The conditional distribution $Pr(x|y)$ can be found by taking a slice through the joint distribution $Pr(x, y) for a fixed y
    - This slice is then divided by the probability of that value y occurring (the total area under the slice)
    - So that the conditional distribution sums to one 
- Bayesâ€™ rule: 
    - $$ Pr(x|y) = \frac{Pr(x, y)}{Pr(y)} = \frac{Pr(y|x)Pr(x)}{Pr(y)}$$
    - $$ Pr(x|y)Pr(y) = Pr(y|x)Pr(x)$$

### Bivariate Gaussian
- The joint distribution function: $$
P(x; \mu, \Sigma) \propto 
\exp\!\left[
-\frac{1}{2} (x - \mu)^{T} \Sigma^{-1} (x - \mu)
\right] $$
- The distribution is characterized by $x$ (a vector of quantity ), $\mu$ (mean vector), $\Sigma$ (covariance matrix)
- The covariance matrix can take spherical (multiple of the identity matrix), diagonal, and full forms (not diagonal):
    - Spherical: isocontours (contours) of the distribution are circles
    - Diagonal: isocontours (contours) of the distribution are axis-aligned ellipses
    - Full: isocontours (contours) are general ellipses

## Information Fundamentals

### Kullback-Leibler Divergence
- Usually referred as "distance" between probability distributions $p(x)$ and $q(x)$
- It tells how similar two distribution are, however, it is not distance as the measure is not symmetric (KL divergence of Q from P is not the same as P from Q)
- KL divergence of Q from P (or relative entropy of P w.r.t to Q):
$$
D_{\mathrm{KL}}(P \,\|\, Q)
=
\int_{-\infty}^{\infty}
p(x)\,
\log\!\left(
\frac{p(x)}{q(x)}
\right)
\, dx
$$
- Properties:
    - If Q is exactly the same as P: the likelihood ratio is 1, the log-likelihood ratio is 0, the integral is 0, therefore, the KL divergence is 0
    - If Q is quite different from P: the log-likelihood ratio is either >0 or <0, the divergence measure > 0
- In Bayesian inference for instance, it can measure information gain from moving from prior Q to posterior P (if Q quite far from P, high information gain)

### Shannon Entropy
- Shannon entropy is a measure of uncertainty of a distribution (related to its variance)
- Intuition:
    - How many bits, on average, are needed to remove uncertainty from a distribution?
    - Equivalently, what is the minimum number of yes/no questions needed to determine what a person is thinking (x)?
    - Suppose x is drawn from a known distribution, the entropy measures how many questions we need to ask to determine x
    - For example, if x is drawn from a Bernoulli distribution, we only need to ask one question: is it 1?
- Definition:
    - In discrete case : 
    $$ 
    H(X)
    = - \sum_x p(x)\,\log p(x)
    $$
    - In continous case (differential entropy): 
    $$ 
    h(X)
    = - \int_{-\infty}^{\infty} p(x)\,\log p(x)\, dx
    $$

### Mutual Information
- For two random variable X and Y, mutual information is the KL divergence of product of marginals of X and Y from joint distribution of (X,Y)
- It measures how much one random variable tells us about another, how informative X is about Y
- It is related to correlation, but correlation captures only linear dependence, while mutual information is a generalization (0 if independent)
- Intuition:
    - It is the expected reduction in entropy of X given value of Y
    - If the value of Y is given, how much uncertainty is reduced for the distribution of X
    - If the value of Y is given, how many minimum yes/no questions can be eliminated when guessing x
- Definition: 
$$
I(X; Y)
=
\iint
p_{X,Y}(x,y)
\log\!\left(
\frac{p_{X,Y}(x,y)}{p_X(x)\,p_Y(y)}
\right)
\, dx\, dy
=
H(Y) - H(Y \mid X)
$$