# Lecture 1
- Intro to Deep Learning

## The perceptron - single neuron model
 - key component in deep learning modeling.  Comes up over and over as the base model.  

###Foward Propagation
- $x_i$ - inputs
- $w_i$ - weights
- $w_0$ - bias
- $g$ - non-linear activation function like 'relu', 'sigmoid'
- $m$ - number of inputs
- $\hat{y}=g(w_o + \sum_{i=1}^m{x_iw_i}$)
- See "How sum vectors and matrices work" for proof

###Common activation functions:
- Purpose of activation functions is to introduce non-linearities into the system.
 - Linear activation produces linear decisions no matter the network size

- Sigma:
 - $\sigma(x) = \frac{1}{1 + e^{-x}}$
 - $\frac{d\sigma}{dx} = \sigma(x) \cdot (1 - \sigma(x))$
- Relu:
 - $\text{ReLU}(x) = \max(0, x)$
 - $\frac{d\text{ReLU}}{dx} = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x \geq 0 \end{cases}$
- Hyperbolic tangent (tanh)
 - $\text{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
 - $\frac{d\text{tanh}}{dx} = 1 - \text{tanh}^2(x)$

### Peceptron simplified
- single perceptron
 - $z=w_0 + \sum_{j=1}^m{x_jw_j}$

- Multi-ouput perceptron
 - $z_i=w_{0,i} + \sum_{j=1}^m{x_jw_{j,i}}$
 - because all inputs are connected to all outputs, a MOP layer is called "Dense"
 - **Tensorflow**
 ```
 import tensorflow as tf
 layer = tf.keras.layers.Dense(units = "m-1")
 ```

## Dense Layer
Defined as all inputs are connected to all outputs.  This is the multi-ouput perceptron model

- in the slideshow they use an 'X' inside of a box to simplify the connections between inputs and nodes.

## Multiple layers - "Deep Neural Network"
**Sequential Model** - you can use the outputs from one peceptron layer as the input for another layer
- $z_{k,i} = w_{0,i}^k+\sum_{j=1}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^k $
- See TF documentation [HERE](href='https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense?version=stable')
 ```
 import tensorflow as tf
 model =tf.keras.Squential([tf.keras.laters.Dense(n1),
                            tf.keras.laters.Dense(n2),
                            ...
                            tf.keras.laters.Dense(2))]
 ```

## Loss
Did my neural network make a mistake?  If so, how big was the mistake?  This is the amount of Loss
- "Loss Function" = "Objective function" = "Empirical risk" = "Cost function"
- $J(W) = \frac{1}{n}\sum_{i=1}^n L(f(x^i;W),y^i)$
- Minimize the mistakes of whole data set

### Loss Functions

Here are descriptions, formulas, LaTeX commands, and Python API calls for five common loss functions:

1. **Mean Squared Error (MSE) Loss:**
   - Used for regression models that output coninuous real numbers
   - **Formula:**
   - $$  L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$
   - **TensorFlow API Call:**
     ```python
     mse_loss = tf.keras.losses.mean_squared_error(y_true, y_pred)
     print(f'Mean Squared Error Loss: {mse_loss.numpy()}')
     ```

2. **Binary Cross-Entropy Loss:**
   - This is the binary cross-entropy loss, and it's commonly used as the loss function for binary classification problems in machine learning
   - can be used with models that output probability between 0 and 1
   - **Formula:**
    $$L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i) \right]$$
    
   - **TensorFlow API Call:**
     ```python
     # Calculate binary cross-entropy loss
     bce_loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)

     print(f'Binary Cross-Entropy Loss: {bce_loss.numpy()}')
     ```

3. **Categorical Cross-Entropy Loss:**
   - The categorical cross-entropy loss function is commonly used in machine learning tasks where the goal is to classify instances into multiple classes.
   - **Formula:**
     $$L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \cdot \log(\hat{y}_{ij})
     $$
   - **TensorFlow API Call:**
     ```python
     # Calculate categorical cross-entropy loss
     cce_loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)

     print(f'Categorical Cross-Entropy Loss: {cce_loss.numpy()}')
     ```

4. **Hinge Loss (SVM):**
   -
   - **Formula:**
     $$
     L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i)
     $$
   - **TensorFlow API Call:**
     ```python
     # Calculate hinge loss
     hinge_loss = tf.keras.losses.hinge(y_true, y_pred)

     print(f'Hinge Loss: {hinge_loss.numpy()}')
     ```

5. **Huber Loss:**
   - **Formula:**

     $$ L(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^{N} \begin{cases} \frac{1}{2} (\hat{y}_i - y_i)^2 & \text{if } | \hat{y}_i - y_i | \leq \delta \\ \delta \cdot | \hat{y}_i - y_i | - \frac{1}{2} \delta^2 & \text{otherwise} \end{cases}

     $$
   - **TensorFlow API Call:**
     ```python
     # Set the delta parameter (threshold for the absolute difference)
     delta = 1.0

     # Calculate Huber loss
     huber_loss = tf.keras.losses.huber(y_true, y_pred, delta)

     print(f'Huber Loss: {huber_loss.numpy()}')
     ```

## Gradient Descent
### Algorithm
Sure, here's a step-by-step description of the gradient descent algorithm in a list format:

1. **Initialize Parameters:**
   - Start with initial values for the parameters (weights and biases) of the model. This could be set randomly or using some predefined strategy.

2. **Set Hyperparameters:**
   - Choose hyperparameters such as the learning rate (\(\alpha\)), which determines the size of the steps taken during each iteration.

3. **Compute Predictions:**
   - Use the current parameters to make predictions on the training data.

4. **Compute Loss:**
   - Calculate the loss, which is a measure of how well the model is performing. The loss function depends on the specific task (e.g., mean squared error for regression, cross-entropy for classification).

5. **Compute Gradients:**
   - Calculate the gradients of the loss with respect to each parameter. This involves computing the partial derivatives of the loss function with respect to each parameter.

6. **Update Parameters:**
   - Adjust the parameters in the opposite direction of the gradients to minimize the loss. This is done using the update rule:
     \[ \text{parameter} = \text{parameter} - \alpha \times \text{gradient} \]
   - Repeat this process for all parameters.

7. **Iterate:**
   - Repeat steps 3-6 for a specified number of iterations (epochs) or until convergence. Convergence can be determined by observing changes in the loss or the gradients falling below a certain threshold.

8. **Evaluate Convergence:**
   - Check for convergence by monitoring the changes in the loss or gradients. If the algorithm has converged, the process can be stopped.

9. **Make Predictions:**
   - Use the trained model with the optimized parameters to make predictions on new, unseen data.

This list captures the essence of the basic gradient descent algorithm. There are variations, such as stochastic gradient descent (SGD) and mini-batch gradient descent, which involve updating parameters based on a subset of the training data rather than the entire dataset at each iteration. These variations are often used to speed up the convergence process, especially for large datasets.

### Computing gradients: Backpropagation
- In theory very simple: It's just the extantiation of the chain rule over and over
General Formaula
$$ \frac{\partial{J(W)}}{\partial{w_2}} = \frac{\partial{J(W)}}{\partial{\hat{y}}} * \frac{\partial{\hat{y}}}{\partial{w_2}} $$

Example:
$$\frac{\partial{J(W)}}{\partial{w_1}} = \frac{\partial{J(W)}}{\partial{\hat{y}}} * \frac{\partial{\hat{y}}}{\partial{z_1}}*\frac{\partial{z_1}}{\partial{w_1}} $$

### Learning Rate
- How do you pick a good value?
 - If its too low: small learning rate and can get you stuck in a local minimum
 - If it's to high: we can get divergence
 - Often it's an exhaustive function of comparing different learning rate outputs.  This is BAD

- Adaptive Learning Rate Algorithms
 - SGD - `tf.keras.optomizers.SDG `
 - Adam - `tf.keras.optomizers.Adam `
 - Adadelta - `tf.keras.optomizers.Adadelta `
 - Adagrad - `tf.keras.optomizers.Adagrad `
 - RMSProp - `tf.keras.optomizers.RMSProp`

### Problem with gradient descent
- Too much computation
- Alternative - Use a single example in the data set- Stochastic Gradient descent.  this may be very noise.  Very fast
- Mini-batch gradient descent
 - Computationally efficient
 - reduces stochasticicity
 - batch sizes on the order of 10s or 100s of data points
 - smoother convergence
 - allows for parallization


### Overfitting
- When the model is too complex and does not generalize well
- **Regularization** - Technique that contrains our optomization to discourage complex models
- Neural networks are prone to overfitting
- **Dropout** - during training randomly set some activations to 0.  
  - forces the netowrk to not rely on any 1 node
  - **Tensorflow** - `tf.keras.layers.Dropout(p=0.5) #50% dropout`
- **Early Stopping**  stop training before overfitting occurs
 -




## Todo List
- [ ] Read more on Common loss functions
- [ ] Compare and contrast Adaptive Learning Algorithms



##Dense Layer from stratch

In [None]:
### Dense Layer from Scratch ###

# n_output_nodes: number of output nodes
# input_shape: shape of the input
# x: input to the layer

class OurDenseLayer(tf.keras.layers.Layer):
  def __init__(self, n_output_nodes):
    super(OurDenseLayer, self).__init__()
    self.n_output_nodes = n_output_nodes

  def build(self, input_shape):
    d = int(input_shape[-1])
    # Define and initialize parameters: a weight matrix W and bias b
    # Note that parameter initialization is random!
    self.W = self.add_weight("weight", shape=[d, self.n_output_nodes]) # note the dimensionality
    self.b = self.add_weight("bias", shape=[1, self.n_output_nodes]) # note the dimensionality

  def call(self, x):
    '''TODO: define the operation for z (hint: use tf.matmul)'''
    z = tf.matmul(x,self.W) + self.b #TODO -done
    '''TODO: define the operation for out (hint: use tf.sigmoid)'''
    y = tf.sigmoid(z)# TODO -done
    return y

##Gradient Descent Loop

In [None]:
import tensorflow as tf
weights = tf.variable([tf.random.normal()])
while True:
  with tf.GradientTape() as g:
    loss = compute_loss(weights)
    gradient = g.gradient(loss,weights)

  weights= weights- lr*gradient

# Lecture 2 - Deep Sequence Modeling
Example Sequence Modeling Application
- Many-to-one - Sentiment Classification
- One to Many - Image Captioning
- Many to Many - Mchine Translation

## Reccurent Neural Networks(RNNs)
Recurrance relation - something about what the network is computing a particular time is being passed to the other time/state-dependant instances
- $h_i$ - Internal state. Nueron memory
- This means that the network output is now dependant on the input $x$ and the state $h_i$
 - $\hat{y}_t = f(x_t,h_{t-1})$
 - $h_t$ iteritively updated over time.
- recurrance relation - determins how $h_t$ updates.
 - $h_t=f_w(x_t,h_{t-1})$, where
 - $h_t$ - cell state
 - $f_w$ - function with weights W
 - $x_t$ - input
 - $h_{t-1}$ - old state
- Becuase we are now recording the state $h_i$ we are also creating a weight matrix corresponding to the state.
 1. Output Vector - $\hat{y}_t=W_{hy}^Th_t$
 2. Update hidden state - $h_t = tanh(W_{hh}^Th_t-1+W_{xh}^Tx_t)$
 3. Input vector $x_t$

Another way to think about RNNs is a "computational graph across time".  Basically for each input there is an input $x_i$; a weight matrix $W_xh$ for the input to the RNN state, a weight matrix $W_{hy}$ for the RNN state to the output; and a weight matrix $W_{hh}$ for the RNN state to the next state.  These weight matrices are iteration independant. See RNN from stratch

**Implementation in TensorFlow** - ``` tf.keras.layers.SimpleRNN(rnn_units) ```

## Sequence Modeling design criteria
Need to:
1. Handle variable length sequences
2. Track long dependencies
3. Maintain information about order
4. Share parameters across the sequence
RNNS meet these criteria

## RNNs Backpropagation through time
Computationally very costly because it requires calculating the gradient for all inputs, and weight matricies for each state.  

- Many values of $l$ - leads to "exploding gradients"
 - fix with gradient clipping
- Few values of $l$ - leads to vanishing gradients
 - fix with activation function, weight initialization, or network architecture


### Dealing with Vanishing gradient problem
1. ReLU activation function - The RELU activation function has a derivative of 1, and thus the the gradient does not reduce below 1
2. Parameter initialization - Initial weights tot he identity matrix
 - Initialize biases to zero.  This helps weights from shrinking to zero
3. Gated Cells - use gates to selectively add or remove information withing each recurrent unit with a pointwise multiplcation and a activation function

#### Long Short Term Memory (LSTMs)
Key concepts
1. Maintain a cell state
2. Use gates to control the flow of information
 - **Forget** gate gets rid of irrelevant info
 - **Store** relevant info from current input
 - **Update** cell state
 - **Output** gate returns a filtered version of the cell state
3. Backpropagation through time with partiall uniterrupted gradient flow

- Flow
 1 Forget --> 2. Store --> 3. Update --> 4. Output

- LSTMs cells can track information through many timesteps
- **LSTMs in TensorFlow** - `tf.keras.layers.LSTM(num_units)`



## RNN from scratch

In [None]:
class MyRNNCell(tf.keras.layers.Layer):
  def __init__(self,rnn_units, input_dim, output_dim):
    super(MyRNNCell,self).__init__()

    #initialize weight matrices
    self.W_xh = self.add_weight([rnn_units,input_dim])
    self.W_hh = self.add_weight([rnn_units,rnn_units])
    self.W_hy = self.add_weight([output_dim,rnn_units])

    #initiate the hidden state to zeros
    self.h = tf.zeros([rnn_units,1])

  def call(self,x):
    #update the hidden state
    self.g=tf.math.tanh(self.W_hh*self.h*delf.W-xh*x)

    #compute the output
    output = self.W_hy* self.h

    #return the current output and hidden state
    return output, self.h

# Lecture 3

#Lecture 4: Deep Generative Modeling
## Supervised vs. Unsupervised ML
Supervised
- Data : (x,y)
- Goal - Learn function to map $x→y$
- Example: classification, regression, object detection, emantic segmentation

Unsupervised
- Data: x
- Learn hidden/underlying structure of data
- Examples: Clustering, feature or dimensionality reduction

##Generative Modeling
Take as input training sample from some distribution and learn a model that represents that distribution

**Debiasing**  - Process of mitigating or reducing biases present in the training data or model.  Some common debiassing approaches are
 - Diverse and Representative training data -
 - Bias detection and measurement
 - Data Augmentation and balancing
 - Adversarial Training
 - Regularization Techniques
 - Explainability and Interpretability
 - Post Processing Techniques
 - Continuous monitoring

## Latent Variable models
### Autoencoders and Variational Autoenvoders (VAEs)
**Latent Variable** - a variable that is not directly observed but is inferred through the observation of other variables.

**Autoencoders**
- Encoder - takes input data and transformes it into a compressed representation typically of lower dimensionality than the orignal input
- Decoder - takes the encoded representation and reconstructs the input data from it.  the goal is to generate an output that is as close to the input as possible.
- Loss Function - Auto encoder is optimized to minimize the difference between the input and reconstructed output.  A loss function measures the dissimilarity bwteen the input and the output.  Common loss functions include MSE, BCE

**Variational Autoencoders**
A type of autoencoder that uses a probablistic approach to represent input data.  The key innovation of VAEs lies in their ability to generate new data points by sampling from a learned probability distribution in the latent space.
 - Encoder
  - Similar to a regular autoencoder, a VAE consists of an encoder neural network that maps the input data to a probabilistic distribution in the latent space.
  - Instead of directly outputting a fixed encoding, the encoder outputs parameters of a probability distribution (usually Gaussian or normal distribution) in the latent space. These parameters are the mean ($\mu$) and the standard deviation ($\sigma$).
  - computes $q_{\phi}(Z|X) $
 - Sampling
  - During the training phase, the VAE introduces a stochastic element by sampling from the latent distribution represented by μ\muμ and σ\sigmaσ. This step is crucial for generating diverse data samples
 - Decoder:
  - The sampled latent variable is then passed through the decoder, which reconstructs the input data. Like a regular autoencoder, the decoder attempts to generate an output that closely matches the original input.
  - computes $p_{\theta}=(X|Z)$
 - Loss Function:
  - The loss function for a VAE consists of two components: a reconstruction loss (measuring how well the reconstructed data matches the input) and a regularization term that encourages the learned latent distribution to be close to a standard normal distribution. This regularization term is typically the Kullback-Leibler (KL) divergence.
   - The KL divergence term penalizes the model if the learned latent distribution deviates significantly from a standard normal distribution (Gaussian). This regularization encourages the VAE to learn a well-behaved and smooth latent space, making it more suitable for generating diverse and meaningful samples during the generation phase.
   - $$D(q_{\phi}(Z|X) || p(z) = -\frac{1}{2}\sum_{j=0}^{k-1}(\sigma_j+\mu_j^2 - log \sigma_j)$$
   - where $-log\sigma_j$ is the KL-divergence between the two distributions

#### VAE optomization
$L(\phi,\theta)$ = Reconstruction Loss + regularization term

**Reconstruction Loss** e.g. $||x-\hat{x}||^2$.  When normalizing use

**Regularization term**: $D(q_{\phi} (z|x) || p(z))$ where
 - $q_{\phi} (z|x)$ - is the inferred latent distribution
 - $p(z)$ - Fixed propr on latent distribution
  - Common choice of prior - Normal Gaussian $$p(z) = N(\mu=0,\sigma^2=1)$$
  - encourages encodings to distrbute normall
  - penalizes the network when it "tries to cheat" by clustering points to specific regions
 -  What are we trying to achieve with regularization?
  1. Continuity - points that are close in the latent space produce similar content after decoding
  2. completeness - sampling from the laten space leads to meaningful content after decoding
  3. Regularization with Normal pior helps enfoce information gradient in the latent space

#### VAE computation
- Problem: Cannot backpropagate gradients through sampling layers because of the probablistic approach.
- **Re-paramaterization** - redefine how a latent variable is sampled with a fixed \mu and \sigma vectors scaled by random constants drawn from the prior distribution
 $$\Rightarrow z = \mu +\sigma\odot\epsilon$$
- $\epsilon$ is drawn from normal distribution
- Review slide 35 for diagram on process

#### VAEs Latent perturbation
- slowly increas or decrease a single latent variable and keep all other variables constant
- Ideally we want latent variables that are uncorrelated.
 - enforce diagonal prior on the latent variables to encourage independence.
- $\beta VAE$ - introduce a loss function $\beta$ that controls the level of entanglement.  $\beta=1$ means no separation.  $\beta>1 \rightarrow$ constrains latent bottleneck

### Generative Adversarial Network (GANs)
- What if we only focus on the quality of the generated sample?  
 - Goal to generate new instances that are close to input
 - It can be difficult to learn the distribution directly.  Instead sample from simple random noise and "learn" the transformation

**GANs**
- have 2 neural networks.  
 - The **Generator** turns noise into an imitation of the data to try and trick the discriminator.  
 - The **discriminator** tries to determine the real vs. fake data

**Chat GPT** notes for step by step description of training process:
#### Training Process:
- During each iteration, the generator generates synthetic data, and the discriminator evaluates both real and generated data.
- The gradients from the discriminator's loss are used to update the discriminator's parameters, making it better at distinguishing real from fake data.
- The gradients from the generator's loss are used to update the generator's parameters, making it better at generating data that fools the discriminator.
- This process continues in a feedback loop, with the generator and discriminator getting better at their respective tasks over time.

####Convergence:
- Ideally, the GAN reaches a point where the generator produces data that is indistinguishable from real data, and the discriminator cannot reliably tell the difference.
- At this point, the GAN is said to have reached convergence.

#### GAN loss function
- **D** tries to identify the synthesized images via $$ arg max(D)E_{z,x}[logD(G(z))+log(1-D(X))] $$
- **G** tries to synthesize fake images that fool **D** $$ arg min(G)E_{z,x}[logD(G(z))+log(1-D(X))] $$
- Combine the two processes - $$argmin(G)max(D)E_{z,x}[logD(G(z))+log(1-D(X))]$$

#### Conditional GANS
- add a factor $c$ applied to the noise and the Descriminator $D$

### Cycle Gan
Chat GPT description:
CycleGAN, short for Cycle-Consistent Generative Adversarial Network, is a type of generative model designed for image-to-image translation without paired training data. It was introduced by Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros in 2017. CycleGAN is particularly powerful for tasks where there's a lack of paired data, such as transforming images from one domain to another without explicit correspondences.

- In audio:  Take the spectrum of words from speech, apply a synthesis to a dirfferent domain.  i.e. instructors voice to obama

#Lecture 5: Robust and Trustworthy Deep Learning
**BIAS** - What happens when models are skewed by sensative feature inputs? i.e. skin color

**UNCERTAINTY** - Can we teach a model to rexognize when it doesnt know an answer?

## Bias present in the AI lifecycle
1. Sampling bias - over sampling from one area of data, undersampling from another.  i.e. healthy patients vs. diseased.  
2. Lack of uncertainty benchmarks and metrics
3. Deployment - what happens when a model deployed in 2023 is used in 2035?  Probably wont work
4. Evaluation - Bulk metrics dont apply to subgroups.  

### Class imbalance
What happens when some cases are overrepresented?
 - Fix: sample reweighing - sample more date on underrepresented
 - Fix: Los reweighing - mistakes are underrepresented
 - Fix: Batch selection - Choose randomly from classes so that ever batch has an equal number of points per class

### What about Latent Features
How do we know which features to label?
The amount of labeling would be work intensive

Use Variational autoencoders (see lecture 4)
1. learn latent structure.  i.e. angle of face image
2. estimate distribution.  We can undersample on dense areas of set and oversample on sparse areas
3. Adaptively guide learning as described in step 2

Estimated joint distribution:
$$\hat{Q}(z|X) \approx\Pi_i\hat{Q}_i(z_i|X)$$
where:
  
$\Pi_i$ is the independence to approximate

and $\hat{Q}_i(z_i|X)$ is the histogram for every latent variable $z_i$.  Thus we define

The probability of selecting a datapoint

$$W(z(x)|X) \approx \Pi_i\frac{1}{\hat{Q}_i(z_i|X) + \alpha}$$

where $\alpha$ is a debiasing parameter.  As $\alpha$ increases this probability goes to the uniform distro.  As it decrease we debias more strongly.

The weight of the sample $W(z(x)|X)$ is used to adaptively resample.  See the missing slide saved image in to show the effect of alpha size.


## Uncertainty
### Type comparison

| Feature	| Aleatoric Uncertainty	| Epistemic Uncertainty|
| --- | --- | --- |
| Nature of Uncertainty |	Inherent, irreducible variability in data	|Model-related, reducible with more information|
|Causes |	Inherent stochastic factors (e.g., noise)	|Lack of knowledge about the underlying model
|Quantifiability |	Quantifiable, estimated directly from data |	Often indirect estimation (e.g., Bayesian methods)
|Reducibility |	Irreducible, remains present with ideal knowledge/model	|Reducible with more information, better models|
|Handling in Models |	Incorporated through probabilistic components, modeling data variability	| Addressed through Bayesian methods, ensembling, dropout-based uncertainty |
|Practical Implications	|Crucial for robust decision-making under inherent variability |	Important for model selection, improvement, and decision-making under uncertainty due to lack of knowledge |

### Estimating Aleatoric uncertainty/Loss
- Learn a set of variances corresponding to the input.
- A.Uncertainty: $f_\theta\rightarrow\hat{y},\sigma^2$
 - $\hat{y}$ : prediction
 - $\sigma^2$ : variance.  Is not independant of input

#### Loss function
Negative Log Likelihood (NLL) is a generalization of MSE to non-constant variance

$$L=\frac{1}{N} * \sum_{i=1}^{N}\frac{(\hat{Y}_i-y_i)^2}{2\sigma_i^2}+ln(\sigma_i^2)$$

- This allows us to understand how the $\sigma$ and $y$ influence our uncertainty

#### Example
In RGB images.  The highest alertoric uncertainty occurs at the edges/boundaries of objects withing the image.  this is irreducible

### Epistemic Uncertainty
- Model dependant uncertainty.  Does the model have confidence in the prediction?

- If you were to train an 'ensemble' of models with the same hyperparameters but with different weights, we would see different outputs.  A comparison of these outputs could be used to calculate uncertainty between outputs.
 - The problem with this is that it is cost intesive to create and compare multiple models.
 - Fix this with dropout layers
 - **KEEP** dropout layers at test time


## CAPSA - Themis AI project
- Model agnostic framework for risk estimation
- Nueral Network wrapper
- Added to training workflow.
- Calculates Biases, Uncertainty, and label noise
- Simplifies uncertainity calculation

```python
 train, test = load_data()
 model = capsa.HistogramWrapper(model,...)
 model.train(train)
 preds,bias = model.predict(test)
```





# Lecture 6: Deep Reinforcement Learning
- Data given in State:Action pairs
## Key Terms
- **Agent** - takes actions.  Algorithm is the agent
- **Environment** - the world in which the agent exists
- **Action** - $a_t$ A move our input the agent can make in the environement
- **Observation** - how the environment responds back with a change of state
- **State change** - $s_{t+1}$
- **Reward** - $r_t$ -feedback that measures the success or failure of the agents action
 - **Discounted total reward** : $R_t=\sum_{i+t}^\infty \gamma^ir_i$
 - $\gamma$ - dampening term makes future rewards worth less. $0<\gamma<1$
- **Policy** $r(s)$ best action to take at state

## Q-Function
$$Q(s_t,a_t) = 𝔼[R_t|S_t,a_t]$$
- Captures the expected total future reward an agent is state,$s$, can receive by executing a certain action, $a$.
- The Q-function gives us the action to take at a current state to maximize the reqard
- Strategy
$$\pi(s)=argmax_a(Q(s,a))$$
- Learned with Deep learning

##Value/Q- learning
Find $Q(s,a)$ where the state maximizes the Q-function
$a=argmax_aQ(s,a)$. i.e. optomize Qfunction

**Target** $(r+\gamma* max_{a'}Q(s',a'))$
Predicted Q(s,a)

- **Q-loss** - Mean Squared Error loss bewteen target and output defined by:
$$L = 𝔼[||r+\gamma*max_{a'}Q(s',a') - Q(s,a)||^2] $$

### Downsides of Q-learning
- Complexity
 - can model scenarios where the actions space is discrete and small
 - Cannot handle continous action spaces
- Flexibility
 - Policy is deterministically computed from the Q function by maximizing the reward --> cannot learn stochastic policies

##Policy Learning / Policy gradient algorithms
Find $\pi(s)$ to optomize reward
Sample $a~\pi(s)$
- Selecting the output which has the highest probaility of choosing the best policy.  B/c this is a distribution:
$$\pi(s)~P(a|s)=\sum_{a_i\in A}P(a_i|s)=1$$

### Loss function
$$loss = -logP(a_t|s_t)R_t$$
- if we get a lot of reward for an action that has a high probabilty.  We would continue to sample that action into the future
- If we sample an action with low reward we wont want to repeat that action
- The negative value of the **loss* function drives us to maximumns
-Apply to gradient descent
$$w'=w+\nabla logP(a_t|s_t)R_t$$
