## Lecture 2.2: Datasets and Losses

### Model Fitting

Goal: find parameters $\theta$ &ensp;&ensp;&ensp;&ensp;$f_{\theta}(\text{x})\ =\ \text{Wx}\ +\ b$

Components:
* dataset
* loss function

### Dataset

Recap:
* samples from a data generating distribution
* we use \textbf{labels} to train the model

$\mathcal{D}\ =\ \{(\text{x}_{i},\ \text{y}_{i})\}_{i\ =\ 1}^{N} \text{where}\ (\text{x}_{i},\ \text{y}_{i}) \sim P(\text{X},\ \text{Y})$

### Loss Function

A \textbf{loss function} measures the quality of the model

Loss Function:
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x}_{i},\ \text{y}_{i})$

Expected Loss:
&ensp;&ensp;&ensp;&ensp;$L(\theta\ |\ \mathcal{D})\ =\ \mathbb{E}_{(\text{x},\ \text{y})\ \sim\ \mathcal{D}}[l(\theta\ |\ \text{x},\ \text{y})]$

* $l(\theta\ |\ \text{x}_{i},\ \text{y}_{i})$ measures how good the prediction of our model is on $1\text{X}1$ single data item
* $L(\theta\ |\ \mathcal{D})\ =\ \mathbb{E}_{(\text{x},\ \text{y})\ \sim\ \mathcal{D}}[l(\theta\ |\ \text{x},\ \text{y})]$ takes the entire data set and computes the average of this loss for each individual data element (a.k.a. expected loss)
* In practice, always optimize the expected loss over the entire data set

### Properties of a Loss Function

* **Low Loss** - good
* **High Loss** - bad
* Loss function over the full dataset $\mathcal{D}$ </br>
&ensp;&ensp;&ensp;&ensp;$L(\theta)\ =\ L(\theta\ |\ \mathcal{D})\ =\ \mathbb{E}_{(\text{x},\ \text{y})\ \sim\ \mathcal{D}}[l(\theta\ |\ \text{x},\ \text{y})]$

### Loss Function: Examples

Linear Regression: </br>
&ensp;&ensp;&ensp;&ensp;$\hat{\text{y}}\ =\ \text{Wx}\ +\ \text{b}$

L2 Loss: </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ =\ \frac{1}{2}||\hat{\text{y}}\ -\ \text{y}||_{2}^{2}$

* Measured either by Euclidean or set of L1 distances between model predictions and expected predictions
* Can think of as just the distance between two points in space
* One is the prediction of the model and the other is the ground truth label

Binary Classification: </br>
&ensp;&ensp;&ensp;&ensp;$\hat{\text{y}}\ =\ \sigma(\text{Wx}\ +\ \text{b})$

Sigmoid + Binary Cross Entropy Loss: </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ =\ -\text{y log}(\hat{\text{y}})\ -\ (1\ -\ \text{y})\text{log}(1\ -\ \hat{\text{y}})$

* Output of the model is a transformation of the linear model through a sigmoid
* Sigmoid gives us a probability if a certain input or output is closer to class one vs class zero
* Binary Cross Entropy is the negative log likelihood of the sigmoid after right class
* $-(1\ -\ \text{y})\text{log}(1\ -\ \hat{\text{y}})$ means that if the label for a certain data point is one, we look at the top part $-\text{y log}(\hat{\text{y}})$
* This will optimize the log of $\hat{\text{y}}$ which is the log likelihood of the model or binary classification of the model
* If the label of $-\text{y log}(\hat{\text{y}})\ =\ 0$ we only look at the bottom part $-(1\ -\ \text{y})\text{log}(1\ -\ \hat{\text{y}})$ and we maximize $1\ -\ \text{log}$ of the probability that the model predicts
* All this means is that we always want to maximize the log probability of the correct label

Multi-Class Classification: </br>
&ensp;&ensp;&ensp;&ensp;$\hat{\text{y}}\ =\ \text{softmax}(\text{Wx}\ +\ \text{b})$

Softmax + Cross Entropy Loss: </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ =\ -\sum\limits_{c\ =\ 1}^{C}1_{[y\ =\ c]}\text{log}(\hat{\text{y}}_{c})$

* Softmax will contain multiple outputs; will produce a vector of probabilities over n classes
* The Cross Entropy Loss says that one of the classes is going to be correct
* For the correct class, Cross Entropy Loss maximizes the log likelihood of the softmax of the correct class
* $\text{log}(\hat{\text{y}}_{\text{c}})$ is the log of the correct class
* $\text{log}(\hat{\text{y}}_{\text{c}})$ can be rewritten for optimization as $-\text{log}(\hat{\text{y}}_{y})$

### Expected Loss

Converts a sample-based loss $l$ into dataset-based loss $L$ </br>
&ensp;&ensp;&ensp;&ensp;$L(\theta\ |\ \mathcal{D})\ =\ \mathbb{E}_{(\text{x},\ \text{y})\sim\mathcal{D}}[l(\theta\ |\ \text{x},\ \text{y})]$
* $l(\theta\ |\ \text{x},\ \text{y})$ function of $\theta,\ \text{x},\ \text{y}$
* $L(\theta\ |\ \mathcal{D})$ function of $\theta$

### Next - Model Fitting

Goal: find parameters $\theta$ </br>
&ensp;&ensp;&ensp;&ensp;$\theta^{*}\ =\ \underset{\theta}{\text{arg min}}L(\theta\ |\ \mathcal{D})$

Deep learning uses **gradient descent**:
* requires gradient $\nabla_{\theta}L(\theta\ |\ \mathcal{D})$
* requires optimizers to update $\theta$