“A learning model that summarizes data with a **set of parameters** of fixed size(independent of the number of training examples) is called a parametric model. No matter how much data you throw at the parametric model, it won’t change its mind about how many parameters it needs.” – Russell and Norvig (2009)

*Simply* stating, **parametrized classification algorithms** learns patterns from the training data.

### Four Components of Parameterized Learning

**Parameterization** is the process of defining the necessary parameters of a given model. In the task of machine learning, parameterization involves defining a problem in terms of four key components: *data*, *a scoring function*, *a loss function*, and *weights and biases*.

#### Input Data
This includes both data points(i.e., raw pixel intensities, features etc.) and the labels associated with each data points. We denote data in terms of **design matrix**. Each **row** in design matrix represents a **data point** and each **column** of the matrix corressponds to a different **features**. We also define a vector y where $y_{i}$ provides the class label of $i^{th}$ image in each class. 

#### Scoring Function
INPUT_IMAGES $\rightarrow$ F(INPUT_IMAGES) $\rightarrow$ OUTPUT_CLASS_LABELS.                                       Here **F** is a scoring function.

#### Loss Function
A loss function quantifies how well our *predicted class labels* agree with our *ground-truth labels*. The higher level of agreement between these two sets of labels, the lower our loss (and higher our classification accuracy, at least on the training set).
Simply, loss function tells **how good our accuracy is**.

#### Weights and biases
The weight matrix, typically denoted as $W$ and the bias vector $b$ are called the **weights** or
parameters of our classifier that we’ll actually be optimizing. Based on the output of our scoring
function and loss function, we’ll be tweaking and fiddling with the values of the weights and biases
to increase classification accuracy. Our main aim is to **reduce our loss**.

### Linear Classification

Let's start with data. 
**Training set** - $(x_i, y_i)$, $i = 1, ..., N$. $x_i$ is each data point and $y_i$ is the label associated with $x_i$. $N$ is the number of data points of dimensionality $D$.

$y_i = 1, ..., K$. In words, $K$ is the number of output categories. For example, in case of dog and cat datset, we have $K=2$.

Now, we must define **scoring function** $f$ that maps the image to the class label scores. One method to accomplish this is via linear mapping.

$$f(x_i, W, b)=Wx_i+b$$

### Advantages of Parameterized Learning
- Once we are done with training our model, we can discard $x_i$ and **keep only the weight matrix $W$ and bias $b$**. This reduces the size of our model.
- Classifying new test data is fast. In order to perform classification, all we need to do is take dot product of $W$ and $x_i$, followed by adding in the bias $b$.

### What does it mean by actually "learning"?
In order to actually "learn" the mapping from input data to the labels via our scoring function, we need to discuss two important concepts:
- Loss Functions [76], [77], [78], [79]
- Optimization methods

###  Loss Functions
A loss function quantifies how "good" and "bad" a given predictor is at classifying the input data points in a dataset. 
Ideally, our loss should decrease over time as we tune our model parameters.

### Multi-class SVM Loss
Inspired by SVMs which uses a scoring function $f$ to map our **data points** to **numerical scores for each class levels**.

$$f(x_i, W, b) = Wx_i+b$$
Now, we have our scoring function, we need to determine how "good" or "bad" this function is(given $W$ and $b$) at making predictions. To make this determination, we need a **loss function**.


<img src="files/5.png">

We can access both $x_i$ as well as its associated label $y_i$.
Let's say our scoring function is $$s=f(x_i, W)$$ which implies, we can obtain the predicted score of the j-th class(**y**) via the i-th data point:

$$s_j=f(x_i,W)_j$$

Using this syntax, we can obtain **hinge loss function**:

$$L_i=\sum_{j\neq y_i}max(0,s_j-s_{{y}_i}+1)$$

$s_j-s_{{y}_i}$ is **output label - actual label**

**max** operation is clamping values at 0 to avoid inclusion of negative values.

To derive loss across entire training set, we take mean over each $L_i$.

$$L = \frac{1}{N} \sum_{i=1}^{N} L_i$$

Another related loss function is the **square hinge loss**.

$$L_i=\sum_{j\neq y_i}max{(0,s_j-s_{{y}_i}+1)}^2$$

Square term penalizes loss which is not good. Now chosing between hinge and square hinge is just a hyperparameter we need to tune.

#### A Multi-Class SVM loss example
Given are some arbitrary output scores of $f(x, W) = Wx+b$.
<img src="files/6.png">

Let's see for the Image #1:
We are seeing two terms here. First is $y_{cat} - y_{actual}$ and Second is $y_{panda}-y_{actual}$.
We can see that error comes out to be 0, which shows that there is no error.

In [4]:
max(0, 1.33 - 4.26 + 1) + max(0, -1.01 - 4.26 + 1)

0

Image #2 : Here our error comes out to be 5.96 which is clear by our wrong prediction.

In [6]:
max(0, 3.76 - (-1.20) + 1) + max(0, -3.81 - (-1.20) + 1)

5.96

Image #3

In [7]:
max(0, -2.37 - (-2.27) + 1) + max(0, 1.03 - (-2.27) + 1)

5.199999999999999

Total loss

In [9]:
(0.0 + 5.96 + 5.2) / 3.0

3.72

Note that our loss was **zero** for only one which implies two of our predictions were incorrect: which is true by above image.

### Cross Entropy Loss and Softmax Classifiers
Softmax classifiers give probabilities for each class label while hinge loss gives the margin.

Mapping function is defined in the same way as hinge loss: $f(x_i, W)=Wx_i$

Unlike hinge loss, we interpret these scores as **unnormalized log probabilities** for each class level, which amounts to swapping out the hinge loss function with cross-entropy loss.

$$L_{i}=-log(\frac{e^{s_{{Y}_i}}}{\sum_{j}{e^{s_{j}})$$ 
