# Linear Classification
In the last section we introduced image classification using K-NN algorithm. K-NN has disadvantages.
 - The classifier must remember all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size.
 - Classifying a test image is expensive since it requires a comparison to all training images.


 **Overview**:
  We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.


# Parameterized mapping from images to label scores
- **score** function
- **loss** function

let’s assume a training dataset of images $x_i \in R^D$
, each associated with a label yi
. Here $ i=1…N $
 and $ yi∈1…K $
. That is, we have N examples (each with a dimensionality D) and K distinct categories. For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function $ f:R^D\mapsto R^K$
 that maps the raw image pixels to class scores.

# Loss and Optimization
## Loss
In the previous lecture we learned to work with data having a set of hyperpaarameters. For linear classifier weights are hyperparameters. For different set of hyperparameters we get different predictions. The questions is , which hyperparameters should we use in our model. In other words, which parameter set is best to use in our model. To get intuition about this question "loss" comes to the picture. In machine learning we want our model to predict as much as perfect. Say we want our a model to classify an image as cat. We want our model that every time it encounters an image it correctly predict as cat or not a cat. When model fails to predict correctly we consider it as loss. Therefore we can define loss as the difference between the predicted value and the actual value for a single example. Now the answer is, we will select that hyperparameter set that results the least loss.

##### Types of loss function
There are different types of loss function. Depending on the problem set and algorithm we use. For linera classifier we use Mean Square Error(MSE).
In general, Given a dataset of examples  {(xi,yi)}i=1 to N ,xi is image and yi is label(integer). Loss over the dataset is a sum of loss over examples.

Loss = $L[i] =(f(X[i],W),Y[i])$

Loss_for_all = $1/N * Sum(Li(f(X[i],W),Y[i])) $


Now we will discuss Multiclass SVM loss.
##### Multiclass SVM loss:
![alt text](lossfunc.png "loss function")

for a single training example for a specified class level , if the value of specified class level is greater than other values then we consider loss as 0 and a amount of loss otherwise. We compare the difference between specified class value with another value. if the specified class value is greater then we return 0 and the difference+1 otherwise. This type of loss is called **hinge** loss

Suppose 3 training example, 3 classes with some W the scoeres f(x,W)=Wx are
for demonstration we consider an 2d array. each value of the inner array contains value for cat,car and frog. for each array first,second and third arrays are

cat,car, frog=[[3.2,5.1,-1.7],[1.3,4.9,2.0],[2.2,2.5,-3.1]]












In [None]:
def L_i(x, y, W):
  """
  unvectorized version. Compute the multiclass svm loss for a single example (x,y)
  - x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
    with an appended bias dimension in the 3073-rd position (i.e. bias trick)
  - y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
  - W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
  """
  delta = 1.0 # see notes about delta later in this section
  scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
  correct_class_score = scores[y]
  D = W.shape[0] # number of classes, e.g. 10
  loss_i = 0.0
  for j in range(D): # iterate over all wrong classes
    if j == y:
      # skip for the true class to only loop over incorrect classes
      continue
    # accumulate loss for the i-th example
    loss_i += max(0, scores[j] - correct_class_score + delta)
  return loss_i

In [1]:
def L_i_vectorized(x, y, W):
  """
  A faster half-vectorized implementation. half-vectorized
  refers to the fact that for a single example the implementation contains
  no for loops, but there is still one loop over the examples (outside this function)
  """
  delta = 1.0
  scores = W.dot(x)
  # compute the margins for all classes in one vector operation
  margins = np.maximum(0, scores - scores[y] + delta)
  # on y-th position scores[y] - scores[y] canceled and gave delta. We want
  # to ignore the y-th position and only consider margin on max wrong class
  margins[y] = 0
  loss_i = np.sum(margins)
  return loss_i


In [2]:
import numpy as np
## Fully vectorized implementation
def L(X,y,W):
     
     """
    Fully-vectorized implementation of the loss function.
    - X: Holds all the training examples as columns (e.g., 3073 x 50,000 in CIFAR-10)
    - y: Array of integers specifying the correct class (e.g., 50,000-D array)
    - W: Weights (e.g., 10 x 3073)
    """
     #compute scores for all classes
     scores=np.dot(W,X)

     #select only the scores for the correct class for each example
     correct_class_scores=scores[y,np.arange(X.shape[1])] 
     #compute the hinge loss for all classes
     margins=np.maximum(0,scores-correct_class_scores+1)
     margins[y,np.arange(X.shape[1])]=0  #set loss for correct class 0

     #compute the overall loss as the average of all hinge losses
     loss=np.mean(margins)
     return loss
