# Linear Classification
In the last section we introduced image classification using K-NN algorithm. K-NN has disadvantages.
 - The classifier must remember all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size.
 - Classifying a test image is expensive since it requires a comparison to all training images.


 **Overview**:
  We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.


# Parameterized mapping from images to label scores
- **score** function
- **loss** function

let’s assume a training dataset of images $x_i \in R^D$
, each associated with a label yi
. Here $ i=1…N $
 and $ yi∈1…K $
. That is, we have N examples (each with a dimensionality D) and K distinct categories. For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function $ f:R^D\mapsto R^K$
 that maps the raw image pixels to class scores.

# Linear Classifier
 $ f(x_i, W, b) =  W x_i + b$


 In the above equation, we are assuming that the image xi
 has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi
 contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vector because it influences the output scores, but without interacting with the actual data xi
. However, you will often hear people use the terms weights and parameters interchangeably.

# Interpreting a linear classifier as template matching
- each row of weight vector is responsible for a specified class. Therefore each row can be said to be a template.

- To label an image we need to comparee the image with all the template and see which template matches with the image.

**Problem**
- ex1: suppose a dataset has different color of image and having maximum colors with red. Then the template of car class would mostly recognize red car but not the cars of other color.

**solution**
- introducing neural network.a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

# Bias trick
 it is cumbersome to keep track of two sets of parameters. Therefore we combine these two into one. For this-

 -by extending the vector xi
 with one additional dimension that always holds the constant 1 and a default bias dimension.

 Now the score function looks like:  $f(x_i,W)=W x_i$

 **New Dimensions**
 $x_i$ =[3073 X 1] , $W$=[10 X 3073]

# Image data processing
Its always recommended to normalize data before training.

**Concept**: Centering the dataset by subtracting the mean from every feature. 
- range of the values of dataset:

        - [-1,1] :zero mean centering
        - [-127,127] we will do in each image processing
      

# Loss and Optimization
## Loss
In the previous lecture we learned to work with data having a set of hyperpaarameters. For linear classifier weights are hyperparameters. For different set of hyperparameters we get different predictions. The questions is , which hyperparameters should we use in our model. In other words, which parameter set is best to use in our model. To get intuition about this question "loss" comes to the picture. In machine learning we want our model to predict as much as perfect. Say we want our a model to classify an image as cat. We want our model that every time it encounters an image it correctly predict as cat or not a cat. When model fails to predict correctly we consider it as loss. Therefore we can define loss as the difference between the predicted value and the actual value for a single example. Now the answer is, we will select that hyperparameter set that results the least loss.

##### Types of loss function
There are different types of loss function. Depending on the problem set and algorithm we use. In image classification for multiclass problem we can use two types of loss function.
  - Multiclass Support Vector Machine Loss
  - Cross Entropy Loss
**Multiclass SVM loss(hinge loss)**:
    - **Goal**:  The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ

    $L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$
    
  Since we are working with linear score function we can write it as: $L_i = \sum_{j\neq y_i} \max(0, w_j^T x_i - w_{y_i}^T x_i + \Delta)$

# Implementation of loss function

In [1]:
def L_i_unvectorized(x, y, W):
  """
  unvectorized version. Compute the multiclass svm loss for a single example (x,y)
  - x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
    with an appended bias dimension in the 3073-rd position (i.e. bias trick)
  - y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
  - W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
  """
  delta = 1.0 # see notes about delta later in this section
  scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
  correct_class_score = scores[y]
  D = W.shape[0] # number of classes, e.g. 10
  loss_i = 0.0
  for j in range(D): # iterate over all wrong classes
    if j == y: #here y is an integer
      # skip for the true class to only loop over incorrect classes
      continue
    # accumulate loss for the i-th example
    loss_i += max(0, scores[j] - correct_class_score + delta)
  return loss_i

**Explanation**: takes input only a single imae as a (3073 X 1) vector ,a single integer value y{y=0-9 : number of classes} and a weight matrix of shape(num of levels, num of features). calculates inner product wx and returns a vector of size (num of lavels,1).

In [2]:
import numpy as np

def L_i_half_vectorized(x, y, W):
  """
  A faster half-vectorized implementation. half-vectorized
  refers to the fact that for a single example the implementation contains
  no for loops, but there is still one loop over the examples (outside this function)
  """
  delta = 1.0
  scores = W.dot(x)
  # compute the margins for all classes in one vector operation
  margins = np.maximum(0, scores - scores[y] + delta)
  # on y-th position scores[y] - scores[y] canceled and gave delta. We want
  # to ignore the y-th position and only consider margin on max wrong class
  margins[y] = 0
  loss_i = np.sum(margins)
  return loss_i


**Note** vecotr- single scalar value =vector

In [3]:
v=np.ones((4,1))
print(v)
print("after minus the new val is ",v-.5)


[[1.]
 [1.]
 [1.]
 [1.]]
after minus the new val is  [[0.5]
 [0.5]
 [0.5]
 [0.5]]


In [4]:
import numpy as np
## Fully vectorized implementation
def L(X,y,W):
     
     """
    Fully-vectorized implementation of the loss function.
    - X: Holds all the training examples as columns (e.g., 3073 x 50,000 in CIFAR-10)
    - y: Array of integers specifying the correct class (e.g., 50,000-D array)
    - W: Weights (e.g., 10 x 3073)
    """
     #compute scores for all classes
     scores=np.dot(W,X)

     #select only the scores for the correct class for each example
     correct_class_scores=scores[y,np.arange(X.shape[1])] # y is an array indicating row indices and np.arrange(X.shape[1] is indicationg col indices 1-49999). correct classs score dim(50000,1)
     #compute the hinge loss for all classes
     margins=np.maximum(0,scores-correct_class_scores+1)
     margins[y,np.arange(X.shape[1])]=0  #set loss for correct class 0
     #compute the overall loss as the average of all hinge losses
     loss=np.mean(margins)
     return loss


**dimensions**:
- X- (3073 X 50000) represents the dataset. each colum represents an example and each row represents number of features. this dim is for CIFAR-10 dataset 
- y -(50000,1) is a 1-D array holding the values of index of the correct class for each training example
- W-(num of class levels=10,num of features=3073) #actually 3072, we have added intercept term using bias trick
-scores-(num of class levesl=10,number of training examples=500000)
- correct_class_scores - (50000,) /1-D array having 50000 values: Unlike y it doesn't contain the index label rather it contains the score values of the correct classes.
- margins- (shape of scores(10,50000)).This can be understood from below numpy operational code.

In [5]:
# if num of columns of two numpy array equal then they can be subtracted elementwise
a=np.ones((5,3))
b=[1,2,3]
print(a)
print(a-b)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[ 0. -1. -2.]
 [ 0. -1. -2.]
 [ 0. -1. -2.]
 [ 0. -1. -2.]
 [ 0. -1. -2.]]


# Q&A(Multiclass SVM loss function)

**Q1**: Suppose we get L_i=0 for a specified class. What happens if we change a littele bit score corresponding to this class?

 - **Answer**: The loss will still be same(0 in this case). Because its score value is already greter than the other scores of other classes.

**Q2**:Whata is the min max possibe loss?

 - **Answer**: Min=0 and max = infinity. 
      - Min: Because if the differences between the scores is negative then the function returns max value zero.
      - Max: Because from the equation we can say that if the correct scores gets very very negative score. Then we accumulate infinity loss.

**Q3**: At initialization W is small so all s ≈ 0.What is the loss? 
 - **Answer**: Number of classes minus one (C-1). The predicted score and the original class score will be same .therfore we will get error 1 for the delta term. And remember we don't calculate for when  $j=y_i$. Thats why minus one.

**Q4**:What if the sum was over all classes?(including $j = y_i$)?
 - **Answer**: Loss increases by 1. Because for this correct class loss=0 but we get the delta value 1.

**Q5**:What if we used mean instead ofsum? ?
 - **Answer**: Doesn't change. because the number of classes remains constant all the time. It only rescale the loss.

**Q6**:What if we use suared of the loss function ?
 - **Answer**: would be a different loss function

# Regularization

Why do we use regularization?
- To avoid overfitting 
- To generalize
- model should be simple.So that works better on test data

**Concept**:
When the number of features is high then the training data fits the model perfectly. But doesn't show a good result in test case. This phenomenon is known as overfitting(High Variance). We know that the trainig examples ($x_i,y_i$) is fixed. We can not change them. Thus some features may have values much more higher than other features. So these exteme features can have great impact on the score function. To overcome this situation we can only penalize thsese values only by penalizing the wieghts. Because by penalizing weights we can control features. There are many regularized loss term availabe. But we will use squared L2 regularized term. $R(W) = \sum_k\sum_l W_{k,l}^2$. Byadding this term to loss function we have now two parts in loss function. One is **data loss** and the **regularization loss**

$L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss} \\\\$

Expanding this out in its full form:
$L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2$

**Example** Suppose we have $x=[1,1,1,1]$ ,$w1=[1,0,0,0]$, $w2=[0.25,0.25,0.25,0.25]$. 
Here,$w_1^Tx = w_2^Tx = 1$ But according to L2 regularization w2 is preferable. Because the sum of the suqred values of w2(.5)< w1(1). This demonstrates that regularized loss for w2 is less than w1.

**Lambda**: Hyperparameters which plays role for trade off between data loss and regualrization loss.



# Implemetation of loss with regularized term


In [6]:
import numpy as np
## Fully vectorized implementation
#so far this is the final svm loss and we will use it in future when to need
def svm_loss(X,y,W,Lambda):
     
     """
    Fully-vectorized implementation of the loss function.
    - X: Holds all the training examples as columns (e.g., 3073 x 50,000 in CIFAR-10)
    - y: Array of integers specifying the correct class (e.g., 50,000-D array)
    - W: Weights (e.g., 10 x 3073)
    """
     #compute scores for all classes
     scores=np.dot(W,X)
     #select only the scores for the correct class for each example
     correct_class_scores=scores[y,np.arange(X.shape[1])] # y is an array indicating row indices and np.arrange(X.shape[1] is indicationg col indices 1-49999). correct classs score dim(50000,1)
     #compute the hinge loss for all classes
     margins=np.maximum(0,scores-correct_class_scores+1)
     margins[y,np.arange(X.shape[1])]=0  #set loss for correct class 0
     #compute the overall loss as the average of all hinge losses
     loss=np.mean(margins)

     # calclute the regularized term :squared sum of all the weights
     reg_term=Lambda*(np.sum(W * W))  #we can also multiply .5 for convenience
     reg_loss=loss+reg_term
     return reg_loss


#### **Setting delta**
we can set delta always 1. The hyperparameters Δ
 and λ
 seem like two different hyperparameters, but in fact they both control the same tradeoff: The tradeoff between the data loss and the regularization loss in the objective. The key to understanding this is that the magnitude of the weights W
 has direct effect on the scores (and hence also their differences): As we shrink all values inside W
 the score differences will become lower, and as we scale up the weights the score differences will all become higher. Therefore, the exact value of the margin between the scores (e.g. Δ=1
, or Δ=100
) is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily. Hence, the only real tradeoff is how large we allow the weights to grow (through the regularization strength λ
).

**other Multiclass SVM formulations:**
- OneVsAll
  -  trains an independent binary SVM for each class vs. all other classes.
   

# Softmax Classifier (Multinomial Logistic Rregression)
the Softmax classifier is generalization of binary logistic regression classifier to multiple classes.
scores=unnormalized log probabilities of the classse  ; $s=f(x_i,W)$
**Steps**
  - we take the score from $x=f(x_i,W)$
  - we exponentiate them so that they become positive
  - we normalize them by sum of the exponents

The last two steps is done by a function named softmax function. This function ends up to a vector of probabilities corresponding to each classes. The sum of all the elements is 1.

$P(Y=k | X=x_i)= \frac{e^{s_k}}{\sum_k e^{s_j}}$ ; outputs the pobability of class k (k=1 to K)

Here, $f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is the softmax function


##### Softmax loss Function(Cross Entropy Loss):
  **Target** : Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class. In other words the loss fucntion tries to make the correct class probability near to 1 and the other class probability near to 0. Thus the loss functin is Negative log of the probability of the correct class.
  $L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$
  
  The equation says, minimize the negative log likelihood of the correct class which can be interpreted as performing Maximum Likelihood Estimation (MLE).




##### Practical Issues:
The intermediate terms $e^{f_{y_i}}$ and $ \sum_j e^{f_j}$  may be very large due to the exponentials. Dividing large numbers can be numerically unstable. for stability we multiply neumeratkor and denominator by a constant C and we get,
$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}
= \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}}
= \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

**Value of C**: we can take any value. A common choice is to set $logC= -max_j f_j$

Other forms of cross-entropy loss
- **binary Classificaion**: $L(y_{\text{pred}}, y_{\text{true}}) = - (y_{\text{true}} \cdot \log(y_{\text{pred}}) + (1 - y_{\text{true}}) \cdot \log(1 - y_{\text{pred}}))
$

- **Multiclass classification**: $L(y_{\text{pred}}, y_{\text{true}}) = - \sum_{i} y_{\text{true}} \cdot \log(y_{\text{pred}})
$

In [7]:
def softmax(X):
    """takes a input matrix and applies exponent to each element and then find the probility of each 
    element corresponding a class
    """
    #here X is score fuction
    X=np.exp(X)
    den=np.sum(X,axis=0) # dim ()
    prob_scores=X/den  # now X is a matrix of shape(10,50000) which contains probability of the each class for each value
    return prob_scores

Now we will implement softmax function considering practical issues with the updated equation .
Updated equation: $\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}
= \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}}
= \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

In [8]:
# stable softmax function
def s_softmax(X):
    X-=np.max(X,axis=0)
    X=np.exp(X)
    den=np.sum(X,axis=0)
    prob_scores=X/den
    return prob_scores


In [9]:
#without numeric stability and regularized terms
#implementing cross entropy loss for multinomial logistic regression(softmax calssifier
def cross_entropy_loss(X,y,W):
    raw_scores=np.dot(W,X) #score funcion
    #get each class probability value by applying softmax function
    prob_scores=softmax(raw_scores)
    # Now we will write code for first two forms described above
    ce_loss_1=np.mean(-np.log(prob_scores[y,np.arange(prob_scores.shape[1])])) #first form of loss function
    #ce_loss_2=np.mean(-np.sum(raw_scores[y,np.arange(raw_scores.shape[0])]) + np.sum(np.log(np.sum(prob_scores,axis=0))))
    return ce_loss_1 #only one is enough .but computed  to see which perfoms better

here we used unstable softmax function. we can also use stable one and check which performs better in our model. will apply in future....

#### Naming convention:
 Multiclass SVM loss/ hinge loss is sometimes called max-margin loss. Technically softmax function is a squashing function

### SVM vs Softmax:
**SVM**
 - through score function wants that the correct class score value is higher than the other class scores by a margin.
 -The SVM does not care about the details of the individual scores
**Softmax**
 - The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low)
 - the Softmax classifier allows us to compute “probabilities” for all labels. 

# Optimization

### Introduction:

**linear score function** : $f(x_i, W) =  W x_i$

**SVM loss**: $L=1N∑i∑j≠yi[max(0,f(xi;W)j−f(xi;W)yi+1)]+αR(W)$

We saw that a setting of the parameters Wthat produced predictions for examples xiconsistent with their ground truth labels yiwould also have a very low loss L. We are now going to introduce the third and last key component: optimization. Optimization is the process of finding the set of parameters W that minimize the loss function.

**Goal**: understanding the interaction among **score** function, **loss** fucntion and **optimization** these three components.

# Visualising the loss function

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right]$

It is clear from the equation that the data loss for each example is a sum of (zero-thresholded due to the max(0,−) function) linear functions of W

**Sign in front of W**:
- +ve sign: it means that the value of weight is positively corelated with the likelihood of the corresponding class. In other words increasing the value of w will increase the chances of being classified as the correct class.
- -ve sign: weights are negatively correlated.

The above equation involves data loss and thresholding, the sign of the weight determines whether the contribution of that weight is considered in the calculation of the loss. Positive weights associated with incorrect class labels (where the model predicted the wrong class) contribute positively to the data loss. On the other hand, negative weights associated with correct class labels contribute negatively to the data loss, reducing it.

**Shape of loss function**:

SVM loss is convex function. Once we extend our score functions f to Neural Networks our objective functions will become non-convex, and the visualizations of those non-convex function will not feature bowls but complex, bumpy terrains.


### Optimization

**Goal** : The goal of optimization is to find W that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function. There are different optimization function based on the type of loss function. For example, for convex function we can use gradient optimization. Our final goal is to optimize neural networks where we can't easily find any of the tools developed in the convex optimization literature.

First we will talk for convex function. In this case for hinge loss.
Approaches we will try-
 - Random Search
 - Random Local Search
 - Following the gradient

#### Strategy #1: A first very bad idea solution: Random search
Since it is so simple to check how good a given set of parameters W is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows:

In [10]:
# First we need to import our data and split it into train test
import numpy as np
import matplotlib.pyplot as plt
import pickle

In [11]:
def unpickle(file):
    """load the cifar-10 data"""

    with open(file, 'rb') as fo:
        data = pickle.load(fo, encoding='bytes')
    return data

In [12]:
def load_cifar_10_data(data_dir, negatives=False):
    """
    Return train_data, train_filenames, train_labels, test_data, test_filenames, test_labels
    """

    # get the meta_data_dict
    # num_cases_per_batch: 1000
    # label_names: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
    # num_vis: :3072

    meta_data_dict = unpickle(data_dir + "/batches.meta")
    cifar_label_names = meta_data_dict[b'label_names']
    cifar_label_names = np.array(cifar_label_names)

    # training data
    cifar_train_data = None
    cifar_train_filenames = []
    cifar_train_labels = []

    # cifar_train_data_dict
    # 'batch_label': 'training batch 5 of 5'
    # 'data': ndarray
    # 'filenames': list
    # 'labels': list

    for i in range(1, 6):
        cifar_train_data_dict = unpickle(data_dir + "/data_batch_{}".format(i))
        if i == 1:
            cifar_train_data = cifar_train_data_dict[b'data']
        else:
            cifar_train_data = np.vstack((cifar_train_data, cifar_train_data_dict[b'data']))
        cifar_train_filenames += cifar_train_data_dict[b'filenames']
        cifar_train_labels += cifar_train_data_dict[b'labels']

    cifar_train_data = cifar_train_data.reshape((len(cifar_train_data), 3, 32, 32))
    if negatives:
        cifar_train_data = cifar_train_data.transpose(0, 2, 3, 1).astype(np.float32)
    else:
        cifar_train_data = np.rollaxis(cifar_train_data, 1, 4)
    cifar_train_filenames = np.array(cifar_train_filenames)
    cifar_train_labels = np.array(cifar_train_labels)

    # test data
    # cifar_test_data_dict
    # 'batch_label': 'testing batch 1 of 1'
    # 'data': ndarray
    # 'filenames': list
    # 'labels': list

    cifar_test_data_dict = unpickle(data_dir + "/test_batch")
    cifar_test_data = cifar_test_data_dict[b'data']
    cifar_test_filenames = cifar_test_data_dict[b'filenames']
    cifar_test_labels = cifar_test_data_dict[b'labels']

    cifar_test_data = cifar_test_data.reshape((len(cifar_test_data), 3, 32, 32))
    if negatives:
        cifar_test_data = cifar_test_data.transpose(0, 2, 3, 1).astype(np.float32)
    else:
        cifar_test_data = np.rollaxis(cifar_test_data, 1, 4)
    cifar_test_filenames = np.array(cifar_test_filenames)
    cifar_test_labels = np.array(cifar_test_labels)

    return cifar_train_data, cifar_train_filenames, cifar_train_labels, \
        cifar_test_data, cifar_test_filenames, cifar_test_labels, cifar_label_names

In [13]:
cifar_10_dir = 'dataset\cifar10'
train_data, train_filenames, train_labels, test_data, test_filenames, test_labels, label_names = load_cifar_10_data(cifar_10_dir)
print("Train data: ", train_data.shape)
print("Train filenames: ", train_filenames.shape)
print("Train labels: ", train_labels.shape)
print("Test data: ", test_data.shape)
print("Test filenames: ", test_filenames.shape)
print("Test labels: ", test_labels.shape)
print("Label names: ", label_names.shape)

Train data:  (50000, 32, 32, 3)
Train filenames:  (50000,)
Train labels:  (50000,)
Test data:  (10000, 32, 32, 3)
Test filenames:  (10000,)
Test labels:  (10000,)
Label names:  (10,)


In [14]:
#Flatten out all the images to be one dimensional
Xtr_rows = train_data.reshape(train_data.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072
Xte_rows = test_data.reshape(test_data.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072

In [15]:
print(Xtr_rows.shape)  
print(Xte_rows.shape)

(50000, 3072)
(10000, 3072)


In [16]:
#adding bias term
bias_term=np.ones((50000,1))
#np.expand_dims(bias_term,axis=0)
Xtr_rows=np.hstack((Xtr_rows,bias_term))

In [17]:
Xtr_rows.shape

(50000, 3073)

Earlier we implemented two types of loss funciton. One is multiclass svm loss and another is cross entropy loss. Now we will claculate loss using both funcion and compare.

In [18]:
Lambda=.001
best_loss_svm=float("inf")
num_labels=10  #for CIFAR-10 data there are 10 classes
num_features=Xtr_rows.shape[1]
for num in range(1000):
    W_svm=np.random.randn(num_labels,num_features)*.0001
    cost_svm=svm_loss(Xtr_rows.T,train_labels,W_svm,Lambda)
    if cost_svm< best_loss_svm:
        best_loss_svm=cost_svm
        best_W_svm=W_svm
    print("in attempt %d the loss is %f, best %f"%(num+1,cost_svm,best_loss_svm))

in attempt 1 the loss is 0.918190, best 0.918190
in attempt 2 the loss is 1.036480, best 0.918190
in attempt 3 the loss is 0.968574, best 0.918190
in attempt 4 the loss is 1.009611, best 0.918190
in attempt 5 the loss is 0.949365, best 0.918190
in attempt 6 the loss is 1.085429, best 0.918190
in attempt 7 the loss is 0.995048, best 0.918190
in attempt 8 the loss is 0.936893, best 0.918190
in attempt 9 the loss is 0.942530, best 0.918190
in attempt 10 the loss is 0.957298, best 0.918190
in attempt 11 the loss is 0.951091, best 0.918190
in attempt 12 the loss is 0.917681, best 0.917681
in attempt 13 the loss is 0.991102, best 0.917681
in attempt 14 the loss is 0.989071, best 0.917681
in attempt 15 the loss is 1.163146, best 0.917681
in attempt 16 the loss is 0.948098, best 0.917681
in attempt 17 the loss is 0.961646, best 0.917681
in attempt 18 the loss is 1.020111, best 0.917681
in attempt 19 the loss is 0.977138, best 0.917681
in attempt 20 the loss is 0.950579, best 0.917681
in attemp

Now the best W we get from the above iterations, we take it and try it out on the test set.

In [19]:
print(Xte_rows.shape)

(10000, 3072)


In [20]:
#adding intercept term to the test set
Xte_rows=np.hstack((Xte_rows,np.ones((10000,1))))
print(Xte_rows.shape)

(10000, 3073)


In [21]:
#calculating scores on the optimized W and see its performance
scores_svm=np.dot(best_W_svm,Xte_rows.T)
#find the index with max score in each column(the predicted class)
y_pred_svm=np.argmax(scores_svm,axis=0)
# calculating the accuracy(fractions of predictions that are correct)
acc_svm=np.mean(y_pred_svm==test_labels) 
print(f"By random search optimization the accuracy is : {acc_svm*100:.2f} %")


By random search optimization the accuracy is : 15.12 %


Lets calculate accuray using cross-entropy loss and random search optimization

In [22]:
# cross_entropy_loss(X,y,W):
best_loss_softmax=float("inf")
for i in range(1000):
    W_softmax=np.random.randn(num_labels,num_features)*.0001
    cost_softmax=cross_entropy_loss(Xtr_rows.T,train_labels,W_softmax)
    if cost_softmax<best_loss_softmax:
        best_loss_softmax=cost_softmax
        best_W_softmax=W_softmax
    print(f"in attepmt {i+1}: loss={cost_softmax},best:{best_loss_softmax}")

in attepmt 1: loss=2.534771973556108,best:2.534771973556108
in attepmt 2: loss=2.496803823326509,best:2.496803823326509
in attepmt 3: loss=2.574885307941612,best:2.496803823326509
in attepmt 4: loss=2.5512826396513484,best:2.496803823326509
in attepmt 5: loss=2.632487635401605,best:2.496803823326509
in attepmt 6: loss=2.547244826827777,best:2.496803823326509
in attepmt 7: loss=2.74719446598005,best:2.496803823326509
in attepmt 8: loss=2.5353122401987265,best:2.496803823326509
in attepmt 9: loss=2.442049545465741,best:2.442049545465741
in attepmt 10: loss=2.498904823721318,best:2.442049545465741
in attepmt 11: loss=2.496090133917194,best:2.442049545465741
in attepmt 12: loss=2.605557169530106,best:2.442049545465741
in attepmt 13: loss=2.4243168511891473,best:2.4243168511891473
in attepmt 14: loss=2.473293055082412,best:2.4243168511891473
in attepmt 15: loss=2.5664818384372383,best:2.4243168511891473
in attepmt 16: loss=2.431324371116032,best:2.4243168511891473
in attepmt 17: loss=2.8285

Now let's calculate accuracy on test set using the optimized W we get from above 1000 iterations

In [23]:
#calcualte scores
scores_softmax=np.dot(best_W_softmax,Xte_rows.T)
# find the index with max score in each column (the predicted class) 
y_pred_softmax=np.argmax(scores_softmax.shape[0])
#calculate accuracy (fractions of predictions that are correct)
acc_softmax=np.mean(y_pred_softmax==test_labels)
print(f"softmax accuracy using random optimization : {acc_softmax*100:.2f}%")


softmax accuracy using random optimization : 10.00%


This score is higher than the svm score. This is because cross-entropy loss was higher than svm loss.

#### Core idea: iterative refinement. 
- The core idea is that finding the best set of weights W is a very difficult or even impossible problem (especially once W contains weights for entire complex neural networks), 
- our approach will be to start with a random W and then iteratively refine it, making it slightly better each time.

### Strategy #2: Random Local Search
**algorithm**:
- start out with a random W
- generate random perturbations δW to it and if the loss at the perturbed W+δW is lower, we will perform an update. The code for this procedure is as follows:

In [24]:
# rl = random local in the following code
W_rl=np.random.randn(num_labels,num_features)*.0001
best_loss_rl=float("inf")
for i in range(1000):
    W_try_rl=W_rl+np.random.randn(num_labels,num_features)*.0001
    cost_rl=svm_loss(Xtr_rows.T,train_labels,W_try_rl,Lambda)
    if cost_rl<best_loss_rl:
        best_loss_rl=cost_rl
        W_rl=W_try_rl   #updating w
    print(f"in attempt {i+1} loss : {cost_rl}, best: {best_loss_rl}")

in attempt 1 loss : 1.1649808937479262, best: 1.1649808937479262
in attempt 2 loss : 1.1039552175624388, best: 1.1039552175624388
in attempt 3 loss : 1.3894803375182812, best: 1.1039552175624388
in attempt 4 loss : 1.1856643763894312, best: 1.1039552175624388
in attempt 5 loss : 1.1085587419545186, best: 1.1039552175624388
in attempt 6 loss : 1.2780012400006688, best: 1.1039552175624388
in attempt 7 loss : 1.1886816106473403, best: 1.1039552175624388
in attempt 8 loss : 1.1767642400766118, best: 1.1039552175624388
in attempt 9 loss : 1.1801137959929822, best: 1.1039552175624388
in attempt 10 loss : 1.257298014237021, best: 1.1039552175624388
in attempt 11 loss : 1.202887729604207, best: 1.1039552175624388
in attempt 12 loss : 1.1904178517855764, best: 1.1039552175624388
in attempt 13 loss : 1.1338280844121553, best: 1.1039552175624388
in attempt 14 loss : 1.3352020500034252, best: 1.1039552175624388
in attempt 15 loss : 1.1901710723770733, best: 1.1039552175624388
in attempt 16 loss : 

In [25]:
# accuracy on test scores
#score
scores_rl=np.dot(W_rl,Xte_rows.T)
# find the index with max scores with each column
y_pred_rl=np.argmax(scores_rl.shape[0])
# find the accuracy
acc_rl=np.mean(y_pred_rl==test_labels)
print(f"accuracy using random local search is {acc_rl*100:.2f}")

accuracy using random local search is 10.00


A little bit accuracy score is increased. If we increase the number of iteratin we would get W that would give more accuracy score on test data.Try 1000 iterations instead of 10

### Strategy #3: Following the Gradient
**Concept** In the previous section we tried to find a direction in the weight-space that would improve our weight vector. We can use gradient to choose the best direction along which we would get W that minimizes the loss mathematically. This direction is related to the gradient of the loss function.

The gradient is a generalization of slope for functions that don’t take a single number but a vector of numbers. Mathematical expression for the derivative of a 1-D function,
 $\frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h}$



#### Computing the gradient
Two ways-
 - **Numerical Gradient**: fast,easy,approximate
 - **Analytic Gradient**:fast,exact,most error prone

#### Computing the gradient numerically with **finite** differences


In [26]:
# def eval_numerical_gradient(f, x):
#   """
#   a naive implementation of numerical gradient of f at x
#   - f should be a function that takes a single argument
#   - x is the point (numpy array) to evaluate the gradient at
#   """

#   fx = f(x) # evaluate function value at original point
#   grad = np.zeros(x.shape)
#   h = 0.00001

#   # iterate over all indexes in x
#   it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
#   while not it.finished:

#     # evaluate function at x+h
#     ix = it.multi_index
#     old_value = x[ix]
#     x[ix] = old_value + h # increment by h
#     fxh = f(x) # evalute f(x + h)
#     x[ix] = old_value # restore to previous value (very important!)

#     # compute the partial derivative
#     grad[ix] = (fxh - fx) / h # the slope
#     it.iternext() # step to next dimension

#   return grad

**practical consideration**
- In formula h tnds to zero, Practically a sufficient smaller value works fine. In fact very smaaller value results time consuming in the gradient descent algorithm which we will see later
- it often works better to compute the numeric gradient using the centered difference formula:
$[f(x+h) - f(x-h)] / 2 h$

In [27]:
# # to use the generic code above we want a function that takes a single argument
# # (the weights in our case) so we close over X_train and Y_train
# def CIFAR10_loss_fun(W):
#   return L(Xtr_rows.T,train_labels, W)

# W = np.random.rand(10, 3073) * 0.001 # random weight vector
# df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient

In [28]:
# #update
# loss_original = CIFAR10_loss_fun(W) # the original loss
# print 'original loss: %f' % (loss_original, )

# # lets see the effect of multiple step sizes
# for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
#   step_size = 10 ** step_size_log
#   W_new = W - step_size * df # new position in the weight space
#   loss_new = CIFAR10_loss_fun(W_new)
#   print 'for step size %f new loss: %f' % (step_size, loss_new)

***NB***: I am leaving executing the above three block of code considering numerical gradient. Because this will require tremendous calculations :( . Later I will try

**update in negative direction**: We update W in the negative direction of the gradient since we wish our loss functions to decrease,not increase.

##### Effect of step size(learning rate):
- if step_size/learning_rate/alpha is too small it takes long time to train
- if too large algorithm may diverge rather than converge

**Problem of efficiency**: 
- complexity of numerical gradient is the number of parameters.
-  In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update

In real life we work with data having millions of parameters. In that case its good to avoid this algorithm look for new way.

#### Computing the gradient analytically with Calculus
**properties**
- fast, no approximation, error prone

**Gradient Check**: Since analytic approach with calculas is more error prone , in practice it is good to use both approach and see which is better.This is called gradient check.


Recall SVM loss for a single data point: $L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to wyi
 we obtain: $\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$
 where 1
 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you’re implementing this in code you’d simply **count the number of classes that didn’t meet the desired margin** (and hence contributed to the loss function) and then the data vector xi
 scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of W
 that corresponds to the correct class. For the other rows where j≠yi
 the gradient is:
$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

***Implementation**:
 **algorithm**: 
 - count the #classes that don't meet the margin then multiply by that training example ($x_i$). -  - Iterate over all the training example through $x_i  \text{to}  x_n$ 
 - sum the all classes that don't meet margin from $x_i \text{to} x_n$


In [29]:
import numpy as np

def calculate_gradient(X, y, W, delta):
    num_classes = W.shape[0]
    num_samples = X.shape[1]

    scores = np.dot(W, X)

    margins = np.maximum(0, scores - scores[y, np.arange(num_samples)] + delta)
    margins[y, np.arange(num_samples)] = 0  # Ignore the correct class

    incorrect_mask = margins > 0

    gradient = np.zeros_like(W)

    # Compute the gradient for the correct class weight
    num_incorrect = np.sum(incorrect_mask, axis=0)
    gradient[y, :] = -num_incorrect[:, np.newaxis] * X.T

    # Compute the gradient for the incorrect class weights
    gradient += np.dot(incorrect_mask, X.T)

    return gradient

#Example usage:
# X = np.array([[1, 2], [2, 3], [4, 5], [5, 6], [7, 8], [9, 10]])  # Example input matrix with shape (num_features, num_samples)
# y = np.array([0, 1, 2, 0, 1, 2])  # Example target classes
# W = np.random.randn(3, 2) * 0.001  # Example weight matrix
W_gd=np.random.randn(10,3073)*.0001
delta = .1  # Example margin value
gradient = calculate_gradient(Xtr_rows.T, train_labels, W_gd, delta)
print(gradient)

[[3541061. 3612699. 3338375. ... 3555359. 3101920.   29134.]
 [3977291. 4127158. 3979775. ... 3758346. 3372210.   30971.]
 [2410115. 2493877. 2399428. ... 2640141. 2411648.   19640.]
 ...
 [4933966. 5111511. 4896145. ... 4916059. 4455733.   37630.]
 [5494391. 5669747. 5430035. ... 5450866. 4885407.   42662.]
 [1799199. 1853190. 1674856. ... 2123101. 1844119.   17556.]]


In [30]:
print(gradient.shape)

(10, 3073)


### Gradient Descent: 
 the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent.

In [31]:
#initializing parameters
def gradient_descent(X,y,learning_rate,num_iters):
    #initializing
    Lamda=.4
    delta=.1
    W = np.random.randn(10, 3073) * 0.001
    for i in range(num_iters):
        grad=calculate_gradient(X,y,W,delta)
        #update
        W-=learning_rate*grad
        # print loss and accuracy after certain periods
        if (i+1)%25==0:

            loss=svm_loss(X,y,W,Lambda)
            scores=np.dot(W,X)
            y_pred=np.argmax(scores.shape[0])
            accuracy=np.mean(y_pred==y)
            print(f"iteration {i+1}, loss={loss:.2f}, accuracy={accuracy*100:.2f}")
    return W


In [32]:
#perform gradient descent
num_iters=400
learning_rate=.01
final_weights=gradient_descent(Xtr_rows.T,train_labels,learning_rate,num_iters)

iteration 25, loss=13748422096467.68, accuracy=10.00
iteration 50, loss=54065991983157.92, accuracy=10.00
iteration 75, loss=119745471045495.55, accuracy=10.00
iteration 100, loss=209444247544725.00, accuracy=10.00
iteration 125, loss=326660017893833.81, accuracy=10.00
iteration 150, loss=468356311736201.00, accuracy=10.00
iteration 175, loss=632268561542259.88, accuracy=10.00
iteration 200, loss=818700764426756.38, accuracy=10.00
iteration 225, loss=1029215694419907.88, accuracy=10.00
iteration 250, loss=1261234849333874.50, accuracy=10.00
iteration 275, loss=1517458459084222.75, accuracy=10.00
iteration 300, loss=1799240348638933.75, accuracy=10.00
iteration 325, loss=2102419854795631.00, accuracy=10.00
iteration 350, loss=2431292024157660.50, accuracy=10.00
iteration 375, loss=2785858628337446.00, accuracy=10.00
iteration 400, loss=3159771956180126.00, accuracy=10.00


performance on the test data using the final weights

In [35]:
sc=np.dot(final_weights,Xte_rows.T)
y_pred=np.argmax(sc.shape[0])
accuracy=np.mean(y_pred==test_labels)
print(accuracy)

0.1


we need to work further to increase accuracy

### **mini-batch gradient descent:
we can split the dataset inot sets. and can apply gradinet descent algorithm on each batch. The extereme case when batch size =1. This condition is known as stochastic gradient descent(SGD).
This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. 