# Linear Classification
In the last section we introduced image classification using K-NN algorithm. K-NN has disadvantages.
 - The classifier must remember all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size.
 - Classifying a test image is expensive since it requires a comparison to all training images.


 **Overview**:
  We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.


# Parameterized mapping from images to label scores
- **score** function
- **loss** function

let’s assume a training dataset of images $x_i \in R^D$
, each associated with a label yi
. Here $ i=1…N $
 and $ yi∈1…K $
. That is, we have N examples (each with a dimensionality D) and K distinct categories. For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function $ f:R^D\mapsto R^K$
 that maps the raw image pixels to class scores.

# Linear Classifier
 $ f(x_i, W, b) =  W x_i + b$


 In the above equation, we are assuming that the image xi
 has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi
 contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vector because it influences the output scores, but without interacting with the actual data xi
. However, you will often hear people use the terms weights and parameters interchangeably.

# Interpreting a linear classifier as template matching
- each row of weight vector is responsible for a specified class. Therefore each row can be said to be a template.

- To label an image we need to comparee the image with all the template and see which template matches with the image.

**Problem**
- ex1: suppose a dataset has different color of image and having maximum colors with red. Then the template of car class would mostly recognize red car but not the cars of other color.

**solution**
- introducing neural network.a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

# Bias trick
 it is cumbersome to keep track of two sets of parameters. Therefore we combine these two into one. For this-

 -by extending the vector xi
 with one additional dimension that always holds the constant 1 and a default bias dimension.

 Now the score function looks like:  $f(x_i,W)=W x_i$

 **New Dimensions**
 $x_i$ =[3073 X 1] , $W$=[10 X 3073]

# Image data processing
Its always recommended to normalize data before training.

**Concept**: Centering the dataset. subtracting the mean from every feature. 
- range of the values of dataset:

        - [-1,1] :zero mean centering
        - [-127,127] we will do in each image processing
      

# Loss and Optimization
## Loss
In the previous lecture we learned to work with data having a set of hyperpaarameters. For linear classifier weights are hyperparameters. For different set of hyperparameters we get different predictions. The questions is , which hyperparameters should we use in our model. In other words, which parameter set is best to use in our model. To get intuition about this question "loss" comes to the picture. In machine learning we want our model to predict as much as perfect. Say we want our a model to classify an image as cat. We want our model that every time it encounters an image it correctly predict as cat or not a cat. When model fails to predict correctly we consider it as loss. Therefore we can define loss as the difference between the predicted value and the actual value for a single example. Now the answer is, we will select that hyperparameter set that results the least loss.

##### Types of loss function
There are different types of loss function. Depending on the problem set and algorithm we use. In image classification for multiclass problem we can use two types of loss function.
  - Multiclass Support Vector Machine Loss
  - Cross Entropy Loss
**Multiclass SVM loss(hinge loss)**:
    - **Goal**:  The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ

    $L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$
    
  Since we are working with linear score function we can write it as: $L_i = \sum_{j\neq y_i} \max(0, w_j^T x_i - w_{y_i}^T x_i + \Delta)$

# Implementation of loss function

In [1]:
def L_i_unvectorized(x, y, W):
  """
  unvectorized version. Compute the multiclass svm loss for a single example (x,y)
  - x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
    with an appended bias dimension in the 3073-rd position (i.e. bias trick)
  - y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
  - W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
  """
  delta = 1.0 # see notes about delta later in this section
  scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
  correct_class_score = scores[y]
  D = W.shape[0] # number of classes, e.g. 10
  loss_i = 0.0
  for j in range(D): # iterate over all wrong classes
    if j == y:
      # skip for the true class to only loop over incorrect classes
      continue
    # accumulate loss for the i-th example
    loss_i += max(0, scores[j] - correct_class_score + delta)
  return loss_i

In [2]:
def L_i_vectorized(x, y, W):
  """
  A faster half-vectorized implementation. half-vectorized
  refers to the fact that for a single example the implementation contains
  no for loops, but there is still one loop over the examples (outside this function)
  """
  delta = 1.0
  scores = W.dot(x)
  # compute the margins for all classes in one vector operation
  margins = np.maximum(0, scores - scores[y] + delta)
  # on y-th position scores[y] - scores[y] canceled and gave delta. We want
  # to ignore the y-th position and only consider margin on max wrong class
  margins[y] = 0
  loss_i = np.sum(margins)
  return loss_i


In [3]:
import numpy as np
## Fully vectorized implementation
def L(X,y,W):
     
     """
    Fully-vectorized implementation of the loss function.
    - X: Holds all the training examples as columns (e.g., 3073 x 50,000 in CIFAR-10)
    - y: Array of integers specifying the correct class (e.g., 50,000-D array)
    - W: Weights (e.g., 10 x 3073)
    """
     #compute scores for all classes
     scores=np.dot(W,X)

     #select only the scores for the correct class for each example
     correct_class_scores=scores[y,np.arange(X.shape[1])] 
     #compute the hinge loss for all classes
     margins=np.maximum(0,scores-correct_class_scores+1)
     margins[y,np.arange(X.shape[1])]=0  #set loss for correct class 0

     #compute the overall loss as the average of all hinge losses
     loss=np.mean(margins)
     return loss


# Q&A(Multiclass SVM loss function)

**Q1**: Suppose we get L_i=0 for a specified class. What happens if we change a littele bit score corresponding to this class?

 - **Answer**: The loss will still be same(0 in this case). Because its score value is already greter than the other scores of other classes.

**Q2**:Whata is the min max possibe loss?

 - **Answer**: Min=0 and max = infinity. 
      - Min: Because if the differences between the scores is negative then the function returns max value zero.
      - Max: Because from the equation we can say that if the correct scores gets very very negative score. Then we accumulate infinity loss.

**Q3**: At initialization Wis small so all s ≈ 0.What is the loss? 
 - **Answer**: Number of classes minus one (C-1). The predicted score and the original class score will be same .therfore we will get error 1 for the delta term. And remember we don't calculate for when  $j=y_i$. Thats why minus one.

**Q4**:What if the sumwas over all classes?(including $j = y_i$)?
 - **Answer**: Loss increases by 1. Because for this correct class loss=0 but we get the delta value 1.

**Q5**:What if we used mean instead ofsum? ?
 - **Answer**: Doesn't change. because the number of classes remains constant all the time. It only rescale the loss.

**Q6**:What if we use suared of the loss function ?
 - **Answer**: would be a different loss function

# Regularization

Why do we use regularization?
- To avoid overfitting 
- To generalize
- model should be simple.So that works better on test data

**Concept**:
When the number of features is high then the training data fits the model perfectly. But doesn't show a good result in test case. This phenomenon is known as overfitting(High Variance). We know that the trainig examples ($x_i,y_i$) is fixed. We can not change them. Thus some features may have values much more higher than other features. So these exteme features can have great impact on the score function. To overcome this situation we can only penalize thsese values only by penalizing the wieghts. Because by penalizing weights we can control features. There are many regularized loss term availabe. But we will use squared L2 regularized term. $R(W) = \sum_k\sum_l W_{k,l}^2$. Byadding this term to loss function we have now two parts in loss function. One is **data loss** and the **regularization loss**

$L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss} \\\\$

Expanding this out in its full form:
$L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2$

**Example** Suppose we have $x=[1,1,1,1]$ ,$w1=[1,0,0,0]$, $w2=[0.25,0.25,0.25,0.25]$. 
Here,$w_1^Tx = w_2^Tx = 1$ But according to L2 regularization w2 is preferable. Because the sum of the suqred values of w2(.5)< w1(1). This demonstrates that regularized loss for w2 is less than w1.

**Lambda**: Hyperparameters which plays role for trade off between data loss and regualrization loss.



# Implemetation of loss with regularized term


In [4]:
#unvectorized
#same as before only regularized term will be added to the loss
Lambda=.4

def L_regularized(x, y, W,Lambda):
  """
  unvectorized version. Compute the multiclass svm loss for a single example (x,y)
  - x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
    with an appended bias dimension in the 3073-rd position (i.e. bias trick)
  - y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
  - W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
  """
  delta = 1.0 # see notes about delta later in this section
  scores = np.dot(np.transpose(W),x) # scores becomes of size 10 x 1, the scores for each class
  correct_class_score = scores[y]
  D = W.shape[0] # number of classes, e.g. 10
  loss_i = 0.0
  for j in range(D): # iterate over all wrong classes
    if j == y:
      # skip for the true class to only loop over incorrect classes
      continue
    # accumulate loss for the i-th example
    loss_i += max(0, scores[j] - correct_class_score + delta)
  regularized_loss=loss_i+ .05*Lambda*np.sum(W*W) #.5 is multiplied due to convenience of calculation later
  return regularized_loss

**other Multiclass SVM formulations:**
- OneVsAll
  -  trains an independent binary SVM for each class vs. all other classes.
   

# Softmax Classifier (Multinomial Logistic Rregression)
the Softmax classifier is generalization of binary logistic regression classifier to multiple classes.
scores=unnormalized log probabilities of the classse  ; $s=f(x_i,W)$
**Steps**
  - we take the score from $x=f(x_i,W)$
  - we exponentiate them so that they become positive
  - we normalize them by sum of the exponents

These three steps is done by a function named softmax function. This function ends up to a vector of probabilities corresponding to each classes. The sum of all the elements is 1.

$P(Y=k\ X=x_i)= \frac{e^{s_k}}{\sum_k e^{s_j}}$ ; outputs the pobability of class k (k=1 to K)

Here, $f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is the softmax function


##### Softmax loss Function(Cross Entropy Loss):
  **Target** : Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class. In other words the loss fucntion tries to make the correct class probability near to 1 and the other class probability near to 0. Thus the loss functin is Negative log of the probability of the correct class.
  $L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$
  
  The equation says, minimize the negative log likelihood of the correct class which can be interpreted as performing Maximum Likelihood Estimation (MLE).




##### Practical Issues:
The intermediate terms $e^{f_{y_i}}$ and $ \sum_j e^{f_j}$  may be very large due to the exponentials. Dividing large numbers can be numerically unstable. for stability we multiply neumeratkor and denominator by a constant C and we get,
$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}
= \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}}
= \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

**Value of C**: we can take any value. A common choice is to set $logC= -max_j f_j$

In [5]:
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer

  p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup
  p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup


#### Naming convention:
 Multiclass SVM loss/ hinge loss is sometimes called max-margin loss. Technically softmax function is a squashing function

### SVM vs Softmax:
**SVM**
 - through score function wants that the correct class score value is higher than the other class scores by a margin.
 -The SVM does not care about the details of the individual scores
**Softmax**
 - The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low)
 - the Softmax classifier allows us to compute “probabilities” for all labels. 

# Optimization

### Introduction:

**linear score function** : $f(x_i, W) =  W x_i$

**SVM loss**: $L=1N∑i∑j≠yi[max(0,f(xi;W)j−f(xi;W)yi+1)]+αR(W)$

We saw that a setting of the parameters Wthat produced predictions for examples xiconsistent with their ground truth labels yiwould also have a very low loss L. We are now going to introduce the third and last key component: optimization. Optimization is the process of finding the set of parameters W that minimize the loss function.

**Goal**: understanding the interaction among **score** function, **loss** fucntion and **optimization** these three components.

# Visualising the loss function

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right]$

It is clear from the equation that the data loss for each example is a sum of (zero-thresholded due to the max(0,−) function) linear functions of W

**Sign in front of W**:
- +ve sign: it means that the value of weight is positively corelated with the likelihood of the corresponding class. In other words increasing the value of w will increase the chances of being classified as the correct class.
- -ve sign: weights are negatively correlated.

The above equation involves data loss and thresholding, the sign of the weight determines whether the contribution of that weight is considered in the calculation of the loss. Positive weights associated with incorrect class labels (where the model predicted the wrong class) contribute positively to the data loss. On the other hand, negative weights associated with correct class labels contribute negatively to the data loss, reducing it.

**Shape of loss function**:

SVM loss is convex function. Once we extend our score functions f to Neural Networks our objective functions will become non-convex, and the visualizations of those non-convex function will not feature bowls but complex, bumpy terrains.


### Optimization

**Goal** : The goal of optimization is to find W that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function. There are different optimization function based on the type of loss function. For example, for convex function we can use gradient optimization. Our final goal is to optimize neural networks where we can't easily find any of the tools developed in the convex optimization literature.

First we will talk for convex function. In this case for hinge loss.
Approaches we will try-
 - Random Search
 - Random Local Search
 - Following the gradient

#### Strategy #1: A first very bad idea solution: Random search
Since it is so simple to check how good a given set of parameters W is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows:

In [6]:
# First we need to import our data and split it into train test
import numpy as np
import matplotlib.pyplot as plt
import pickle

In [7]:
def unpickle(file):
    """load the cifar-10 data"""

    with open(file, 'rb') as fo:
        data = pickle.load(fo, encoding='bytes')
    return data

In [8]:
def load_cifar_10_data(data_dir, negatives=False):
    """
    Return train_data, train_filenames, train_labels, test_data, test_filenames, test_labels
    """

    # get the meta_data_dict
    # num_cases_per_batch: 1000
    # label_names: ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
    # num_vis: :3072

    meta_data_dict = unpickle(data_dir + "/batches.meta")
    cifar_label_names = meta_data_dict[b'label_names']
    cifar_label_names = np.array(cifar_label_names)

    # training data
    cifar_train_data = None
    cifar_train_filenames = []
    cifar_train_labels = []

    # cifar_train_data_dict
    # 'batch_label': 'training batch 5 of 5'
    # 'data': ndarray
    # 'filenames': list
    # 'labels': list

    for i in range(1, 6):
        cifar_train_data_dict = unpickle(data_dir + "/data_batch_{}".format(i))
        if i == 1:
            cifar_train_data = cifar_train_data_dict[b'data']
        else:
            cifar_train_data = np.vstack((cifar_train_data, cifar_train_data_dict[b'data']))
        cifar_train_filenames += cifar_train_data_dict[b'filenames']
        cifar_train_labels += cifar_train_data_dict[b'labels']

    cifar_train_data = cifar_train_data.reshape((len(cifar_train_data), 3, 32, 32))
    if negatives:
        cifar_train_data = cifar_train_data.transpose(0, 2, 3, 1).astype(np.float32)
    else:
        cifar_train_data = np.rollaxis(cifar_train_data, 1, 4)
    cifar_train_filenames = np.array(cifar_train_filenames)
    cifar_train_labels = np.array(cifar_train_labels)

    # test data
    # cifar_test_data_dict
    # 'batch_label': 'testing batch 1 of 1'
    # 'data': ndarray
    # 'filenames': list
    # 'labels': list

    cifar_test_data_dict = unpickle(data_dir + "/test_batch")
    cifar_test_data = cifar_test_data_dict[b'data']
    cifar_test_filenames = cifar_test_data_dict[b'filenames']
    cifar_test_labels = cifar_test_data_dict[b'labels']

    cifar_test_data = cifar_test_data.reshape((len(cifar_test_data), 3, 32, 32))
    if negatives:
        cifar_test_data = cifar_test_data.transpose(0, 2, 3, 1).astype(np.float32)
    else:
        cifar_test_data = np.rollaxis(cifar_test_data, 1, 4)
    cifar_test_filenames = np.array(cifar_test_filenames)
    cifar_test_labels = np.array(cifar_test_labels)

    return cifar_train_data, cifar_train_filenames, cifar_train_labels, \
        cifar_test_data, cifar_test_filenames, cifar_test_labels, cifar_label_names

In [9]:
cifar_10_dir = 'dataset\cifar10'
train_data, train_filenames, train_labels, test_data, test_filenames, test_labels, label_names = load_cifar_10_data(cifar_10_dir)
print("Train data: ", train_data.shape)
print("Train filenames: ", train_filenames.shape)
print("Train labels: ", train_labels.shape)
print("Test data: ", test_data.shape)
print("Test filenames: ", test_filenames.shape)
print("Test labels: ", test_labels.shape)
print("Label names: ", label_names.shape)

Train data:  (50000, 32, 32, 3)
Train filenames:  (50000,)
Train labels:  (50000,)
Test data:  (10000, 32, 32, 3)
Test filenames:  (10000,)
Test labels:  (10000,)
Label names:  (10,)


In [10]:
#Flatten out all the images to be one dimensional
Xtr_rows = train_data.reshape(train_data.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072
Xte_rows = test_data.reshape(test_data.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072

In [11]:
print(Xtr_rows.shape)
print(Xte_rows.shape)

(50000, 3072)
(10000, 3072)


In [12]:
#adding bias term
bias_term=np.ones((50000,1))
#np.expand_dims(bias_term,axis=0)
Xtr_rows=np.hstack((Xtr_rows,bias_term))

In [13]:
Xtr_rows.shape

(50000, 3073)

In [14]:
def hinge_loss(x, y, W, reg_strength):
    scores = np.dot(W, x)  # Compute scores
    correct_class_score = scores[y]
    margins = np.maximum(0, scores - correct_class_score + 1)  # Compute margins
    margins[y] = 0  # Ignore the margin for the correct class
    loss = np.sum(margins)  # Compute the loss

    # Compute the gradient
    num_classes = W.shape[0]
    num_features = x.shape[0]
    # dW = np.zeros_like(W)
    # for c in range(num_classes):
    #     if c == y:
    #         continue
    #     margin = margins[c]
    #     if margin > 0:
    #         dW[c] += x
    #         dW[y] -= x

    # Average loss and gradient over all examples
    loss /= num_classes
    # dW /= num_classes

    # Add regularization to the loss and gradient
    loss += 0.5 * reg_strength * np.sum(W * W)
    # dW += reg_strength * W
    return loss


In [15]:
Lambda = 0.4
bestloss = float("inf")  # Python assigns the highest possible float value
num_classes = 10
num_features = Xtr_rows.shape[1]

for num in range(10):
    W = np.random.randn(num_classes, num_features) * 0.0001  # Generate random parameters
    loss = 0.0
    # dW = np.zeros_like(W)

    for i in range(Xtr_rows.shape[0]):
        # loss_i, dW_i = hinge_loss(Xtr_rows[i], train_labels[i], W, Lambda)
        loss_i= hinge_loss(Xtr_rows[i], train_labels[i], W, Lambda)
        loss += loss_i
        # dW += dW_i

    loss /= Xtr_rows.shape[0]  # Compute average loss over the training set
    # dW /= Xtr_rows.shape[0]  # Compute average gradient over the training set

    if loss < bestloss:  # Keep track of the best solution
        bestloss = loss
        bestW = W

    print('In attempt %d, the loss was %f, best %f' % (num, loss, bestloss))

In attempt 0, the loss was 1.000024, best 1.000024
In attempt 1, the loss was 0.983465, best 0.983465
In attempt 2, the loss was 1.074962, best 0.983465
In attempt 3, the loss was 1.004174, best 0.983465
In attempt 4, the loss was 0.985748, best 0.983465
In attempt 5, the loss was 0.929322, best 0.929322
In attempt 6, the loss was 1.012900, best 0.929322
In attempt 7, the loss was 0.989676, best 0.929322
In attempt 8, the loss was 1.007881, best 0.929322
In attempt 9, the loss was 0.987415, best 0.929322


For time saving we only work for 10 weight matrix. we should try much more. But lets test our test data on the best weights and measure accuracy

In [16]:
def predict_accuracy(X, y, W):
    """
    Predict the labels for the given data and calculate the accuracy.
    
    Arguments:
    X -- input data, shape (N, D)
    y -- true labels, shape (N, )
    W -- weight matrix, shape (C, D)
    Returns:
    accuracy -- prediction accuracy
    """
    scores = np.dot(W, X.T)  # Compute scores
    predicted_labels = np.argmax(scores, axis=0)  # Predict labels
    accuracy = np.mean(predicted_labels == y) * 100.0  # Calculate accuracy
    
    return accuracy

In [17]:
# Flatten out the test images and add bias term
bias_term = np.ones((Xte_rows.shape[0], 1))
Xtest_rows = np.hstack((Xte_rows, bias_term))

# Predict accuracy
accuracy = predict_accuracy(Xtest_rows, test_labels, bestW)
print("Prediction accuracy: %.2f%%" % accuracy)

Prediction accuracy: 10.05%


this score is really very poor .We could perform better by iterating more to get w. e.g. if we iterated over 1000 times may be we would get a W that would give accuracy more than 10.56

#### Core idea: iterative refinement. 
- The core idea is that finding the best set of weights W is a very difficult or even impossible problem (especially once W contains weights for entire complex neural networks), 
- our approach will be to start with a random W and then iteratively refine it, making it slightly better each time.

### Strategy #2: Random Local Search
**algorithm**:
- start out with a random W
- generate random perturbations δW to it and if the loss at the perturbed W+δW is lower, we will perform an update. The code for this procedure is as follows:

In [18]:
Lambda = 0.4
bestloss_rls = float("inf")  # Python assigns the highest possible float value
num_classes = 10
num_features = Xtr_rows.shape[1]
W_rls = np.random.randn(num_classes,num_features) * 0.001 # generate random starting W

for i in range(10):
  loss_rls=0.0
  step_size = 0.0001
  Wtry = W_rls + np.random.randn(10, 3073) * step_size
  for i in range(Xtr_rows.shape[0]):
     loss_rls_i= hinge_loss(Xtr_rows[i],train_labels, Wtry,Lambda)
     loss_rls+=loss_rls_i
  loss_rls/=Xte_rows.shape[0]
  if loss_rls < bestloss_rls:
    W_rls = Wtry
    bestloss_rls = loss_rls
  print('iter %d loss is %f' % (i, bestloss_rls))

ValueError: operands could not be broadcast together with shapes (10,) (50000,) 