In [1]:
import numpy as np

## Sigmoid Function
Implement the `compute_sigmoid()` function, which computes the Sigmoid function value.

**Arguments:**

* **`z`** : 
  * A 2D numpy array of floats

**Returns:**

* Sigmoid value of **z** with the same shape as **z**
<br><br>$\hspace{20mm}{\sigma(z)} = \frac{1}{1+e^{-z}}$


In [2]:
def compute_sigmoid(z):
    #ADD YOUR CODE HERE
    sigmoid_value = 1/(1+np.exp(-z))
    return sigmoid_value

In [3]:
# SAMPLE TEST CASE
z = np.array([[2, 1, -3.213, 9.4], [1, 0, 6.1, -3.01]])
sigmoid_value = compute_sigmoid(z)
print(np.round(sigmoid_value, 4))

[[0.8808 0.7311 0.0387 0.9999]
 [0.7311 0.5    0.9978 0.047 ]]


**Expected Output:**
```
[[0.8808 0.7311 0.0387 0.9999]
 [0.7311 0.5    0.9978 0.047 ]]
```

## Hypothesis Function in Logistic Regression
Implement the `compute_hypothesis()` function, which computes
the hypothesis value using vectorization.

**Arguments:**

* **`X`** : Design Matrix
  * A 2D numpy array of shape (num of instances, num of features)

* **`w`** : Parameters corresponding to each feature
  * A 2D numpy array of shape (num of features, 1)

* **`b`** :  Intercept value
  * A float value


**Returns:**

* Hypothesis value for the given data
 * A 2D numpy array of shape (num of instances, 1) <br><br>$\hspace{20mm}H = \sigma(Xw+b)\\[0.1pt]$
<br>$\hspace{2cm}$(where $\sigma$ represents the sigmoid function) 

In [4]:
def compute_hypothesis(X, w, b):
    #ADD YOUR CODE HERE
    z = np.dot(X, w) + b
    H = compute_sigmoid(z)
    return H

In [5]:
# SAMPLE TEST CASE
X = np.array([[-5, 2.34, 7, 6], [6, 1.2, 0, 4]])
b = 0.1
w = np.array([[0.3], [-0.5], [-0.2], [0.4]])
H = (np.round(compute_hypothesis(X, w, b),3)).squeeze()
print(*H)

0.172 0.948


**Expected Output:**
```
0.172 0.948
```

## $L_2$ Regularized Cost Function
Implement the `compute_L2_cost()` function, which computes
the $L_2$ Regularized cost value in Logistic Regression.

**Arguments:**

* **`X`** : Design Matrix
  * A 2D numpy array of shape (num of instances, num of features)

* **`Y`** : Target values corresponding to each training instance in $X$
  * A 2D numpy array of shape (num of instances, 1)

* **`w`** : Parameters corresponding to each feature
  * A 2D numpy array of shape (num of features, 1)

* **`b`** :  Intercept value
  * A float value

* **`Lambda`** :  Regularization parameter($\lambda$)
  * A float value


**Returns:**

* $L_2$ Regularized cost value for the given data <br><br>$\hspace{20mm}J_{w,b}(X)=\frac{-1}{m}\left [Y^Tlog(H)+(1-Y)^Tlog(1-H) \right ]+ \frac{\lambda}{2m}w^Tw\\[0.1pt]$
<br>$\hspace{2cm}$(where $H$ is the Hypothesis value of $X$, and $m$ is the number of instances) 

In [6]:
def compute_L2_cost(X, Y, w, b, Lambda):
    #ADD YOUR CODE HERE
    m = len(X)
    H = compute_hypothesis(X, w, b)
    regression_value = np.dot(Y.T, np.log(H)) + np.dot((1-Y).T, np.log(1-H))
    regularization_value = (Lambda/(2*m)) * np.dot(w.T, w)
    J = (-1/m * regression_value) + regularization_value
    return np.squeeze(J)

In [7]:
# SAMPLE TEST CASE
X = np.array([[-5, 2.34, 7, 6], [6, 1.2, 0, 4]])
Y = np.array([[0], [1]])
w = np.array([[0.3], [-0.5], [-0.2], [0.4]])
b = 0.1
Lambda = 0.1
cost_value = np.round(compute_L2_cost(X, Y, w, b, Lambda),3)
print(cost_value)

0.135


**Expected Output:**
```
0.135
```

## Gradients of $L_2$ Regularized Cost Function
Implement the `gradient_of_L2_cost()`, which computes
the gradients of the $L_2$ regularized cost function in Logistic Regression.


**Arguments:**

* **`X`** : Design Matrix
  * A 2D numpy array of shape (num of instances, num of features)

* **`Y`** : Target values corresponding to each training instance in $X$
  * A 2D numpy array of shape (num of instances, 1)

* **`w`** : Parameters corresponding to each feature
  * A 2D numpy array of shape (num of features, 1)

* **`b`** :  Intercept value
  * A float value

* **`Lambda`** :  Regularization parameter($\lambda$)
  * A float value



**Returns:**
* Gradient of the cost function corresponding to the parameters of each feature.
 * A 2D numpy array of shape (num of features, 1)<br><br>$\hspace{20mm}\frac{dJ}{dw} = \frac{1}{m}\left [ X^T(H-Y) + \lambda w\right ]$<br><br>
* Gradient corresponding to the intercept value
 * A float value<br><br>$\hspace{20mm}\frac{dJ}{db} = \frac{1}{m}\sum (H-Y) \\[0.1pt]  \\[0.1pt]$
<br>$\hspace{2cm}$(where $H$ is the Hypothesis value of $X$, and $m$ is the number of instances) 


In [8]:
def gradient_of_L2_cost(X, Y, w, b, Lambda):
    #ADD YOUR CODE HERE
    m = len(X)
    H = compute_hypothesis(X, w, b)
    dw = 1/m * (np.dot(X.T,(H-Y)) + (Lambda * w))
    db = 1/m * (np.sum(H-Y))
    return dw, db


In [9]:
# SAMPLE TEST CASE
X = np.array([[-5, 2.34, 7, 6], [6, 1.2, 0, 4]])
Y = np.array([[0], [1]])
w = np.array([[0.3], [-0.5], [-0.2], [0.4]])
b = 0.1
Lambda = 0.1
dw, db = gradient_of_L2_cost(X, Y, w, b, Lambda)
dw = np.round(dw.ravel(),3)
print(*dw)
print(np.round(db,3))

-0.572 0.145 0.593 0.432
0.06


**Expected Output:**
```
-0.572 0.145 0.593 0.432
0.06
```

## Gradient Descent in $L_2$ Regularized Logistic Regression
Compute the optimal parameter values using gradient descent.

**Arguments:**

* **`X`** : Design Matrix
  * A 2D numpy array of shape (num of instances, num of features)

* **`Y`** : Target values corresponding to each training instance in $X$
  * A 2D numpy array of shape (num of instances, 1)

* **`w`** : Initial parameters corresponding to each feature
  * A 2D numpy array of shape (num of features, 1)

* **`b`** :  Initial intercept value
  * A float value

* **`cost_diff_threshold`** : threshold value for the absolute cost difference to stop iterating in gradient descent (*Convergence Criteria*)
  * A float value

* **`learning_rate`** :  Learning rate($\alpha$)
  * A float value

* **`Lambda`** :  Regularization parameter($\lambda$)
  * A float value

**Returns:**
* `w`: Optimal parameters of features($w$'s)
 * A 2D numpy array with the same shape as the argument $w$<br>
$\hspace{10mm}w = w - \alpha \frac{dJ}{dw}$<br><br>
* `b`: Optimal intercept value<br>
$\hspace{20mm}b = b - \alpha \frac{dJ}{db}$<br><br>

**NOTE:**
* The gradient descent is said to be converged when the absolute value of the cost difference is less than the given threshold.
* Stop iterating when the gradient descent starts to diverge.

In [10]:
def gradient_descent(X, Y, w, b, cost_diff_threshold, learning_rate, Lambda):
    #ADD YOUR CODE HERE
    i = 0
    costs = [compute_L2_cost(X, Y, w, b, Lambda)]
    ws = [w]
    bs = [b]
    cost_diff = cost_diff_threshold + 1
    while(abs(cost_diff) > cost_diff_threshold):
      dw, db = gradient_of_L2_cost(X, Y, w, b, Lambda)
      w = w - (learning_rate * dw)
      b = b - (learning_rate * db)
      costs.append(compute_L2_cost(X, Y, w, b, Lambda))
      ws.append(w)
      bs.append(bs)
      cost_diff = costs[i+1] - costs[i]
      if cost_diff > 0:
        print(f"Divergent at {i}")
        break
      i = i + 1
    return w, b


In [11]:
# SAMPLE TEST CASE
X = np.array([[-5, 2.34, 7, 6], [6, 1.2, 0, 4]])
Y = np.array([[0], [1]])
w = np.array([[0.3], [-0.5], [-0.2], [0.4]])
b = 0.1
Lambda = 0.1
cost_diff_threshold = 5e-10
learning_rate = 0.01 
w, b = gradient_descent(X, Y, w, b, cost_diff_threshold, learning_rate, Lambda)
w = np.round(w.ravel(),3)
print(*w)
print(np.round(b,3))

0.654 -0.06 -0.399 -0.096
1.616


**Expected Output:**
```
0.654 -0.06 -0.399 -0.096
1.616
```

## One-vs-Rest for Multi-Class Classification
Implement the `one_vs_rest()` function, which uses the One-vs-Rest approach to compute the optimal parameters for each class in a Multi-Class Classification Problem using Logistic Regression.


**Arguments:**

* **`X`** : Design Matrix
  * A 2D numpy array of shape (num of instances, num of features)

* **`Y`** : Target values corresponding to each training instance in $X$
  * A 2D numpy array of shape (num of instances, 1)

* **`w`** : Initial parameters corresponding to each feature
  * A 2D numpy array of shape (num of features, 1)

* **`b`** :  Initial intercept value
  * A float value

* **`cost_diff_threshold`** : threshold value for the absolute cost difference to stop iterating in gradient descent (*Convergence Criteria*)
  * A float value

* **`learning_rate`** :  Learning rate($\alpha$)
  * A float value

* **`Lambda`** :  Regularization parameter($\lambda$)
  * A float value



**Returns:**
* `classwise_params_dict`
 * A dict where the keys are the class labels and the values are the respective optimal parameters [$w$, $b$]<br>
   * where $w$ is a 2D numpy array with the same shape as the argument $w$<br>
   * $b$ is the optimal value of the intercept (float)


In [16]:
def one_vs_rest(X, Y, w, b, cost_diff_threshold, learning_rate, Lambda):
    #ADD YOUR CODE HERE
    classes = np.unique(Y)
    classes.sort()
    classwise_params_dict = {}
    for class_label in classes:
        classes_y = np.where(class_label == Y,1,0)
        optimal_w,optimal_b = gradient_descent(X, classes_y, w, b, cost_diff_threshold, learning_rate, Lambda)
        classwise_params_dict[class_label] = [optimal_w, optimal_b]
    return classwise_params_dict
    

In [17]:
# SAMPLE TEST CASE
X = np.array([[4.9, 3.0, 1.4, 0.2], [4.6, 3.4, 1.4, 0.3], [5.6, 3.0, 4.5, 1.5], [6.1, 3.0, 4.6, 1.4], [7.7, 2.6, 6.9, 2.3]])
Y = np.array([[0], [0], [1], [1], [2]])
w = np.array([[0.1], [0.1], [-0.1], [-0.1]])
b = 0.1
cost_diff_threshold = 1e-5
learning_rate = 0.1
Lambda = 0.1

classwise_params_dict = one_vs_rest(X, Y, w, b, cost_diff_threshold, learning_rate, Lambda)

classes = sorted(classwise_params_dict.keys())
for class_label in classes:
    print("class_label:",class_label)
    [w, b] = classwise_params_dict[class_label]
    print("w:", *np.round(w.ravel(),3))
    print("b:", np.round(b,3))

class_label: 0
w: 0.437 0.931 -1.732 -0.761
b: 0.432
class_label: 1
w: -1.839 1.44 1.256 0.429
b: 0.524
class_label: 2
w: -0.199 -2.117 1.287 0.558
b: -1.374


**Expected Output:**
```
class_label: 0
w: 0.437 0.931 -1.732 -0.761
b: 0.432
class_label: 1
w: -1.839 1.44 1.256 0.429
b: 0.524
class_label: 2
w: -0.199 -2.117 1.287 0.558
b: -1.374
```

## Prediction in One-vs-Rest Approach 
Implement the `predict_labels_in_one_vs_rest()` function, 
which predicts the class labels based on the optimal parameter values of each class learned using the One-vs-Rest approach.


**Arguments:**

* **`X`** : Design Matrix to predict the class labels for
  * A 2D numpy array of shape (num of instances, num of features)

* **`classwise_params_dict`**: A dict where the keys are the class labels(ints) and the values are the respective optimal parameters [$w$, $b$]<br>
   * where $w$ is a 2D numpy array with the shape (num of features, 1)<br>
   * $b$ is the optimal value of the intercept (float)



**Returns:**

* Predicted class labels of $X$
 * A 2D numpy array of shape (num of instances, 1)
 * If there is a tie among the class probabilities, predict the class label with the smallest value among the tied classes.

In [32]:
def predict_labels_in_one_vs_rest(X, classwise_params_dict):
    #ADD YOUR CODE HERE
    classes = sorted(classwise_params_dict.keys())
    classes = np.array(classes)
    hypothesis_values = np.zeros((len(classes), len(X)))
    
    for idx, classes_labels in enumerate(classes):
        params = classwise_params_dict[classes_labels]
        w = params[0]
        b = params[1]
        hypothesis_values[idx] = compute_hypothesis(X, w, b).ravel()
        
    # print(hypothesis_values)
    predicted_idx = np.argmax(hypothesis_values, axis= 0).ravel()
    # print(predicted_idx)
    predicted = classes[predicted_idx]
    return predicted

In [33]:
# SAMPLE TEST CASE
X = np.array([[4.9, 3.0, 1.4, 0.2], [4.6, 3.4, 1.4, 0.3], [5.6, 3.0, 4.5, 1.5], [6.1, 3.0, 4.6, 1.4], [7.7, 2.6, 6.9, 2.3]])

classwise_params_dict = {1: [np.array([[ 0.43],[ 0.93],[-1.73 ],[-0.76]]), 0.43], 
                         2: [np.array([[-1.84 ],[ 1.42],[ 1.25],[ 0.43]]), 0.63],
                         3: [np.array([[-0.22],[-2.07],[ 1.32],[ 0.56]]), -1.54]}

predictions = predict_labels_in_one_vs_rest(X, classwise_params_dict)
print(*predictions.ravel())

1 1 2 2 3


**Expected Output:**
```
1 1 2 2 3
```