# Exercise 1
Which linear regression training algorithm can you use if you have a training set
with millions of features?

![image.png](attachment:ba1bc8ec-e067-436e-b018-17df2dfdc1aa.png) </br>
*n is the number of features.* </br>
It is advised to use gradient descent.

# Exercise 2
Suppose the features in your training set have very different scales. Which algo‐
rithms might suffer from this, and how? What can you do about it?

![image.png](attachment:cc7d52de-3b44-4ecd-842e-0a62cb37816d.png) </br>
Gradient descent may suffer from it, so you should apply scaling.

# Exercise 3
Can gradient descent get stuck in a local minimum when training a logistic
regression model?

![image.png](attachment:5a693e33-4e93-4ea3-b46d-e202179b756c.png) </br>
No it can't because the cost function (log loss) is convex.


# Exercise 4
Do all gradient descent algorithms lead to the same model, provided you let them
run long enough?

Pretty much. They can converge to the solution if you use a good learning schedule.

# Exercise 5
Suppose you use batch gradient descent and you plot the validation error at every
epoch. If you notice that the validation error consistently goes up, what is likely
going on? How can you fix this?

You should implement early stopping. The model is overfitting the training data. You can also try to use a better model or add features.

# Exercise 6
Is it a good idea to stop mini-batch gradient descent immediately when the
validation error goes up?

No, because when you are using the mini-batch sgd, the whole process is not so smooth. You should wait for some epochs and when you are confident that the error went up only then should you stop the training.

# Exercise 7
Which gradient descent algorithm (among those we discussed) will reach the
vicinity of the optimal solution the fastest? Which will actually converge? How
can you make the others converge as well?

The model that will reach the vicinity of the optimal solution the fastest will be the stochastic gradient descent. But it will not converge, only the batch gradient descent will converge. To make them converge you need to use a learning schedule.

# Exercise 8
Suppose you are using polynomial regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?

The model is overfitting. You can use a lasso regularization to try to bring the unnecessary parameters to zero, use a simpler model or try to come up with some new features.

# Exercise 9
Suppose you are using ridge regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that
the model suffers from high bias or high variance? Should you increase the
regularization hyperparameter α or reduce it?

The model suffers from high bias. It found the best compromise given the circumstances. It can't get any closer to the data. </br> ![image.png](attachment:a4bf4969-5102-4eff-b14a-96ade0668401.png) </br> You should reduce the α hyperparameter to give a model more flexibility.


# Exercise 10
Why would you want to use:
- a) Ridge regression instead of plain linear regression (i.e., without any
regularization)?


The Ridge regularization would keep weights smaller, thus constraining the model and preventing overfitting. It smoothens out the predictions.

- b. Lasso instead of ridge regression?

The Lasso regularization can be used when you suspect that only few features are usefull as it tends to bring the useless features weights to zero.

- c. Elastic net instead of lasso regression?

In general elastic net is preferred over lasso because lasso may behave eratically when the number of features gets greater than number of training instances or when several factors are strongly correlated.

# Exercise 11
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two logistic regression classifiers or one softmax regres‐
sion classifier?

You shuold implement two logistic regression classifiers because softmax reg class can predict only one class. </br>
![image.png](attachment:5137bcbf-49fc-4676-9c4f-ba631b659aed.png)

# Exercise 12
Implement batch gradient descent with early stopping for softmax regression
without using Scikit-Learn, only NumPy. Use it on a classification task such as
the iris dataset.

In [1]:
import numpy as np

np.random.seed(42)
X = np.random.randn(1000, 3)
X[:,0] = 1
y = np.random.randint(0, 3, 1000)
classes = ['class1', 'class2', 'class3']
m = len(X)
num_features = len(X[0,:])
num_classes = len(classes)



X, y

(array([[ 1.        , -0.1382643 ,  0.64768854],
        [ 1.        , -0.23415337, -0.23413696],
        [ 1.        ,  0.76743473, -0.46947439],
        ...,
        [ 1.        ,  0.07203686, -0.21220897],
        [ 1.        ,  0.07748052,  0.25775254],
        [ 1.        ,  0.33417642, -0.15525905]]),
 array([1, 1, 1, 0, 2, 2, 1, 0, 2, 1, 2, 2, 0, 0, 2, 1, 0, 2, 0, 0, 0, 2,
        1, 1, 1, 0, 2, 0, 1, 1, 2, 0, 1, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
        1, 2, 1, 1, 2, 0, 2, 1, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 1, 0, 2, 0,
        1, 1, 2, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 0, 2, 2, 2, 2, 2, 2, 2,
        2, 0, 2, 2, 2, 1, 0, 0, 0, 2, 2, 1, 0, 2, 0, 1, 1, 1, 0, 2, 2, 1,
        2, 1, 0, 1, 0, 1, 2, 2, 1, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 1, 0, 0,
        2, 1, 1, 1, 1, 1, 0, 0, 0, 2, 1, 2, 1, 0, 2, 0, 1, 1, 1, 0, 1, 0,
        2, 1, 1, 1, 2, 1, 0, 0, 1, 1, 2, 2, 2, 1, 2, 2, 0, 2, 0, 0, 2, 0,
        1, 1, 2, 1, 0, 1, 2, 2, 0, 2, 1, 2, 0, 2, 1, 0, 1, 0, 2, 0, 0, 2,
        1, 2, 0, 2, 0, 2,

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=False, return_X_y=True)
X, X_test, y, y_test = train_test_split(X,y,train_size=0.8, random_state=42)
m = len(X)
m_test = len(X_test)
bias_term = np.ones((m,1))
X = np.append(bias_term, X, axis=1)
bias_term = np.ones((m_test,1))
X_test = np.append(bias_term, X_test, axis=1)

num_features = len(X[0,:])
num_classes = len(np.unique(y))

X[:5], y[:5], num_classes 

(array([[1. , 4.6, 3.6, 1. , 0.2],
        [1. , 5.7, 4.4, 1.5, 0.4],
        [1. , 6.7, 3.1, 4.4, 1.4],
        [1. , 4.8, 3.4, 1.6, 0.2],
        [1. , 4.4, 3.2, 1.3, 0.2]]),
 array([0, 0, 1, 0, 0]),
 3)

In [3]:
np.random.seed(42)
theta = np.random.randn(num_classes, num_features)
theta

array([[ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337],
       [-0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004],
       [-0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783]])

In [4]:
def softmax_score_k(theta_k, x):
    return theta_k.T @ x

def softmax_score_k_matrix(theta_k, X):
    scores = []
    for x in X:
        scores.append(softmax_score_k(theta[k], x))
    return scores
    
def softmax_scores(theta, x):
    scores = []
    for k in range(num_classes):
        score = softmax_score_k(theta[k], x)
        scores.append(score)
    return scores

def softmax_scores_matrix(theta, X):
    scores = []
    for x in X:
        scores.append(softmax_scores(theta, x))
    return scores

def softmax_probabilities(scores):
    probabilities = []
    sum_of_scores = np.exp(scores).sum()
    for k, score in enumerate(scores):
        probability = np.exp(score) / sum_of_scores
        # print(f"probability for class {k}: {probability}")
        probabilities.append(probability)
    return probabilities

def softmax_probabilities_matrix(scores):
    probabilities = []
    for score in scores:
        probabilities.append(softmax_probabilities(score))
    return probabilities
    
def predict(probabilities):
    return np.argmax(probabilities)
    
def predict_matrix(probabilities):
    predictions = []
    for prob in probabilities:
        predictions.append(np.argmax(prob))
    return predictions

def calculate_gradients_for_class(theta, X, y, k):
    p_k = softmax_score_k_matrix(theta[:,k], X)
    y_k = []
    p_k = np.array(p_k)
    m = len(X)
    for y_i in y:
        if y_i == k:
            y_k.append(1)
        else:
            y_k.append(0)

    y_k = np.array(y_k)

    return (1 / m) * np.dot(X.T, p_k - y_k)

In [5]:
scores = softmax_scores_matrix(theta, X)
probabilities = softmax_probabilities_matrix(scores)
probabilities
predictions = predict_matrix(probabilities)
predictions
correct_preds = predictions == y
accuracy_train = correct_preds.sum() / len(correct_preds)
print(f"accuracy on train: {accuracy_train:.2f}")
scores = softmax_scores_matrix(theta, X_test)
probabilities = softmax_probabilities_matrix(scores)
probabilities
predictions = predict_matrix(probabilities)
predictions
correct_preds = predictions == y_test
accuracy_test = correct_preds.sum() / len(correct_preds)
print(f"accuracy on test: {accuracy_test:.2f}")

accuracy on train: 0.34
accuracy on test: 0.30


In [8]:
epochs = 3000
lr = 0.01

print(theta)

for epoch in range(epochs):
    for k in range(num_classes):
        theta[k] = theta[k] - lr * calculate_gradients_for_class(theta, X, y, k)        

print(theta)

[[ 0.43177717 -0.20162901  0.44044242  0.07935747 -0.48093251]
 [-0.07271745  0.42283447 -0.48506946 -0.1475715  -0.01624339]
 [-0.56123166 -0.17917494  0.28123904  0.25462096  0.09958667]]
[[ 0.3695782  -0.08850938  0.33930206 -0.0605967  -0.28159783]
 [ 0.25504856  0.30484636 -0.45344524 -0.00459905 -0.24379988]
 [-0.57527104 -0.1479774   0.26804056  0.15466331  0.30693658]]


In [9]:
scores = softmax_scores_matrix(theta, X)
probabilities = softmax_probabilities_matrix(scores)
probabilities
predictions = predict_matrix(probabilities)
predictions
correct_preds = predictions == y
accuracy_train = correct_preds.sum() / len(correct_preds)
print(f"accuracy on train: {accuracy_train:.2f}")
scores = softmax_scores_matrix(theta, X_test)
probabilities = softmax_probabilities_matrix(scores)
probabilities
predictions = predict_matrix(probabilities)
predictions
correct_preds = predictions == y_test
accuracy_test = correct_preds.sum() / len(correct_preds)
print(f"accuracy on test: {accuracy_test:.2f}")

accuracy on train: 0.82
accuracy on test: 0.80
