### 1. Which Linear Regression training algorithm can you use if you have a training set with millions of features
Batch gradient descent, gradient descent, mini-batch gradient descent

### 2. Suppose the features in your training set have very different scales. Which algorithm might suffer from this, and how? What can you do about it?
Gradient descent, because it will converge slower. I can scale features using MinMaxScaler or StandardScaler. Regularized models can end up in non-optimal solution, because they could ignore features with small values.

### 3. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?
No, because logistic regression optimization problem is convex

### 4. Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?
Nearly, yes

### 5. Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?
If the validation error consistently goes up after every epoch, then one posibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However if the training error is not going up, then your model is overfitting the training set and you should stop training.

### 6. Is it a good idea to stop Mini-batch Gradient Descent immidiately when the validation error goes up?
Not really, because in mini-batch gradient descent is pretty common that the validation errors is going up and down in single steps, while model is actually converging.

### 7. Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastests? Which will actually converge? How can you make the others converge as well?
The fastest is stochastic gradient descent, because it is using single point in each step
Batch gradient descent converges without any improvements. To make mini-batch and stochastic ones converges we should reduce learning rate in further epochs.

### 8. Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are tree ways to solve this?
The model is overfitting the training data. Solutions: gather more training data, reduce polynomial degree, regularize the model.

### 9. Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the models suffers from high bias or high variance? Should you increase the regularization hyperparameter :alfa: or reduce it?
The model is suffering from high bias. To fix this we should reduce regularization parameter.

### 10. Why would you want to use:
a. Ridge Regression instead of plain Linear Regresion (i.e., without any regularization)?
You should always use some kind of regularization
b. Lasso instead of Ridge Regression?
When some of the features might be useless
c. Elastic Net instead of Lasso?

11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression (without using Scikit-Learn)
Two Logistic Regression, because classes are not exclusive.


### 12. Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn)

In [131]:
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]
y = iris['target']

In [132]:
def with_bias(X):
    return np.c_[np.ones((X.shape[0], 1)), X]

X_with_bias = with_bias(X)

In [133]:
np.random.seed(2042)

In [134]:
test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(test_ratio * total_size)
validation_size = int(validation_ratio * test_size)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

In [135]:
def to_one_hot(y):
    n_classes = len(np.unique(y))
    m = len(y)
    y_one_hot = np.zeros((m, n_classes))
    y_one_hot[np.arange(m), y] = 1
    return y_one_hot

In [136]:
y_train[:10]

array([0, 1, 2, 1, 1, 0, 1, 1, 1, 0])

In [137]:
to_one_hot(y_train[:10])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [138]:
import sys

y_one_hot = to_one_hot(y)

# theta = np.zeros((y_one_hot.shape[1], X_with_bias.shape[1]))
#
# print(y.shape)
# print(np.unique(y))
#
# m = y.size
# sk = np.dot(X, np.transpose(theta))
# exp_sk = np.exp(sk)
# pk = exp_sk / np.sum(exp_sk, 0)
# J = - np.sum(y_one_hot * np.log(pk)) / m
# delta_J = np.dot(np.transpose(pk - y_one_hot), X) / m
eta = 0.01
epochs = 100

# def predict(X):
#     sk = np.dot(X, np.transpose(theta))
#     exp_sk = np.exp(sk)
#     pk = exp_sk / np.sum(exp_sk, 1)
#     return np.argmax(pk, 1)
#
# def predict_proba(X):
#     sk = np.dot(X, np.transpose(theta))
#     exp_sk = np.exp(sk)
#     pk = exp_sk / np.sum(exp_sk, 1)
#     return pk

class SoftmaxRegression():
    def __init__(self, eta=0.01, max_iterations=100, patience=3):
        self.eta = eta
        self.max_iterations = max_iterations
        self.patience = patience

    def fit(self, X, y):
        classes = np.unique(y)

        y_one_hot = to_one_hot(y)

        # theta = np.zeros((classes.shape[0], X.shape[1]))
        theta = np.random.randn(classes.shape[0], X.shape[1])

        best_J = sys.maxsize
        best_theta = theta
        num_of_iterations_without_improve = 0
        epsilon = 1e-7

        for epoch in range(self.max_iterations):
            m = y.size
            sk = np.dot(X, np.transpose(theta))
            exp_sk = np.exp(sk)
            exp_sum = np.sum(exp_sk, axis=1, keepdims=True)
            pk = exp_sk / exp_sum
            delta_J = np.dot(np.transpose(pk - y_one_hot), X) / m
            theta = theta - self.eta * delta_J
            J = - np.sum(y_one_hot * np.log(pk + epsilon)) / m
            if J < best_J:
                num_of_iterations_without_improve = 0
                best_theta = theta
                best_J = J
            elif num_of_iterations_without_improve >= self.patience:
                print(self.patience, "iterations without loss improvement")
                print("Returning", best_theta, "as final model parameters, which results in loss equal to", best_J)
                break
            else:
                num_of_iterations_without_improve += 1

            if epoch % 500 == 0:
                print("Epoch:", epoch, "loss:", J)

        self.theta = best_theta

    def predict(self, X):
        print("Theta shape:", self.theta.shape)
        print("X shape:", X.shape)
        sk = np.dot(X, np.transpose(self.theta))
        print("SK shape:", sk.shape)
        exp_sk = np.exp(sk)
        exp_sum = np.sum(exp_sk, axis=1, keepdims=True)
        print("Sum shape:", exp_sum.shape)
        pk = exp_sk / exp_sum
        return np.argmax(pk, 1)



In [139]:
soft_reg =  SoftmaxRegression(eta=0.1, max_iterations=50001)
soft_reg.fit(X_train, y_train)


X = np.array([[5, 2]])
soft_reg.predict(with_bias(X))

Epoch: 0 loss: 2.362342730571486
Epoch: 500 loss: 0.3901456732957007
Epoch: 1000 loss: 0.3006303713791963
Epoch: 1500 loss: 0.2557342432944118
Epoch: 2000 loss: 0.2268602426614136
Epoch: 2500 loss: 0.20622628704901602
Epoch: 3000 loss: 0.19056359354600286
Epoch: 3500 loss: 0.17818617431018688
Epoch: 4000 loss: 0.16811391527990094
Epoch: 4500 loss: 0.15973039860713542
Epoch: 5000 loss: 0.15262549740322198
Epoch: 5500 loss: 0.14651442988232588
Epoch: 6000 loss: 0.1411926471237409
Epoch: 6500 loss: 0.13650911119871476
Epoch: 7000 loss: 0.13234968767190888
Epoch: 7500 loss: 0.12862641311230055
Epoch: 8000 loss: 0.12527033001867477
Epoch: 8500 loss: 0.12222656943078447
Epoch: 9000 loss: 0.11945089477253538
Epoch: 9500 loss: 0.11690722163282337
Epoch: 10000 loss: 0.11456580489790852
Epoch: 10500 loss: 0.11240189180671357
Epoch: 11000 loss: 0.11039470637518366
Epoch: 11500 loss: 0.10852667344114651
Epoch: 12000 loss: 0.10678281860424053
Epoch: 12500 loss: 0.10515029905569177
Epoch: 13000 loss

array([2])

In [140]:
y_valid_predict = soft_reg.predict(X_valid)

np.mean(y_valid_predict == y_valid)

Theta shape: (3, 3)
X shape: (6, 3)
SK shape: (6, 3)
Sum shape: (6, 1)


0.8333333333333334

In [141]:
y_test_predict = soft_reg.predict(X_test)

np.mean(y_test_predict == y_test)

Theta shape: (3, 3)
X shape: (30, 3)
SK shape: (30, 3)
Sum shape: (30, 1)


0.9333333333333333

In [142]:
y_valid_one_hot = to_one_hot(y_valid)

# number_of_classes x number of features
theta = np.zeros((y_one_hot.shape[1], X_valid.shape[1]))


print(y.shape)
print(np.unique(y))


m = y.size

# number of examples x number of classes
sk = np.dot(X_valid, np.transpose(theta))
exp_sk = np.exp(sk)

# number of examples x number of classes
sum_exp_sk = np.sum(exp_sk, axis=1, keepdims=True)

# number of examples x number of classes
pk = exp_sk / sum_exp_sk
J = - np.sum(y_valid_one_hot * np.log(pk)) / m
delta_J = np.dot(np.transpose(pk - y_valid_one_hot), X_valid) / m

(150,)
[0 1 2]
