### Training Model Exercises pg. 145

<i>converge - tends to a limit</i>
    
<i>regularize a model - to constrain a model by reducing its degrees of freedom and decrease variance</i>

1. If there are millions of features in a linear regression problem, the best algorithm to use would be a gradient descent algorithm (stochastic, mini-batch, and batch if there is enough memory). Using the normal equation would be too slow because it scales poorly with the number of features in the dataset. Alternatively, the gradient descent scales poorly with the number of training instances

2. Feature scaling is helpful when the dataset has very different scales for features. The algorithm that suffers from this is gradient descent because it will resemble an elongated bowl (pg 115) when the features have different scales. The downside to this is that reaching the global minimum for the cost function will take longer and take a rougher path (elongated bowl shape). As a solution, scale the data before training! Smaller values are deemed as less important (thus ignored) when features have different scales.

    **The normal equation does not need feature scaling**

3. No because the cost function will be convex

4. Depends. If the model is linear or logistic, the cost function is convex, and the learning rate isn't too high, then all the GD algorithms will approach the global minimum and results in similar models. Although, the learning rate must be gradually decreased because otherwise stochastic and mini-batch GD will never fully converge to the global minimum. Even if they are run for a very long time, they will produce slightly different models unless the learning rate is adjusted while converging.

5. Validation error increasing means the learning rate is either too high and the GD algorithm is diverging. If the training [set] error is also increasing then it is a clear sign that the learning rate is too **high**. If the training set error is not increasing, then the problem is **overfitting** and that training should be halted

6. No, because mini-batch GD is a erratic and might still be converging to the global minimum, and it might also be escaping a local minimum. Rather, save the model at each training iteration, and stop training if the training error is not getting better then revert to the best saved model.

7. Stochastic (since it considers one training instance at a time) and mini-batch w/ small mini-batch sizes reach the vicinity of the global min the fastest, but they won't actually converge to the global min (unless learning rate is gradually decreased with a learning schedule) but, rather, get very close to it. Batch actually does converge given enough training time.

8. Overfitting or underfitting. To solve:
    1. overfitting - decrease the polynomial order (polynomial degree). underfitting - increase the polynomial order (polynomial degree)
    2. regularize the model by using a penalized cost function (ridge or lasso)
    3. get more data! increase the size of the training set

9. <i>high bias - higher error on training and test data (underfitting). high variance - low error on training data but high error on test data (overfitting) </i>.

    The model suffers from high bias because it is underfitting both the training and validation sets. Thus, the regulization hyperparameter, alpha, should be decreased

10. When to use:
    1. Linear regression vs. ridge regression:
        - regularized models typically perform better than plain lin. reg.
    2. Lasso regression vs. ridge regression:
        - Lasso regression sets the weights of the "unimportant" feature to zero (automatic feature scaling)
        - Choose Lasso when it is known that only a few features matter
        - Choose ridge when it is not known
    3. Lasso regression vs. Elastic net regression
        - Use elastic net (mix of ridge and lasso) because lasso behaves erratically when the # of features is greater than # of instances or when most of the features are strongly correlated
   

11. Batch GD with early stopping for Softmax regression:

In [210]:
from sklearn.datasets import load_iris
import numpy as np

data = load_iris()
X = data["data"][:, (2, 3)] #only petal length and width
y = data["target"]

In [211]:
X_with_bias = np.c_[np.ones([len(X), 1]), X]

In [212]:
np.random.seed(2042)

In [213]:
# split dataset into training and test sets from scratch

test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(test_ratio * total_size)
validation_size = int(validation_ratio * total_size)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train, y_train = X_with_bias[rnd_indices[:train_size]], y[rnd_indices[:train_size]]
X_validation, y_validation = X_with_bias[rnd_indices[train_size:-test_size]], y[rnd_indices[train_size:-test_size]] # index of -test_size is shorthand of len(X_width_bias) - test_size
X_test, y_test = X_with_bias[rnd_indices[-test_size:]], y[rnd_indices[-test_size:]]

In [214]:
# softmax uses classified probabilities 

def one_hot(y):
    n_classes = y.max() + 1
    y_one_hot = np.zeros((len(y), n_classes))
    y_one_hot[np.arange(len(y)), y] = 1 
    return y_one_hot

In [215]:
y_train_one_hot = one_hot(y_train)
y_validation_one_hot = one_hot(y_validation)
y_test_one_hot = one_hot(y_test)

In [216]:
def softmax(softmax_scores):
    exps = np.exp(softmax_scores)
    return exps / np.sum(exps,axis=1, keepdims=True)

In [217]:
n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y_train))

In [218]:
eta = 0.01 #define learning rate
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7 #add to Pk^i in log(Pk^i) because to prevent log(0) error

Theta = np.random.randn(n_inputs, n_outputs) #random initializations for feature weights. shape = n_features x n_classes. Each class has their own dedicated parameter vector

for iteration in range(n_iterations):
    softmax_scores = X_train.dot(Theta)
    y_proba = softmax(softmax_scores)
    cost = -np.mean(np.sum(y_train_one_hot*np.log(y_proba + epsilon), axis=1))
    error = y_proba - y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, cost)
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

0 5.446205811872683
500 0.8350062641405651
1000 0.6878801447192402
1500 0.6012379137693314
2000 0.5444496861981872
2500 0.5038530181431525
3000 0.47292289721922487
3500 0.44824244188957774
4000 0.4278651093928793
4500 0.41060071429187134
5000 0.3956780375390374


In [219]:
Theta

array([[ 3.32094157, -0.6501102 , -2.99979416],
       [-1.1718465 ,  0.11706172,  0.10507543],
       [-0.70224261, -0.09527802,  1.4786383 ]])

In [220]:
softmax_score_valid = X_validation.dot(Theta)
Y_proba = softmax(softmax_score_valid)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_validation)
accuracy_score

0.9666666666666667

In [227]:
alpha = 0.1
eta = 0.1 #define learning rate
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7 #add to Pk^i in log(Pk^i) because to prevent log(0) error

Theta = np.random.randn(n_inputs, n_outputs) #random initializations for feature weights. shape = n_features x n_classes. Each class has their own dedicated parameter vector

for iteration in range(n_iterations):
    softmax_scores = X_train.dot(Theta)
    y_proba = softmax(softmax_scores)
    l2_loss = (alpha/2) * np.sum(np.square(Theta[1:])) #skip the first element
    cost = -np.mean(np.sum(y_train_one_hot*np.log(y_proba + epsilon), axis=1)) + l2_loss
    error = y_proba - y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, cost)
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients

0 3.31466679236949
500 0.5378947790755032
1000 0.5046573947978396
1500 0.49505389686784407
2000 0.4914430657465934
2500 0.4899610932392464
3000 0.48932601099063844
3500 0.48904708136462205
4000 0.4889227267817381
4500 0.4888667580914424
5000 0.48884141287504024


In [228]:
softmax_score_valid = X_validation.dot(Theta)
Y_proba = softmax(softmax_score_valid)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_validation)
accuracy_score

1.0

In [229]:
alpha = 0.1
eta = 0.1 #define learning rate
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7 #add to Pk^i in log(Pk^i) because to prevent log(0) error
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs) #random initializations for feature weights. shape = n_features x n_classes. Each class has their own dedicated parameter vector

for iteration in range(n_iterations):
    softmax_scores = X_train.dot(Theta)
    y_proba = softmax(softmax_scores)
    l2_loss = (alpha/2) * np.sum(np.square(Theta[1:])) #skip the first element
    cost = -np.mean(np.sum(y_train_one_hot*np.log(y_proba + epsilon), axis=1)) + l2_loss
    error = y_proba - y_train_one_hot
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients
    
    softmax_scores = X_train.dot(Theta)
    y_proba = softmax(softmax_scores)
    l2_loss = (alpha/2) * np.sum(np.square(Theta[1:])) #skip the first element
    cost = -np.mean(np.sum(y_train_one_hot*np.log(y_proba + epsilon), axis=1)) + l2_loss
    error = y_proba - y_train_one_hot
    
    if iteration % 500 == 0:
        print(iteration, cost)
    if cost < best_loss:
        best_loss = cost
    else:
        print(iteration-1, best_loss)
        print(iteration, loss, 'loss increased! early stopping!')
        break

0 1.4810766085956093
500 0.5273833023235621
1000 0.5017773436353987
1500 0.4940082947584887
2000 0.4910219013615579
2500 0.4897825069610757
3000 0.48924807180697094
3500 0.4890124718111627
4000 0.4889071897601139
4500 0.4888597339906442
5000 0.48883822270267346


In [230]:
softmax_score_valid = X_validation.dot(Theta)
Y_proba = softmax(softmax_score_valid)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_validation)
accuracy_score

1.0

In [235]:
softmax_score_valid = X_test.dot(Theta)
Y_proba = softmax(softmax_score_valid)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_test)
accuracy_score

0.9333333333333333