<h1 style="text-align: center; font-weight: bold; font-size: 36px;">Character Level Bigram Model - Gradient Descent</h1>

# Introduction

Let's create a **bigram** model by **gradient descent** - a single linear layer pseudo neural network.

Inspired by Karpathy [Neural Networks: Zero-to-Hero](https://github.com/karpathy/nn-zero-to-hero). 
We are using the same [names.txt](https://github.com/karpathy/makemore/blob/master/names.txt) as in Zero to Hero so we can compare results.

# Definitions

Let's define:

Vector of scores, i.e. logits, i.e. row in matrix $W$:

$$
\mathbf{z} = (z_1, \dots, z_K)
$$

Model predictions, using softmax:

$$
\mathbf{\hat y} = (\hat y_1, \dots, \hat y_K),
\qquad
\hat y_i = \mathcal{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K}{e^{z_j}}}
$$

One-hot target labels

$$
\mathbf{y} = (y_1, \dots, y_K ),
\qquad
y_k =
\begin{cases}
1 & \text{for the correct class } k, \\
0 & \text{otherwise}
\end{cases}
$$

Cross-entropy loss, general and simplified form for correct class $\mathcal{c}$:

$$
\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{k=1}^{K} y_k \log{\hat y_k},
\qquad
\mathcal{L} = -\log{\hat y_c}
$$

Gradient w.r.t logits:

$$
\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i
$$

# Imports

In [1]:
import time
import numpy as np
np.set_printoptions(linewidth=200)

# Explore the Data

Load the data and show some examples

In [2]:
with open('../data/names.txt', 'r') as f:
    names = f.read().splitlines()
print("Num names:", len(names))
print("Example names:", names[:10])
print("Min length:", min(len(name) for name in names))
print("Max length:", max(len(name) for name in names))

Num names: 32033
Example names: ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']
Min length: 2
Max length: 15


Count the bigram pairs, including special start/stop tokens

In [3]:
# Confirm the vocabulary is ASCII only
letters = sorted(list(set(''.join(names))))

# Add start/stop tokens - same token for both
letters = ['.'] + letters
print(letters)

['.', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [4]:
# Indices for all characters, including start/stop tokens
stoi = {ch: i for i, ch in enumerate(letters)}
itos = {i: ch for i, ch in enumerate(letters)}
# Print first 10 entries to verify
print(list(stoi.items())[:10])
print(list(itos.items())[:10])

[('.', 0), ('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5), ('f', 6), ('g', 7), ('h', 8), ('i', 9)]
[(0, '.'), (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'), (8, 'h'), (9, 'i')]


Prepare the dataset

In [5]:
X, Y = [], []  # inputs and targets

for name in names:
    name = '.' + name + '.'  # add start/stop tokens
    for i in range(len(name) - 1):
        first_char = name[i]
        second_char = name[i + 1]
        X.append(first_char)
        Y.append(second_char)

X = np.array([stoi[c] for c in X])
Y = np.array([stoi[c] for c in Y])

print("Num examples:", len(X))
print(f"X {X.shape},{X.dtype}:")
print(X[:10])

Num examples: 228146
X (228146,),int64:
[ 0  5 13 13  1  0 15 12  9 22]


In [6]:
print(X[:10])
print([itos[i] for i in X[:10]])
print(Y[:10])
print([itos[i] for i in Y[:10]])

[ 0  5 13 13  1  0 15 12  9 22]
['.', 'e', 'm', 'm', 'a', '.', 'o', 'l', 'i', 'v']
[ 5 13 13  1  0 15 12  9 22  9]
['e', 'm', 'm', 'a', '.', 'o', 'l', 'i', 'v', 'i']


# Optimize - Single Step

Reduce data size for easier printing.

In [7]:
x = X[:6]
y = Y[:6]

print(f"x {x.shape} {x.dtype}:")
print(x)
print()

print(f"y {y.shape} {y.dtype}:")
print(y)
print()

x_one_hot = np.zeros((len(x), len(letters)), dtype=np.float32)
x_one_hot[np.arange(len(x)), x] = 1
print(f"x_one_hot {x_one_hot.shape} {x_one_hot.dtype}")
print(x_one_hot[:5])
print()

y_one_hot = np.zeros((len(y), len(letters)), dtype=np.float32)
y_one_hot[np.arange(len(y)), y] = 1
print(f"y_one_hot {y_one_hot.shape} {y_one_hot.dtype}")
print(y_one_hot[:5])

x (6,) int64:
[ 0  5 13 13  1  0]

y (6,) int64:
[ 5 13 13  1  0 15]

x_one_hot (6, 27) float32
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

y_one_hot (6, 27) float32
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [8]:
np.random.seed(22)

# Init Weights
W = np.random.randn(len(letters), len(letters)).astype(np.float32)
print(f"W {W.shape},{W.dtype}:")
print(W.round(2))

W (27, 27),float32:
[[-0.09 -1.46  1.08 -0.24 -0.49 -1.    0.92 -1.1   0.63 -0.56  0.03 -0.23  0.59  0.75 -1.06  1.06  0.75  1.06  1.52 -1.49  1.86 -1.6  -0.65  0.34  1.05  0.63  0.36]
 [ 0.56 -1.09  0.02  2.5  -2.49 -0.23 -0.1  -0.89 -0.14  0.1  -0.25 -0.08 -1.09  0.59 -0.64 -1.11  2.11 -0.57 -0.48 -1.92  0.4  -1.05 -0.69  0.75  0.54 -0.73  0.56]
 [ 0.43 -0.14 -0.94  0.48 -1.53  0.4   0.01 -1.23 -1.05  2.52 -2.04  0.09 -0.31  0.49  0.35  0.95  0.76  0.01 -1.38 -0.27  0.54  0.54  1.16 -0.17 -1.18 -0.55  0.27]
 [ 0.98  1.01  0.78 -1.25 -0.42  0.55  0.33  0.86 -1.23  0.62 -2.77 -0.49  0.07 -0.35  0.87 -0.22  0.02  0.69 -0.88  1.5  -0.65  0.6   0.21 -0.42  0.1   0.31 -0.47]
 [ 2.2  -1.01  0.    1.13  0.51  1.08 -1.53 -0.23  0.04  1.26 -0.82 -0.06  1.88  0.38  0.43 -0.06  1.19 -1.87 -0.82 -0.57  0.13  0.   -2.13  0.11 -0.85  2.83 -0.27]
 [ 1.06 -0.55 -0.24 -0.85  0.85  1.33 -1.27  0.52  0.78  1.39 -0.47  1.59 -1.09  0.36 -0.33 -0.13  1.26  0.7   3.08  0.09 -0.58  1.35  0.42 -0.28  1.49  1.

In [9]:
# Calculate Logits
logits = x_one_hot @ W  # n_batch, n_vocab

# Equivalently
logits_2 = W[x,:]    # n_batch, n_vocab
assert np.allclose(logits, logits_2)
del logits_2

print(f"logits {logits.shape},{logits.dtype}:")
print(logits.round(2))

logits (6, 27),float32:
[[-0.09 -1.46  1.08 -0.24 -0.49 -1.    0.92 -1.1   0.63 -0.56  0.03 -0.23  0.59  0.75 -1.06  1.06  0.75  1.06  1.52 -1.49  1.86 -1.6  -0.65  0.34  1.05  0.63  0.36]
 [ 1.06 -0.55 -0.24 -0.85  0.85  1.33 -1.27  0.52  0.78  1.39 -0.47  1.59 -1.09  0.36 -0.33 -0.13  1.26  0.7   3.08  0.09 -0.58  1.35  0.42 -0.28  1.49  1.63 -1.31]
 [-0.78  0.96 -0.04  0.51  1.05 -0.74  1.7   1.07 -0.96 -1.22  0.23  0.44 -0.63 -0.15 -0.12 -0.25  0.5   1.69  2.05 -0.69  0.08 -0.48  0.06 -0.28  1.52  1.1  -0.46]
 [-0.78  0.96 -0.04  0.51  1.05 -0.74  1.7   1.07 -0.96 -1.22  0.23  0.44 -0.63 -0.15 -0.12 -0.25  0.5   1.69  2.05 -0.69  0.08 -0.48  0.06 -0.28  1.52  1.1  -0.46]
 [ 0.56 -1.09  0.02  2.5  -2.49 -0.23 -0.1  -0.89 -0.14  0.1  -0.25 -0.08 -1.09  0.59 -0.64 -1.11  2.11 -0.57 -0.48 -1.92  0.4  -1.05 -0.69  0.75  0.54 -0.73  0.56]
 [-0.09 -1.46  1.08 -0.24 -0.49 -1.    0.92 -1.1   0.63 -0.56  0.03 -0.23  0.59  0.75 -1.06  1.06  0.75  1.06  1.52 -1.49  1.86 -1.6  -0.65  0.34  1.05

In [10]:
def softmax(logits):
    """Numerically stable softmax"""
    max_ = np.max(logits, axis=-1, keepdims=True)
    exp = np.exp(logits - max_)
    exp_sum = np.sum(exp, axis=-1, keepdims=True)
    return exp / exp_sum

In [11]:
y_hat = softmax(logits)
print(f"y_hat {y_hat.shape},{y_hat.dtype}:")
print(y_hat.round(2))

y_hat (6, 27),float32:
[[0.02 0.01 0.07 0.02 0.01 0.01 0.06 0.01 0.04 0.01 0.02 0.02 0.04 0.05 0.01 0.06 0.05 0.06 0.1  0.01 0.14 0.   0.01 0.03 0.06 0.04 0.03]
 [0.04 0.01 0.01 0.01 0.03 0.05 0.   0.02 0.03 0.06 0.01 0.07 0.   0.02 0.01 0.01 0.05 0.03 0.3  0.02 0.01 0.05 0.02 0.01 0.06 0.07 0.  ]
 [0.01 0.05 0.02 0.03 0.06 0.01 0.11 0.06 0.01 0.01 0.02 0.03 0.01 0.02 0.02 0.02 0.03 0.11 0.15 0.01 0.02 0.01 0.02 0.01 0.09 0.06 0.01]
 [0.01 0.05 0.02 0.03 0.06 0.01 0.11 0.06 0.01 0.01 0.02 0.03 0.01 0.02 0.02 0.02 0.03 0.11 0.15 0.01 0.02 0.01 0.02 0.01 0.09 0.06 0.01]
 [0.04 0.01 0.02 0.29 0.   0.02 0.02 0.01 0.02 0.03 0.02 0.02 0.01 0.04 0.01 0.01 0.2  0.01 0.01 0.   0.04 0.01 0.01 0.05 0.04 0.01 0.04]
 [0.02 0.01 0.07 0.02 0.01 0.01 0.06 0.01 0.04 0.01 0.02 0.02 0.04 0.05 0.01 0.06 0.05 0.06 0.1  0.01 0.14 0.   0.01 0.03 0.06 0.04 0.03]]


In [12]:
# Sanity check
print(y_hat.sum(axis=-1, keepdims=True).round(2))

[[1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]]


In [13]:
def cross_entropy(y_hat, correct_target_idx):
    """Compute the cross-entropy loss. Equivalent to neg log likelihood."""
    target_class_prob = y_hat[np.arange(len(y_hat)), correct_target_idx]    # n_batch
    ce_loss = -1 * np.log(target_class_prob)
    return ce_loss

In [14]:
weight_decay = 0.01
loss = cross_entropy(y_hat, y).mean()
loss = loss.item()
print(loss)

3.618154525756836


In [15]:
def sample_name(BP):
    """Sample a name using the probabilities table."""
    chars = ["."]   # start token
    while True:
        current_char = chars[-1]
        current_int = stoi[current_char]
        row_probs = BP[current_int]
        next_int = np.random.choice(len(row_probs), p=row_probs)
        next_char = itos[next_int]
        chars.append(next_char)
        if next_char == ".":   # end token
            break
    return "".join(chars)

In [16]:
np.random.seed(22)

# Sample from model
P = softmax(W)
for i in range(15):
    name = sample_name(BP=P)
    print(name)

.hjluacerht.
.qyua.
.tytiupgpgnqdyznq.
.fzqqdlpc.
.obaptcajhtotznfcniu.
.lgpstht.
.myxapbmfcnnjpdyewrwapziuehtu.
.tluehtovaposhdytvitlovawmq.
.bwgpgpkulacsovrsuotiuicbvaj.
.qxbwmyjslerzineanjyjhzt.
.lcsjisuqkqkndywnf.
.qqbmywqxxkjvehcztlf.
.tlyqdrbksotwtobrmxrrwflxnlgpcyerwttobpybuermqninqxjckfqy.
.bcig.
.rwmrerzuudzg.


In [17]:
# Backward pass into logits
logits_grad = y_hat - y_one_hot

# Equivalently
logits_grad_2 = y_hat.copy()
logits_grad_2[np.arange(len(y)), y] -= 1
assert np.allclose(logits_grad, logits_grad_2)
del logits_grad_2

print(f"logits_grad {logits_grad.shape},{logits_grad.dtype}:")
print(logits_grad.round(2))
print()

logits_grad (6, 27),float32:
[[ 0.02  0.01  0.07  0.02  0.01 -0.99  0.06  0.01  0.04  0.01  0.02  0.02  0.04  0.05  0.01  0.06  0.05  0.06  0.1   0.01  0.14  0.    0.01  0.03  0.06  0.04  0.03]
 [ 0.04  0.01  0.01  0.01  0.03  0.05  0.    0.02  0.03  0.06  0.01  0.07  0.   -0.98  0.01  0.01  0.05  0.03  0.3   0.02  0.01  0.05  0.02  0.01  0.06  0.07  0.  ]
 [ 0.01  0.05  0.02  0.03  0.06  0.01  0.11  0.06  0.01  0.01  0.02  0.03  0.01 -0.98  0.02  0.02  0.03  0.11  0.15  0.01  0.02  0.01  0.02  0.01  0.09  0.06  0.01]
 [ 0.01 -0.95  0.02  0.03  0.06  0.01  0.11  0.06  0.01  0.01  0.02  0.03  0.01  0.02  0.02  0.02  0.03  0.11  0.15  0.01  0.02  0.01  0.02  0.01  0.09  0.06  0.01]
 [-0.96  0.01  0.02  0.29  0.    0.02  0.02  0.01  0.02  0.03  0.02  0.02  0.01  0.04  0.01  0.01  0.2   0.01  0.01  0.    0.04  0.01  0.01  0.05  0.04  0.01  0.04]
 [ 0.02  0.01  0.07  0.02  0.01  0.01  0.06  0.01  0.04  0.01  0.02  0.02  0.04  0.05  0.01 -0.94  0.05  0.06  0.1   0.01  0.14  0.    0.01  0.03 

In [18]:
# Backward pass into W
W_grad = x_one_hot.T @ logits_grad / len(x)

# Equivalently, accumulate by indexing
# Note: one-hot version on this tiny dataset is actually faster because
# matmuls use highly optmised BLAS routines, while np.add.at is sequential
W_grad_2 = np.zeros_like(W)
np.add.at(W_grad_2, x, logits_grad)
W_grad_2 /= len(x)
assert np.allclose(W_grad, W_grad_2)
del W_grad_2

print(f"W_grad {W_grad.shape},{W_grad.dtype}:")
print(W_grad.round(2))

W_grad (27, 27),float32:
[[ 0.01  0.    0.02  0.01  0.   -0.16  0.02  0.    0.01  0.    0.01  0.01  0.01  0.02  0.   -0.15  0.02  0.02  0.03  0.    0.05  0.    0.    0.01  0.02  0.01  0.01]
 [-0.16  0.    0.    0.05  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.01  0.    0.    0.03  0.    0.    0.    0.01  0.    0.    0.01  0.01  0.    0.01]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.01  0.    0.    0.    0.01  0.01  0.    0.    0.01  0.01  0.    0.01  0.   -0.16  0.    0.    0.01  0.    0.05  0.    0.    0.01  0.    0.    0.0

In [19]:
def numerical_gradient_check():

    np.random.seed(22)

    # Init Weights
    W = np.random.randn(len(letters), len(letters)).astype(np.float128)
    print(W.dtype)

    eps = 1e-5

    # Forward pass
    logits = W[x,:]    # n_batch, n_vocab
    y_hat = softmax(logits)
    loss = cross_entropy(y_hat, y).mean() + weight_decay*(W**2).mean()
    loss = loss.item()

    # Backward pass
    # Equivalent to: logits_grad = y_hat - y_one_hot
    logits_grad = y_hat.copy()
    logits_grad[np.arange(len(y)), y] -= 1

    # Equivalent to: W_grad = x_one_hot.T @ logits_grad / len(x)
    W_grad = np.zeros_like(W)
    np.add.at(W_grad, x, logits_grad)
    W_grad /= len(x)
    W_grad += 2*weight_decay*W / W.size  # Gradient from weight_decay

    # Grad check
    W_grad_num = np.zeros_like(W)
    i, j = 0, 0

    # for i in range(W.shape[0]):
    #     for j in range(W.shape[1]):

    i, j = 0, 0

    for i in range(W.shape[0]):
        for j in range(W.shape[1]):
            W_cpy_minus = W.copy()
            W_cpy_minus[i, j] -= eps
            logits = W_cpy_minus[x,:]    # n_batch, n_vocab
            y_hat = softmax(logits)
            loss_minus = cross_entropy(y_hat, y).mean()

            W_cpy_plus = W.copy()
            W_cpy_plus[i, j] += eps
            logits = W_cpy_plus[x,:]    # n_batch, n_vocab
            y_hat = softmax(logits)
            loss_plus = cross_entropy(y_hat, y).mean()

            W_grad_num[i, j] = (loss_plus - loss_minus) / (2*eps)

    #assert np.allclose(W_grad, W_grad_num)
    return np.allclose(W_grad, W_grad_num, atol=1e-4, rtol=1e-3)

if numerical_gradient_check():
    print("Numerical grad check ok!")

float128
Numerical grad check ok!


In [20]:
print("loss before the update:", loss)

# Update weights
learning_rate = 0.1
W += -learning_rate * W_grad

loss = cross_entropy(softmax(W[x,:]), y).mean().item()
print("loss after one update:", loss)

loss before the update: 3.618154525756836
loss after one update: 3.6011502742767334


# Train Run

In [21]:
def train_model(X, Y, learning_rate=10, weight_init_scale=1.0, weight_decay=0.01, early_stop=2.49, print_every=100, max_epochs=500):

    print(f"Training model with: "
          f"lr={learning_rate}, "
          f"weight_init_scale={weight_init_scale}, "
          f"max_epochs={max_epochs}")
    
    start_time = time.time()

    losses = []

    # Copy not technically necessery
    x = X.copy()
    y = Y.copy()

    np.random.seed(22)
    W = np.random.randn(len(letters), len(letters)).astype(np.float32) * weight_init_scale

    for epoch in range(max_epochs):
        # Forward pass
        logits = W[x,:]    # n_batch, n_vocab
        y_hat = softmax(logits)
        loss = cross_entropy(y_hat, y).mean() + weight_decay*(W**2).mean()
        loss = loss.item()

        # Print and store loss
        if print_every is not None and epoch % print_every == 0:
            print(f"epoch: {epoch}, loss: {loss}")
        losses.append(loss)

        # Backward pass
        # Equivalent to: logits_grad = y_hat - y_one_hot
        logits_grad = y_hat.copy()
        logits_grad[np.arange(len(y)), y] -= 1

        # Equivalent to: W_grad = x_one_hot.T @ logits_grad / len(x)
        W_grad = np.zeros_like(W)
        np.add.at(W_grad, x, logits_grad)
        W_grad /= len(x)
        W_grad += 2*weight_decay*W / W.size  # Gradient from weight_decay

        # Update weights
        W += -learning_rate * W_grad

        if loss < early_stop:
            break

    elapsed_time = time.time() - start_time
    print(f"Training completed: "
          f"elapsed_time={elapsed_time:.2f}s, "
          f"final_epoch={epoch}, loss={loss}")
    
    result = {
        'name': f'w={weight_init_scale}_lr={learning_rate}',
        'W': W,
        'losses': losses,
    }

    return result


In [22]:
result = train_model(X, Y, learning_rate=25.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, print_every=50, max_epochs=500)
W = result['W']
losses = result['losses']

Training model with: lr=25.0, weight_init_scale=0.01, max_epochs=500
epoch: 0, loss: 3.2961506843566895
epoch: 50, loss: 2.5306262969970703
epoch: 100, loss: 2.500636339187622
epoch: 150, loss: 2.4911513328552246
Training completed: elapsed_time=31.99s, final_epoch=161, loss=2.4899325370788574


In [23]:
# Init numpy random seed
np.random.seed(22)

P = softmax(W)
for i in range(5):
    name = sample_name(BP=P)
    print(name)

# Expected output:
# .chasah.
# .mar.
# .kora.
# .ryn.
# .quliemi.

.chasah.
.mar.
.kora.
.ryn.
.quliemi.


# Experiments

In [None]:
train_runs = {}

result = train_model(X, Y, learning_rate=150.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=25.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=10.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=1.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

In [None]:
# plot losses
import matplotlib.pyplot as plt
fig, ax = plt.subplots()

for name, losses in train_runs.items():
    ax.plot(losses, label=name)

ax.legend()
plt.show()

In [None]:
train_runs = {}

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=2.0, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=1.0, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=0.1, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=0.01, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

result = train_model(X, Y, learning_rate=50.0, weight_init_scale=0.0, weight_decay=0.01, early_stop=2.49, max_epochs=500, print_every=None)
train_runs[result['name']] = result['losses']

In [None]:
# plot losses
import matplotlib.pyplot as plt
fig, ax = plt.subplots()

for name, losses in train_runs.items():
    ax.plot(losses, label=name)

ax.legend()
plt.show()