## Simple example of TuRBO-1

In [1]:
from turbo.turbo_1_grad import Turbo1Grad
import numpy as np
import torch
import math
import matplotlib
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

# Set up an optimization problem class

In this TuRBO with derivative example, we'll be setting up a function as follows.

First, we'll create a true linear function in 3 dimensions with weights $w^{*} = [1.1, 0.8, 0.1]$ and a bias of $b^{*} = 2.0$. Then we'll sample 1000 data points in 3 dimensions, and set the labels to be $y_{i} = w^{*\top}x_{i} + b^{*} + \epsilon_{i}$, where $\epsilon_{i} \sim \mathcal{N}(0, \sigma^{2}_{n})$. We set $\sigma^{2}_{n}$ such that there is roughly a 10:1 signal to noise ratio in the data.

Given `data_x` and `data_y`, the function we'll seek to optimize will be to fit a linear model to this data, not assuming we know the optimal parameters $w^{*},b^{*}$. In other words, our "blackbox" objective function with gradients will be $$f(w, b)=\frac{1}{N}\sum_{i=1}^{N}\left(y_{i} - [w^{\top}x_{i} + b]\right)^{2}$$.

Obviously, this function should have a global optimum very close to $w^{*}, b^{*}$.

In [2]:
# Step 1: Create data_x and data_y as described above, with true_weights being w* and true_bias being b*

true_weights = torch.tensor([1.1, 0.8, 0.1], dtype=torch.float32).view(3)
true_bias = 2.0

data_x = torch.randn(1000, 3)
data_y = data_x.matmul(true_weights)
data_y = data_y + torch.randn_like(data_y) * (0.1 * data_y.std()) + true_bias

In [3]:
class LinearModel(torch.nn.Module):
    """
    Represents a simple linear model in PyTorch, where forward(x) returns w'x + b.
    
    The weights and bias parameters are learnable, which will let us get derivatives for them easily.
    """
    def __init__(self, num_dims):
        super().__init__()
        self.weights = torch.nn.Parameter(torch.zeros(num_dims))
        self.bias = torch.nn.Parameter(torch.zeros(1))
    
    def forward(self, x):
        return x.matmul(self.weights) + self.bias

## Baseline / Sanity Check

In the next cell, we do a simple baseline / sanity check where we train the linear model with gradient descent (Adam). Obviously, since our function is convex in this case, there should be no trouble recovering nearly the exactly correct weights and bias.

In [4]:
lm = LinearModel(3)

from torch.optim import Adam

optimizer = Adam(lm.parameters(), lr=0.01)

for i in range(500):
    lm.zero_grad()
    output = lm(data_x)
    loss = (data_y - output).pow(2).mean()
    loss.backward()
    if i % 50 == 0:
        print(f'Iteration {i} - Loss = {loss.item():.2f}')
    optimizer.step()
    
print(f'Learned weights: {lm.weights.data} - Learned bias: {lm.bias.data.item():.2f}')
print(f'True weights: {true_weights} - True bias: {2.0}')

Iteration 0 - Loss = 5.91
Iteration 50 - Loss = 2.85
Iteration 100 - Loss = 1.32
Iteration 150 - Loss = 0.60
Iteration 200 - Loss = 0.26
Iteration 250 - Loss = 0.11
Iteration 300 - Loss = 0.05
Iteration 350 - Loss = 0.03
Iteration 400 - Loss = 0.02
Iteration 450 - Loss = 0.02
Learned weights: tensor([1.0978, 0.7969, 0.0988]) - Learned bias: 1.98
True weights: tensor([1.1000, 0.8000, 0.1000]) - True bias: 2.0


## Define blackbox objective class

Next, we define a class similar to what is normally used for TuRBO. 

The `__call__` method here instantiates a `LinearModel` object using the class above, but fills the weights with the first three entries in `x` and fills the bias with the last entry in `x`. Then, we compute the loss as normal, and compute the derivatives with respect to the weights and the bias, which we pull out in to a single `grad` vector.

The return of this call is a `dim + 1` vector of the form `[f(x), [df/dx]_1, [df/dx]_2, [df/dx]_3, [df/dx]_4]`.

In [5]:
class LinearModelFunction(object):
    def __init__(self, data_x, data_y, dim=4):
        self.dim = dim
        self.lb = -2 * np.ones(dim)
        self.ub = 2 * np.ones(dim)
        
        self.data_x = data_x
        self.data_y = data_y
    
    def __call__(self, x):
        x = torch.from_numpy(x)
        lb_ = torch.from_numpy(self.lb)
        ub_ = torch.from_numpy(self.ub)
        assert len(x) == self.dim
        assert torch.all(x >= lb_)
        assert torch.all(x <= ub_)
        
        lm = LinearModel(num_dims=self.dim - 1)
        lm.weights.data.copy_(x[:3])
        lm.bias.data.copy_(x[-1])
        
        output = lm(data_x)
        loss = (data_y - output).pow(2).mean()
        loss.backward()
        
        grad = torch.cat([lm.weights.grad, lm.bias.grad])
        return np.hstack([loss.item(), grad.cpu().detach().numpy()])

In [6]:
lmf = LinearModelFunction(data_x, data_y)

In [7]:
lmf(np.array([1.1, 0.8, 0.1, 2.0]))  # Optimal parameters

array([0.02016393, 0.00554729, 0.00578184, 0.00085408, 0.01044734])

## Create a Turbo optimizer instance

Everything from this point on is identical to the standard TuRBO setting.

In [21]:
turbo1 = Turbo1Grad(
    f=lmf,  # Handle to objective function
    lb=lmf.lb,  # Numpy array specifying lower bounds
    ub=lmf.ub,  # Numpy array specifying upper bounds
    n_init=20,  # Number of initial bounds from an Latin hypercube design
    max_evals = 1000,  # Maximum number of evaluations
    batch_size=10,  # How large batch size TuRBO uses
    verbose=True,  # Print information from each batch
    use_ard=True,  # Set to true if you want to use ARD for the GP kernel
    max_cholesky_size=2000,  # When we switch from Cholesky to Lanczos
    n_training_steps=50,  # Number of steps of ADAM to learn the hypers
    min_cuda=1024,  # Run on the CPU for small datasets
    device="cpu",  # "cpu" or "cuda"
    dtype="float64",  # float64 or float32
)

Using dtype = torch.float64 
Using device = cpu


# Run the optimization process

In [22]:
turbo1.optimize()

Starting from fbest = 1.963
30) New best: 1.244
40) New best: 0.8119
50) New best: 0.3613
60) New best: 0.2452
70) New best: 0.1934
80) New best: 0.1626
90) New best: 0.1485
90) Restarting with fbest = 0.1485
Starting from fbest = 1.869
150) New best: 0.1458
160) New best: 0.1033
170) New best: 0.07771
180) New best: 0.06606
180) Restarting with fbest = 0.06606
Starting from fbest = 1.444
270) Restarting with fbest = 0.141
Starting from fbest = 1.024
320) New best: 0.03518
340) New best: 0.02846
350) New best: 0.02442
360) New best: 0.02178
360) Restarting with fbest = 0.02178
Starting from fbest = 2.293
450) Restarting with fbest = 0.02311
Starting from fbest = 2.933
540) Restarting with fbest = 0.02561
Starting from fbest = 2.021
630) Restarting with fbest = 0.1422
Starting from fbest = 0.4744
720) Restarting with fbest = 0.02496
Starting from fbest = 1.621
810) Restarting with fbest = 0.1009
Starting from fbest = 1.143
900) Restarting with fbest = 0.08303
Starting from fbest = 2.391

## Extract all evaluations from Turbo and print the best

In [23]:
X = turbo1.X  # Evaluated points
fX = turbo1.fX  # Observed values
ind_best = np.argmin(fX[:, 0])  # The first column is the actual function value, so argmin over that.
f_best, x_best = fX[ind_best], X[ind_best, :]

In [27]:
print(f'Best value (and gradient) was {torch.from_numpy(f_best)}')
print(f'Best weights were {torch.from_numpy(x_best[:3])} and the bias was {x_best[-1]:.2f}')

Best value (and gradient) was tensor([ 0.0209, -0.0322,  0.0264,  0.0200, -0.0318], dtype=torch.float64)
Best weights were tensor([1.0807, 0.8103, 0.1086], dtype=torch.float64) and the bias was 1.98
