# Compiling inference with neural networks

The inference algorithms we gave for goal inference in the previous notebook don't seem very intelligent---they basically boil down to randomly guessing scenarios and hoping that one happens to match the dataset. There are a number of approaches for creating more efficient inference algorithms. We will focus on one approach based on using neural networks to speed up inference.

In [None]:
addprocs(4);

In [None]:
import Gen
@everywhere using Gen;

First, we load the relevant code from the previous notebook.

In [None]:
@everywhere include("resources/goals/scene.jl")
@everywhere include("resources/goals/path_planner.jl")
@everywhere include("resources/goals/uniform_2d.jl");

In [None]:
@everywhere include("resources/goals/agent_waypoint_model.jl")

In [None]:
include("resources/goals/rendering.jl")

In [None]:
detour_dataset = [
    Point(9.59825,8.92063)
    Point(21.8936,9.54817)
    Point(30.9534,10.8819)
    Point(43.1137,9.75395)
    Point(48.8929,10.4189)
    Point(46.0282,21.7662)
    Point(35.0281,25.9994)
    Point(27.2084,33.5729)
    Point(20.1662,39.9398)
    Point(18.7309,50.0026)
];

First, let's understand in a bit more detail why the importance sampling inference algorithm was slow when applied to the new model with the waypoint.
Suppose we knew the right waypoint:

In [None]:
trace = ProgramTrace()
constrain!(trace, "use-waypoint", true)
constrain!(trace, "waypoint", Point(50, 10));

Then the baseline importance sampling algorithm gives reasonable inferences with fewer samples. Recall that 1024 samples within the importance sampling algorithm were needed for reasonable inferences, without knowledge of the waypoint.

In [None]:
num_samples_list = [1, 4, 32]
figure = Figure(num_rows=1, num_cols=length(num_samples_list),
                width=900, height=300, trace_width=100, trace_height=100,
                margin_top=20, titles=map((n)-> "Importance sampling ($n samples)", num_samples_list))
here(figure)

In [None]:
CSS("""
    #$(id(figure)) .path.recorded { visibility: hidden; }
    #$(id(figure)) .path.constrained { visibility: visible; }
    #$(id(figure)) .path_segments { visibility: hidden; }
    #$(id(figure)) .destination { fill-opacity: 0.5; }
""")

In [None]:
constrain!(trace, "start", Point(10, 10))
for (i, point) in enumerate(detour_dataset)
    constrain!(trace, "x$i", point.x)
    constrain!(trace, "y$i", point.y)
end
renderer = JupyterInlineRenderer("agent_model_renderer", Dict("destination" => "overlay"))
num_approximate_samples = 50
for (i, num_samples) in enumerate(num_samples_list)
    attach(renderer, id(figure => i))
    for j=1:num_approximate_samples
        output_sample = agent_waypoint_model_importance_sampling(trace, num_samples)
        render(renderer, output_sample)
    end
end

With knowledge of the waypoint, we get reasonable inferences from importance sampling with just 32 samples. We use this idea by training a neural network to make informed guesses about the waypoint, given the observed data as its input. We train the neural network on many independent unconstrained executions of the program, each of which generates a trace containing both latent variables and a dataset. The neural network will be trained to predict the waypoint given the dataset. This neural network will then be used to accelerate inference for any observed data we encounter.

In [None]:
include("resources/goals/neural.jl");

In [None]:
@everywhere @program waypoint_predictor_network(features::Vector{Float64}, num_hidden_units::Int) begin

    # parameters of the network, with their initial values prior to training
    W_hidden = @e(randn(num_hidden_units, length(features)), "W-hidden")
    b_hidden = @e(randn(num_hidden_units), "b-hidden")
    W_output_x_mu = @e(randn(num_hidden_units), "W-output-x-mu")
    b_output_x_mu = @e(randn(), "b-output-x-mu")
    W_output_x_log_std = @e(randn(num_hidden_units), "W-output-x-log-std")
    b_output_x_log_std = @e(randn(), "b-output-x-log-std")
    W_output_y_mu = @e(randn(num_hidden_units), "W-output-y-mu")
    b_output_y_mu = @e(randn(), "b-output-y-mu")
    W_output_y_log_std =  @e(randn(num_hidden_units), "W-output-y-log-std")
    b_output_y_log_std = @e(randn(), "b-output-y-log-std")

    # compute the hidden layer values
    hidden = sigmoid(W_hidden * features + b_hidden)

    # sample the x-coordinate of waypoint prediction
    x_mu = W_output_x_mu' * hidden + b_output_x_mu
    x_std = exp(W_output_x_log_std' * hidden + b_output_x_log_std)
    output_x = @g(normal(x_mu, x_std), "output-x")

    # sample the y-coordinate of waypoint prediction
    y_mu = W_output_y_mu' * hidden + b_output_y_mu
    y_std = exp(W_output_y_log_std' * hidden + b_output_y_log_std)
    output_y = @g(normal(y_mu, y_std), "output-y")
end

# we rescale the output of the neural network to make training easier
@everywhere function scale_coordinate{T}(x::T)
    (x - 50.) / 100.
end

@everywhere function unscale_coordinate{T}(x::T)
    x * 100. + 50.
end

We are going to predict the waypoint using the measured locations at the first 10 time pooints, and our neural network predictor will use 50 hidden units:

In [None]:
@everywhere num_time_steps = 10
@everywhere num_hidden_units = 50

We obtain initial values for the neural network parameters by running the neural network probabilistic program once and extracting the parameter values. This procedure also serves to identify which named values in the trace are the parameters that we which to optimize over.

In [None]:
function make_initial_parameter_values(num_time_steps::Int, num_hidden_units::Int)
    
    # construct example features for the neural network (to get the
    # dimensionality of the features)
    trace = ProgramTrace()
    @generate!(agent_waypoint_model(), trace)
    example_features = construct_features(trace, num_time_steps)

    # run the neural network once, sampling paramters, and extract their values
    inference_trace = ProgramTrace()
    @generate!(waypoint_predictor_network(example_features, num_hidden_units), inference_trace)
    parameters = Dict{String,Any}()
    parameters["W-hidden"] = inference_trace["W-hidden"]
    parameters["b-hidden"] = inference_trace["b-hidden"]
    parameters["W-output-x-mu"] = inference_trace["W-output-x-mu"]
    parameters["b-output-x-mu"] = inference_trace["b-output-x-mu"]
    parameters["W-output-x-log-std"] = inference_trace["W-output-x-log-std"]
    parameters["b-output-x-log-std"] = inference_trace["b-output-x-log-std"]
    parameters["W-output-y-mu"] = inference_trace["W-output-y-mu"]
    parameters["b-output-y-mu"] = inference_trace["b-output-y-mu"]
    parameters["W-output-y-log-std"] = inference_trace["W-output-y-log-std"]
    parameters["b-output-y-log-std"] = inference_trace["b-output-y-log-std"]
    return parameters
end;

The following function defines the training distribution for our amortized inference neural network. Each execution of this function returns a trace of the `agent_waypoint_model` probabilistic program, which contains locations of the agent, as well as the waypoint, and all of the other elements of an internally-coherent simulated scenario. Recall that each trace is a traing datum, that contains the input to the neural network (the measured location of the simulated drone over time), and the ground-truth value of the simulated drone's waypoint, which the neural network should predict. The loss associated with each training trace is the negative log likelihood of the true waypoint under the neural networks' predictive distribution. Minimizing the expected value of this loss function over the training distribution is equivalent to minimizing the expected Kullback-Leibler divergence from the true posterior to the network's predictive distribution.

In [None]:
@everywhere function model_trace_generator()

    # we are only compiling for a fixed start position
    model_trace = ProgramTrace()
    constrain!(model_trace, "start", Point(10., 10.))
   
    # reject samples until path planning succeeded and use-waypoint = true
    # we are assuming that the agent does use the waypoint
    while true
        @generate!(agent_waypoint_model(), model_trace)
        !model_trace["planning-failed"] && model_trace["use-waypoint"] && break
    end
    @assert model_trace["use-waypoint"]
    return model_trace
end


Next, we need to match up the random choices representing the waypoint in the neural network trace with the random choices representing the waypoint in the model trace. Specifically, we need to indicate how to constrain the neural network trace with the ground-truth value of the waypoint obtained from a training trace.:

In [None]:
@everywhere function constrain_waypoint_network_outputs(model_trace::ProgramTrace, network_trace::ProgramTrace)
    @assert model_trace["use-waypoint"]
    delete!(network_trace, "output-x")
    delete!(network_trace, "output-y")
    waypoint = model_trace["waypoint"]
    constrain!(network_trace, "output-x", scale_coordinate(waypoint.x))
    constrain!(network_trace, "output-y", scale_coordinate(waypoint.y))
end

Next, we indicate how to construct the input to the neural network from the grond-truth model trace. We scale the coordinates from the range [0, 100] to the range [-0.5, 0.5] to make training faster. 

In [None]:
@everywhere function construct_features(model_trace::Trace, num_time_steps::Int)
    xs = map((j) -> model_trace["x$j"], 1:num_time_steps)
    ys = map((j) -> model_trace["y$j"], 1:num_time_steps)
    scale_coordinate(vcat(xs, ys))
end

@everywhere function inference_input_constructor(model_trace::Trace)
    features = construct_features(model_trace, num_time_steps)
    (features, num_hidden_units)    
end

Finally, we put all of the components together into the `AmortizedInferenceScheme` which is defined in `neural.jl`:

In [None]:
amortized_inference_scheme = AmortizedInferenceScheme(
    
        # generates model traces which serve as training data 
        model_trace_generator,
    
        # the inference program being optimized
        waypoint_predictor_network,
    
        # procedure for constructing input to inference program from a model trace
        inference_input_constructor,
    
        # procedure for constraining output of inference program using a model trace
        constrain_waypoint_network_outputs
    );

Finally, we train the network. We only do a few gradient steps here for illustration.

In [None]:
training_params = TrainingParams(
        32, # minibatch size
        10, # maximum number of ADAM SGD iterations
        32, # number of test samples to use for evaluation at each SGD step
        ADAMParameters(0.001, 0.9, 0.999, 1e-8) # optimization parameters
)

# intiitalize parameters
inference_parameters = make_initial_parameter_values(num_time_steps, num_hidden_units::Int)

# train is defined in neural.jl. It is a generic procedure for training in an amortized inference scheme.
inference_parameters = train(amortized_inference_scheme, inference_parameters, training_params);

 We have already trained the network for 20,000 iterations (about two hours), and we load those parameters here:

In [None]:
inference_parameters = load_neural_network("resources/goals/neural_waypoint_predictor_params.json");

Let's visualize the waypoints proposed by the trained neural network for a few  different datasets.

In [None]:
num_datasets = 9
figure = Figure(num_rows=3, num_cols=3, width=900, height=900, trace_width=100, trace_height=100)
here(figure)

In [None]:
CSS("""
    #$(id(figure)) .path.recorded { visibility: visible; }
    #$(id(figure)) .path.constrained { visibility: visible; }
    #$(id(figure)) .path_segments { visibility: visible; }
    #$(id(figure)) .destination { visibility: visible; }
    #$(id(figure)) .waypoint { stroke-opacity: 0.5; }
""")

In [None]:
renderer = JupyterInlineRenderer("agent_model_renderer", Dict("waypoint" => "overlay"))
for dataset_index=1:num_datasets    
    model_trace = model_trace_generator()
    for i=1:50
        network_trace = ProgramTrace()
        for key in keys(inference_parameters)
            intervene!(network_trace, key, inference_parameters[key])
        end
        network_input = inference_input_constructor(model_trace)
        @generate!(waypoint_predictor_network(network_input...), network_trace)
        
        delete!(model_trace, "waypoint")
        constrain!(model_trace, "waypoint", Point(
                unscale_coordinate(network_trace["output-x"]),
                unscale_coordinate(network_trace["output-y"])))

        attach(renderer, id(figure => dataset_index))
        render(renderer, model_trace)
        sleep(0.01)
    end
end

Let's also see the predictions for our particular original detour dataset, and compare these to the predictions made from the uniform distribution:

In [None]:
figure = Figure(num_rows=1, num_cols=2,
                width=900, height=300, trace_width=100, trace_height=100,
                margin_top=20, titles=["uniform proposal", "trained neural proposal"])
here(figure)

In [None]:
CSS("""
    #$(id(figure)) .path.recorded { visibility: hidden; }
    #$(id(figure)) .path.constrained { visibility: visible; }
    #$(id(figure)) .path_segments { visibility: hidden; }
    #$(id(figure)) .destination { visibility: hidden; }
    #$(id(figure)) .waypoint { stroke-opacity: 0.5; }
""")

In [None]:
renderer = JupyterInlineRenderer("agent_model_renderer", Dict("waypoint" => "overlay"))

model_trace = ProgramTrace()
constrain!(model_trace, "start", Point(10, 10))
constrain!(model_trace, "use-waypoint", true)
for (i, point) in enumerate(detour_dataset)
    constrain!(model_trace, "x$i", point.x)
    constrain!(model_trace, "y$i", point.y)
end
@generate!(agent_waypoint_model(), model_trace)

# show uniform proposals
attach(renderer, id(figure => 1))
for i=1:50
    @generate!(agent_waypoint_model(), model_trace)
    render(renderer, model_trace)
end

# show neural proposals
attach(renderer, id(figure => 2))
for i=1:50
    network_trace = ProgramTrace()
    for key in keys(inference_parameters)
        intervene!(network_trace, key, inference_parameters[key])
    end
    network_input = inference_input_constructor(model_trace)
    @generate!(waypoint_predictor_network(network_input...), network_trace)
            
    delete!(model_trace, "waypoint")
    constrain!(model_trace, "waypoint", Point(
        unscale_coordinate(network_trace["output-x"]),
        unscale_coordinate(network_trace["output-y"])))
    
    @generate!(agent_waypoint_model(), model_trace)
    render(renderer, model_trace)
end

The network is able to make better-than-random predictions of the waypoint location for many of the sampled datasets, including our particular `detour_dataset` of interest. We spent the up-front computational cost of training the network, but now that we have trained it, we can apply to it any data we encounter. The idea of amortizing the cost of inference across many problem instances is called **amortized inference**.

Let's use this trained neural network to speed up inference for our dataset. We use the network to make intelligent proposals for the waypoint within an importance sampling algorithm:

In [None]:
import Distributions 

@everywhere function agent_waypoint_model_neural_importance_sampling(trace::ProgramTrace, num_samples::Int,
                                                                     network_parameters::Dict{String,Any})
    # compute the input for this data
    network_input = inference_input_constructor(trace)

    # set parameters in the network
    network_trace = ProgramTrace()
    for key in keys(inference_parameters)
        intervene!(network_trace, key, inference_parameters[key])
    end
    propose!(network_trace, "output-x", Float64)
    propose!(network_trace, "output-y", Float64)

    traces = Vector{ProgramTrace}(num_samples)
    scores = Vector{Float64}(num_samples)
    for k=1:num_samples
        
        # fork a copy of the model trace
        t = deepcopy(trace)
        
        # predict the waypoint from the neural network
        (network_score, _) = @generate!(waypoint_predictor_network(network_input...), network_trace)
        delete!(t, "waypoint")
        waypoint = Point(
                unscale_coordinate(network_trace["output-x"]),
                unscale_coordinate(network_trace["output-y"]))
        constrain!(t, "waypoint", waypoint)
        
        # propose the reset of the random chocies in the model by executing the model
        # program, conditioned on the predicted waypoint
        (model_score, _) = @generate!(agent_waypoint_model(), t)

        # the score is the model score minus the neural proposal score
        scores[k] = model_score - network_score
        traces[k] = t
    end
    weights = exp.(scores - logsumexp(scores))
    weights = weights / sum(weights)
    
    if !Distributions.isprobvec(weights)
        return ProgramTrace() # no sample produced
    end
    
    chosen = rand(Distributions.Categorical(weights))
    return traces[chosen]
end

In [None]:
num_samples_list = [1, 8, 64]
figure = Figure(num_rows=1, num_cols=length(num_samples_list),
                width=900, height=300, trace_width=100, trace_height=100,
                margin_top=20, titles=map((n)-> "SIR ($n samples)", num_samples_list))

here(figure)

In [None]:
CSS("""
    #$(id(figure)) .path.recorded { visibility: hidden; }
    #$(id(figure)) .path.constrained { visibility: visible; }
    #$(id(figure)) .path_segments { visibility: hidden; }
    #$(id(figure)) .destination { fill-opacity: 0.5; }
    #$(id(figure)) .waypoint { stroke-opacity: 0.5; }
""")

In [None]:
renderer = JupyterInlineRenderer("agent_model_renderer",  Dict("destination" => "overlay", "waypoint" => "overlay"))

trace = ProgramTrace()
constrain!(trace, "start", Point(10., 10.))
constrain!(trace, "use-waypoint", true)
for (i, point) in enumerate(detour_dataset)
    constrain!(trace, "x$i", point.x)
    constrain!(trace, "y$i", point.y)
end
   
num_approximate_samples = 50
for (i, num_samples) in enumerate(num_samples_list)
    attach(renderer, id(figure => i))
    for j=1:num_approximate_samples
        output_sample = agent_waypoint_model_neural_importance_sampling(trace, num_samples, inference_parameters)
        render(renderer, output_sample)
    end
end

Having spent the computational resources up front to compile the neural network to predict the waypoint, we are now able to obtain reasonable inferences with just 64 samples within importance sampling. Running the neural network adds has negligible computational cost compared with running the probabilistic program.