Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solver summary info to diagnostics #208

Conversation

efernandez
Copy link
Collaborator

Diagnostics look as follows with this change:

  • When the solver is running (I've used a try_lock() to avoid blocking the diagnostics updater waiting for the optimization loop):
[  OK   ] /fuse:  fuse - Optimization converged
    timestamp:   2020-11-18 18:45:58.728069+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 0
    - Optimization: Running
    - Time Since Last Optimization Request [s]: 0.00599473
  • When the solver converges to a solution:
[  OK   ] /fuse:  fuse - Optimization converged
    timestamp:   2020-11-18 18:45:59.720956+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher
    - Started: True
    - Pending Transactions: 0
    - Optimization Termination Type: CONVERGENCE
    - Optimization Total Time [s]: 0.00631603
    - Optimization Iterations: 2
    - Initial Cost: 26.8172
    - Final Cost: 26.6269
    - Time Since Last Optimization Request [s]: 0.0287197
  • When the solver doesn't converge to a solution:
[ WARN  ] fuse: /fuse - Optimization didn't converged
    timestamp:   2020-11-18 18:46:55.795951+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 1
    - Optimization Termination Type: NO_CONVERGENCE
    - Optimization Total Time [s]: 0.0103187
    - Optimization Iterations: 6
    - Initial Cost: 74.8047
    - Final Cost: 8.12362
    - Time Since Last Optimization Request [s]: 0.03309
  • When the solver fails (it's never happened to me, so this is the expected output based on the code):
[ ERROR  ] fuse: /fuse - Optimization failed
    timestamp:   2020-11-18 18:46:55.795951+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 1
    - Optimization Termination Type: FAILURE
    - Optimization Total Time [s]: 0.0103187
    - Optimization Iterations: 6
    - Initial Cost: 74.8047
    - Final Cost: 8.12362
    - Time Since Last Optimization Request [s]: 0.03309

We can also plot some of those values over time 馃槂 :
add-solver-summary-diagnostics-Screenshot from 2020-11-24 11-54-56

This also changes the diagnostic updater name, making things cleaner in the children classes.

@efernandez efernandez self-assigned this Nov 24, 2020
@efernandez efernandez added the enhancement New feature or request label Nov 24, 2020
@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
// Update the graph
graph_->update(*new_transaction);
// Optimize the entire graph
graph_->optimize(params_.solver_options);
summary_ = graph_->optimize(params_.solver_options);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do something when the solver doesn't converge. I'd like to discuss the options and listen to your thoughts, so maybe we can do something on another PR.

One option is to check if (summary_.isSolutionUsable()): http://ceres-solver.org/nnls_solving.html#_CPPv4NK5ceres6Solver7Summary16IsSolutionUsableEv

If it's not usable we could do one of the following:

  • Abort
  • Don't notify, but still do the rest, i.e. prepare the marginalization for the next cycle
  • Roll back the graph to the state before adding the transaction, but this isn't trivial

Note that isSolutionUsable() returns true also if the solver doesn't converge and finishes because it reached the max time or iterations limit. In that case, the diagnostic warning should be enough, and we can simply continue as normal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. 馃檪

We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.

However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.

It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.

And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.

Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.

I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.

But I'm not sure how you feel about it. I can be convinced otherwise.

Copy link
Collaborator Author

@efernandez efernandez Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sensor and motion models going out of sync if we roll back to the graph before updating it with the last transaction is a very good point. I admit I didn't think much about that, although I already anticipated it wouldn't be trivial to roll back the graph.

I agree aborting is the best (and simplest) thing we can do. In such a case, I could print the last transaction, blaming it.
Indeed, if the optimization failed it should be because of something related with one or more constraints in the transaction, or the initial values provided in the variables, or the implementation of the constraints cost functions.
We would still have to narrow that down, but it sounds like we should be able to that after if we record the graph and transactions.
Actually, we need to print the transaction, because the last transaction hasn't been published yet. It sounds like we should still notify the publishers, so the bad graph and transaction are published. Then, in the optimization loop we only need to print the transaction timestamp when aborting.

I've updated this PR with a commit that does that.

Does this sound like a good plan to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I believe a failure here only happens if a cost function yields a bad result, like NaN. TBH I've not been able to test a failure mode because my current configuration just works. 馃槃

I think when a failure happens, the solver itself prints some diagnostic messages, and it usually crashes after, if NaNs pollute the state. This is speculation though. It'd be great if I could test a failure mode. Then, maybe what we can do is be more informative when it happens, in terms of transaction and graph. Ideally, rolling back the graph would be great, but it'd be quite difficult to do that atm, as already pointed out.

Any suggestion to force a failure? Maybe adding a division by zero in a cost function, or sth like that. Or just returning false always, so the cost function always fails to evaluate. 馃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just a one-off test with throwaway code, then maybe throw a quick static counter variable into one of the cost functions, and after it reaches N iterations, we do what you suggested and divide by 0 or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I move this conversation to an "issue", so we can discuss outside of this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, go ahead. I'll try to see what happens on failure soon, but I think this PR shouldn't wait on that. A more elaborate error handling could be discussed in that issue and implemented on another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #213

@@ -495,20 +496,100 @@ void FixedLagSmoother::transactionCallback(
}
}

void FixedLagSmoother::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& status)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a bit weird here. I simply added some helper functions that are used below.

@@ -482,7 +482,7 @@ void Optimizer::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& stat
return;
}

status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimizer");
status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimization converged");
Copy link
Collaborator Author

@efernandez efernandez Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message matches the one we get now in normal conditions. Maybe not too elegant, but we can't overwrite it with mergeSummary.

Another option would be to change the status.message directly in the children setDiagnostics() method.

@@ -99,7 +99,7 @@ Optimizer::Optimizer(
private_node_handle_.createTimer(ros::Duration(diagnostic_updater_timer_period_),
boost::bind(&diagnostic_updater::Updater::update, &diagnostic_updater_));

diagnostic_updater_.add("Optimizer", this, &Optimizer::setDiagnostics);
diagnostic_updater_.add(private_node_handle_.getNamespace(), this, &Optimizer::setDiagnostics);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the name, and simplifies the code remove from the children setDiagnostics() method; the one the diff doesn't show well.

Copy link
Contributor

@svwilliams svwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind switching over to the unique_locks...

fuse_optimizers/src/fixed_lag_smoother.cpp Outdated Show resolved Hide resolved
fuse_optimizers/src/fixed_lag_smoother.cpp Outdated Show resolved Hide resolved
@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
// Update the graph
graph_->update(*new_transaction);
// Optimize the entire graph
graph_->optimize(params_.solver_options);
summary_ = graph_->optimize(params_.solver_options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. 馃檪

We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.

However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.

It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.

And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.

Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.

I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.

But I'm not sure how you feel about it. I can be convinced otherwise.

fuse_optimizers/src/fixed_lag_smoother.cpp Show resolved Hide resolved
fuse_optimizers/src/fixed_lag_smoother.cpp Outdated Show resolved Hide resolved
@efernandez efernandez force-pushed the add-solver-summary-diagnostics branch from df09ed0 to 4599f76 Compare December 1, 2020 08:05
@efernandez efernandez force-pushed the add-solver-summary-diagnostics branch 2 times, most recently from 6a245bf to ad9464d Compare December 1, 2020 08:37
This also changes the diagnostic updater name, making things cleaner in
the children classes.
@efernandez efernandez force-pushed the add-solver-summary-diagnostics branch from ad9464d to ec9649e Compare December 1, 2020 09:51
@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
// Update the graph
graph_->update(*new_transaction);
// Optimize the entire graph
graph_->optimize(params_.solver_options);
summary_ = graph_->optimize(params_.solver_options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.

@efernandez
Copy link
Collaborator Author

Any idea why checks are still running for ec9649e?

@svwilliams
Copy link
Contributor

Having strange buildfarm issues. I needed to trigger a rebuild.

@svwilliams svwilliams merged commit 86d111c into locusrobotics:devel Dec 1, 2020
@efernandez efernandez deleted the add-solver-summary-diagnostics branch December 10, 2020 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

None yet

3 participants