Add solver summary info to diagnostics #208

efernandez · 2020-11-24T10:57:43Z

Diagnostics look as follows with this change:

When the solver is running (I've used a try_lock() to avoid blocking the diagnostics updater waiting for the optimization loop):

[  OK   ] /fuse:  fuse - Optimization converged
    timestamp:   2020-11-18 18:45:58.728069+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 0
    - Optimization: Running
    - Time Since Last Optimization Request [s]: 0.00599473

When the solver converges to a solution:

[  OK   ] /fuse:  fuse - Optimization converged
    timestamp:   2020-11-18 18:45:59.720956+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher
    - Started: True
    - Pending Transactions: 0
    - Optimization Termination Type: CONVERGENCE
    - Optimization Total Time [s]: 0.00631603
    - Optimization Iterations: 2
    - Initial Cost: 26.8172
    - Final Cost: 26.6269
    - Time Since Last Optimization Request [s]: 0.0287197

When the solver doesn't converge to a solution:

[ WARN  ] fuse: /fuse - Optimization didn't converged
    timestamp:   2020-11-18 18:46:55.795951+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 1
    - Optimization Termination Type: NO_CONVERGENCE
    - Optimization Total Time [s]: 0.0103187
    - Optimization Iterations: 6
    - Initial Cost: 74.8047
    - Final Cost: 8.12362
    - Time Since Last Optimization Request [s]: 0.03309

When the solver fails (it's never happened to me, so this is the expected output based on the code):

[ ERROR  ] fuse: /fuse - Optimization failed
    timestamp:   2020-11-18 18:46:55.795951+01:00
    hardware_id: fuse
    - Sensor Models: wheel_odometry unicycle_2d_ignition laser_odometry laser_localization 
    - Motion Models: unicycle_2d 
    - Publishers: odometry_publisher 
    - Started: True
    - Pending Transactions: 1
    - Optimization Termination Type: FAILURE
    - Optimization Total Time [s]: 0.0103187
    - Optimization Iterations: 6
    - Initial Cost: 74.8047
    - Final Cost: 8.12362
    - Time Since Last Optimization Request [s]: 0.03309

We can also plot some of those values over time 😃 :

This also changes the diagnostic updater name, making things cleaner in the children classes.

efernandez · 2020-11-24T11:17:41Z

fuse_optimizers/src/fixed_lag_smoother.cpp

@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
      // Update the graph
      graph_->update(*new_transaction);
      // Optimize the entire graph
-      graph_->optimize(params_.solver_options);
+      summary_ = graph_->optimize(params_.solver_options);


I wonder if we should do something when the solver doesn't converge. I'd like to discuss the options and listen to your thoughts, so maybe we can do something on another PR.

One option is to check if (summary_.isSolutionUsable()): http://ceres-solver.org/nnls_solving.html#_CPPv4NK5ceres6Solver7Summary16IsSolutionUsableEv

If it's not usable we could do one of the following:

Abort

Don't notify, but still do the rest, i.e. prepare the marginalization for the next cycle

Roll back the graph to the state before adding the transaction, but this isn't trivial

Note that isSolutionUsable() returns true also if the solver doesn't converge and finishes because it reached the max time or iterations limit. In that case, the diagnostic warning should be enough, and we can simply continue as normal.

I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. 🙂

We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.

However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.

It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.

And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.

Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.

I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.

But I'm not sure how you feel about it. I can be convinced otherwise.

The sensor and motion models going out of sync if we roll back to the graph before updating it with the last transaction is a very good point. I admit I didn't think much about that, although I already anticipated it wouldn't be trivial to roll back the graph.

I agree aborting is the best (and simplest) thing we can do. In such a case, I could print the last transaction, blaming it.
Indeed, if the optimization failed it should be because of something related with one or more constraints in the transaction, or the initial values provided in the variables, or the implementation of the constraints cost functions.
We would still have to narrow that down, but it sounds like we should be able to that after if we record the graph and transactions.
Actually, we need to print the transaction, because the last transaction hasn't been published yet. It sounds like we should still notify the publishers, so the bad graph and transaction are published. Then, in the optimization loop we only need to print the transaction timestamp when aborting.

I've updated this PR with a commit that does that.

Does this sound like a good plan to you?

Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.

I thought about that, but I believe a failure here only happens if a cost function yields a bad result, like NaN. TBH I've not been able to test a failure mode because my current configuration just works. 😄

I think when a failure happens, the solver itself prints some diagnostic messages, and it usually crashes after, if NaNs pollute the state. This is speculation though. It'd be great if I could test a failure mode. Then, maybe what we can do is be more informative when it happens, in terms of transaction and graph. Ideally, rolling back the graph would be great, but it'd be quite difficult to do that atm, as already pointed out.

Any suggestion to force a failure? Maybe adding a division by zero in a cost function, or sth like that. Or just returning false always, so the cost function always fails to evaluate. 🤔

If this is just a one-off test with throwaway code, then maybe throw a quick static counter variable into one of the cost functions, and after it reaches N iterations, we do what you suggested and divide by 0 or something?

Mind if I move this conversation to an "issue", so we can discuss outside of this PR?

No, go ahead. I'll try to see what happens on failure soon, but I think this PR shouldn't wait on that. A more elaborate error handling could be discussed in that issue and implemented on another PR.

efernandez · 2020-11-24T11:19:18Z

fuse_optimizers/src/fixed_lag_smoother.cpp

@@ -495,20 +496,100 @@ void FixedLagSmoother::transactionCallback(
  }
 }

-void FixedLagSmoother::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& status)


The diff is a bit weird here. I simply added some helper functions that are used below.

fuse_optimizers/src/fixed_lag_smoother.cpp

efernandez · 2020-11-24T11:27:28Z

fuse_optimizers/src/optimizer.cpp

@@ -482,7 +482,7 @@ void Optimizer::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& stat
    return;
  }

-  status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimizer");
+  status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimization converged");


This message matches the one we get now in normal conditions. Maybe not too elegant, but we can't overwrite it with mergeSummary.

Another option would be to change the status.message directly in the children setDiagnostics() method.

efernandez · 2020-11-24T11:28:29Z

fuse_optimizers/src/optimizer.cpp

@@ -99,7 +99,7 @@ Optimizer::Optimizer(
      private_node_handle_.createTimer(ros::Duration(diagnostic_updater_timer_period_),
                                       boost::bind(&diagnostic_updater::Updater::update, &diagnostic_updater_));

-  diagnostic_updater_.add("Optimizer", this, &Optimizer::setDiagnostics);
+  diagnostic_updater_.add(private_node_handle_.getNamespace(), this, &Optimizer::setDiagnostics);


This changes the name, and simplifies the code remove from the children setDiagnostics() method; the one the diff doesn't show well.

fuse_optimizers/src/fixed_lag_smoother.cpp

svwilliams

If you don't mind switching over to the unique_locks...

fuse_optimizers/src/fixed_lag_smoother.cpp

svwilliams · 2020-11-30T23:08:32Z

fuse_optimizers/src/fixed_lag_smoother.cpp

@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
      // Update the graph
      graph_->update(*new_transaction);
      // Optimize the entire graph
-      graph_->optimize(params_.solver_options);
+      summary_ = graph_->optimize(params_.solver_options);


I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. 🙂

We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.

However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.

It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.

And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.

Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.

I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.

But I'm not sure how you feel about it. I can be convinced otherwise.

fuse_optimizers/src/fixed_lag_smoother.cpp

This also changes the diagnostic updater name, making things cleaner in the children classes.

ayrton04 · 2020-12-01T10:15:01Z

fuse_optimizers/src/fixed_lag_smoother.cpp

@@ -202,7 +202,8 @@ void FixedLagSmoother::optimizationLoop()
      // Update the graph
      graph_->update(*new_transaction);
      // Optimize the entire graph
-      graph_->optimize(params_.solver_options);
+      summary_ = graph_->optimize(params_.solver_options);


Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.

efernandez · 2020-12-01T16:45:38Z

Any idea why checks are still running for ec9649e?

svwilliams · 2020-12-01T16:47:08Z

Having strange buildfarm issues. I needed to trigger a rebuild.

efernandez requested review from svwilliams and ayrton04 November 24, 2020 10:57

efernandez self-assigned this Nov 24, 2020

efernandez added the enhancement New feature or request label Nov 24, 2020

efernandez commented Nov 24, 2020

View reviewed changes

fuse_optimizers/src/fixed_lag_smoother.cpp Outdated Show resolved Hide resolved

efernandez commented Nov 24, 2020

View reviewed changes

fuse_optimizers/src/fixed_lag_smoother.cpp Outdated Show resolved Hide resolved

efernandez commented Nov 24, 2020

View reviewed changes

fuse_optimizers/src/fixed_lag_smoother.cpp Show resolved Hide resolved

svwilliams requested changes Nov 30, 2020

View reviewed changes

efernandez force-pushed the add-solver-summary-diagnostics branch from df09ed0 to 4599f76 Compare December 1, 2020 08:05

efernandez requested a review from svwilliams December 1, 2020 08:07

efernandez force-pushed the add-solver-summary-diagnostics branch 2 times, most recently from 6a245bf to ad9464d Compare December 1, 2020 08:37

efernandez added 2 commits December 1, 2020 10:49

Add solver summary info to diagnostics

14fc553

This also changes the diagnostic updater name, making things cleaner in the children classes.

Abort if optimization failed

ec9649e

efernandez force-pushed the add-solver-summary-diagnostics branch from ad9464d to ec9649e Compare December 1, 2020 09:51

ayrton04 approved these changes Dec 1, 2020

View reviewed changes

svwilliams approved these changes Dec 1, 2020

View reviewed changes

svwilliams merged commit 86d111c into locusrobotics:devel Dec 1, 2020

efernandez deleted the add-solver-summary-diagnostics branch December 10, 2020 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add solver summary info to diagnostics #208

Add solver summary info to diagnostics #208

efernandez commented Nov 24, 2020

efernandez Nov 24, 2020

svwilliams Nov 30, 2020

efernandez Dec 1, 2020 •

edited

Loading

ayrton04 Dec 1, 2020

efernandez Dec 1, 2020

ayrton04 Dec 1, 2020

svwilliams Dec 1, 2020

efernandez Dec 1, 2020

svwilliams Dec 1, 2020

efernandez Nov 24, 2020

efernandez Nov 24, 2020 •

edited

Loading

efernandez Nov 24, 2020

svwilliams left a comment

svwilliams Nov 30, 2020

ayrton04 Dec 1, 2020

efernandez commented Dec 1, 2020

svwilliams commented Dec 1, 2020

Add solver summary info to diagnostics #208

Add solver summary info to diagnostics #208

Conversation

efernandez commented Nov 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efernandez Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efernandez Nov 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svwilliams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efernandez commented Dec 1, 2020

svwilliams commented Dec 1, 2020

efernandez Dec 1, 2020 •

edited

Loading

efernandez Nov 24, 2020 •

edited

Loading