Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a fix to have expected skipafterfailedadjoint behavior #45

Merged
merged 4 commits into from
Jun 4, 2020

Conversation

yqliaohk
Copy link
Contributor

Purpose

Added a minor fix to have an expected behavior if we set the skipafterfaildadjoint to False. Previously, when it's set to False, though the adjoints after a failed one get solved, they still get reset later. Changes have been made not to reset if we don't want to skip.

Type of change

What types of change is it?

  • Bugfix (non-breaking change which fixes an issue)

Testing

Explain the steps needed to test the new code to verify that it does indeed address the issue and produce the expected behavior.

Checklist

Put an x in the boxes that apply.

  • I have run unit and regression tests which pass locally with my changes
  • I have added new tests that prove my fix is effective or that my feature works
  • I have added necessary documentation

@yqliaohk yqliaohk requested a review from a team as a code owner May 26, 2020 21:44
@yqliaohk yqliaohk requested review from joanibal and anilyil May 26, 2020 21:44
Copy link
Collaborator

@joanibal joanibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the root cause of the problem. Could you help me understand. Please see that in-code comments.

python/pyADflow.py Show resolved Hide resolved
python/pyADflow.py Show resolved Hide resolved
@anilyil
Copy link
Contributor

anilyil commented Jun 3, 2020

I think this looks good for now. @joanibal , I think what was happening is this: If the current adjoint solution cannot reach the tolerance in the specified number of iterations, the solution is reset to 0 regardless of what is set for 'skipafterfailedadjoint'. However, for many cases, adjoint simply stalls but the solution does provide a useful result for the gradient. Therefore, we would like to use this value for the gradient computation and keep going. This PR addresses this issue.

ADflow currently does not give you any warning or raise fail flags when adjoints fail to converge. This is the worst possible outcome in my opinion since it just keep going with completely wrong gradients, which can be difficult to debug if you are not expecting this.

Like I said, this PR as it stands is okay for me. However, I want to add a few warning outputs and update the adjoint printouts a bit so that it becomes easier to diagnose what is happening. I can do these changes in a separate PR, or push to @yqliaohk 's branch and update this PR. What do you all prefer? New PR, or update this PR?

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

Perhaps I misunderstood, but based on the comments in the code I understood the option skipafterfailedadjoint to be used to prevent additional adjoints of an aeroproblem from being solved after one has failed to save computational time.

See the comment before the option is used
# Check for any previous adjoint failure. If any adjoint has failed
# on this AP, there is no point in solving the reset, so continue
# with psi set as zero

I agree that it would be useful to have an option to allow the use of partially converged adjoints, but I think we should just add another option to do that. Perhaps something like usepartiallyconvergedadjoint

then the option could be checked when checking for a fail. like this ...
if self.adflow.killsignals.adjointfailed and not self.adflow.options['usepartiallyconvergedadjoint]:

@anilyil
Copy link
Contributor

anilyil commented Jun 3, 2020

I think this option name is good as is. It tells you only about what it will do with the remaining adjoints. I think the behavior after this PR will be the expected behavior. Then we can discuss adding a new option that picks wether to use or not use the partially converged adjoint. I cannot think of many scenarios where a partially converged adjoint would not be useful, and I think most people would want to keep going with the partially converged answer.

We can add an option that switches between using the partially converged adjoints, or raises an analysis fail flag. I am not sure if we have the capability to tell snopt that gradient evaluation failed. if we can, then that may be useful for some cases.

But again, I think this PR fixes the issue with this option and would give the expected behavior after this is merged.

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

Sorry, but I guess I'm still confused about how the current behavior for skipafterfailedadjoint is wrong/unexpected. It skips the the other adjoint solves after a failed solve.

@anilyil
Copy link
Contributor

anilyil commented Jun 3, 2020

Before this PR, if skipafterfailedadjoint is set to False, if an adjoint solution partially converges, adflow still resets the solution vector, instead of using the partially converged solution. This results in wrong gradients for the function with partially converged adjoint.

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

Right, I get that.

But, based on my understanding, the option was never intended to address that issue. It looks like it should only be used to skip other adjoint not to specify what happens to partially converged solutions.

@anilyil
Copy link
Contributor

anilyil commented Jun 3, 2020

right, it does not specify what should be done with the partially converged solutions. by default, we want to use partially converged solutions. hence the PR. If you can think of a case where a partially converged solution is not useful, then we can create another PR with that option.

This partially converged solution issue comes up very frequently with meshes where we have zippers. After some convergence, the solution bottoms out at the same place, so it is common to rely on the partially converged solution. The stall happens after 6-7 orders of magnitude convergence relatively (depending on where you start with the initial residual), so its not too bad.

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

"by default, we want to use partially converged solutions. hence the PR. If you can think of a case where a partially converged solution is not useful, then we can create another PR with that option."

Ok then why not just change the behavior after an adjoint fails from
self.curAP.adflowData.adjoints[objective][:] = 0.0
to
self.curAP.adflowData.adjoints[objective][:] = psi

Why involve the skipafterAdjointFail option at all?

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

I'm going to request some more reviewer for additional perspectives.

@yqliaohk
Copy link
Contributor Author

yqliaohk commented Jun 3, 2020

Ok, I see how it can be confusing. From the option name, we expect to skip after any failed adjoint if we set it to True. So I guess I was expecting that if I set it to false, the adjoint won't be skipped and it can still be used and get a descent gradient evaluation with partially converged adjoint. I agree it's not clearly reflected in the name or docstring. Maybe I can improve the docstring or use a different option to control the behavior? But I'm not sure about if we should by default always use partially converged adjoint or not.

@anilyil
Copy link
Contributor

anilyil commented Jun 3, 2020

We may want to zero-out the adjoint solution if the adjoint solution failed in some cases. For example, if you have skipafterfailedadjoint set to True, and your adjoints normally converge, except for a single point, where it fails (may be due to stall, or may be some other reason due to the local state). Then, you want the adjoint vector to be cleared because if you just keep this, it may cause a problem in the next iteration. This is because we start the next adjoint solution with the solution from the previous iteration.

So currently, the main thing I think we need is a fail flag when you have a partially converged adjoint and you have skipafterfailedadjoint set to True, so that SNOPT just takes another step and tries again. For this scenario, resetting the adjoint solution clears out possible errors in the adjoint vector. When the state changes considerably, restarting the adjoint does not have a huge benefit anyways.

I think I agree with you on that skipafterfailedadjoint option is kinda useless, and maybe we should switch to usepartiallyconvergedadjoint like you suggested. But I think we should do this on a separate PR. Like I said, I want to add some warnings with this and improve the adjoint solution output a bit.

I suggest we merge this PR as is, and then I can create another PR where I add the changes I propose.

@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

"I think I agree with you on that skipafterfailedadjoint option is kinda useless"
I'm not trying to say that this option is useless, I just think it wasn't intended to solve your problem.

The idea of merging incomplete features to the master makes me nervous. I'd prefer you submitted your changes as a PR on @yqliaohk's fork, which when accepted will automatically appear here.
But, I think the admin's should have the final say.

@nwu63 or @eirikurj how do you think we should handle this?

@ewu63
Copy link
Collaborator

ewu63 commented Jun 3, 2020

@anilyil and I found the function checkAdjointSolutionFailure here, so that should be used along with checkSolutionFailure after calling ADflow to solve the adjoint.

To summarize, this is what I think the behaviour should be:

  • skipafterfailedadjoint=True: if one adjoint failed, we want to skip solving more adjoints. We will also do a hard reset to the adjoint solution and send a fail flag to pyOptSparse.
  • skipafterfailedadjoint=False: assume the adjoint solution was close and still useful even though it did not meet the prescribed tolerance. Pretend it did not fail, and continue as normal. We do not reset the adjoint solution but I think we should still send a fail flag, otherwise there is no way to let the user know about this. It should be up to the user to determine what to do with this information, knowing ADflow will just chug along regardless of if it's garbage or not. With a call to checkAdjointSolutionFailure, they can then manually override the value in funcsSens if the want to.

We need to check and figure out what is passed to checkAdjointSolutionFailure first, for both cases

In the future, I would like to see a secondary tolerance, both for primal and adjoint, similar to the semi-converged option for aerostructural problems. That way, we aim for the full convergence, but we will set fail=False if it reached say 2 orders above the target. This would be for a future PR.

As for the name, I agree the current name is not perfect, and usepartiallyconvergedadjoint is a better name. Again, this would be for a future PR, since this PR is just for fixing a specific bug. We should also properly provide deprecation warnings, and retain deprecated options for one minor release before removing them.

As for the default option, I would want to keep it as True. IMO, as a user without extensive knowledge of ADflow, I think I would expect failure to be presented as failure, regardless of how close it was to converging. It is up to the user to set an appropriate tolerance for the particular problem. Issues with zipper meshes and so on are for advanced users who should then use options such as this to fix those issues.

To summarize the summary:

  • Check the fail flag to make sure it is behaving as we expect
  • If so, I am happy with the PR as is
  • I do not think we should change the default value of the option
  • I do think we should change the name of the option, but that should be done properly via a second PR

@joanibal joanibal requested review from joanibal and removed request for eirikurj, sseraj and Xiaosong2105 June 3, 2020 22:36
@joanibal
Copy link
Collaborator

joanibal commented Jun 3, 2020

Thanks @nwu63.

I interpreted the skipafteradjointfail in a different way, so i didn't see the current behavior as unexpected. But seeing that I'm in the minority here it was obviously unexpected to most.

I no longer oppose pulling in this PR after checking checkAdjointSolutionFailure as Neil said. As a result, I'm going to remove myself as reviewer and let @nwu63 make the merge after the appropriate checking.

@joanibal joanibal removed their request for review June 3, 2020 22:47
@ewu63
Copy link
Collaborator

ewu63 commented Jun 3, 2020

After further discussions with @anilyil, this is what we decided:

  1. Check that the checkAdjointSolutionFailure works as expected for both cases, i.e. regardless of skipafterfailedadjoint value, if an adjoint did not meet the specified tolerance, a fail flag should be passed.
  2. Do not change the default, but add a warning whenever adjoints are skipped. @anilyil will do this and push to the PR branch.
  3. Delay changing the option name, since it describes its purpose sufficiently. However, it is not explained in the ADflow docs, so add it into the options table. This should be done now rather than waiting for Document and update adflow options #43. @anilyil will also do this.

@anilyil
Copy link
Contributor

anilyil commented Jun 4, 2020

I have pushed a few changes (and possibly messed up while merging the changes from master but we can fix that). I added the warnings when the skipafterfailedadjoint has an effect on the adjoint solution, regardless of what the option is set to.

I also realized once the adjointFailed gets set to True, it is never set back to Fasle, so SNOPT would not be able to backtrack. However, we want this to be reset once the design changes. I am not sure the best approach here. For now, I just added this to the evalFunctionsSens call, but any feedback is welcome.

Finally, I added the option to the options list. I just added it to the bottom, but we can reorder it. I may have messed up the whitespace due to my editor settings.

@nwu63 to address the items you listed:

  1. checkAdjointFailure works as expected for both cases. A fail flag appears if any of the adjoints didn't converge, regardless of the skipafterfailedadjoint option.
  2. Did not change the default, and added the warning.
  3. Added the explanation to the docs.

@ewu63 ewu63 merged commit 0013334 into mdolab:master Jun 4, 2020
Copy link
Contributor

@eirikurj eirikurj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to be a little late for the discussion here. For what its worth though, I do agree that partial solution may be useful in some cases and the previous discussion here is good.
However, I would have liked @anilyil changes to be reviewed/approved by reviewers before being merged in.

# a fail flag remaining from a previous call. Either way, we
# want to at least try solving the adjoints here, regardless
# of what happened in the previous iterations.
self.curAP.adjointFailed = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resetting between design iterations this location is probably the best.
However, this is explicitly creating and setting this attribute. This renders the following block useless in solveAdjoint as the option always exists.

# Initialize the fail flag in this AP if it doesn't exist
if not hasattr(self.curAP, 'adjointFailed'):
    self.curAP.adjointFailed = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I know that, but my idea was to just keep that for the cases where solveadjoint is called from some other routine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should remove that alltogether and have it throw an attribute error, since you should not go into that part of the code without the attribute set properly anyways.

@ewu63
Copy link
Collaborator

ewu63 commented Jun 4, 2020

Yeah I apologize, I tried to update the PR with master, but GitHub complained it wasn't possible, and I accidentally clicked the next button before I could read it. Next thing I knew the PR has been merged.. Sorry about that.

Please continue discussions here, and if there are any changes to be made, let's make a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants