Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[s2s test_finetune_trainer] failing multigpu test #8400

Merged
merged 1 commit into from
Nov 8, 2020

Commits on Nov 8, 2020

  1. [s2s test_finetune_trainer] failing test

    Sam,
    ```
    RUN_SLOW=1 pytest examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_slow
    ```
    fails for me - not learning anything. 
    ```
    >       assert first_step_stats["eval_bleu"] < last_step_stats["eval_bleu"]  # model learned nothing
    E       AssertionError: assert 0.0 < 0.0
    ```
    Looking at the logs, it gains some knowledge in the first half of the epochs and then drops back to 0.00 in the last ones.
    
    Changing to lr 3e-3 (this PR) seems to make it more stable, but it could be a card specific thing - this is with rtx3090. 
    
    Alternatively the test should compare not the first and last metrics, but perhaps something more flexible?
    
    But other way it feels too dependent on the card/config - perhaps a long term approach to make it more resilient is by feeding it more than 8 records.
    
    @sshleifer
    stas00 committed Nov 8, 2020
    Configuration menu
    Copy the full SHA
    a59f217 View commit details
    Browse the repository at this point in the history