Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save checkpoint if train_steps is smaller than batcher's steps_per_epoch #2298

Merged
merged 1 commit into from
Jul 21, 2022

Conversation

dantreiman
Copy link
Contributor

@dantreiman dantreiman commented Jul 21, 2022

If you try to train using a total number of steps smaller than steps_per_epoch, the model weights are not saved and ludwig throws an exception.

This is useful for doing quick tests by training for a small number of steps to generate a saved model, i.e.

trainer:
  train_steps: 50

Output:

Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:35<00:00,  3.11s/it]Traceback (most recent call last):
  File "/Users/daniel/mambaforge/envs/ludwig39-dev/bin/ludwig", line 33, in <module>
    sys.exit(load_entry_point('ludwig', 'console_scripts', 'ludwig')())
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/cli.py", line 166, in main
    CLI()
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/cli.py", line 66, in __init__
    getattr(self, args.command)()
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/cli.py", line 71, in train
    train.cli(sys.argv[2:])
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/train.py", line 387, in cli
    train_cli(**vars(args))
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/train.py", line 181, in train_cli
    model.train(
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/api.py", line 552, in train
    train_stats = trainer.train(
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/trainers/trainer.py", line 862, in train
    self.model.load(save_path)
  File "/Users/daniel/Desktop/github/dantreiman-ludwig/ludwig/models/ecd.py", line 160, in load
    self.load_state_dict(torch.load(weights_save_path, map_location=device))
  File "/Users/daniel/mambaforge/envs/ludwig39-dev/lib/python3.9/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/Users/daniel/mambaforge/envs/ludwig39-dev/lib/python3.9/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/Users/daniel/mambaforge/envs/ludwig39-dev/lib/python3.9/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/Users/daniel/Desktop/github/dantreiman-ludwig/examples/rotten_tomatoes/results/experiment_run/model/model_weights'

Copy link
Contributor

@abidwael abidwael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once checks pass

@github-actions
Copy link

Unit Test Results

       4 files   -        2         4 suites   - 2   1h 47m 48s ⏱️ - 1h 4m 14s
2 936 tests ±       0  2 887 ✔️ ±       0    49 💤 ±  0  0 ±0 
5 872 runs   - 2 936  5 754 ✔️  - 2 871  118 💤  - 65  0 ±0 

Results for commit be107b3. ± Comparison against base commit f02e92c.

@dantreiman dantreiman merged commit 7f39a3c into ludwig-ai:master Jul 21, 2022
@dantreiman dantreiman deleted the daniel/train_steps_save branch July 21, 2022 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants