Now that you have the core elements for a robust computer vision training environment, how can you use the `TrainHandGestureClassifier` flow to iteratively find the best models for your use case?

For the rest of this tutorial, we will demonstrate two important elements of iterative model development: checkpointing model state and tracking experiment results.

To follow along with this page, you can access this [Jupyter notebook](https://github.com/outerbounds/tutorials/tree/main/cv-2/cv-S2E5).

Checkpointing in model development essentially means that you save the state of a model, so you can resume it at a later time. This way you can make sure you do not lose results, such as your trained model. It also ensure you have a process to load an already trained model in future training and production scenarios, while avoiding duplication of costly computation. 

In the PyTorch example used in the `TrainHandGestureClassifier` flow, a "checkpoint" refers to this code.

```python
checkpoint_dict = {
    'state_dict': model.cpu().state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'epoch': epoch,
    'config': config_dict
}
torch.save(checkpoint_dict, checkpoint_path)
```

You can save any artifact of the training work you have done in a `checkpoint_dict` like this.
Then, you can resume the model state from the checkpoint.

```python
from models.mobilenetv3 import MobileNetV3
model = MobileNetV3(num_classes=num_classes, size='small', pretrained=pretrained, freezed=freezed)
...
checkpoint = torch.load(checkpoint, map_location=torch.device(device))["state_dict"]
model.load_state_dict(checkpoint, strict=False)
...
model.to(device)
return model
```

[Here](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training) are more general resources from the PyTorch documentation on checkpointing.

Model checkpoints in this example are written to the `best_model.pth` location. 
But if we are running on a remote compute instance, how do we move this checkpoint to a cloud resource that will persist beyond the lifecycle of the compute task? Again, Metaflow's S3 client makes this easy!

After each time model performs better than the previous best one, it is checkpointed and the result is uploaded to the cloud using this snippet:

```python
path_to_best_model = os.path.join(experiment_path, 'best_model.pth')
with S3(s3root = experiment_cloud_storage_path) as s3:
    s3.put_files([(path_to_best_model, path_to_best_model)])
```

The payoff of checkpointing in this way is that now you can easily resume the model from this state. 

In a notebook you can now evaluate the model, train it further, or iterate on the model architecture (PyTorch allows you to build [dynamic graphs](https://cs230.stanford.edu/section/5/)). 

In [6]:
from hagrid.classifier.run import _initialize_model
from omegaconf import OmegaConf

model_path = 'best_model.pth'

try:
    model = _initialize_model(
        conf = OmegaConf.load('hagrid/classifier/config/default.yaml'),
        model_name = 'ResNet18', 
        checkpoint_path = model_path, # can be local or S3 URI. 
        device = 'cpu'
    )
except FileNotFoundError:
    print("Are you sure you trained a model and saved the file to {}".format(model_path))

Building ResNet18
Building model from local checkpoint at: best_model.pth


Because of how the `TrainHandGestureClassifier` uses Metaflow's built-in versioning capabilities, we can also resume model training with the `--checkpoint` parameter when you run the `TrainHandGestureClassifier` defined in `classifier_flow.py`. This checkpoint parameter can either be to the local `.pth` file or its location in an S3 bucket.

In [5]:
#meta:filter_words=Step,INFO
! python classifier_flow.py --package-suffixes .yaml run --epochs 1 --model ResNet18 --checkpoint best_model.pth

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mTrainHandGestureClassifier[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-12-03 23:02:03.277 [0m[1mWorkflow starting (run-id 187934):[0m
[35m2022-12-03 23:02:05.019 [0m[32m[187934/start/1013676 (pid 5254)] [0m[1mTask is starting.[0m
[35m2022-12-03 23:02:07.425 [0m[32m[187934/start/1013676 (pid 5254)] [0m[22mTraining ResNet18 in flow TrainHandGestureClassifier[0m
[35m2022-12-03 23:02:11.071 [0m[32m[187934/start/1013676 (pid 5254)] [0m[1mTask finished successfully.[0m
[35m2022-12-03 23:02:12.674 [0m[32m[187934/train/1013677 (pid 5264)] [0m[1mTask is starting.[0m
[35m2022-12-03 23:02:13.953 [0m[32m[187934/train/1013677 (pid 5264)] [0m[22m

In this lesson, you saw how to ensure you don't lose progress as you iterate on your model using checkpoints. 
You learned how to store model checkpoints and resume that state from a notebook or as the starting point in a subsequent flow. 
In the next lesson, we will complete the tutorial by demonstrating the use of [Tensorboard](https://www.tensorflow.org/tensorboard)'s experiment tracking solution with Metaflow. 