So far this tutorial has focused on data pipelines. This lesson introduces model development considerations into the workflow. You can see the source code for the flow [here](https://github.com/outerbounds/tutorials/tree/main/cv-2/classifier_flow.py).


In cases where we have a lot of labeled images, deep learning can be an extremely effective modeling technique. This, however, assumes you have access to sufficient resources to train deep learning models. In this episode, you will learn about concepts at the intersection of  applied machine learning and cloud development, such as how to use transfer learning models to bootstrap your deep learning workflows and how to iteratively develop models. More specifically, you will use PyTorch and Metaflow to access data and train transfer learning models on hardware accelerators in the cloud.

![](/assets/stack-model.png)

Transfer learning helps you approach state-of-the-art results without spending too much on resources. It is a common pattern where a machine learning developer resumes the expensive work done to train models (trained by Google, Meta, Microsoft, etc.) instead of discarding it. In this example, we use a general image processing model and fine-tune it to our hand gesture classification use case. To do this, we leverage the `build_model` function defined in the `hagrid` package.

In [59]:
from hagrid.classifier.utils import build_model, get_device
model = build_model(model_name = 'MobileNetV3_small', num_classes = 19, device = get_device())

Building MobileNetV3_small


Under the hood, this package calls one of the classes defined in the `models` directory, which you can inspect [here](https://github.com/hukenovs/hagrid/tree/master/classifier/models). The key lines that load the model leverage a custom class like `MobileNetV3` that subclasses [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).

```python
from models.mobilenetv3 import MobileNetV3
model = MobileNetV3(num_classes=num_classes, size='small', pretrained=pretrained, freezed=freezed)
```

The custom classes, such as `MobileNetV3` in this case, each instantiate a torchvision model.

```python
weights = torchvision.models.MobileNet_V3_Small_Weights if pretrained else None
torchvision_model = torchvision.models.mobilenet_v3_small(weights=weights)
```

Each custom class has its own constructor and `forward` step, like all neural networks in PyTorch. 
An important part of working with transfer learning models is knowing how to stitch together a new output on the end of an existing classifier.
For example, in the next code snippet, you can see the contents of the torch constructor for the `MobileNetV3` class, and how it uses torchvision to load the models above. Moreover, you can see how the loaded `torchvision_model` is then loaded with three parts including the "transferred" `backbone`, the `gesture_classifier`, and the `leading_hand_classifier`. 

```python
if size == "small":
    weights = torchvision.models.MobileNet_V3_Small_Weights if pretrained else None
    torchvision_model = torchvision.models.mobilenet_v3_small(weights=weights)
    in_features = 576
    out_features = 1024
else:
    weights = torchvision.models.MobileNet_V3_Large_Weights if pretrained else None
    torchvision_model = torchvision.models.mobilenet_v3_large(weights=weights)
    in_features = 960
    out_features = 1280

if freezed:
    for param in torchvision_model.parameters():
        param.requires_grad = False

self.backbone = nn.Sequential(
    torchvision_model.features,
    torchvision_model.avgpool
)

self.gesture_classifier = nn.Sequential(
    nn.Linear(in_features=in_features, out_features=out_features),
    nn.Hardswish(),
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=out_features, out_features=num_classes),
)
self.leading_hand_classifier = nn.Sequential(
    nn.Linear(in_features=in_features, out_features=out_features),
    nn.Hardswish(),
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=out_features, out_features=2),
)

```

Then, the `forward` function that is called when the neural net produces a prediction can then feed the output of the `backbone` part of the network into both classifiers. 

```python
def forward(self, x: Tensor) -> Dict:
    x = self.backbone(x)
    x = x.view(x.size(0), -1)
    gesture = self.gesture_classifier(x)
    leading_hand = self.leading_hand_classifier(x)
    return {"gesture": gesture, "leading_hand": leading_hand}
```

This is an example of multi-task learning, a pattern that larger, more general models often use. As the model trains, its weights will be updated to improve both the `gesture_classifier` and `leading_hand_classifier`. You can view the model architecture with the `torchsummary` package:

In [60]:
from torchsummary import summary
summary(model)

Layer (type:depth-idx)                        Param #
├─Sequential: 1-1                             --
|    └─Sequential: 2-1                        --
|    |    └─Conv2dNormActivation: 3-1         464
|    |    └─InvertedResidual: 3-2             744
|    |    └─InvertedResidual: 3-3             3,864
|    |    └─InvertedResidual: 3-4             5,416
|    |    └─InvertedResidual: 3-5             13,736
|    |    └─InvertedResidual: 3-6             57,264
|    |    └─InvertedResidual: 3-7             57,264
|    |    └─InvertedResidual: 3-8             21,968
|    |    └─InvertedResidual: 3-9             29,800
|    |    └─InvertedResidual: 3-10            91,848
|    |    └─InvertedResidual: 3-11            294,096
|    |    └─InvertedResidual: 3-12            294,096
|    |    └─Conv2dNormActivation: 3-13        56,448
|    └─AdaptiveAvgPool2d: 2-2                 --
├─Sequential: 1-2                             --
|    └─Linear: 2-3                            590,848
|    └─Hardsw

Layer (type:depth-idx)                        Param #
├─Sequential: 1-1                             --
|    └─Sequential: 2-1                        --
|    |    └─Conv2dNormActivation: 3-1         464
|    |    └─InvertedResidual: 3-2             744
|    |    └─InvertedResidual: 3-3             3,864
|    |    └─InvertedResidual: 3-4             5,416
|    |    └─InvertedResidual: 3-5             13,736
|    |    └─InvertedResidual: 3-6             57,264
|    |    └─InvertedResidual: 3-7             57,264
|    |    └─InvertedResidual: 3-8             21,968
|    |    └─InvertedResidual: 3-9             29,800
|    |    └─InvertedResidual: 3-10            91,848
|    |    └─InvertedResidual: 3-11            294,096
|    |    └─InvertedResidual: 3-12            294,096
|    |    └─Conv2dNormActivation: 3-13        56,448
|    └─AdaptiveAvgPool2d: 2-2                 --
├─Sequential: 1-2                             --
|    └─Linear: 2-3                            590,848
|    └─Hardsw

We looked at a single example in this section.
You can load one of the following models with the same process: [ResNet18](https://pytorch.org/vision/stable/models/generated/torchvision.models.resnet18.html#torchvision.models.resnet18), [ResNext50](https://pytorch.org/vision/stable/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d), [ResNet152](https://pytorch.org/vision/stable/models/generated/torchvision.models.resnet152.html#torchvision.models.resnet152), [MobileNetV3_small](https://pytorch.org/vision/stable/models/generated/torchvision.models.mobilenet_v3_small.html#torchvision.models.mobilenet_v3_small), [MobileNetV3_large](https://pytorch.org/vision/stable/models/generated/torchvision.models.mobilenet_v3_large.html#torchvision.models.mobilenet_v3_large), or [Vitb32](https://pytorch.org/vision/stable/models/generated/torchvision.models.vit_b_32.html#torchvision.models.vit_b_32).

Now we have access to data pipelines and powerful models for transfer learning. How do we use them together?
The next script shows how to tie together what you have built so far in the following steps:
* Load data using your custom PyTorch Dataset and DataLoader.
* Build a model from `MobileNetV3_small`.
* Run a single epoch of model training.

In [None]:
# imports and configuration
from hagrid.classifier.run import _initialize_model
from hagrid.classifier.preprocess import get_transform
from hagrid.classifier.dataset import GestureDataset
from hagrid.classifier.utils import build_model, collate_fn, get_device
from hagrid.classifier.train import TrainClassifier
import math
import torch
from omegaconf import OmegaConf
conf = OmegaConf.load('hagrid/classifier/config/default.yaml')
device = 'cpu'

# load data
train_dataset = GestureDataset(is_train=True, conf=conf, transform=get_transform())
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=64,
    num_workers=8,
    collate_fn=collate_fn,
    persistent_workers = True,
    shuffle=True
)

# build model
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(
    params,
    lr=conf.optimizer.lr,
    momentum=conf.optimizer.momentum,
    weight_decay=conf.optimizer.weight_decay
)
criterion = torch.nn.CrossEntropyLoss()
model = build_model(model_name = 'MobileNetV3_small', num_classes = 19, device = device)

# run single epoch
for i, (images, labels) in enumerate(train_loader):
    images = torch.stack(list(image.to(device) for image in images))
    output = model(images)
    loss = []
    for target in list(labels)[0].keys():
        target_labels = [label[target] for label in labels]
        target_labels = torch.as_tensor(target_labels).to(device)
        predicted_labels = output[target]
        loss.append(criterion(predicted_labels, target_labels))
    loss = sum(loss)
    loss_value = loss.item()
    if not math.isfinite(loss_value):
        print("Loss is {}, stopping training".format(loss_value))
        exit(1)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print("Step: {} - Loss {}".format(i, loss_value))

Building MobileNetV3_small
Step: 0 - Loss 3.648484230041504
Step: 1 - Loss 3.644343614578247
Step: 2 - Loss 3.658726692199707
Step: 3 - Loss 3.626072645187378
Step: 4 - Loss 3.6349782943725586
Step: 5 - Loss 3.6447670459747314
Step: 6 - Loss 3.6444149017333984
Step: 7 - Loss 3.639073133468628
Step: 8 - Loss 3.6401968002319336
Step: 9 - Loss 3.6476778984069824
Step: 10 - Loss 3.667464017868042
Step: 11 - Loss 3.6293206214904785
Step: 12 - Loss 3.673339605331421
Step: 13 - Loss 3.6308274269104004
Step: 14 - Loss 3.6213526725769043
Step: 15 - Loss 3.6380038261413574
Step: 16 - Loss 3.6302177906036377
Step: 17 - Loss 3.648437023162842
Step: 18 - Loss 3.639798641204834
Step: 19 - Loss 3.6680898666381836
Step: 20 - Loss 3.6555967330932617
Step: 21 - Loss 3.6286935806274414
Step: 22 - Loss 3.587489604949951


If you ran the code in the previous section, it probably took a little bit of time to finish that single training epoch. 
It is no secret that hardware accelerators like Nvidia GPUs can help you increase your rate of experimentation. 
But there is no shortage of difficulties when accessing these machines in a [full stack machine learning](/docs/infra-stack/) environment.

With a bit of restructuring the code you've seen so far in this tutorial, Metaflow makes [GPU access straightforward](https://docs.metaflow.org/scaling/remote-tasks/introduction) for data scientists, data analysts, and operations researchers.
Moreover, the stack used in next section's Metaflow will demonstrate a way to set up controls you can use to address the following challenges that emerge when using remote GPU instances:
1. How do you guarantee dependencies on the remote compute instance if you don't have a local GPU? 
    * You can create a Docker image and access it with a Metaflow decorator like `@batch(image = X)` or `@kubernetes(image = X)`.
2. How to efficiently load data batches so you don't pay for idle GPU instances?
    * With Metaflow's `@batch(memory = X)` or `@kubernetes(memory = X)` decorators you can control the amount of memory your compute instance needs. This works nicely with PyTorch DataLoader features like tuning the `num_workers` that are used to load mini-batches in separate processes to feed the model. 
3. How to quickly iterate with different transfer learning model heads? 
    * In this section, you will see a flow where you can type the model name into the command as a string `--model <MODEL_NAME> --checkpoint <S3_PATH_TO_MODEL_NAME>.pth`, with the rest of the configuration remaining the same.
4. How do you include local files such as the configuration at `hagrid/classifier/config/default.yaml`, on the GPU instance? 
    * Metaflow flows can be run with the `--package-suffixes <FILE EXTENSION>` top-level argument. In similar circumstances where you only want to include a single file, you can use Metaflow's [IncludeFile](/docs/load-local-data-with-include/) parameter. 

Now we have all the ingredients we need to encapsulate data loading and model training logic in a flow. 
The `TrainHandGestureClassifier` flows shows how you can use Metaflow to
- Request GPU resources when using AWS Batch via Metaflow's `@batch` decorator.
- Configure the persistent S3 location to track results.
- Download the data onto the compute instance using Metaflow's S3 client.
- Save model checkpoints to S3.

![](../../../../static/assets/cv-tutorial-2-TrainHandGestureClassifier.png)

In [78]:
%%writefile classifier_flow.py
from metaflow import FlowSpec, Parameter, step, batch, environment, S3, metaflow_config, current

class TrainHandGestureClassifier(FlowSpec):

    S3_URI = Parameter(
        's3', type=str, 
        default='s3://outerbounds-tutorials/computer-vision/hand-gesture-recognition',
        help = 'The s3 uri to the root of the model objects.'
    )

    DATA_ROOT = Parameter(
        'data', type=str, default='data/',
        help = 'The relative location of the training data.'
    )

    IMAGES = Parameter(
        'images', type=str,
        default = 'subsample.zip',
        help = 'The path to the images.'
    )

    ANNOTATIONS = Parameter(
        'annotations', type=str,
        default = 'subsample-annotations.zip'
    )

    PATH_TO_CONFIG = Parameter(
        'config', type=str, 
        default = 'hagrid/classifier/config/default.yaml',
        help = 'The path to classifier training config.'
    )
    
    NUMBER_OF_EPOCHS = Parameter(
        'epochs', type=int, default=100,
        help = 'The number of epochs to train the model from.'
    )

    MODEL_NAME = Parameter(
        'model', type=str,
        default = 'MobileNetV3_small',
        help = '''Pick a model from:
            - [ResNet18, ResNext50, ResNet152, MobileNetV3_small, MobileNetV3_large, Vitb32]
        '''
    )
    
    CHECKPOINT_PATH = Parameter(
        'checkpoint', type=str, default = None, 
        help = 'Path to the model state you want to resume. Eithe'
    )
    
    # # If you do not plan to checkpoint models in S3, then you may want
    # # to use Metaflow's IncludeFile here, instead of this parameter to 
    # # the path. Make sure to import IncludeFile :)
    # CHECKPOINT_PATH = IncludeFile(
    #    'best_model.pth',
    #    is_text=False,
    #    help='The path to your local best_model.pth checkpoint',
    #    default='./best_model.pth'
    # )

    @step
    def start(self):
        # Configure the (remote) experiment tracking location.
        # In this tutorial, experiment tracking means
            # 1: Storing the best model state checkpoints to S3.
            # 2: Storing parameters as Metaflow artifacts.
            # 3: Storing metrics/logs with Tensorboard. 
        import os
        print("Training {} in flow {}".format(self.MODEL_NAME, current.flow_name))
        self.datastore = metaflow_config.METAFLOW_CONFIG['METAFLOW_DATASTORE_SYSROOT_S3']
        self.experiment_storage_prefix = os.path.join(self.datastore, current.flow_name, current.run_id)
        self.next(self.train)

    def _download_data_from_s3(self, file, sample : bool = True):
        import zipfile
        import os
        with S3(s3root = self.S3_URI) as s3:
            if sample:
                path = os.path.join(self.DATA_ROOT, file)
                result = s3.get(path)
                with zipfile.ZipFile(result.path, 'r') as zip_ref:
                    zip_ref.extractall(path.split('.zip')[0])
            else: # Full dataset takes too long for the purpose of this tutorial.
                raise NotImplementedError()

    # 🚨🚨🚨 Do you want to ▶️ on ☁️☁️☁️?
    # You need to be configured with a Metaflow AWS deployment to use this decorator.
    # If you want to run locally, you can comment the `@batch` decorator out.
    @batch(
        gpu=1,
        memory=32000,
        image='eddieob/cv-tutorial:gpu-latest',
        shared_memory=8000,
    )
    @step
    def train(self):
        from hagrid.classifier.run import run_train
        from hagrid.classifier.utils import get_device
        import os
        
        # Download the dataset onto the compute instance.
        if not os.path.exists(self.DATA_ROOT):
            os.mkdir(self.DATA_ROOT)
        print("Downloading images...")
        self._download_data_from_s3(self.IMAGES, sample=True)
        print("Done!")
        print("Downloading annotations...")
        self._download_data_from_s3(self.ANNOTATIONS, sample=True)
        print("Done!")

        # Train a model from available MODEL_NAME options from a checkpoint.
        # There will be errors that happen if CHECKPOINT_PATH doesn't match MODEL_NAME.
        # The user should know which checkpoint paths came from which models.
        self.train_args = dict(
            path_to_config = self.PATH_TO_CONFIG,
            number_of_epochs = self.NUMBER_OF_EPOCHS,
            device = get_device(),
            checkpoint_path = self.CHECKPOINT_PATH,
            model_name = self.MODEL_NAME,
            tensorboard_s3_prefix = self.experiment_storage_prefix,
            always_upload_best_model = True
        )
        _ = run_train(**self.train_args)

        # Move the best model checkpoint to S3 if METAFLOW_DATASTORE_SYSROOT_S3 is available. 
        # See the comment in the start step about setting self.experiment_storage_prefix.
        experiment_path = os.path.join("experiments", self.MODEL_NAME)
        path_to_best_model = os.path.join(experiment_path, 'best_model.pth')
        self.best_model_location = os.path.join(self.experiment_storage_prefix, path_to_best_model)
        if self.best_model_location.startswith('s3://'):
            with S3(s3root = self.experiment_storage_prefix) as s3:
                s3.put_files([(path_to_best_model, path_to_best_model)])
                print("Best model checkpoint saved at {}".format(self.best_model_location))
        self.next(self.end)
        
    @step
    def end(self):
        pass # You could do some fancy analytics, post-processing, or write a nice message here too! 

if __name__ == '__main__':
    TrainHandGestureClassifier()

Writing classifier_flow.py


Now it is time to run the `TrainHandGestureClassifier`. 
The model can be run using a command structured like this:

```bash
python classifier_flow.py --package-suffixes '.yaml' run --model 'ResNet18' --checkpoint 'best_model.pth'
```

- The configuration file ends in `.yaml` and we need it to be on the remote instances accessed for the `train` step. Metaflow has a mechanism to that makes it accessible in the remote compute steps by using `--package-suffixes .yaml` in the run command. 
- The model name is specified using the `--model` argument. This is defined in the flow parameters. 
- The state of the `--model` to resume from is specified using the `--checkpoint` argument. This is also defined in the flow parameters.

You can explore more runtime options to add to the run command by investigating the `Parameter` definitions in the flow definition in the `classifier_flow.py` file.

```bash
python classifier_flow.py --package-suffixes '.yaml' run --epochs 50 --model 'ResNet18'
```

In [66]:
#meta:tag=hide_input
! python classifier_flow.py --package-suffixes '.yaml' run --epochs 50 --model 'ResNet18'

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mTrainHandGestureClassifier[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eddie[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-12-03 17:32:06.592 [0m[1mWorkflow starting (run-id 187928):[0m
[35m2022-12-03 17:32:08.306 [0m[32m[187928/start/1013657 (pid 97697)] [0m[1mTask is starting.[0m
[35m2022-12-03 17:32:11.837 [0m[32m[187928/start/1013657 (pid 97697)] [0m[22mTraining ResNet18 in flow TrainHandGestureClassifier[0m
[35m2022-12-03 17:32:15.585 [0m[32m[187928/start/1013657 (pid 97697)] [0m[1mTask finished successfully.[0m
[35m2022-12-03 17:32:17.160 [0m[32m[187928/train/1013658 (pid 97705)] [0m[1mTask is starting.[0m
[35m2022-12-03 17:32:18.390 [0m[32m[187928/train/1013658 (pid 97705)] [0m

In this lesson, you learned how to do transfer learning with PyTorch to access and iterate on state-of-the-art models. You then saw how Metaflow helps you  access GPUs in the cloud while addressing common issues when setting up your workflows for remote computing tasks. 
Metaflow provides access to compute, takes care of the networking between the data and compute environments, and versions all results so they can be accessed later.

In the next lesson, we will level up the ability to iterate on this model effectively. Specifically, you will learn how to save a model checkpoint in the cloud and then resume training from that point. See you there!