# Run PyTorchJob From Function

In this Notebook we are going to create [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/pytorch/).

The PyTorchJob will run distributive training using [DistributedDataParallel strategy](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html).

## Install Kubeflow Python SDKs

You need to install PyTorch packages and Kubeflow SDKs to run this Notebook.

In [None]:
!pip install torch==1.12.1
!pip install torchvision==0.13.1

# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.
!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python

## Create Train Script for CNN Model

This is simple **Convolutional Neural Network (CNN)** model for recognizing different picture of clothing using [Fashion MNIST Dataset](https://github.com/zalandoresearch/fashion-mnist).

In [2]:
def train_pytorch_model():
    import logging
    import os
    from torchvision import transforms, datasets
    import torch
    from torch import nn
    import torch.nn.functional as F
    import torch.distributed as dist

    logging.basicConfig(
        format="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%SZ",
        level=logging.DEBUG,
    )

    # Create PyTorch CNN Model.
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 20, 5, 1)
            self.conv2 = nn.Conv2d(20, 50, 5, 1)
            self.fc1 = nn.Linear(4 * 4 * 50, 500)
            self.fc2 = nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # Get dist parameters.
    # Kubeflow Training Operator automatically set appropriate RANK and WORLD_SIZE based on the configuration.
    RANK = int(os.environ["RANK"])
    WORLD_SIZE = int(os.environ["WORLD_SIZE"])
    
    model = Net()
    # Attach model to DistributedDataParallel strategy.
    dist.init_process_group(backend="gloo", rank=RANK, world_size=WORLD_SIZE)
    Distributor = nn.parallel.DistributedDataParallel
    model = Distributor(model)

    # Split batch size for each worker.
    batch_size = int(128 / WORLD_SIZE)

    # Get Fashion MNIST DataSet.
    train_loader = torch.utils.data.DataLoader(
        datasets.FashionMNIST(
            "./data",
            train=True,
            download=True,
            transform=transforms.Compose([transforms.ToTensor()]),
        ),
        batch_size=batch_size,
    )

    # Start Training.
    logging.info(f"Start training for RANK: {RANK}. WORLD_SIZE: {WORLD_SIZE}")
    for epoch in range(1):
        model.train()
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0:
                logging.info(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

## Run Training Locally in the Notebook

We are going to download Fashion MNIST Dataset and start local training.

In [3]:
# Set dist env variables to run the above training locally on the Notebook.
import os
os.environ["RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "1234"

# Train Model locally in the Notebook.
train_pytorch_model()

2022-09-12T18:21:28Z INFO     Added key: store_based_barrier_key:1 to store for rank: 0


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Processing...


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
2022-09-12T18:31:05Z INFO     Start training for RANK: 0. WORLD_SIZE: 1


Done!


2022-09-12T18:31:05Z INFO     Reducer buckets have been rebuilt in this iteration.


## Start Distributive Training with PyTorchJob

Before creating PyTorchJob, you have to create `TrainingClient()`. It uses [Kubernetes Python client](https://github.com/kubernetes-client/python) to communicate with Kubernetes API server. You can set path and context for [the kubeconfig file](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/). The default location for the kubeconfig is `~/.kube/config`.

Kubeflow Training Operator automatically set the appropriate env variables (`MASTER_PORT`, `MASTER_ADDR`, `WORLD_SIZE`, `RANK`) for each PyTorchJob container.

In [5]:
from kubeflow.training import TrainingClient

# Start PyTorchJob Training.
pytorchjob_name = "train-pytorch"
training_client = TrainingClient()

training_client.create_pytorchjob_from_func(
    name=pytorchjob_name,
    func=train_pytorch_model,
    num_worker_replicas=3, # How many PyTorch Workers will be run.
)

PyTorchJob kubeflow-user-example-com/train-pytorch has been created


### Check PyTorchJob Status

Use `KubeflowClient` APIs to get information about created PyTorchJob.

In [18]:
print(f"PyTorchJob Status: {training_client.is_job_running(name=pytorchjob_name, job_kind='PyTorchJob')}")

PyTorchJob Status: True


### Get PyTorchJob Pod Names

In [19]:
training_client.get_job_pod_names(pytorchjob_name)

['train-pytorch-master-0',
 'train-pytorch-worker-0',
 'train-pytorch-worker-1',
 'train-pytorch-worker-2']

### Get PyTorchJob Training Logs

In [27]:
training_client.get_job_logs(pytorchjob_name, container="pytorch")

The logs of pod train-pytorch-master-0:
 2023-01-12T18:55:33Z INFO     Added key: store_based_barrier_key:1 to store for rank: 0
2023-01-12T18:55:33Z INFO     Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 26421880/26421880 [00:02<00:00, 12562567.98it/s]
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloadi

## Delete PyTorchJob

When PyTorchJob is finished, you can delete the resource.

In [28]:
training_client.delete_pytorchjob(pytorchjob_name)

PyTorchJob kubeflow-user-example-com/train-pytorch has been deleted
