# From notebook to Kubernetes pipeline

![](assets/nb.png)

This tutorial will show you how to automatically convert a Jupyter notebook into a Kubernetes pipeline.

Let's download a sample notebook:

In [1]:
# conda activate {env} doesn't work well here
# so we manually modify the path
PATH=$CONDA_PREFIX/envs/soopervisor/bin:$PATH

In [2]:
mkdir pipeline
cd pipeline

In [3]:
curl -O https://raw.githubusercontent.com/ploomber/soorgeon/main/examples/machine-learning/nb.ipynb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5782  100  5782    0     0  23892      0 --:--:-- --:--:-- --:--:-- 23794


```{note}
The sample notebook is a typical Machine
Learning pipeline, you can see it
[here](https://github.com/ploomber/soorgeon/blob/main/examples/machine-learning/nb.ipynb)
```

## Automatic refactoring

Let's now use [soorgeon](https://github.com/ploomber/soorgeon) to refactor the notebook:

In [4]:
pip install soorgeon --quiet
soorgeon refactor nb.ipynb -p /mnt/shared-folder -d parquet

Added README.md
[32mFinished refactoring 'nb.ipynb', use Ploomber to continue.[0m

Install dependencies (this will install ploomber):
    $ pip install -r requirements.txt

List tasks:
    $ ploomber status

Execute pipeline:
    $ ploomber build

Plot pipeline:
    $ ploomber plot

* Documentation: https://docs.ploomber.io
* Jupyter integration: https://ploomber.io/s/jupyter
* Other editors: https://ploomber.io/s/editors

[0m

```{note}
Soorgeon uses static analysis to split notebooks into
several files, the output is a [Ploomber](https://github.com/ploomber/ploomber)
pipeline that then we can export to Kubernetes.

The `-p` tells Soorgeon that it should store all the pipeline
outputs in a `/mnt/shared-folder` directory, and the `-d`
option states we should use `.parquet` files for the outputs.
```

We now configure the Argo workflows backend:

In [5]:
# soopervisor add requires a requirements.lock.txt file
cp requirements.txt requirements.lock.txt

## Configuring Argo Workflows

In [6]:
# add the taget environment
soopervisor add training --backend argo-workflows

No pipeline.training.yaml found, looking for pipeline.yaml instead
Found /Users/Edu/dev/soopervisor/kind/doc/pipeline/pipeline.yaml. Loading...
[34m== Adding /Users/Edu/dev/soopervisor/kind/doc/pipeline/training/Dockerfile... ==[0m
Environment added, to export it:
	 $ soopervisor export training
To force execution of all tasks:
	 $ soopervisor export training --mode force

[0m

Soopervisor uses a `soopervisor.yaml` to configure your project, we'll download a pre-configured one:

In [7]:
curl https://raw.githubusercontent.com/ploomber/soopervisor/master/tutorials/workflow/soopervisor-workflow.yaml -o soopervisor.yaml

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   152  100   152    0     0    569      0 --:--:-- --:--:-- --:--:--   567


In [8]:
cat soopervisor.yaml

training:
  backend: argo-workflows
  repository: null
  mounted_volumes:
    - name: shared-folder
      spec:
        hostPath:
          path: /host


## Exporting Argo YAML Spec

The `soopervisor export` command will create the Docker image and the Argo YAML spec:

In [None]:
soopervisor export training --skip-tests --ignore-git --mode force

Here's the generated Argo YAML spec:

In [7]:
cat training/argo.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: pipeline-
spec:
  entrypoint: dag
  templates:
  - inputs:
      parameters:
      - name: task_name
    name: run-task
    script:
      command:
      - bash
      image: pipeline:latest-default
      imagePullPolicy: Never
      source: |-
        ploomber task {{inputs.parameters.task_name}} --entry-point pipeline.yaml --force
      volumeMounts:
      - mountPath: /mnt/shared-folder
        name: shared-folder
        subPath: ''
      workingDir: null
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: task_name
            value: load
        dependencies: []
        name: load
        template: run-task
      - arguments:
          parameters:
          - name: task_name
            value: clean
        dependencies:
        - load
        name: clean
        template: run-task
      - arguments:
          parameters:
          - name: task_name
            value: train-test-s

## Running on Kubernetes

In [8]:
kind delete cluster

Deleting cluster "kind" ...


In [11]:
kind create cluster --config ../kind-config.yaml

Creating cluster "kind" ...
 [32m✓[0m Ensuring node image (kindest/node:v1.24.0) 🖼7l
 [32m✓[0m Preparing nodes 📦 7l
 [32m✓[0m Writing configuration 📜7l
 [32m✓[0m Starting control-plane 🕹️7l
 [32m✓[0m Installing CNI 🔌7l
 [32m✓[0m Installing StorageClass 💾7l
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂
You have new mail in /var/mail/Edu


In [13]:
cat ../kind-config.yaml

apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
  - role: control-plane
    extraMounts:
      - hostPath: outputs
        containerPath: /host


In [14]:
kubectl get nodes

NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   43s   v1.24.0


In [15]:
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.3.9/install.yaml

namespace/argo created
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtaskresults.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtasksets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-admin created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-edit created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-view created
clusterrole.rbac.authorization.k8s.io/argo-cluster-role created
clust

In [16]:
kubectl patch deployment \
  argo-server \
  --namespace argo \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": [
  "server",
  "--auth-mode=server"
]}]'

deployment.apps/argo-server patched


In [None]:
sleep 5

In [22]:
kubectl get pods -n argo

NAME                                   READY   STATUS    RESTARTS   AGE
argo-server-57cf87c886-54hgn           0/1     Running   0          9s
argo-server-65566599f8-7zrz8           0/1     Running   0          12s
workflow-controller-77c44779bf-vl65l   1/1     Running   0          12s


In [23]:
kind load docker-image pipeline:latest-default

Image: "pipeline:latest-default" with ID "sha256:192bb0cda7e554ee28ed80abded89a46d53cfd3f4a60e0bd447ddffad18407c4" not yet present on node "kind-control-plane", loading...
You have new mail in /var/mail/Edu


In [24]:
argo submit -n argo training/argo.yaml

Name:                pipeline-dw79l
Namespace:           argo
ServiceAccount:      default
Status:              Pending
Created:             Sun Aug 28 01:59:18 -0500 (now)
Progress:            


```{note}
To access Argo's UI, open a terminal and execute:

`kubectl -n argo port-forward deployment/argo-server 2746:2746`

Then, open: https://localhost:2746/
```

In [25]:
argo wait @latest -n argo

@latest Succeeded at 2022-08-28 02:01:35 -0500 CDT
You have new mail in /var/mail/Edu


In [26]:
argo get @latest -n argo

Name:                pipeline-dw79l
Namespace:           argo
ServiceAccount:      default
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Sun Aug 28 01:59:18 -0500 (2 minutes ago)
Started:             Sun Aug 28 01:59:18 -0500 (2 minutes ago)
Finished:            Sun Aug 28 02:01:35 -0500 (now)
Duration:            2 minutes 17 seconds
Progress:            5/5
ResourcesDuration:   3m1s*(1 cpu),3m1s*(100Mi memory)

[39mSTEP[0m                          TEMPLATE  PODNAME                    DURATION  MESSAGE
 [32m✔[0m pipeline-dw79l             dag                                              
 ├─[32m✔[0m load                     run-task  pipeline-dw79l-3455339821  42s         
 ├─[32m✔[0m clean                    run-task  pipeline-dw79l-4113816922  21s         
 ├─[32m✔[0m train-test-split         run-task  pipeline-dw79l-112166065   15s         
 ├─[32m✔[0m linear-regression        run-task  pip

In [28]:
ls outputs/

clean-df.parquet               train-test-split-X_test.pkl
clean.ipynb                    train-test-split-X_train.pkl
linear-regression.ipynb        train-test-split-y_test.pkl
load-df.parquet                train-test-split-y_train.pkl
load.ipynb                     train-test-split.ipynb
random-forest-regressor.ipynb
