<a href="https://colab.research.google.com/github/probml/pyprobml/blob/master/book1/intro/caliban.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running parallel jobs on Google Cloud using Caliban 

[Caliban](https://github.com/google/caliban) is a package that makes it easy to run embarassingly parallel jobs on Google Cloud Platform (GCP) from your laptop.  (Caliban bundles your code into a Docker image, and then runs it on  [Cloud AI Platform](https://cloud.google.com/ai-platform), which is a VM on top of GCP.)


In [42]:
import json
import pandas as pd
import glob
from IPython.display import display
import numpy as np
import matplotlib as plt

# Installation

The details on how to install and run Caliban can be found [here](https://github.com/google/caliban). Below we give a very brief summary. Do these steps on your laptop, **outside of this colab**.

- [install docker](https://github.com/google/caliban#docker) and test using ```docker run hello-world```

- ```pip install caliban```

- [setup GCP](https://caliban.readthedocs.io/en/latest/getting_started/cloud.html)


# Launch jobs on GCP 

Do these steps on your laptop, **outside of this colab**.



- create a requirements.txt file containing packages you need to be installed in GCP Docker image. Example:

```
numpy
scipy
#sympy
matplotlib
#torch # 776MB  slow
#torchvision
tensorflow_datasets
jupyter
ipywidgets
seaborn
pandas
keras
sklearn
#ipympl 
jax
flax
 
# below is jaxlib with GPU support
 
# CUDA 10.0
#tensorflow-gpu==2.0
#https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp36-none-linux_x86_64.whl
#https://storage.googleapis.com/jax-releases/cuda100/jaxlib-0.1.47-cp37-none-linux_x86_64.whl
 
# CUDA 10.1
#tensorflow-gpu==2.1
#https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.47-cp37-none-linux_x86_64.whl
 
tensorflow==2.1  # 421MB slow
https://storage.googleapis.com/jax-releases/cuda101/jaxlib-0.1.60+cuda101-cp37-none-manylinux2010_x86_64.whl
 
# jaxlib with CPU support
#tensorflow
#jaxlib
```

- create script that you want to run in parallel, eg [caliban_test.py](https://github.com/probml/pyprobml/blob/master/scripts/caliban_test.py)


- create config.json file with the list of flag combinations you want to pass to the script. For example the following file says to run 2 versions of the script, with flags ```--ndims 10 --prefix "***"``` and ```--ndims 100 --prefix "***"```. (The prefix flag is for pretty printing.)
```
{"ndims": [10, 100],
"prefix": "***" }
```

- launch jobs on GCP, giving them a common name using the xgroup flag. 
```
cp ~/github/pyprobml/scripts/caliban_test.py .
caliban cloud --experiment_config config.json --xgroup mygroup --gpu_spec 2xV100  caliban_test.py
```
You can specify the kind of machines you want to use as explained [here](https://caliban.readthedocs.io/en/latest/cloud/gpu_specs.html). If you omit "--gpu_spec", it defaults to n1-standard-8 with a single P100 GPU.


- open the URL that it prints to monitor progress. Example:
```
Visit https://console.cloud.google.com/ai-platform/jobs/?projectId=probml to see the status of all jobs.
 ```
You should see something like this:
<img src="https://github.com/probml/pyprobml/blob/
master/book1/intro/figures/GCP-jobs.png?raw=true">

- Monitor your jobs by clicking on 'view logs'.   You should see something like this:
<img src="https://github.com/probml/pyprobml/blob/
master/book1/intro/figures/GCP-logs-GPU.png?raw=true">

- When jobs are done,  download  the log files using [caliban_save_logs.py](https://github.com/probml/pyprobml/blob/master/scripts/caliban_save_logs.py). Example:
```
python ~/github/pyprobml/scripts/caliban_save_logs.py --xgroup mygroup 
```

- Upload the log files to Google drive and parse them  inside colab using python code below.


# Parse the log files

In [137]:
!rm -rf pyprobml # Remove any old local directory to ensure fresh install
!git clone https://github.com/probml/pyprobml


Cloning into 'pyprobml'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 6409 (delta 8), reused 13 (delta 3), pack-reused 6385[K
Receiving objects: 100% (6409/6409), 249.32 MiB | 29.15 MiB/s, done.
Resolving deltas: 100% (3571/3571), done.
Checking out files: 100% (738/738), done.


In [138]:
import pyprobml.scripts.probml_tools as pml
pml.test()


welcome to python probabilistic ML library


In [None]:
import pyprobml.scripts.caliban_logs_parse as parse

In [146]:
import glob
logdir = 'https://github.com/probml/pyprobml/tree/master/data/Logs'
fnames = glob.glob(f'{logdir}/*.config')
print(fnames) # empty

[]


In [147]:
from google.colab import drive
drive.mount('/content/gdrive')

logdir = '/content/gdrive/MyDrive/Logs'
fnames = glob.glob(f'{logdir}/*.config')
print(fnames)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
['/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config', '/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config']


In [148]:
configs_df = parse.parse_configs(logdir)
display(configs_df)

for n in [1,2]:
  print(get_args(configs_df, n))

reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.config
reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.config


Unnamed: 0,createTime,endTime,etag,jobId,labels,startTime,state,trainingInput,trainingOutput,job_num
0,2021-02-10T01:25:48Z,2021-02-10T01:31:53Z,6IkX86ZWV7A=,caliban_kpmurphy_20210209_172547_1,{'docker_image': 'gcr_ioprobmlf07f9a7112celate...,2021-02-10T01:29:22Z,SUCCEEDED,"{'args': ['--ndims', '10'], 'masterConfig': {'...",{'consumedMLUnits': 1.82},1
0,2021-02-10T01:25:49Z,2021-02-10T01:37:42Z,GqLt4PZ0ttw=,caliban_kpmurphy_20210209_172548_2,{'docker_image': 'gcr_ioprobmlf07f9a7112celate...,2021-02-10T01:35:10Z,SUCCEEDED,"{'args': ['--ndims', '100'], 'masterConfig': {...",{'consumedMLUnits': 1.82},2


['--ndims', '10']
['--ndims', '100']


In [140]:
logdir = '/content/gdrive/MyDrive/Logs'
#df1 = log_file_to_pandas('/content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210208_194505_1.log')
logs_df = parse.parse_logs(logdir)
display(logs_df.sample(n=5))


reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172547_1.log
reading  /content/gdrive/MyDrive/Logs/caliban_kpmurphy_20210209_172548_2.log


Unnamed: 0,insertId,labels,logName,receiveTimestamp,resource,severity,textPayload,timestamp,levelname,message,job_num
38,1nntk2eb1c,{'ml.googleapis.com/endpoint': ''},projects/probml/logs/ml.googleapis.com%2Fcalib...,2021-02-10T01:31:55.013085729Z,{'labels': {'job_id': 'caliban_kpmurphy_202102...,INFO,Waiting for job to be provisioned.,2021-02-10T01:31:54.456090168Z,,,2
26,vd5m7bg2f7mpm1,{'compute.googleapis.com/resource_id': '395485...,projects/probml/logs/master-replica-0,2021-02-10T01:29:20.975762879Z,{'labels': {'job_id': 'caliban_kpmurphy_202102...,ERROR,,2021-02-10T01:29:02.428182549Z,ERROR,2021-02-10 01:29:02.428102: I tensorflow/core/...,1
21,vd5m7bg2f7mpm6,{'compute.googleapis.com/resource_id': '395485...,projects/probml/logs/master-replica-0,2021-02-10T01:29:20.975762879Z,{'labels': {'job_id': 'caliban_kpmurphy_202102...,ERROR,,2021-02-10T01:29:02.428854113Z,ERROR,2021-02-10 01:29:02.428771: W tensorflow/strea...,1
7,sl3titg23clsu2,{'compute.googleapis.com/resource_id': '668635...,projects/probml/logs/master-replica-0,2021-02-10T01:35:26.468383099Z,{'labels': {'job_id': 'caliban_kpmurphy_202102...,INFO,,2021-02-10T01:35:06.347586078Z,INFO,TF backend\n,2
5,sl3titg23clsu4,{'compute.googleapis.com/resource_id': '668635...,projects/probml/logs/master-replica-0,2021-02-10T01:35:26.468383099Z,{'labels': {'job_id': 'caliban_kpmurphy_202102...,INFO,,2021-02-10T01:35:06.347591706Z,INFO,*** flax version 0.3.0\n,2


In [141]:
print(parse.get_log_messages(logs_df, 1))

[['Validating job requirements...']
 ['Job creation request has been successfully validated.']
 ['Job caliban_kpmurphy_20210209_172547_1 is queued.']
 ['INFO:root:python caliban_test.py --ndims 10\n']
 ["2021-02-10 01:28:58.986485: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"]
 ['2021-02-10 01:28:58.986516: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n']
 ['2021-02-10 01:29:02.426093: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set\n']
 ['2021-02-10 01:29:02.426243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library li

In [142]:
print(parse.get_log_messages(logs_df, 2))

[['Validating job requirements...']
 ['Job creation request has been successfully validated.']
 ['Job caliban_kpmurphy_20210209_172548_2 is queued.']
 ['This job is number 1 in the queue and requires 8.0 N1/E2 CPUs, 2 V100 accelerators, 100Gb standard disks and 0Gb ssd disks. The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 V100, 30 K80, 30 P100, 4 P4, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed across all regions.The project is using 8.0 N1/E2 CPUs out of 450.0 N1/E2, 8.0 C2, 8.0 N2, 800.0 preemptible allowed, 2 V100 accelerators out of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 allowed, 100Gb standard disks out of 180000 allowed and 0Gb ssd disks out of 75000 allowed in the region us-central1.']
 ['INFO:root:python caliban_test.py --ndims 100\n']
 