CFLow

The framework orchestrates a simple workflow to dispatch and manage large amount of VMs(standard or preemptible), on which to execute any user provided functions, scripts as a module.

For example, to train a ML model and perform predictions on many independent datasets(125 files, each contains 10M records), there should be an efficient way to parallel the work, so that user can focus only one core algorithm file, plugin it in and kick off a run. The ouput will be then combined. To prepare test data submitted to Kaggle challenges, with 125 parallel VMs(without GPU), the time reduced from 14 hours to 15 minutes.

Design consideration

Keep things simple!

Currently, there is no easy way to execute complex functions/codes with custom packages, ML for example, on GCE, GKE, or ML Engine. Cloud function can only run simple scripts(javascript).

If you are using tensorflow, and need a hybrid cloud architecture, look into Kubeflow.

There are 2 important scripting of framework:

4 scripts on worker VM:
- init.sh - boot script called when VM starts; built into VM image
- startup-singal - boot script downloaded from storage, main logic happens here
- shutdown-signal - called upon VM shutdown; record unfinish work
- task.sh - entry point of user provided script
scheduler.py - on different VM or from terminal console, to create, monitor, delete work VM

How to start

Download the code from Github
Place your code inside "task" folder; task.sh is an sample script, you can use any other code
Review and update const.py
Upload the whole package into Cloud Storage
Build a VM image for worker VM
1. Create a standard VM
2. Install below packages
```
sudo pip install --upgrade google-cloud-storage
sudo pip install --upgrade google-cloud-pubsub
sudo pip install --upgrade google-api-python-client
```
1. Download init.sh script and place it under root /
2. Modify the script
  - where to download and where to stage the python script from Cloud Storage
  - credential json file name
3. Shutdown the VM
4. Goto Compute Image console page, create a new VM based on the disk used for this VM. The name of the image need to be updated in const.py
Create a service account with following permission, then download json key and place it under credential folder
- Compute Instance Admin
- Service Account User
- Pub/Sub Publisher
- Pub/Sub Subscriber
- Storage Object Admin
Use another standard VM as the controller to kick off the script or directly use the terminal console. Make sure the user execute the script has right permission.

python3 scheduler.py

Limitation

only pull based subscriber used, because push based requires valid SSL endpoint
log on worker nodes are stored as /tmp/cflow/cflow.log; currently we do not collect logs from nodes. But should be easy to modify the sample task.sh to achieve
pub/sub content is not clear before each run

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
compute		compute
pubsub		pubsub
storage		storage
task		task
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
const.py		const.py
init.sh		init.sh
scheduler.py		scheduler.py
shutdown-signal.py		shutdown-signal.py
startup-signal.py		startup-signal.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CFLow

Design consideration

How to start

Limitation

About

Releases

Packages

Languages

License

liuxiao/CFlow

Folders and files

Latest commit

History

Repository files navigation

CFLow

Design consideration

How to start

Limitation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages