Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research item: Long running batch-jobs / workflow #657

Open
alexellis opened this issue Apr 18, 2018 · 40 comments
Open

Research item: Long running batch-jobs / workflow #657

alexellis opened this issue Apr 18, 2018 · 40 comments

Comments

@alexellis
Copy link
Member

alexellis commented Apr 18, 2018

Should we support long-running batch-jobs?

Are these in the scope of OpenFaaS functions which are typically several seconds in duration?

Are there primitives in Kubernetes such as jobs which we can leverage?

Kubernetes jobs example

Edit 29 Sept 2019

Since long-running jobs and workflows are related, I've added that to the title, so if you're looking for workflow, please feel free to comment with your use-case and whether it's for business purposes or for fun.

@alexellis
Copy link
Member Author

I spoke with @iyovcheva about this yesterday. My initial thoughts are that jobs are orthogonal to functions, but they are a common ask from the community around longer-running batch processing and CI jobs. Is there an opportunity to add value here?

@cpitkin
Copy link

cpitkin commented May 23, 2018

Nomad and Kubernetes both have these primitives but I don't believe Swarm does. It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.

I can see this being a larger ask from the community going forward. I have been thinking a lot about this myself for personal and work-related tasks. It would be nice to have something that can be scheduled to run at a specific time but not be locked into time constraints. I also feel that it is a slippery slope to not turn the project into something like Rundeck. It may be worth waiting to see what Swarm does before building anything. Leveraging the primitives of something already build is probably going to be a smaller lift.

@alexellis
Copy link
Member Author

alexellis commented May 23, 2018

It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.

Having read the thread and following the thread, I think this is unlikely to happen in the near future.

Kubernetes can support this use-case natively, Swarm may need a separate controller writing to make this possible - I made a start with a CLI tool which has a few users called JaaS

The Rundeck project looks interesting btw.

@CuZn13
Copy link

CuZn13 commented May 25, 2018

There are many such scenarios in the actual business. Containers can do this kind of work well. Pull up the container when using it,Complete work to destroy the container. We are doing this based on openfaas.
When invoked, put the request in a queue and deploy function. Record the number of call requests and completions. If there is no new request in a certain period of time and all operations are completed, delete function. Wait for the next invoked

@lukasheinrich
Copy link

Hello -- just commenting here as well to share our science use-case. "Black box" functions that are expensive to evaluate are a common setup in optimization problems. Often a machine-learning based strategy defines at what parameters a function is evaluated in order to reduce the number of evaluations: see e.g. https://scikit-optimize.github.io/ or https://github.com/diana-hep/excursion function evaluations can easily be multiple hours. My ideal user-interface would be

> faas submit --parameters '{"a": 1, "b": "Hello World"}'
http://some.url/to/future
> faas ready http://some.url/to/future
false
> faas ready http://some.url/to/future
true
> faas retrieve http://some.url/to/future
{"value": 1.23}

@alexellis
Copy link
Member Author

Thanks for adding your use-case.

What if instead of checking for the result in a stateful way, you specified a callback URL in an event-driven way?

Is "job checking" a hard requirement too?

I think you could do this basic flow with the asynchronous processing using a long enough max timeout. Each async function call returns a call ID which you get back on the callback.

Alex

@lukasheinrich
Copy link

lukasheinrich commented Dec 8, 2018

Hi
-- I think a callback would work equally well. definitely not a hard requirement. Are you thinking one call back for all function calls (where the callback specifies some invocation id) or a unique callback url per call?

@sheryever
Copy link

Is there any update?

I am using csharp functions, If a function is receiving the request on single thread then I can start a new thread with some unique key and return that in response. Then on the other request i can send the key and check the status of the thread.

I didn't try to implement this until now because I am new and testing the OpenFaas for our need and don't know how it will behave to threading and searching for the solution which is already available.

I was planing to replace the windows services which with OpenFaas. These service are running the scheduled task which usually takes 3 to 10 minutes but now we also need that to run those tasks on demand.

@alexellis
Copy link
Member Author

Hi @sheryever you can run for 3-5 mins, no problems. Threads are also fair game. 👍

I don't think you need what I'm calling long-running jobs for that.

Alex

@alexellis
Copy link
Member Author

alexellis commented Jun 28, 2019

Some of the requirements/constraints I'm hearing from users:

  • run for up to several hours
  • status can be checked
  • can be cancelled
  • new version can be scheduled without cancelling the existing versions executing

Assumed, but need users to confirm:

  • uses or looks like a function
  • invoked or scheduled in a similar way to a function

A thin wrapper around a Kubernetes Job, or clear documentation on this use-case using Kubernetes jobs may be enough for a large % of the people asking for the above, but this is unclear. I think it would be worth looking into with 1-2 of the people who need this.

@lihaiswu
Copy link

I'd like to add my user case. We have many teams use different automation frameworks with different language.

  1. To schedule the tests, we'd like to treat each framework as a function. When there's a new available build, we can just invoke all functions by passing related parameters. Each test may running for several hours

  2. When there's code change to the framework and merged, it'll update the function without breaking on-going request. For new request, route to latest function. For old on-going request, wait for it to complete and upgrade the function.

  3. For the async function, it could support the auto scaling based on the request number limits in the function.

Thanks @alexellis and thanks @openfaas.

@burtonr
Copy link
Contributor

burtonr commented Jun 28, 2019

This is an intriguing problem... more than a couple use cases being described here, but as Alex pointed out, there are some high level similarities/requirements.

Just thinking out loud about this, I can't think of how we would "know" that a function is still executing. Perhaps a "status" function available in the OF core to receive messages from functions. I'm thinking like setting a variable on a function indicating it as a long running job (ie functionType: job) that would then start a routine that would push updates to the main status function in order to be able to query/report on the status. Just something simple with an in-memory map of [functionName]status that would be updated on post from the function's watchdog.

To summarize, the list of things in my head to accomplish this would be:

  • New function to report status and cancel "jobs" (as part of the OpenFaaS core functions)
  • Update watchdog to accept new parameter to mark a function as a long-running job (ie functionType: job)
  • That will then enable a new status reporting routine (background task) to push data to the main "status function" on an interval
    • "running" | "duration: 15m30s" | "completed" ...something like that
  • GET openfaas/system/status?function=big-data-process
    • Where big-data-process is the function name as defined in the yml file

Some questions I haven't thought of a way to answer yet:

  • What about scaling?
    • How would we report status on 2+ long running functions of the same name?
    • Would you just get status on the latest one?
    • Then the user would only be notified when all the jobs were complete...

@LucasRoesler
Copy link
Member

What about a batch mode in the of-watchdog. This would run the function method to finish and then stop. This would make it easy to use the same image as a function of the image in an Argo workflow / k8s job / etc.

By default the watchdog could just send an "empty" request. But it could also accept a file path and parse any of the files as requests, one at a time, to the method. The response would either be dropped or saved (one per file) to a file path. This would be very Argo friendly. Alternatively, it could accept an s3 compatible server, bucket, and path and read/write from there.

I don't think openfaas needs to be the batch job runner, but if we make it really easy to just use a function in a batch job system. That would go a long way.

Technically this would all be possible without any changes from us, you can write a template that contains watchdog and some other init script and just use a different command in your container when you use it in batch job. But documenting this and making it an approved workflow instead of a workaround would probably make people happy

@LucasRoesler
Copy link
Member

@alexellis is the goal that "functions" should be compatible with batch jobs, e.g. take an openfaas function and run it in Argo/pure k8s jobs/kubeflow/etc or is it that we want to create another job system in which we will take any docker image that exposes the "function interface" of a server on 8080 and then run it as a one time job?

@alexellis
Copy link
Member Author

That is a good question. Maybe it will be both?

What do our users need?

@srisco
Copy link

srisco commented Jul 1, 2019

Hi, I've been working on different use cases involving long-running batch jobs and these are my thoughts.

After trying different scaling configurations with asyncronous invokations, mainly based on CPU consumption and increasing the number of replicas of the queue-worker, my colleagues and I came to the conclusion that it was more convenient to use kubernetes jobs. However, we wanted to take advantage of openfaas' ability to invoke functions through the gateway, so we decided to create oscar-worker as a substitute for nats-queue-worker. Its goal is to convert invocations that reach nats through the /async-function/ route into k8s jobs. It is not a very elegant solution, since the NATS queue wouldn't be necessary for this purpose, but it does its job.

My idea of a better integration for long-running jobs in openfaas would be adding some tag to the functions in order to indicate that they are long-running functions/jobs. These functions will have a new route in the gateway, for example /job/. When a request is sent to this route, the gateway will convert the request to a kubernetes job. The result could be displayed in the logs or sent via callback using a sidecar (or init-container + container) in the job.

I think this approach wouldn't be so hard to implement and would cover the needs of a huge number of users.

@alexellis
Copy link
Member Author

Also potentially useful / interesting relating to workflows - https://github.com/s8sg/faas-flow

@zhl146
Copy link

zhl146 commented Jul 30, 2019

Hi everyone! I've been looking for an ideal solution to run long-running ETL that take a few hours to run. They are mostly involved with database to database data transfer. Currently, we're just running on some VMs with an in-house scheduler, but we'd like to do better than this. OpenFaas seemed like a possible solution, as we could run "functions" on demand. However, I am unsure of the suitability for something like OpenFaas to run stuff that takes that long. There has been some great discussion here about what is currently missing from OpenFaas in this domain. Namely:

  1. No way to check on the status of a running async function
  2. No ability to retry on failure
  3. No way to cancel a running job

Something that I was not clear about was whether you could run something like this on OpenFaas. I know that AWS lambda has a hard timeout of 15 minutes per invocation. Does OF have something like this, or is it just that it may not be reliable to run a function for that long?

If someone could give me a high level idea of what Kubernetes Jobs offers that OF does not currently and what is gained by putting an OF layer on top of Jobs would gain us, that would help me immensely.

Thanks!

@valorl
Copy link

valorl commented Aug 9, 2019

For us, even for things that only run let's say 2-3 minutes, I think we would quite appreciate having the 3 features @zhl146 mentioned: status, retries and cancelling.

You get all of that by simply just using a Kubernetes Job (@zhl146), which for us is pretty viable, since most of our long-running jobs don't have the exact semantics of a function - e.g. they are usually purely side-effects and don't need to return anything.

However, one of the reasons OpenFaas is attractive to us is that we can deploy each job as "a piece of code that can be triggered by an HTTP request", which helps decouple the job itself from the means of running it. For example, you can have a CronJob calling the function every 15min, while at the same time being able to call it manually/reactively, without deploying the business logic of the job twice, or creating separate container images.

Therefore, I think it would be beneficial to be able to use the OpenFaas API to reuse the code it already has in the container to spawn the K8S job, instead of having a separate flow for it.

As an alternative, I also see value in making it possible to re-use the OF-built image and just run the function without the watchdog. That way we could just use the K8S API separately to run the jobs but still reusing the same contianer image. But I think we would prefer the OF-integrated option.

@ameier38
Copy link

I am very interested in this and would love to help push this forward. For my use case, our team uses Airflow for our ETL processes and OpenFaas functions for the actual processing of files. We have found this to be a really nice combo as we can more easily test each of the different processes without having to bloat our Airflow code. Airflow then simply wires up the different functions and handles the retries and failures.

Right now we have a OpenFaas function called record-function to record when another function has started/completed/failed by storing the status in Redis. We use it by first calling function/record-function/{unique-id}/start where unique-id is just a uuid we use to indentify the function. We then call our actual function async-function/my-long-running-function and pass function/record-function/{unique-id}/stop as the callback function. We then use an Airflow sensor to poll the Redis database to see when the function has completed and then continue with the rest of the workflow.

It would be great if we could instead kick off a long running function as a k8s job and then poll the status of the function by making a call to the OpenFaas gateway with something like system/function/my-long-running-function and get back the k8s job status response.

@alexellis alexellis changed the title Research item: Long running batch-jobs Research item: Long running batch-jobs / workflow Sep 29, 2019
@sguruswa
Copy link

I have started using OF for my project and ended up with this issue(#657). I have requirement/need long running function using async but is there any other way for

  1. Check on the status of a running async function
  2. Retry on failure
  3. Stop/cancel a running job
  4. job called twice when call async
    can anyone help me....please

@koladilip
Copy link

Is possible to use OpenFaaS with Argo Workflows?
This will give much more flexibility to the users build complex flow processing capabilities.

@alexellis
Copy link
Member Author

A few requests have come up on Slack recently:

  • "run one container per function"
  • "to check the progress of an invocation"
  • "to cancel an inflight invocation"

These all seem like job semantics that would fit in with the discussion on this issue.

An approach which may work with the existing primitives, without changing OpenFaaS is:

For each request create a $RANDOM_UID then run faas-cli deploy --image function/image $RANDOM_UID with an async callback to a "done" function.
Set the function not to scale to zero, and have one replica.
The done function deletes the function faas-cli delete $RANDOM_UID

Failed invocations still come back to the "done" function.

For status checking, the done function could write to some storage like a database table, which would allow for in-progress detection, fetching the result and for cancellation.

None of this would require Kubernetes Jobs or limiting to only working on K8s, however there will be some edge cases. If anyone here is still interested in "jobs for openfaas", I'd suggest prototyping the above and seeing how well it works for you.

There's some other areas that may need further probing like identity and request signing so that Mallory cannot simply invoke the "done" function with custom function names and use that to abuse the system.

@alexellis
Copy link
Member Author

@koladilip sure, go ahead. You can invoke a function endpoint via HTTP from an Argo workload or run it as a container and use a sidecar to invoke it (I created an example for @csakshaug for this last year, but cannot find it right now). How far did you get with what you were trying?

@alexellis
Copy link
Member Author

The CD project Tekton has also been popularised since this thread was created, whilst it's aimed at Continuous Deployment, it has a "Pipeline" mechanism that may be interesting to some users -> https://github.com/tektoncd/pipeline

@alexellis
Copy link
Member Author

cc @aledbf @tmiklas

@alexellis
Copy link
Member Author

I would welcome usecases and examples of current job workflows and how you would see it working differently in openfaas, and making things easier for you.

@sergiotm87
Copy link

Hi! For long running and complex workflows i am learning about https://temporal.io/ from the creators of Uber's Cadence and wrote a starting tutorial with a golang function https://sergiotm87.github.io/blog/post/temporalio-workflows-with-openfaas-functions/

Mitchell Hashimoto had recently said they are runing temporal to orchestrate HashiCorp Cloud Platform

@alexellis
Copy link
Member Author

alexellis commented Oct 28, 2020

A couple of resources people might find interesting:

Quick PoC to run a Kubernetes job and print out the logs -> https://github.com/alexellis/lavoro
An openfaas template to make puppeteer on Kubernetes easy -> https://github.com/alexellis/openfaas-puppeteer-template

Whilst working on "lavoro" - I think I had a question about how jobs in openfaas would differ from our current functions vision, and whether they are the same thing.

Jobs such as processing a video will need a file injecting as an input, and collecting as output, unless the code itself manages that.

Jobs may not have a HTTP server since they only process one request, they may just be a container with a "CMD" that runs to completion

Jobs won't necessarily have an API in the same way as our current functions do, so it's hard to interface with them. What is the lowest common denominator? It's no longer a HTTP request/response exchange.

@srisco
Copy link

srisco commented Oct 28, 2020

Hi, we are currently researching about file processing using Kubernetes jobs. As you pointed out, this kind of processing must manage input and output files, so there must be a component that takes charge of obtaining/saving files from a data storage provider.

In our case we have developed OSCAR2, which depends on a MinIO deployment in the same cluster, being in charge of invoking the functions/jobs. Our tool is able to create and configure the bucket notifications of MinIO from the job spec. The component in charge of the input and output of files is FaaS Supervisor, a binary that is automatically mounted via a volume in the jobs. To support synchronous invocations we have integrated it with OpenFaaS (redirecting the requests to the gateway) and, in addition, we have added a log recovery service to know the status of the jobs. Workflows can be achieved by linking input/output buckets of different functions.

If anyone is interested in using it, do not hesitate to contact us. We are currently updating the documentation, but we already have a helm chart ready to install on any Kubernetes cluster.

@alexellis
Copy link
Member Author

Hi @srisco I am aware of your project and have taken a look at the approach of replacing the asynchronous NATS worker.

Since we added multiple-queue support, you no longer need to take away the regular asynchronous invocations, but can additively provide your OSCAR queue-worker on another "queue name"

See also: Multiple named queue support

What limitations have you found with the use of Kubernetes jobs? And if you took your learnings and wanted to see them applied upstream in the original project, how would you go about that now? What would it look like to suit your needs?

I also saw that you've written your own OpenFaaS UI which looks very similar to ours in some respects. We're also considering rebuilding a new UI with React or Angular. Have you thought about what it would take to release a version of your UI that could be used with the upstream project?

Feel free to chat with us on OpenFaaS Slack

We welcome contributions from users of the project, and also have an open call for sponsors. If you can think of a way to support the upstream project in some way, that would be appreciated.

Glad you have found value in OpenFaaS for your solution, I hope that we can collaborate in some way going forward?

Alex

@alexellis
Copy link
Member Author

@sergiotm87 thanks for pointing us at Temporal. Is that product open-source, or paid-for only?

I noticed on your blog that the code examples are collapsed, and I visited it twice and ignored the code examples. Is there a way you can stop them from collapsing? I think you'll be missing out on people having an "aha" moment because they can't see the code.

@kevin-lindsay-1
Copy link

If anyone in the community cares, I've been pushing for a number of relatively small items that work together in such a way where I'm very close on my end to being able to support arbitrarily long-running jobs that are handled in a gracefully autoscaled fashion.

@seb-835
Copy link

seb-835 commented Dec 10, 2021

Hi all,
will be happy to share with you my use case,

we got a library with a lot of scientifics functions,
each function can be call through a CLI , each are cpu/memory intensive use and long-time running.

We want to give access to this functions to our datascientists in a k8s Cluster
So the aim was to convert the library into an openfaas image, to be able to call each function (with parameters) through http, and get the result back through the async callback... this is pretty simple and "openfaas" easy...

But for security reason, we need to run only one function per container, like batch or job , unfortunately i do not find the way to do it with openfaas so that's why i am pretty interest in this thread.

@srisco thanks, i will give a look to Oscar
And @kevin-lindsay-1 , i am pretty interested in your work , please give us some information.

@alexellis
Copy link
Member Author

I wrote up the changes we made for Surge (where @kevin-lindsay-1 works) here:

Improving long-running jobs for OpenFaaS users

Commercial users can get in touch with us immediately via https://openfaas.com/support instead of waiting for this to come up on the roadmap or in a triage call.

@seb-835
Copy link

seb-835 commented Dec 10, 2021

@alexellis thanks, this is very interesting , i was playing with pre-stop hook to avoid my downscaled pod to get trash while still computing stuff.

Now i will have to wait for your change, not commercial user there :(

You got nothing in mind to have 1 container per request ?

@alexellis
Copy link
Member Author

There's ways to do this already, but why do you want that?

@seb-835
Copy link

seb-835 commented Dec 10, 2021

we do scientific processing for different projects, and in one container we are not authorized to process for 2 different projects, ie we must not have data from different projects in same container.

@alexellis
Copy link
Member Author

Who came up with the boundary of a container? Why not a VM? Why not a process?

The scientific processing isn't commercial? Someone must fund it in some way. You're welcome to speak to that person and suggest that they book a call with us. You'll find a link on that page I shared.

Happy to walk you through how this would work that way. Of course you have the docs and all the readme files on GitHub that can be read freely too.

@seb-835
Copy link

seb-835 commented Dec 10, 2021

@alexellis let s go through mail to not pollute this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests