-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research item: Long running batch-jobs / workflow #657
Comments
I spoke with @iyovcheva about this yesterday. My initial thoughts are that jobs are orthogonal to functions, but they are a common ask from the community around longer-running batch processing and CI jobs. Is there an opportunity to add value here? |
Nomad and Kubernetes both have these primitives but I don't believe Swarm does. It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be. I can see this being a larger ask from the community going forward. I have been thinking a lot about this myself for personal and work-related tasks. It would be nice to have something that can be scheduled to run at a specific time but not be locked into time constraints. I also feel that it is a slippery slope to not turn the project into something like Rundeck. It may be worth waiting to see what Swarm does before building anything. Leveraging the primitives of something already build is probably going to be a smaller lift. |
Having read the thread and following the thread, I think this is unlikely to happen in the near future. Kubernetes can support this use-case natively, Swarm may need a separate controller writing to make this possible - I made a start with a CLI tool which has a few users called JaaS The Rundeck project looks interesting btw. |
There are many such scenarios in the actual business. Containers can do this kind of work well. Pull up the container when using it,Complete work to destroy the container. We are doing this based on openfaas. |
Hello -- just commenting here as well to share our science use-case. "Black box" functions that are expensive to evaluate are a common setup in optimization problems. Often a machine-learning based strategy defines at what parameters a function is evaluated in order to reduce the number of evaluations: see e.g. https://scikit-optimize.github.io/ or https://github.com/diana-hep/excursion function evaluations can easily be multiple hours. My ideal user-interface would be
|
Thanks for adding your use-case. What if instead of checking for the result in a stateful way, you specified a callback URL in an event-driven way? Is "job checking" a hard requirement too? I think you could do this basic flow with the asynchronous processing using a long enough max timeout. Each async function call returns a call ID which you get back on the callback. Alex |
Hi |
Is there any update? I am using csharp functions, If a function is receiving the request on single thread then I can start a new thread with some unique key and return that in response. Then on the other request i can send the key and check the status of the thread. I didn't try to implement this until now because I am new and testing the OpenFaas for our need and don't know how it will behave to threading and searching for the solution which is already available. I was planing to replace the windows services which with OpenFaas. These service are running the scheduled task which usually takes 3 to 10 minutes but now we also need that to run those tasks on demand. |
Hi @sheryever you can run for 3-5 mins, no problems. Threads are also fair game. 👍 I don't think you need what I'm calling long-running jobs for that. Alex |
Some of the requirements/constraints I'm hearing from users:
Assumed, but need users to confirm:
A thin wrapper around a Kubernetes Job, or clear documentation on this use-case using Kubernetes jobs may be enough for a large % of the people asking for the above, but this is unclear. I think it would be worth looking into with 1-2 of the people who need this. |
I'd like to add my user case. We have many teams use different automation frameworks with different language.
Thanks @alexellis and thanks @openfaas. |
This is an intriguing problem... more than a couple use cases being described here, but as Alex pointed out, there are some high level similarities/requirements. Just thinking out loud about this, I can't think of how we would "know" that a function is still executing. Perhaps a "status" function available in the OF core to receive messages from functions. I'm thinking like setting a variable on a function indicating it as a long running job (ie To summarize, the list of things in my head to accomplish this would be:
Some questions I haven't thought of a way to answer yet:
|
What about a By default the watchdog could just send an "empty" request. But it could also accept a file path and parse any of the files as requests, one at a time, to the method. The response would either be dropped or saved (one per file) to a file path. This would be very Argo friendly. Alternatively, it could accept an s3 compatible server, bucket, and path and read/write from there. I don't think openfaas needs to be the batch job runner, but if we make it really easy to just use a function in a batch job system. That would go a long way. Technically this would all be possible without any changes from us, you can write a template that contains watchdog and some other init script and just use a different command in your container when you use it in batch job. But documenting this and making it an approved workflow instead of a workaround would probably make people happy |
@alexellis is the goal that "functions" should be compatible with batch jobs, e.g. take an openfaas function and run it in Argo/pure k8s jobs/kubeflow/etc or is it that we want to create another job system in which we will take any docker image that exposes the "function interface" of a server on 8080 and then run it as a one time job? |
That is a good question. Maybe it will be both? What do our users need? |
Hi, I've been working on different use cases involving long-running batch jobs and these are my thoughts. After trying different scaling configurations with asyncronous invokations, mainly based on CPU consumption and increasing the number of replicas of the queue-worker, my colleagues and I came to the conclusion that it was more convenient to use kubernetes jobs. However, we wanted to take advantage of openfaas' ability to invoke functions through the gateway, so we decided to create oscar-worker as a substitute for nats-queue-worker. Its goal is to convert invocations that reach nats through the My idea of a better integration for long-running jobs in openfaas would be adding some tag to the functions in order to indicate that they are long-running functions/jobs. These functions will have a new route in the gateway, for example I think this approach wouldn't be so hard to implement and would cover the needs of a huge number of users. |
Also potentially useful / interesting relating to workflows - https://github.com/s8sg/faas-flow |
Hi everyone! I've been looking for an ideal solution to run long-running ETL that take a few hours to run. They are mostly involved with database to database data transfer. Currently, we're just running on some VMs with an in-house scheduler, but we'd like to do better than this. OpenFaas seemed like a possible solution, as we could run "functions" on demand. However, I am unsure of the suitability for something like OpenFaas to run stuff that takes that long. There has been some great discussion here about what is currently missing from OpenFaas in this domain. Namely:
Something that I was not clear about was whether you could run something like this on OpenFaas. I know that AWS lambda has a hard timeout of 15 minutes per invocation. Does OF have something like this, or is it just that it may not be reliable to run a function for that long? If someone could give me a high level idea of what Kubernetes Jobs offers that OF does not currently and what is gained by putting an OF layer on top of Jobs would gain us, that would help me immensely. Thanks! |
For us, even for things that only run let's say 2-3 minutes, I think we would quite appreciate having the 3 features @zhl146 mentioned: status, retries and cancelling. You get all of that by simply just using a Kubernetes However, one of the reasons OpenFaas is attractive to us is that we can deploy each job as "a piece of code that can be triggered by an HTTP request", which helps decouple the job itself from the means of running it. For example, you can have a Therefore, I think it would be beneficial to be able to use the OpenFaas API to reuse the code it already has in the container to spawn the K8S job, instead of having a separate flow for it. As an alternative, I also see value in making it possible to re-use the OF-built image and just run the function without the watchdog. That way we could just use the K8S API separately to run the jobs but still reusing the same contianer image. But I think we would prefer the OF-integrated option. |
I am very interested in this and would love to help push this forward. For my use case, our team uses Airflow for our ETL processes and OpenFaas functions for the actual processing of files. We have found this to be a really nice combo as we can more easily test each of the different processes without having to bloat our Airflow code. Airflow then simply wires up the different functions and handles the retries and failures. Right now we have a OpenFaas function called It would be great if we could instead kick off a long running function as a k8s job and then poll the status of the function by making a call to the OpenFaas gateway with something like |
I have started using OF for my project and ended up with this issue(#657). I have requirement/need long running function using async but is there any other way for
|
Is possible to use OpenFaaS with Argo Workflows? |
A few requests have come up on Slack recently:
These all seem like job semantics that would fit in with the discussion on this issue. An approach which may work with the existing primitives, without changing OpenFaaS is: For each request create a $RANDOM_UID then run Failed invocations still come back to the "done" function. For status checking, the done function could write to some storage like a database table, which would allow for in-progress detection, fetching the result and for cancellation. None of this would require Kubernetes Jobs or limiting to only working on K8s, however there will be some edge cases. If anyone here is still interested in "jobs for openfaas", I'd suggest prototyping the above and seeing how well it works for you. There's some other areas that may need further probing like identity and request signing so that Mallory cannot simply invoke the "done" function with custom function names and use that to abuse the system. |
@koladilip sure, go ahead. You can invoke a function endpoint via HTTP from an Argo workload or run it as a container and use a sidecar to invoke it (I created an example for @csakshaug for this last year, but cannot find it right now). How far did you get with what you were trying? |
The CD project Tekton has also been popularised since this thread was created, whilst it's aimed at Continuous Deployment, it has a "Pipeline" mechanism that may be interesting to some users -> https://github.com/tektoncd/pipeline |
I would welcome usecases and examples of current job workflows and how you would see it working differently in openfaas, and making things easier for you. |
Hi! For long running and complex workflows i am learning about https://temporal.io/ from the creators of Uber's Cadence and wrote a starting tutorial with a golang function https://sergiotm87.github.io/blog/post/temporalio-workflows-with-openfaas-functions/ Mitchell Hashimoto had recently said they are runing temporal to orchestrate HashiCorp Cloud Platform |
A couple of resources people might find interesting: Quick PoC to run a Kubernetes job and print out the logs -> https://github.com/alexellis/lavoro Whilst working on "lavoro" - I think I had a question about how jobs in openfaas would differ from our current functions vision, and whether they are the same thing. Jobs such as processing a video will need a file injecting as an input, and collecting as output, unless the code itself manages that. Jobs may not have a HTTP server since they only process one request, they may just be a container with a "CMD" that runs to completion Jobs won't necessarily have an API in the same way as our current functions do, so it's hard to interface with them. What is the lowest common denominator? It's no longer a HTTP request/response exchange. |
Hi, we are currently researching about file processing using Kubernetes jobs. As you pointed out, this kind of processing must manage input and output files, so there must be a component that takes charge of obtaining/saving files from a data storage provider. In our case we have developed OSCAR2, which depends on a MinIO deployment in the same cluster, being in charge of invoking the functions/jobs. Our tool is able to create and configure the bucket notifications of MinIO from the job spec. The component in charge of the input and output of files is FaaS Supervisor, a binary that is automatically mounted via a volume in the jobs. To support synchronous invocations we have integrated it with OpenFaaS (redirecting the requests to the gateway) and, in addition, we have added a log recovery service to know the status of the jobs. Workflows can be achieved by linking input/output buckets of different functions. If anyone is interested in using it, do not hesitate to contact us. We are currently updating the documentation, but we already have a helm chart ready to install on any Kubernetes cluster. |
Hi @srisco I am aware of your project and have taken a look at the approach of replacing the asynchronous NATS worker. Since we added multiple-queue support, you no longer need to take away the regular asynchronous invocations, but can additively provide your OSCAR queue-worker on another "queue name" See also: Multiple named queue support What limitations have you found with the use of Kubernetes jobs? And if you took your learnings and wanted to see them applied upstream in the original project, how would you go about that now? What would it look like to suit your needs? I also saw that you've written your own OpenFaaS UI which looks very similar to ours in some respects. We're also considering rebuilding a new UI with React or Angular. Have you thought about what it would take to release a version of your UI that could be used with the upstream project? Feel free to chat with us on OpenFaaS Slack We welcome contributions from users of the project, and also have an open call for sponsors. If you can think of a way to support the upstream project in some way, that would be appreciated. Glad you have found value in OpenFaaS for your solution, I hope that we can collaborate in some way going forward? Alex |
@sergiotm87 thanks for pointing us at Temporal. Is that product open-source, or paid-for only? I noticed on your blog that the code examples are collapsed, and I visited it twice and ignored the code examples. Is there a way you can stop them from collapsing? I think you'll be missing out on people having an "aha" moment because they can't see the code. |
If anyone in the community cares, I've been pushing for a number of relatively small items that work together in such a way where I'm very close on my end to being able to support arbitrarily long-running jobs that are handled in a gracefully autoscaled fashion. |
Hi all, we got a library with a lot of scientifics functions, We want to give access to this functions to our datascientists in a k8s Cluster But for security reason, we need to run only one function per container, like batch or job , unfortunately i do not find the way to do it with openfaas so that's why i am pretty interest in this thread. @srisco thanks, i will give a look to Oscar |
I wrote up the changes we made for Surge (where @kevin-lindsay-1 works) here: Improving long-running jobs for OpenFaaS users Commercial users can get in touch with us immediately via https://openfaas.com/support instead of waiting for this to come up on the roadmap or in a triage call. |
@alexellis thanks, this is very interesting , i was playing with pre-stop hook to avoid my downscaled pod to get trash while still computing stuff. Now i will have to wait for your change, not commercial user there :( You got nothing in mind to have 1 container per request ? |
There's ways to do this already, but why do you want that? |
we do scientific processing for different projects, and in one container we are not authorized to process for 2 different projects, ie we must not have data from different projects in same container. |
Who came up with the boundary of a container? Why not a VM? Why not a process? The scientific processing isn't commercial? Someone must fund it in some way. You're welcome to speak to that person and suggest that they book a call with us. You'll find a link on that page I shared. Happy to walk you through how this would work that way. Of course you have the docs and all the readme files on GitHub that can be read freely too. |
@alexellis let s go through mail to not pollute this thread. |
Should we support long-running batch-jobs?
Are these in the scope of OpenFaaS functions which are typically several seconds in duration?
Are there primitives in Kubernetes such as jobs which we can leverage?
Kubernetes jobs example
Edit 29 Sept 2019
Since long-running jobs and workflows are related, I've added that to the title, so if you're looking for workflow, please feel free to comment with your use-case and whether it's for business purposes or for fun.
The text was updated successfully, but these errors were encountered: