-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD specialist worker nodes #3741
ADD specialist worker nodes #3741
Conversation
Hi @berthob98 |
Hi @berthob98, thanks for your PR submission, in the next days I will try your PR in a cluster that has one worker with GPU and another without GPU. Before to merge this PR you need to sign up the ICLA (In the bot comment). And because it's a new feature you need to add a txt file of your submission |
...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
Outdated
Show resolved
Hide resolved
...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
Outdated
Show resolved
Hide resolved
Hi @mliradelc, thank you for testing. I will sign the ICLA and make the txt file. |
@berthob98 Today is the release cut, I need you to sign the ICLA so we can merge it. I will start testing right now |
@mliradelc Ok I just signed the ICLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sign up!
I just realized that your PR is based of r/11.x instead of develop branch, can you change your base branch?
e61e789
to
cbd04c8
Compare
@mliradelc Ok I rebased the branch. |
A short notice that Apereo has received @berthob98's ICLA. As usual, it might take a bit of time for them to process it, meaning that you might still see complaining tests if you file new pull requests. Ignore that and point to this comment if necessary. There is now no reason to not accept patches from you based on the ICLA. Also, thanks for filing it! |
I tried the code, but I don't understand very well one thing, When I select the worker using the new option in the service registry file, It's always choose as a first option. But when I have more than one encoding job It should queue those jobs to the GPU worker, but that doesn't happen. Is that the correct behavior? Looking into the workers loads, they still select what worker to use with the profile jobload (Inside each profile), and at the end the non gpu workers also start encoding. |
Thanks for signing up and look up for the base branch |
@mliradelc I am sorry for the late answer I missed your comment.
Sending every encoding job to a single GPU worker could lead to encoding jobs queuing up on a single GPU worker while other workers idle, producing unnesessary congestion. In my tests modern GPUs where about 4 times faster than modern CPUs when encoding. It also should be considered that modern GPUs only have around 2-3 encoding units. I am still testing how to improve this mechanism and interested in diffrent ideas. |
Hi @berthob98, I miss completely your answer. I read when you published, and then I forgot completely about it, my apologies. How I think a Specialized worker should work. (From my opinion)
Because of the nature of this feature, where the encoding hardware can be very different between them, I think that will be handy to have a benchmark tool inside opencast. A set of standardized test to try for now the raw power of the machine. With that, you can easily can configure the best load percentages for each machine using the slowest worker as baseline. Another thing that needs to be clarified, is the capabilities of the machine to do parallel encoding, I've seen with CUDA, that even with 1 encoding engine, you can make 4 jobs with a little decrease of performance. This topic is a huge one, but if it solved, can make the encoding tasks up to 10x faster for large organizations |
I'd be quite happy to see a benchmarking tool, for a number of reasons, but I suspect that's outside the bounds of this PR - unless @berthob98 wants to do it :) What would improve the PR is allowing the user to adjust the limits where the special logic kicks in/out. |
0d33ded
to
cbd04c8
Compare
@gregorydlogan I made the limit adjustable via the cfg file |
I also removed this function, since encoding doesn't scale up like that for all GPUs. The number of parallel encodings is limited on GPUs, some have a pretty low (artificial) limit. |
@mliradelc What would be interesting measurments?
|
I would keep it very simple from the start, Like, use the frames per second for the encoding task. About the videos to encode and the FFmpeg command, you can leave that configurable to the user. What encoding profile to use and left the video to encode in a set folder. About the limit problem from the GPU, that is very true, and you can have fail jobs because the GPU refused to encode more than the driver allows. That can be limited by configuration. |
This pull request has conflicts ☹ |
...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
Outdated
Show resolved
Hide resolved
etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg
Outdated
Show resolved
Hide resolved
@berthob98 do you want to do this benchmarking as part of this pr, or a follow on? Happy either way, but this pr is starting to get pretty old. |
I will do the benchmarking as another PR. Is there something I can do to help merge this PR? |
For me, it's ok, works and now with the another PR that you made to check the CPU usage is a good solution. Now you need to solve the conflict that GitHub is flagging and for me is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor changes, and you'll probably want to rebase this anyway :) Looks really good otherwise!
...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
Outdated
Show resolved
Hide resolved
...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
Outdated
Show resolved
Hide resolved
af60f39
to
64ea69a
Compare
@gregorydlogan good points. I applied both things and also did some small changes to the parsing of the config file. |
64ea69a
to
3cc4e73
Compare
Thanks for the changes, a squash is not necessary (Before it was but today GitHub squash and merges) but welcome because you can improve the commit description. |
PR Description
This PR adds the ability to define a list of worker nodes that gets preferred when dispatching compilation jobs (Job Type: org.opencastproject.composer). This could for example be useful in a setup with one or more GPU accelerated nodes.
This feature is disabled by default and only activated when a list of specialized worker nodes is set. The comma-separated list of worker nodes is defined in the configuration file (etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg).
To ensure backwards compatibility not defining a list of specialized worker nodes is safe and leaves the behavior of the system unchanged.
When dispatching composer jobs, the defined worker nodes are chosen over every not specialized worker nodes if the LoadFactor of the specialized worker nodes is 0.4 or smaller and if the LoadFactor of the specialized worker node is smaller than 4 times the LoadFactor of the not specialized worker node. This mechanism is meant to prevent the overload of specialized nodes while ensuring to dispatch encoding jobs to the defined nodes if there is capacity. Since this is a (working) draft feedback to this idea is appreciated. It would also be possible to make the threshold for preferring specialized nodes configurable via the configuration file, if that is desirable.
If wanted, I could also add a definition of this feature to the admin configuration.
Information for easier code review:
This PR closes #3740.