ADD specialist worker nodes #3741

berthob98 · 2022-05-05T16:49:52Z

PR Description

This PR adds the ability to define a list of worker nodes that gets preferred when dispatching compilation jobs (Job Type: org.opencastproject.composer). This could for example be useful in a setup with one or more GPU accelerated nodes.
This feature is disabled by default and only activated when a list of specialized worker nodes is set. The comma-separated list of worker nodes is defined in the configuration file (etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg).
To ensure backwards compatibility not defining a list of specialized worker nodes is safe and leaves the behavior of the system unchanged.
When dispatching composer jobs, the defined worker nodes are chosen over every not specialized worker nodes if the LoadFactor of the specialized worker nodes is 0.4 or smaller and if the LoadFactor of the specialized worker node is smaller than 4 times the LoadFactor of the not specialized worker node. This mechanism is meant to prevent the overload of specialized nodes while ensuring to dispatch encoding jobs to the defined nodes if there is capacity. Since this is a (working) draft feedback to this idea is appreciated. It would also be possible to make the threshold for preferring specialized nodes configurable via the configuration file, if that is desirable.
If wanted, I could also add a definition of this feature to the admin configuration.

Information for easier code review:

The variable org.opencastproject.encoding.workers is retrieved from the etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg configuration file
When dispatching jobs a list of candidate nodes gets sorted by a comparator class. The dispatching attempts follow the order of the list.
I made a copy of the existing comparator that only is used when dealing with a job of type composer. This comparator prefers the worker nodes retrieved form the configuration file.

This PR closes #3740.

github-actions · 2022-05-05T16:50:12Z

Hi @berthob98
Thank you for contributing to Opencast.
We noticed that you have not yet filed an Individual Contributor License Agreement. Doing that (once) helps us to ensure that Opencast stays free for all. If you make your contribution on behalf of an institution, you might also want to file a Corporate Contributor License Agreement (giving you as individual contributor a bit more security as well). It can take a while for this bot to find out about new filings, so if you just filed one or both of the above do not worry about this message!
Please let us know if you have any questions regarding the CLA.

mliradelc · 2022-05-09T13:14:47Z

Hi @berthob98, thanks for your PR submission, in the next days I will try your PR in a cluster that has one worker with GPU and another without GPU.

Before to merge this PR you need to sign up the ICLA (In the bot comment).

And because it's a new feature you need to add a txt file of your submission

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java

berthob98 · 2022-05-09T19:05:04Z

Hi @berthob98, thanks for your PR submission, in the next days I will try your PR in a cluster that has one worker with GPU and another without GPU.

Before to merge this PR you need to sign up the ICLA (In the bot comment).

And because it's a new feature you need to add a txt file of your submission

Hi @mliradelc, thank you for testing. I will sign the ICLA and make the txt file.

mliradelc · 2022-05-13T08:23:27Z

@berthob98 Today is the release cut, I need you to sign the ICLA so we can merge it. I will start testing right now

berthob98 · 2022-05-13T12:44:29Z

@mliradelc Ok I just signed the ICLA

mliradelc

Thanks for sign up!

I just realized that your PR is based of r/11.x instead of develop branch, can you change your base branch?

berthob98 · 2022-05-13T14:02:24Z

Thanks for sign up!

I just realized that your PR is based of r/11.x instead of develop branch, can you change your base branch?

@mliradelc Ok I rebased the branch.

lkiesow · 2022-05-16T11:22:44Z

A short notice that Apereo has received @berthob98's ICLA. As usual, it might take a bit of time for them to process it, meaning that you might still see complaining tests if you file new pull requests. Ignore that and point to this comment if necessary. There is now no reason to not accept patches from you based on the ICLA. Also, thanks for filing it!

mliradelc · 2022-05-16T14:49:09Z

I tried the code, but I don't understand very well one thing,

When I select the worker using the new option in the service registry file, It's always choose as a first option. But when I have more than one encoding job It should queue those jobs to the GPU worker, but that doesn't happen. Is that the correct behavior?

Looking into the workers loads, they still select what worker to use with the profile jobload (Inside each profile), and at the end the non gpu workers also start encoding.

mliradelc · 2022-05-16T14:49:52Z

Thanks for sign up!
I just realized that your PR is based of r/11.x instead of develop branch, can you change your base branch?

@mliradelc Ok I rebased the branch.

Thanks for signing up and look up for the base branch

berthob98 · 2022-06-03T14:31:31Z

I tried the code, but I don't understand very well one thing,

When I select the worker using the new option in the service registry file, It's always choose as a first option. But when I have more than one encoding job It should queue those jobs to the GPU worker, but that doesn't happen. Is that the correct behavior?

Looking into the workers loads, they still select what worker to use with the profile jobload (Inside each profile), and at the end the non gpu workers also start encoding.

@mliradelc I am sorry for the late answer I missed your comment.
The behavior you described is expected. The code only prefers GPU workers in either of two cases:

if the GPU workers NodeLoad is under 40%
if there is no normal Node with 25% or less of the GPU workers NodeLoad

Sending every encoding job to a single GPU worker could lead to encoding jobs queuing up on a single GPU worker while other workers idle, producing unnesessary congestion.

In my tests modern GPUs where about 4 times faster than modern CPUs when encoding. It also should be considered that modern GPUs only have around 2-3 encoding units.

I am still testing how to improve this mechanism and interested in diffrent ideas.

mliradelc · 2022-07-29T08:13:25Z

Hi @berthob98, I miss completely your answer. I read when you published, and then I forgot completely about it, my apologies.

How I think a Specialized worker should work. (From my opinion)

The Specialized worked must take priority all the encoding tasks
If there is another specialized worker, they should balance their loads
A non specialized worker should only take an encoding task if the ETA of the specialized workers is later than the ETA of the non-specialized worker.
- Example, A specialized worker has a queue that will take 15 minutes to finish and there is a new video that will take 5 minutes to finish. So the ETA will be in 20 minutes. In the same cluster, you have a non-specialized worker free, it will take 15 minutes to encode that video. The admin should choose the non-specialized worker for that case.

Because of the nature of this feature, where the encoding hardware can be very different between them, I think that will be handy to have a benchmark tool inside opencast. A set of standardized test to try for now the raw power of the machine. With that, you can easily can configure the best load percentages for each machine using the slowest worker as baseline.

Another thing that needs to be clarified, is the capabilities of the machine to do parallel encoding, I've seen with CUDA, that even with 1 encoding engine, you can make 4 jobs with a little decrease of performance.

This topic is a huge one, but if it solved, can make the encoding tasks up to 10x faster for large organizations

gregorydlogan · 2022-08-07T04:30:27Z

I'd be quite happy to see a benchmarking tool, for a number of reasons, but I suspect that's outside the bounds of this PR - unless @berthob98 wants to do it :)

What would improve the PR is allowing the user to adjust the limits where the special logic kicks in/out.

berthob98 · 2022-08-10T10:20:03Z

What would improve the PR is allowing the user to adjust the limits where the special logic kicks in/out.

@gregorydlogan I made the limit adjustable via the cfg file

berthob98 · 2022-08-10T10:52:56Z

* if there is no normal Node with 25% or less of the GPU workers NodeLoad

I also removed this function, since encoding doesn't scale up like that for all GPUs. The number of parallel encodings is limited on GPUs, some have a pretty low (artificial) limit.

berthob98 · 2022-08-10T11:00:54Z

Because of the nature of this feature, where the encoding hardware can be very different between them, I think that will be handy to have a benchmark tool inside opencast. A set of standardized test to try for now the raw power of the machine. With that, you can easily can configure the best load percentages for each machine using the slowest worker as baseline.

@mliradelc What would be interesting measurments?

Encoding time of a standard video compared on the CPU and GPU?
Encoding time of multiple (2, 5, 10) videos compared to the encoding time of a single video?

mliradelc · 2022-08-16T10:09:15Z

I would keep it very simple from the start, Like, use the frames per second for the encoding task.

About the videos to encode and the FFmpeg command, you can leave that configurable to the user. What encoding profile to use and left the video to encode in a set folder.

About the limit problem from the GPU, that is very true, and you can have fail jobs because the GPU refused to encode more than the driver allows. That can be limited by configuration.

github-actions · 2022-09-01T15:53:12Z

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java

etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg

gregorydlogan · 2022-09-14T23:35:42Z

@berthob98 do you want to do this benchmarking as part of this pr, or a follow on? Happy either way, but this pr is starting to get pretty old.

berthob98 · 2022-09-15T14:30:35Z

@berthob98 do you want to do this benchmarking as part of this pr, or a follow on? Happy either way, but this pr is starting to get pretty old.

I will do the benchmarking as another PR. Is there something I can do to help merge this PR?

mliradelc · 2022-09-15T16:55:21Z

For me, it's ok, works and now with the another PR that you made to check the CPU usage is a good solution.

Now you need to solve the conflict that GitHub is flagging and for me is good to go.

gregorydlogan

Very minor changes, and you'll probably want to rebase this anyway :) Looks really good otherwise!

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java

berthob98 · 2022-09-22T15:29:36Z

@gregorydlogan good points. I applied both things and also did some small changes to the parsing of the config file.
Should I squash my commits?

mliradelc · 2022-09-23T07:47:56Z

Thanks for the changes, a squash is not necessary (Before it was but today GitHub squash and merges) but welcome because you can improve the commit description.

gregorydlogan self-assigned this May 5, 2022

mliradelc self-requested a review May 9, 2022 13:15

gregorydlogan requested changes May 9, 2022

View reviewed changes

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java Outdated Show resolved Hide resolved

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java Outdated Show resolved Hide resolved

mliradelc requested changes May 13, 2022

View reviewed changes

berthob98 force-pushed the specialist-worker-nodes branch from e61e789 to cbd04c8 Compare May 13, 2022 14:01

mliradelc self-requested a review May 17, 2022 12:27

berthob98 force-pushed the specialist-worker-nodes branch from 0d33ded to cbd04c8 Compare August 10, 2022 10:15

github-actions bot added the has-conflicts label Sep 1, 2022

ziegenberg reviewed Sep 2, 2022

View reviewed changes

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java Outdated Show resolved Hide resolved

etc/org.opencastproject.serviceregistry.impl.ServiceRegistryJpaImpl.cfg Outdated Show resolved Hide resolved

gregorydlogan requested changes Sep 15, 2022

View reviewed changes

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java Outdated Show resolved Hide resolved

...eregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java Outdated Show resolved Hide resolved

Berthold Busskamp added 4 commits September 22, 2022 15:44

ADD specialist worker nodes

825fb5a

fix checkstyle

f2dfc0d

ADD load threshold for encoding worker priorization

f45afd8

fix checkstyle

ecc33d1

berthob98 force-pushed the specialist-worker-nodes branch from af60f39 to 64ea69a Compare September 22, 2022 15:25

github-actions bot removed the has-conflicts label Sep 22, 2022

small changes to the addition of specail workers

3cc4e73

berthob98 force-pushed the specialist-worker-nodes branch from 64ea69a to 3cc4e73 Compare September 22, 2022 15:35

berthob98 requested review from gregorydlogan and mliradelc and removed request for mliradelc and gregorydlogan October 6, 2022 15:30

gregorydlogan merged commit b37be49 into opencast:develop Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD specialist worker nodes #3741

ADD specialist worker nodes #3741

berthob98 commented May 5, 2022

github-actions bot commented May 5, 2022

mliradelc commented May 9, 2022

berthob98 commented May 9, 2022 •

edited

Loading

mliradelc commented May 13, 2022

berthob98 commented May 13, 2022

mliradelc left a comment

berthob98 commented May 13, 2022 •

edited

Loading

lkiesow commented May 16, 2022

mliradelc commented May 16, 2022 •

edited

Loading

mliradelc commented May 16, 2022

berthob98 commented Jun 3, 2022 •

edited

Loading

mliradelc commented Jul 29, 2022 •

edited

Loading

gregorydlogan commented Aug 7, 2022

berthob98 commented Aug 10, 2022 •

edited

Loading

berthob98 commented Aug 10, 2022

berthob98 commented Aug 10, 2022

mliradelc commented Aug 16, 2022

github-actions bot commented Sep 1, 2022

gregorydlogan commented Sep 14, 2022

berthob98 commented Sep 15, 2022

mliradelc commented Sep 15, 2022

gregorydlogan left a comment

berthob98 commented Sep 22, 2022

mliradelc commented Sep 23, 2022

ADD specialist worker nodes #3741

ADD specialist worker nodes #3741

Conversation

berthob98 commented May 5, 2022

PR Description

github-actions bot commented May 5, 2022

mliradelc commented May 9, 2022

berthob98 commented May 9, 2022 • edited Loading

mliradelc commented May 13, 2022

berthob98 commented May 13, 2022

mliradelc left a comment

Choose a reason for hiding this comment

berthob98 commented May 13, 2022 • edited Loading

lkiesow commented May 16, 2022

mliradelc commented May 16, 2022 • edited Loading

mliradelc commented May 16, 2022

berthob98 commented Jun 3, 2022 • edited Loading

mliradelc commented Jul 29, 2022 • edited Loading

gregorydlogan commented Aug 7, 2022

berthob98 commented Aug 10, 2022 • edited Loading

berthob98 commented Aug 10, 2022

berthob98 commented Aug 10, 2022

mliradelc commented Aug 16, 2022

github-actions bot commented Sep 1, 2022

gregorydlogan commented Sep 14, 2022

berthob98 commented Sep 15, 2022

mliradelc commented Sep 15, 2022

gregorydlogan left a comment

Choose a reason for hiding this comment

berthob98 commented Sep 22, 2022

mliradelc commented Sep 23, 2022

berthob98 commented May 9, 2022 •

edited

Loading

berthob98 commented May 13, 2022 •

edited

Loading

mliradelc commented May 16, 2022 •

edited

Loading

berthob98 commented Jun 3, 2022 •

edited

Loading

mliradelc commented Jul 29, 2022 •

edited

Loading

berthob98 commented Aug 10, 2022 •

edited

Loading