Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MH-13277 improve scheduler performance #614

Merged

Conversation

krinnewitz
Copy link

@krinnewitz krinnewitz commented Dec 7, 2018

This patchset improves the scheduler performance as part of the opencast crowdfunding 2018. As a benchmark, we scheduled 50k events and then added 6 events on top. Previously, adding the 6 events took between 19 and 20 seconds. Of these 20 seconds, the conflict check took about 80% of the time. After our changes, it takes about 2 seconds to add the 6 events on top. Of these 2 seconds, the conflict check now only takes a very small percentage of time. The benchmark was done using Opencast with MariaDB on an AMD FX(tm)-8350 Eight-Core Processor with 32GB RAM and an SSD.

The following changes were done to achieve the goal of faster conflict checks and higher throughput when inserting events:

Remove Unused YAGNI Code

  • The "transactional" feature of the scheduler was only used for the migration from opencast 3.x to 4.x and introduced a lot of complexity. In order to improve the maintainability of opencast, we decided to remove this code. If anyone wants to utilize it someday, it can be restored from the git history.
  • The "blacklist" feature of the scheduler wasn't fully implemented. We removed it as well, again for the sake of maintainability.

Store Properties of Scheduled Events in a Dedicated Table

Properties such as start date, end date, capture device id, of scheduled events were stored as asset manager properties which are basically key-value pairs which cannot be stored efficiently using a relational DBMS. For this reason, the conflict check was done in-memory, which was a workaround for the slow database queries. However, this still was slow since all events had to be loaded into memory before performing the conflict check. The scheduler now uses a proper "SQL-friendly" table to store these properties. This drastically improves the conflict check performance (which can be done on the database again) as well as the insertion speed when adding new events.

Reduce Number of ActiveMQ Messages When Adding Events

When scheduling multiple events, one message per event was sent over ActiveMQ. This seemed to cause some buffers filling up and blocking the message sender which in turn slowed down event insertion. We changed the behavior to send just one message for a whole bunch of events, which solves this problem.

Parallelize Adding Multiple Events

When scheduling multiple events (using an RRULE), these events cannot have conflicts with each other. This allows us to parallelize event insertion in this case, which also improves the throughput.

Other Changes

Additionally, we improved the test coverage of the scheduler and added a migration which is necessary because of the database schema changes,

@pmiddend pmiddend force-pushed the f/MH-13277-improve-scheduler-performance branch from 2c0a9a1 to f02f34b Compare December 7, 2018 10:22
@JamesUoM
Copy link
Contributor

JamesUoM commented Dec 7, 2018

Is there any way to store custom properties, in scheduled events? We had planned to tighter integrate our Participation Manager (opt-out) with the using Asset managers properties to store addition choices like editing, camera and or screen recording, without having to parse the workflow and capture agent configs.

@krinnewitz
Copy link
Author

krinnewitz commented Dec 7, 2018

It depends.If you know the keys and types of your properties in advance, you can just add one column per property to oc_scheduled_extended_event and store them there.

If the properties you want to store are arbitrary and don't have to be queried individually, you can add another string column, serialize the properties using gson and store them in this column. For reading those properties, you just read the string from the database and de-serialize them again. Just like we did it for the capture agent and workflow properties.

The problem with the workflow and capture agent properties was, that they can consist of arbitrary key-value pairs, so we couldn't create one column per property. But since these properties weren't queried individually, we could just serialize them and store them as a whole.

And of course, you can still save and load asset manager properties. However, using the asset manager properties is expensive.

@staubesv
Copy link
Contributor

staubesv commented Dec 7, 2018

@JamesUoM & @krinnewitz Opencast 6.x introduces a new processing settings persistent layer that uses the asset manager to store workflow properties. At some point in the future, it likely makes sense to remove the workflow properties from the scheduler table and make the scheduler use the processing settings persistent layer of the asset manager. At the end of the day, a scheduled event is a event and it would be perfectly fine to store the workflow properties just for the event (vs. scheduled event).

@lkiesow lkiesow self-requested a review December 7, 2018 12:21
@lkiesow lkiesow self-assigned this Dec 7, 2018
@staubesv
Copy link
Contributor

staubesv commented Dec 7, 2018

@krinnewitz Please also adopt the page https://docs.opencast.org/develop/developer/scheduler/

@krinnewitz krinnewitz force-pushed the f/MH-13277-improve-scheduler-performance branch 2 times, most recently from e2c172a to b09a926 Compare December 8, 2018 08:51
@krinnewitz
Copy link
Author

Thanks @staubesv. I updated the documentation.

Regarding the workflow properties: Maybe we could change that in a follow-up PR if there are not too many review findings with this one. However, the problem with storing anything in the asset manager properties is, that even just insertion is quite slow. So when scheduling multiple events at once, that could already be a problem.

@JamesUoM
Copy link
Contributor

JamesUoM commented Dec 10, 2018

Thanks @staubesv. I updated the documentation.

Regarding the workflow properties: Maybe we could change that in a follow-up PR if there are not too many review findings with this one. However, the problem with storing anything in the asset manager properties is, that even just insertion is quite slow. So when scheduling multiple events at once, that could already be a problem.

Appologies for the diversion:

We schedule events for two weeks ahead which can be 2000+ new events being added on a Monday. If we moved away from our intermediate table of via recordings and user options is then 25,000 events at the start of the semester and a busy 72 hours when the majority opt-out decisions occur. I think we're likely to keep our user options in our own Participation Management (PM) tables if the Asset Manager was not able to cope with this, and I would be very concerned if the creation and modification performance of CA and WF configs was negatively impacted.

Very simply we ingest the whole time table and create potential recordings against which users opt-out, this is synchronized daily as the timetable changes often. We then created (delete and modify) scheduled events for a rolling two week window synchronized on a daily basis.

Timetable --1 semester--> PM --2 weeks--> OC
                          ^^
                     User options

@krinnewitz krinnewitz force-pushed the f/MH-13277-improve-scheduler-performance branch from b09a926 to f3c8e49 Compare December 11, 2018 08:27
@gregorydlogan
Copy link
Member

I've tagged this as a feature, even though it's not really a feature. It does however warrant the extra testing that comes with being a feature IMO.

@lkiesow
Copy link
Member

lkiesow commented Jan 13, 2019

Sorry for the late start of the review.

When scheduling multiple events (using an RRULE), these events cannot have conflicts with each
other. This allows us to parallelize event insertion in this case, which also improves the throughput.

Does Opencast's internal processing of RRULES differ from what's possible in iCalendar? Because I think this would be a completely valid schedule with overlapping events using an RRULE (12 events á 2h scheduled hourly):

BEGIN:VEVENT
DTSTAMP:20190113T185826Z
UID:20190113T185826Z-1469192696
DTSTART;TZID=Europe/Berlin:20190201T120000
DTEND;TZID=Europe/Berlin:20190201T140000
RRULE:FREQ=HOURLY;INTERVAL=1;COUNT=12
SUMMARY:Test
DESCRIPTION:Some overlapping test events
END:VEVENT

@pmiddend pmiddend force-pushed the f/MH-13277-improve-scheduler-performance branch from f3c8e49 to ab439b1 Compare January 14, 2019 17:02
@pmiddend
Copy link

@lkiesow You're right. I hope the admin UI actually prevents this from happening, but surely there are other ways to "inject" such an RRULE into the system. Either way, I've added some code to check those overlaps.

@pmiddend pmiddend force-pushed the f/MH-13277-improve-scheduler-performance branch from ab439b1 to 79a9478 Compare January 14, 2019 17:51
Copy link
Member

@lkiesow lkiesow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a number of questions, comments and suggestions.
In general, this looks good though.
I will do some actual testing later today/tomorrow.

@lkiesow
Copy link
Member

lkiesow commented Jan 27, 2019

I've tested this in a production-like environment and automated scheduling of ~400 events worked very well. The only problem I really noticed was the weird naming I already found in the code earlier when manually scheduling multiple events. Apart from that, everything seems to work.

@lkiesow
Copy link
Member

lkiesow commented Jan 28, 2019

Btb, some review findings are just comments and may not necessarily require any change.
Just take a look at them and figure out if it makes sense to change anything ;-P

@karendolan
Copy link
Member

Hi, since this pull has a lot of them, can someone explain why it is good for Opencast to make new iterations of NotFoundException, etc, that do not extend plain old Java NotFoundExceptions, etc? I do not understand the gained benefits over the complexity added.

@krinnewitz
Copy link
Author

krinnewitz commented Jan 28, 2019

Hi, since this pull has a lot of them, can someone explain why it is good for Opencast to make new iterations of NotFoundException, etc, that do not extend plain old Java NotFoundExceptions, etc? I do not understand the gained benefits over the complexity added.

Hi @karendolan I think in most cases it makes sense to not throw exceptions which are part of the java API since you want to handle exceptions thrown by the java class library differently from the ones thrown by your own code. And in most cases you want to express a different situation than the "standard" java exception stands for.

For example here is a snippet from the javadoc of javax.ws.rs.NotFoundException:

A runtime exception indicating a resource requested by a client was not found on the server.

That may not make sense in all places of your own code.

@krinnewitz
Copy link
Author

Thanks @lkiesow and @karendolan.

I have fixed most of your comments.

@smarquard
Copy link
Contributor

Is there anything still outstanding on this review?

@lkiesow
Copy link
Member

lkiesow commented Feb 10, 2019

Essentially just my final review. Unfortunately, I have been very busy last week.

One thing I like to take another look at is the separation of workflow parameters for which we now have two separate stores. I'd like to make sure that we do not get any inconsistencies there.

@smarquard
Copy link
Contributor

I'm getting test failures with this branch on a local build. The expected and actual times are out by one hour, which makes me think it's timezone related. Building in SAST, which is GMT+2 without DST adjustment.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.opencastproject.scheduler.impl.UtilTests
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.218 s - in org.opencastproject.scheduler.impl.UtilTests
[INFO] Running org.opencastproject.scheduler.impl.RecordingTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in org.opencastproject.scheduler.impl.RecordingTest
[INFO] Running org.opencastproject.scheduler.impl.SchedulerServiceImplTest
16:43:51  WARN (SchedulerServiceImpl:1379) - Invalid recording state: bad_state.
16:44:11  WARN (SchedulerServiceImpl:443) - Mediapackage with id '1' already exists!
[ERROR] Tests run: 34, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 25.53 s <<< FAILURE! - in org.opencastproject.scheduler.impl.SchedulerServiceImplTest
[ERROR] testEventManagement(org.opencastproject.scheduler.impl.SchedulerServiceImplTest)  Time elapsed: 0.24 s  <<< ERROR!
org.opencastproject.scheduler.api.SchedulerException: Event ID 5822c12b-5941-43d6-b8b9-1c22ba293d13 has already ended at 2019-02-12T13:45:11Z and now is 2019-02-12T14:44:11Z
	at org.opencastproject.scheduler.impl.SchedulerServiceImplTest.testEventManagement(SchedulerServiceImplTest.java:1442)

[INFO] Running org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest
[ERROR] Tests run: 11, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 0.885 s <<< FAILURE! - in org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest
[ERROR] testLastModifed(org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest)  Time elapsed: 0.077 s  <<< FAILURE!
java.lang.AssertionError
	at org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest.testLastModifed(SchedulerServiceDatabaseImplTest.java:85)

[ERROR] testUpdateEvent(org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest)  Time elapsed: 0.078 s  <<< FAILURE!
java.lang.AssertionError: expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
	at org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest.testUpdateEvent(SchedulerServiceDatabaseImplTest.java:225)

[ERROR] testCreateEvent(org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest)  Time elapsed: 0.065 s  <<< FAILURE!
java.lang.AssertionError: expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
	at org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest.testCreateEvent(SchedulerServiceDatabaseImplTest.java:139)

[ERROR] testResetRecordingState(org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest)  Time elapsed: 0.064 s  <<< FAILURE!
java.lang.AssertionError: expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
	at org.opencastproject.scheduler.impl.persistence.SchedulerServiceDatabaseImplTest.testResetRecordingState(SchedulerServiceDatabaseImplTest.java:803)

[INFO] Running org.opencastproject.scheduler.impl.SchedulerUtilTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.068 s - in org.opencastproject.scheduler.impl.SchedulerUtilTest
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   SchedulerServiceDatabaseImplTest.testCreateEvent:139 expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
[ERROR]   SchedulerServiceDatabaseImplTest.testLastModifed:85
[ERROR]   SchedulerServiceDatabaseImplTest.testResetRecordingState:803 expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
[ERROR]   SchedulerServiceDatabaseImplTest.testUpdateEvent:225 expected:<Mon Jan 07 09:00:00 SAST 2019> but was:<Mon Jan 07 08:00:00 SAST 2019>
[ERROR] Errors: 
[ERROR]   SchedulerServiceImplTest.testEventManagement:1442 » Scheduler Event ID 5822c12...
[INFO] 
[ERROR] Tests run: 51, Failures: 4, Errors: 1, Skipped: 0

Copy link
Member

@lkiesow lkiesow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a final in-depth review and this patch looks good except for one line of SQL update code which needs to be removed. Scheduling stuff works, capture agents work with the scheduler, migration works (I did upgrade from 6 to 7). @krinnewitz, can once the SQL line is removed, I'm going to merge this.

docs/upgrade/6_to_7/mysql5.sql Outdated Show resolved Hide resolved
@lkiesow
Copy link
Member

lkiesow commented Feb 12, 2019

Ups, missed @smarquard's comment. Is this issue new? Did the tests work in the same timezone without this patch?

@lkiesow
Copy link
Member

lkiesow commented Feb 12, 2019

Btb, to properly test this on develop, I applied the following patches beforehand to remove some critical issues which would have otherwise made me testing this impossible:

@krinnewitz
Copy link
Author

I can reproduce test failures when building scheduler-impl with mvn clean install -Duser.timezone=Africa/Johannesburg

@krinnewitz
Copy link
Author

@lkiesow @smarquard I added yet another commit which should fix the test failures.

@smarquard
Copy link
Contributor

That solution to fixing the timezone issues is concerning, because at the beginning of our "timezone journey", we found that the default timezone could be set incorrectly when multiple threads were changing the timezone default at the same time. Thread 1 changes the default, thread 2 gets the default (which is now not the actual user/system default, but what thread 1 has changed it to), thread 1 changes it back to the user/system default but thread 2 has now hung on to an incorrect default and sets the default incorrectly. So we made this change:

76dfc35

which your commit essentially reverses. IMO it's always unsafe to call setDefault(...) because it's changing a value which is JVM-wide, and you don't know what other threads are doing at that time.

@krinnewitz krinnewitz force-pushed the f/MH-13277-improve-scheduler-performance branch from 299856b to 171cea0 Compare February 13, 2019 07:14
@krinnewitz
Copy link
Author

Thanks for the hint. I removed the last commit. Since the problem with failing tests already exists in develop and is not introduced by this PR, I suggest to merge it.

@smarquard
Copy link
Contributor

I agree with merging this, despite the test failures which I'm hoping will be taken care of in #604 that @karendolan is working on. That PR (as it stands) removes all the timezone setDefault calls.

Kim Rinnewitz and others added 11 commits February 13, 2019 12:30
- Remove unused blacklist code
- Remove superfluous migration and transaction code
This speeds up the conflict check as well as the insertion of new
events.
Since these events never have conflicts, we can add them in parallel.
- Add more SchedulerServiceImpl Tests
- Add more tests for SchedulerServiceDatabaseImpl
@krinnewitz krinnewitz force-pushed the f/MH-13277-improve-scheduler-performance branch from 171cea0 to 85e6430 Compare February 13, 2019 11:42
@krinnewitz
Copy link
Author

Just rebased onto develop to resolve merge conflict

@smarquard
Copy link
Contributor

The failing tests pass on my local build if I use:

mvn -Duser.timezone=Europe/Zurich

@lkiesow
Copy link
Member

lkiesow commented Feb 13, 2019

I'll then merge it like this.

Memo to myself: Mid-term, add -Duser.timezone=America/St_Johns to the Travis builds.

@lkiesow lkiesow merged commit 85e6430 into opencast:develop Feb 14, 2019
lkiesow added a commit that referenced this pull request Feb 14, 2019
…pencast into develop

Pull request #614
  MH-13277 improve scheduler performance
@lkiesow
Copy link
Member

lkiesow commented Feb 14, 2019

It's merged. Thanks for the patch.
I think this does improve Opencast quite a bit.

@lkiesow
Copy link
Member

lkiesow commented Feb 19, 2019

Seems like this has some issues after all. I'm now getting test failures on a regular basis (not every time!) on develop. This I have seen on multiple systems:

[INFO] Running org.opencastproject.scheduler.impl.SchedulerServiceImplTest
11:43:41  WARN (SchedulerServiceImpl:1377) - Invalid recording state: bad_state.
11:43:53  WARN (SchedulerServiceImpl:444) - Mediapackage with id '1' already exists!
[ERROR] Tests run: 34, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 13.256 s <<< FAILURE! - in org.opencastproject.scheduler.impl.SchedulerServiceImplTest
[ERROR] testAddMultipleEvents(org.opencastproject.scheduler.impl.SchedulerServiceImplTest)  Time elapsed: 2.657 s  <<< FAILURE!
java.lang.AssertionError: expected:<200> but was:<199>
        at org.opencastproject.scheduler.impl.SchedulerServiceImplTest.testAddMultipleEvents(SchedulerServiceImplTest.java:996)

It also happened on the Travis build here:
https://api.travis-ci.org/v3/job/495192840/log.txt

@smarquard
Copy link
Contributor

Curious that the test failure shows up unpredictably. I haven't seen any failures like this, either on develop or our 6.x custom branch with this code merged in.

These WARNs show up for successful builds, so are not an issue:

16:11:58  WARN (SchedulerServiceImpl:1377) - Invalid recording state: bad_state.
16:12:19  WARN (SchedulerServiceImpl:444) - Mediapackage with id '1' already exists!

@pmiddend
Copy link

pmiddend commented Feb 19, 2019

@lkiesow As I am now noticing, we indeed write into a Map in a thread-unsafe way (see scheduledEvents in addMultipleEventInternal). This might lead to sporadic problems like you mentioned (and is definitely a bug). I'll file another PR for that.
See #729.

@lkiesow
Copy link
Member

lkiesow commented Feb 19, 2019

Thanks for taking a look at this.
I will review the patch tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants