Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible solutions to optimize cpu load for HLS #3040

Closed
zhanghuicuc opened this issue Jul 10, 2017 · 27 comments
Closed

Possible solutions to optimize cpu load for HLS #3040

zhanghuicuc opened this issue Jul 10, 2017 · 27 comments
Assignees

Comments

@zhanghuicuc
Copy link

zhanghuicuc commented Jul 10, 2017

Hi,
As mentioned in #1227 and #1412, Loader:HLS thread cosumes a lot of CPU resources. We have used ARM Streamline to analyze the CPU activity and got the result as follows:
ori2
We can see that core 1 and 2 can be more than 60% busy sometimes. On some low performance cpus like Mstar6A938, this can lead to a large amount of dropped frames especially when playing 4K high bitrate streams. The same issue happens on both 1.5.x and 2.x.

Possible solutions to optimize cpu load for HLS may be a) Lower the thread priority of Loader:HLS and b) Load and extract ts packets concurrently.

a)

changeThreadPrio.diff

In this way, OMX may have more resources to do decode related work. The dropped frame numbers can be reduced by 60% according to our 3 hours long time test results. And we have not found any side effects so far.

b)
loadAndparseConcurrently.diff

Loading several ts packets each time can reduce the number of IO operations. By doing so, we need to load and extract ts packets concurrently to prevent the player from getting stuck in buffering state. The cpu activity now is as follows:
(loading 50 packets each time)
50multithread
(loading 100 packets each time)
100multithread
We can see that cpu load has been reduces a lot and dropped frame numbers is also reduced.

Maybe switching to fragmented mp4 or DASH is the best solution, but there still exists a lot of HLS-TS streams we need to deal with. I'd appreciate any thoughts on these possible solutions and I'm happy to prepare a PR for any potential fix.

Thanks!

@AquilesCanta
Copy link
Contributor

Would you kindly provide us with the streams you are using to test resource consumption? It would be good for us to know the characteristics of the streams.

@zhanghuicuc
Copy link
Author

zhanghuicuc commented Jul 10, 2017

I simply used Bento4 mp42hls to convert tears_h264_high_uhd_30000.mp4 in https://storage.googleapis.com/wvmedia/clear/h264/tears/tears_uhd.mpd to a HLS-TS stream. The m3u8s are as follows:
master.m3u8
stream.m3u8

The stream is on our internal server right now and I'm afraid it can't be accessed from outside.
I've uploaded some of the media segments to google drive
https://drive.google.com/file/d/0B2c0s1M2EOS7aXFRRDFxTG42QTg/view?usp=sharing

@dbedev
Copy link

dbedev commented Jul 11, 2017

Think these are some valid points by zhanghuicu on exoplayer's HLS capabillity. Also the proposed enhancements to the exoplayer way of (HLS) operating look very interesting.

Do think that at the root of these HLS cpu peaks there is a more generic choice in the way the default exoplayer data loading operates. It works in a peak pattern that tries to fetch data as quickly as possible with reltatively large intervals (15s). And thus you get these kind of cpu utilization peaks at given intervals for Ts based HLS.

see #2083

As far as I understand this loading approach is geared towards energy efficiency for battery operated devices (also taking into account also the energy for the network tranfser) . For a none battery powered tv / media box solution like the Mstar6A938 one probably is more concerned just about the keeping the cpu clock low and having as little framedrops as possible. As a result one would instead want to spread the loading over time to create a better visual performance.

The cpu utilization side effect of the default loader choice grows with the bitrate as as the loader works in the time domain only. So for a 30mbit stream by default it's trying to load and parse 15 seconds of data (56mb) as quick as possible. Ironically this also means that the higher you're connection speed to the server, the more peak utilization the player will put on one of the cpu's. Which in turn can cause scheduling issues on the system the player runs on.

I think that adapting the loader priority and doing concurrent Ts loading would definitly improve experience for an externally powered device as both methods will effectively counter the default loader cpu utilization side effects as the bitrate increases. Not sure though how the splitting of extracting , lower prio will affect the energy consumption though. Maybe the exoplayer team can do some tests for this based on a nexus device. Also not sure how much the splitting will affect the adaptive selection/bandwith measurement.

Seeing the increasing use of utilizing exoplayer in AndroidTv style devices, maybe it is an idea to develop different load controls for none battery operated devices or make the default one adapt to higher bitrates as are more common for (high quality) tv oriented delivery.

Do think that the current BUFFER_PACKET_COUNT setting (5)in TsExtractor.java , is definitely too small. Seems that this should be a relatively simple improvement, is there any reasoning to why this is set to 5 ?

@ojw28
Copy link
Contributor

ojw28 commented Jul 11, 2017

Agreed there are some valid points of investigation, and thanks a lot for the analyses. A few thoughts/comments:

  • It's worth noting that the rendering thread already has a higher priority than the loading threads (aside: this was broken until 2.4.0, where it was fixed in ecb62cc, so if you're using an older version then updating may be an easy way to see some improvement). Although I agree there may be a valid argument for also lowering priority of the loading threads too.
  • If we can better batch up work in order to reduce load then that seems like an easy win. I'm not sure about the proposed patch, however, which would appear to generate a lot of garbage that will be really bad for devices running Dalvik. It also adds an additional memory copy. It's unclear what the cost trade-off is there, and we already have more copy steps through the TS extractor than in other extractors. Does simply increasing BUFFER_PACKET_COUNT (and not doing anything else) yield a performance increase? If so then that looks like something we should just go and do now. If someone could measure the improvement of making just that change, that would be great. @zhanghuicuc - Is that something you could easily get data for, given your existing setup?
  • I think fragmented MP4 is the proper long term solution to this problem (whether sticking with HLS or moving to DASH).
  • The initial post mentions there's a lot of existing TS content, that there's existing 4K content, and that there exist low powered devices. Whilst all are definitely true, I'm not sure the combination (which is when this is most problematic) is particularly common as a percentage of ExoPlayer usage, and it's unclear whether it ever will be as major content providers shift to using fragmented MP4 instead. This weighs somewhat on what we can reasonable do here (e.g. it doesn't make sense for us to make changes to improve this case that would negatively impact other more predominant uses).

Thanks!

@ojw28
Copy link
Contributor

ojw28 commented Jul 11, 2017

It would also be a good idea to profile the extractor to see if there are any hotspots that we can optimize. If we're able to make it cheaper, that's going to be a better solution than spreading the cost (e.g. by lowering thread priority) :).

@ojw28
Copy link
Contributor

ojw28 commented Jul 11, 2017

For looking at increasing BUFFER_PACKET_COUNT, it would be good to measure performance with the current value of 5 compared to a proposed new value of 100, with no other changes. It looks like reads stop being satisfied in full when you get over about 80, at least using the default network stack, so there's little to be gained from going beyond 100 I think. If we get confirmation it makes a significant improvement, we'll make that change.

@zhanghuicuc
Copy link
Author

a)
Simply increasing BUFFER_PACKET_COUNT to 100 (and not doing anything else) can not make a significant improvement. The test result is as follows, and the first image in my initial post is the test result when BUFFER_PACKET_COUNT = 5:
read100withoudmultithread

However, if we set BUFFER_PACKET_COUNT to a much larger value, like 250, the test result is as follows:
read250withoutmultithread
And the side effect is that the player comes into buffering state more often, as I've mentioned before.

It looks like reads stop being satisfied in full when you get over about 80, at least using the default network stack, so there's little to be gained from going beyond 100 I think.

I don't understand the meaning of "reads stop being satisfied in full " and where does this "80" come from?

b)
I will try to remove the additional memory copy in my proposed patch, maybe using a ring buffer. And I'll try to profile the extractor to see if there are any hotspots.

Thanks for your valuable comments!

@ojw28
Copy link
Contributor

ojw28 commented Jul 12, 2017

  • Could you explain how we're supposed to be interpreting the images you've attached. I'm not really sure what I'm looking at :). Ideally, if you could provide numbers instead (e.g. dropped frames per minute), that would be a lot easier to interpret/compare.
  • To answer the question about "reads being satisfied in full": The input.read calls are not guaranteed to read the full amount of data requested. The return value is the amount actually read, which may be less (but is at least 1 byte), just like how InputStreams work. If the idea of increasing BUFFER_PACKET_COUNT is to make fewer, larger reads from the input (and hence from the underlying network stack), this will only have a positive effect up to the point where we're requesting as much data as the underlying network stack can actually fulfill in a single call. Increasing beyond that point means we request even larger reads, but the amount actually read will stop increasing, and so there's no actual benefit to doing this. For DefaultHttpDataSource the point where it stops making a difference is around 80, at least on a test device I have.

@dbedev
Copy link

dbedev commented Jul 12, 2017

Not sure if I managed to get the message across properly before.

The current exoplayer implementation requires a single cpu core to parse up to 15 seconds of container data around ~15s intervals. It will do this at the "maximum" speed at which the data source can provide it and it assumes this will not cause a task scheduling conflict with other related systems tasks on the system.

If there is a clash with another task that happens to be the audio/video OMX decoder (spawned by the driver/os layer) one will end up with a frame drop eventually. So the issue seems to be about peak cpu loads not cpu load in general.

I would argue that this issue is not just related to limited cpu processing power of a device, but more about assuming a relation of cpu processing power vs the device's network IO throughput. With the current loader approach the quicker the network IO the more computation power is required from the device in order to avoid a potential scheduling issue. So if not dealt with, this issue will only worsen for existing devices over time as connections speeds to servers will probably improve, but the device cpu processing power will not.

Ts streams are the obvious candidate to hit it first as they are the most expensive to parse, but possibly encrypted containers or streaming over https(heavy cipher suite) for different formats would eventually hit similar behavior when increasing their bitrate.

  1. Setting a higher priority on the playback tasks should avoid most of the scheduling clashes between the exoplayer playback task and loading task. However I have encountered multiple devices where some of the OMX decoding work is actually done on separate task on the OS, so the OMX tasks would still be in conflict with the loader. So the proposed solution of reducing the priority looks like a more effective way to counter this.

  2. If possible improving the Ts parsing speed will improve things in general (it makes ts more like the others). Effectively it will however just move the maximum bitrate figure at which this issue occurs. Looking at the beautiful graphs there is no real overall cpu load issue as there is a big amount of free time on all of the cpu's.

  3. Setting BUFFER_PACKET_COUNT of 5 looks rather inefficient based on basic network/cpu operating knowledge. Increasing it might actually not improve much in this scenario or at some point even worsen the situation due to increasing the effective peak IO throughput.

  4. The plots also seems show a peak of cpu usage at the start of loading of an assumed 15s segment. Could this be to previously unprocessed but already loaded data in some cache causing a peak in IO throughput ? Or is there some computational heavy part at the start of loading ?

In general for our project we did not find a more effective approach worth investigating than tuning the exoplayer parameters to our intended use until we managed a reasonable performance. Which off coarse goes directly against the exoplayer's purpose of providing abstraction.

We came to a value of 20 for BUFFER_PACKET_COUNT in our expoplayer 1.5.x based experiment. Together with avoiding the android 4.x OS cipherInputstream mess. However in our experiment we use HLS streams that typically do not go beyond 15mbit (with optional AES envelope encryption) and we have a fairly powerful cpu at hand.

We did not need complete drop less playback performance at that moment, near fluent was good enough. Drop less would become more important once we can have something like dynamic HDMI output rate switching. Is there any form of support for this in Exoplayer v2 in combination with recent android versions?

ojw28 added a commit that referenced this issue Jul 12, 2017
Really low hanging fruit optimization for TS extraction.

ParsableBitArray is quite expensive. In particular readBits
contains at least 2 if blocks and a for loop, and was being
called 5 times per 188 byte packet (4 times via readBit). A
separate change will follow that optimizes readBit, but for
this particular case there's no real value to using a
ParsableBitArray anyway; use of ParsableBitArray IMO only
really becomes useful when you need to parse a bitstream more
than 4 bytes long, or where parsing the bitstream requires
some control flow (if/for) to parse.

There are probably other places where we're using
ParsableBitArray over-zealously. I'll roll that into a
tracking bug for looking in more detail at all extractors.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161650940
ojw28 added a commit that referenced this issue Jul 12, 2017
ParsableBitArray.readBit in particular was doing an excessive
amount of work. The new implementation is ~20% faster on desktop.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161666420
ojw28 added a commit that referenced this issue Jul 12, 2017
Apply the same learnings as in ParsableBitArray.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161674119
@ojw28
Copy link
Contributor

ojw28 commented Jul 12, 2017

  • We're on the same page; I understand your points :). And yes, we'll take a look at the thread priorities.
  • The point of my first bullet above is that it's really hard for us to look at screenshots of two graphs and derive meaningful data from them. So it would be much better if you could provide numerical data instead. For example "we see 5 dropped frames per minute with priority X vs 50 dropped frames per minute with priority Y" directly measures user impact and the numbers are trivial to compare :).
  • I think there's a lot of low hanging optimization fruit in TsExtractor (and possibly in other extractors too). I just pushed some changes that tackle some of the immediately obvious bits. It would be interesting to know whether you see a measurable difference.

@zhanghuicuc
Copy link
Author

OK, I'll do some tests and let you know once I've got any result.

@zhanghuicuc
Copy link
Author

Hi,
I've done some tests on a Mstar6A938 device, the network bandwidth was set to 10MBps. The same 4K video (duration=1 hour) was played for 3 times. The test results are as follows:

  1. Using r2.2.0, we see 354 dropped frames per hour.
  2. Using the latest dev-v2 with ojw28's commits, we see 204 dropped frames per hour.
  3. Using the latest dev-v2 with ojw28's commits, and set the Loader thread priority to lowest (and not doing anything else), we see 146 dropped frames per hour.
  4. Using the latest dev-v2 with ojw28's commits, and set BUFFER_PACKET_COUNT to 100 (and not doing anything else), we see 179 dropped frames per hour.
  5. Using the latest dev-v2 with ojw28's commits, and set BUFFER_PACKET_COUNT to 100, and using my proposed patch to load and parse TS packet concurrently (and not doing anything else), we see 114 dropped frames per hour.

It seems that ojw28's commits are really effective, and lower Loader's thread priority can make some improvement. Please let me know if you need more information and I'm glad to do more tests if you want.

Thanks.

@ojw28
Copy link
Contributor

ojw28 commented Jul 17, 2017

Thanks; this is excellent data! One additional data point I'd be interested in is what happens in case (3) if you drop the thread priority to THREAD_PRIORITY_LESS_FAVORABLE rather than THREAD_PRIORITY_LOWEST.

@dbedev
Copy link

dbedev commented Jul 17, 2017

Nice to see some good progress on this, these already look like promising improvements.

zhanghuicuc are you using a different 4k video as the uploaded one or are you using a 100mbit link?

The BUFFER_PACKET_COUNT result when using 100 is more or less what we experienced on our armv7 android 4.3 based platform. Unfortunately we do not have an Mstar based system to test on, if this could be provided we would be willing to help with some additional analysis for this issue ;).

Though our used platform is very different from the Mstar6A938, it would still be interesting to see if our reached value of 20 for the BUFFER_PACKET_COUNT we got to for our platform also works better for the mstar platform. For us 20 was the point after which things did not improve anymore and even further out things (above 40) started to perform worse again.

@zhanghuicuc
Copy link
Author

zhanghuicuc commented Jul 17, 2017

@dbedev I'm using a different 4k video with a bitrate of 15Mbps. Actually, it was just transcoded from the uploaded one using ffmpeg and converted to a HLS stream using Bento4. The network link is about 10MBps.

I've just done some tests with the Loader thread priority of THREAD_PRIORITY_LESS_FAVORABLE. The video (duration = 1 hour) was played twice. We saw 225 dropped frames per hour.

It seems that THREAD_PRIORITY_LESS_FAVORABLE is not low enough. I'll do some tests with THREAD_PRIORITY_BACKGROUND and see what happens. And I'll also do some tests with different BUFFER_PACKET_COUNT.

ojw28 added a commit that referenced this issue Jul 18, 2017
We currently read at most 5 packets at a time from the
extractor input. Whether this is inefficient depends on
how efficiently the underlying DataSource handles lots
of small reads. It seems likely, however, that DataSource
implementations will in general more efficiently handle
fewer larger reads, and in the case of this extractor
it's trivial to do this.

Notes:
- The change appears to make little difference in my
  testing with DefaultHttpDataSource, although analysis
  in #3040 suggests that it does help.
- This change shouldn't have any negative implications
  (i.e. at worst it should be neutral wrt performance). In
  particular it should not make buffering any more likely,
  because the underlying DataSource should return fewer
  bytes than are being requested in the case that it
  cannot fully satisfy the requested amount.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=162206761
ojw28 added a commit that referenced this issue Jul 19, 2017
Really low hanging fruit optimization for TS extraction.

ParsableBitArray is quite expensive. In particular readBits
contains at least 2 if blocks and a for loop, and was being
called 5 times per 188 byte packet (4 times via readBit). A
separate change will follow that optimizes readBit, but for
this particular case there's no real value to using a
ParsableBitArray anyway; use of ParsableBitArray IMO only
really becomes useful when you need to parse a bitstream more
than 4 bytes long, or where parsing the bitstream requires
some control flow (if/for) to parse.

There are probably other places where we're using
ParsableBitArray over-zealously. I'll roll that into a
tracking bug for looking in more detail at all extractors.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161650940
ojw28 added a commit that referenced this issue Jul 19, 2017
ParsableBitArray.readBit in particular was doing an excessive
amount of work. The new implementation is ~20% faster on desktop.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161666420
ojw28 added a commit that referenced this issue Jul 19, 2017
Apply the same learnings as in ParsableBitArray.

Issue: #3040

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161674119
@zhanghuicuc
Copy link
Author

Hi,
Sorry for delayed response. I've just done some tests with THREAD_PRIORITY_BACKGROUND (without ojw28's commit 009369b). The same video is played for 3 times. We saw 175 dropped frames per hour. The result is acceptable for me considering that THREAD_PRIORITY_LOWEST maybe too aggressive.
BTW, I saw the following code in C.java
public static final int PRIORITY_DOWNLOAD = PRIORITY_PLAYBACK - 1000;
But PRIORITY_DOWNLOAD is never used. What is the original intention of PRIORITY_DOWNLOAD?

I'll also do some tests with different BUFFER_PACKET_COUNT.

Thanks

@zhanghuicuc
Copy link
Author

Hi
Similar to 4cb5b34, we can avoid using ParsableBitArray to parse PAT and PMT. As the following diff file shows:
dontUseParsableBitArrayForPATandPMT.txt

@AquilesCanta
Copy link
Contributor

@zhanghuicuc, PRIORITY_DOWNLOAD is used in a component that is under development, and will be released soon.

Have you profiled the provided patch? I wouldn't expect a great difference there, considering the sparsity of PSI data in comparison with 4k PES data.

@zhanghuicuc
Copy link
Author

@AquilesCanta
i agree with you. it doesn't make a great improvement.

@ojw28
Copy link
Contributor

ojw28 commented Jul 31, 2017

Agreed. I briefly considered this also, but profiling suggested it wasn't really worth it.

@zhanghuicuc
Copy link
Author

zhanghuicuc commented Aug 1, 2017

I've noticed that in MediaCodecVideoRenderer.shouldDropOutputBuffer, the frame drop threshold is set to 30ms. Is this just an arbitary value? May be this threshold is too strict for 24fps and 25fps streams.
In NuPlayer, the frame drop threshold is set to 40ms. In mstplayer (from mstar) and ffplay, the frame drop threshold is set to 1/fps. IMO, this threshold should be set according to framerate.
I've done some tests with a threshold of 40ms and a 25fps stream, the frame drop number can be reduced by 30% in comparison with a threshold of 30ms. And I have not found any side effects so far.

@ojw28
Copy link
Contributor

ojw28 commented Aug 1, 2017

It's arbitrary. I'm not convinced allowing frames to be "more late" is necessarily better than dropping them though, from a visual point of view. Does the result actually look smoother when you increase the value, or does it just decrease the dropped frame count and look just as janky? In any case, it's clear that "minimizing dropped frames" isn't in isolation the goal to be aiming for. It's trivial to get to 0 simply by never dropping and showing frames arbitrarily late instead, but that doesn't make it a good idea :).

@dbedev
Copy link

dbedev commented Aug 7, 2017

Just my 2 cents:

A -30 .. 11ms window for frame dropping is rather a tight for 24/25fps content as every drop would cause a 41.6/40 ms jump in time by dropping the frame. Not sure though if increasing this would have noticeable negative effect on things like av sync.

Besides that another thing we have observed with this control loop is that some media codec implementations do use quite some cpu resources (we do use byte buffer out instead of direct surface coupling) . This sometimes causes the media codec to fall behind to a point where our renderer (which uses the same control loop as the default renderer) decides to simply drop everything.

As there is no option to drop individual frames at the decoder level. It would probably be better to have a minimum service level to only drop X frames consecutively so a user still sees something in stead of just reporting drop counters. Not sure though if this case ever occurs when decoding to a surface directly .

Another thing to consider might be to move away from the deprecated getOutputBuffers() and use the getOutputBuffer(index) approach on API 21+ devices as this should allow the codec run more optimized according the documentation.

Is there an explicit reason that Exoplayer still utilizes this "deprecated" method for new platforms?

@ojw28
Copy link
Contributor

ojw28 commented Aug 7, 2017

  • Yeah, it might be a little tight. I think it would be fine to increase it to 40ms to match NuPlayer.
  • Having a minimum service level (i.e. not dropping everything) is tracked by Fast play will cause the HD video frozen #2777 (phrased there specifically in the context of variable speed playback, where the issue is much more likely to occur).
  • We still use getOutputBuffers() because it's simpler to have one code path rather than two. We'll switch to using the newer methods once there is an actual performance benefit. We've confirmed with the platform team that there's no performance benefit currently, and there wont be one until at least the P release of Android.

@ojw28
Copy link
Contributor

ojw28 commented Aug 7, 2017

@zhanghuicuc - Could you clarify the measured frame drops when, all other things kept equal and using the latest dev-v2, loader thread priority is set to (a) what it is now, (b) THREAD_PRIORITY_LESS_FAVORABLE, (c) THREAD_PRIORITY_BACKGROUND, (d) THREAD_PRIORITY_LOWEST. I'm having a somewhat hard time figuring it out from the thread above. It's unclear which values are directly comparable. Thanks!

@zhanghuicuc
Copy link
Author

(a) what it is now : 204 dropped frames per hour
(b) THREAD_PRIORITY_LESS_FAVORABLE : 225 dropped frames per hour
(c) THREAD_PRIORITY_BACKGROUND : 175 dropped frames per hour
(d) THREAD_PRIORITY_LOWEST : 146 dropped frames per hour

For case (a), (c) and (d), the same video played 3 times. For case (b), the same video played 2 times. So there may be some deviation here.

@ojw28
Copy link
Contributor

ojw28 commented Jul 5, 2022

There's little evidence that this is still a problem, given modern devices and a continued shift toward mire efficient container formats (i.e., FMP4). Closing as obsolete.

@ojw28 ojw28 closed this as completed Jul 5, 2022
@google google locked and limited conversation to collaborators Sep 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants