New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow performance #50

Open
DanielBiegler opened this Issue Mar 17, 2017 · 31 comments

Comments

Projects
None yet
@DanielBiegler

DanielBiegler commented Mar 17, 2017

How long does it take for you to compress a couple of images?

I tried compressing a 7,8MB JPG with --quality 84 and it took nearly 20 minutes.

I also tried a 1,4MB JPG with --quality 85 and it took nearly 10 minutes.

I must assume that this is not normal - is something wrong with my binary?

I am on Ubuntu 16.04 LTS, intel core i7-4790K CPU @ 4.00GHz
I installed gflag via sudo apt-get install libgflags-dev and got libpng via sudo apt-get install libpng16-dev. After that I make with no errors.

convert -quality 85 src.jpg dst.jpg runs in under 1 second, if that is any help.

Anyone else experience this?

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 17, 2017

Collaborator

It's true that Guetzli is really slow to compress. As a rough order of magnitude estimate, it takes ~ 1 minute / MPixel on a 2.6 GHz Xeon. The time consumed should increase a bit more quickly than linearly with image size.

How large in pixel count are your images?

Collaborator

robryk commented Mar 17, 2017

It's true that Guetzli is really slow to compress. As a rough order of magnitude estimate, it takes ~ 1 minute / MPixel on a 2.6 GHz Xeon. The time consumed should increase a bit more quickly than linearly with image size.

How large in pixel count are your images?

@iron-udjin

This comment has been minimized.

Show comment
Hide comment
@iron-udjin

iron-udjin Mar 17, 2017

Are you going to implement multi-threaded image processing? It would be really helpful while guetzli doesn't have CPU/RAM usage optimizations for now.

iron-udjin commented Mar 17, 2017

Are you going to implement multi-threaded image processing? It would be really helpful while guetzli doesn't have CPU/RAM usage optimizations for now.

@jan-wassenberg

This comment has been minimized.

Show comment
Hide comment
@jan-wassenberg

jan-wassenberg Mar 17, 2017

Member

Would it be an option to invoke the binary multiple times in parallel? That's much easier than adding threading everywhere inside Guetzli+butteraugli, and should work unless you need to compress really large images.

Member

jan-wassenberg commented Mar 17, 2017

Would it be an option to invoke the binary multiple times in parallel? That's much easier than adding threading everywhere inside Guetzli+butteraugli, and should work unless you need to compress really large images.

@iron-udjin

This comment has been minimized.

Show comment
Hide comment
@iron-udjin

iron-udjin Mar 17, 2017

Yes, I use this option. But it will be great to have multi-threaded processing for large images in the future. That's why I'm asking about plans for implementation this feature.

iron-udjin commented Mar 17, 2017

Yes, I use this option. But it will be great to have multi-threaded processing for large images in the future. That's why I'm asking about plans for implementation this feature.

@DanielBiegler

This comment has been minimized.

Show comment
Hide comment
@DanielBiegler

DanielBiegler Mar 17, 2017

@robryk the big one is around 16MP (5300x3000) so your estimate is roughly in that ballpark.

I was just testing guetzli and didn't dive into its specifics yet, will it be possible to multithread its workload?

For example I wanted to compress around 200 pictures each one is roughly 7-8mb in size. This would take forever. If I hyperthread with 8 threads and each picture takes roughly 20 minutes, the compression would take over 8 hours to complete.

DanielBiegler commented Mar 17, 2017

@robryk the big one is around 16MP (5300x3000) so your estimate is roughly in that ballpark.

I was just testing guetzli and didn't dive into its specifics yet, will it be possible to multithread its workload?

For example I wanted to compress around 200 pictures each one is roughly 7-8mb in size. This would take forever. If I hyperthread with 8 threads and each picture takes roughly 20 minutes, the compression would take over 8 hours to complete.

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 17, 2017

Collaborator

@DanielBiegler If you have 200 pictures, I'd echo @jan-wassenberg's suggestion: run multiple instances of Guetzli and thus process multiple pictures in parallel. This will be more effective parallelization than anything that can be done inside Guetzli.

Collaborator

robryk commented Mar 17, 2017

@DanielBiegler If you have 200 pictures, I'd echo @jan-wassenberg's suggestion: run multiple instances of Guetzli and thus process multiple pictures in parallel. This will be more effective parallelization than anything that can be done inside Guetzli.

@kornelski

This comment has been minimized.

Show comment
Hide comment
@kornelski

kornelski Mar 17, 2017

@robryk I presume large part of the slowness and memory use is because it's the first release and Guetzli hasn't been optimized yet. How much of the slowness is inherent to the algorithm and unavoidable, and how much can be done to improve the speed?

kornelski commented Mar 17, 2017

@robryk I presume large part of the slowness and memory use is because it's the first release and Guetzli hasn't been optimized yet. How much of the slowness is inherent to the algorithm and unavoidable, and how much can be done to improve the speed?

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 17, 2017

Collaborator

@pornel We didn't try to optimize Guetzli in ways that could make it harder to modify. That means that there's likely both some speedup available by just optimizing single routines and, more significantly, speedup available by restructuring parts of Guetzli (e.g. attempting to reuse more computation results between iterations).

That said, I believe that much more can be done for memory consumption, which we didn't optimize nearly at all.

Collaborator

robryk commented Mar 17, 2017

@pornel We didn't try to optimize Guetzli in ways that could make it harder to modify. That means that there's likely both some speedup available by just optimizing single routines and, more significantly, speedup available by restructuring parts of Guetzli (e.g. attempting to reuse more computation results between iterations).

That said, I believe that much more can be done for memory consumption, which we didn't optimize nearly at all.

@DanielBiegler

This comment has been minimized.

Show comment
Hide comment
@DanielBiegler

DanielBiegler Mar 17, 2017

@robryk my estimate already took that into account.

Later, I'll compare the lower end around --quality 30 and if the results are good, maybe using guetzli in the background for a couple of days will be worth it(?) - we'll see.

Thanks for the quick answers.

DanielBiegler commented Mar 17, 2017

@robryk my estimate already took that into account.

Later, I'll compare the lower end around --quality 30 and if the results are good, maybe using guetzli in the background for a couple of days will be worth it(?) - we'll see.

Thanks for the quick answers.

@slatted

This comment has been minimized.

Show comment
Hide comment
@slatted

slatted Mar 17, 2017

I've been surprised overall with the performance. I have a 13MB image of Beef Slouvaki, and its been running for 25 min (using between 3 - 5gb of RAM), and it still isn't finished.

I was able to resize a 350kb~ image relatively quick (<2 min)

slatted commented Mar 17, 2017

I've been surprised overall with the performance. I have a 13MB image of Beef Slouvaki, and its been running for 25 min (using between 3 - 5gb of RAM), and it still isn't finished.

I was able to resize a 350kb~ image relatively quick (<2 min)

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 17, 2017

Collaborator

@slatted The runtime grows slightly faster than linearly with image size. If you want some inkling into what's happening, you might wish to pass --verbose commandline flag.

Collaborator

robryk commented Mar 17, 2017

@slatted The runtime grows slightly faster than linearly with image size. If you want some inkling into what's happening, you might wish to pass --verbose commandline flag.

@slatted

This comment has been minimized.

Show comment
Hide comment
@slatted

slatted Mar 17, 2017

@robryk thanks, the 13MB image just finished up (right around 30min) and comes in at 2.9MB (@ 84 quality). Visually can't notice a difference. Very cool

slatted commented Mar 17, 2017

@robryk thanks, the 13MB image just finished up (right around 30min) and comes in at 2.9MB (@ 84 quality). Visually can't notice a difference. Very cool

@clouless

This comment has been minimized.

Show comment
Hide comment
@clouless

clouless Mar 18, 2017

is it normal, that a 4MB file needs more than 8GB of RAM inside Docker?

I have guetzli running inside docker and the docker process has 8GB of RAM.
Guetzli process gets killed.

I also tried to give the Docker Daemon 12GB but guetzli still got killed.

you can test it with:

docker run -i -t codeclou/kartoffelstampf:1.1.1 bash
# docker container bash opens

wget -O hires.jpg https://codeclou.github.io/kartoffelstampf/test-images/test-affinity-photo-600dpi.jpg

guetzli hires.jpg hires_comp.jpg

Any hints of what I can do to lower RAM consumption?
Is guetzli not suited to run dockerized?

dockerhub: https://hub.docker.com/r/codeclou/kartoffelstampf/

clouless commented Mar 18, 2017

is it normal, that a 4MB file needs more than 8GB of RAM inside Docker?

I have guetzli running inside docker and the docker process has 8GB of RAM.
Guetzli process gets killed.

I also tried to give the Docker Daemon 12GB but guetzli still got killed.

you can test it with:

docker run -i -t codeclou/kartoffelstampf:1.1.1 bash
# docker container bash opens

wget -O hires.jpg https://codeclou.github.io/kartoffelstampf/test-images/test-affinity-photo-600dpi.jpg

guetzli hires.jpg hires_comp.jpg

Any hints of what I can do to lower RAM consumption?
Is guetzli not suited to run dockerized?

dockerhub: https://hub.docker.com/r/codeclou/kartoffelstampf/

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 18, 2017

Collaborator

@clouless The image you're trying to compress has ~70MPix (its size is 8333x8333). According to the readme Guetzli uses ~300MB per MPix, so you should expect it to use ~21GB on that image.

This is obviously far from ideal (both the high constant and the inability to process the image in tiles), but this is within the current expectations. #11 is the issue for reducing memory consumption.

Collaborator

robryk commented Mar 18, 2017

@clouless The image you're trying to compress has ~70MPix (its size is 8333x8333). According to the readme Guetzli uses ~300MB per MPix, so you should expect it to use ~21GB on that image.

This is obviously far from ideal (both the high constant and the inability to process the image in tiles), but this is within the current expectations. #11 is the issue for reducing memory consumption.

@clouless

This comment has been minimized.

Show comment
Hide comment
@clouless

clouless Mar 18, 2017

Ok thx for checking. Then my test image might be too large. I will try using the convert-to-png workaround with my actual DSLR photos for testing. Is there a way to check the amount of MPix before starting the conversion? How do you determine MPix? Is it safe to rely on exif Megapixels entry?

clouless commented Mar 18, 2017

Ok thx for checking. Then my test image might be too large. I will try using the convert-to-png workaround with my actual DSLR photos for testing. Is there a way to check the amount of MPix before starting the conversion? How do you determine MPix? Is it safe to rely on exif Megapixels entry?

@robryk

This comment has been minimized.

Show comment
Hide comment
@robryk

robryk Mar 18, 2017

Collaborator

@clouless I use ImageMagick's identify or an image viewer to find the dimensions of the image and multiply them together to get the total number of pixels. I'd expect the exif entry to be usually correct when present.

Note that the convert-to-png workaround is a workaround for a problem where Guetzli errorneously claims that an image is invalid. It doesn't impact memory usage nor time spent in any appeciable fashion.

Collaborator

robryk commented Mar 18, 2017

@clouless I use ImageMagick's identify or an image viewer to find the dimensions of the image and multiply them together to get the total number of pixels. I'd expect the exif entry to be usually correct when present.

Note that the convert-to-png workaround is a workaround for a problem where Guetzli errorneously claims that an image is invalid. It doesn't impact memory usage nor time spent in any appeciable fashion.

@clouless

This comment has been minimized.

Show comment
Hide comment
@clouless

clouless Mar 18, 2017

ok thx. You helped me a lot :) keep up the great work 👍

clouless commented Mar 18, 2017

ok thx. You helped me a lot :) keep up the great work 👍

@DanielBiegler

This comment has been minimized.

Show comment
Hide comment
@DanielBiegler

DanielBiegler Mar 18, 2017

@clouless (img-width * img-height) / 1000000 = X megapixels

DanielBiegler commented Mar 18, 2017

@clouless (img-width * img-height) / 1000000 = X megapixels

@SuicSoft

This comment has been minimized.

Show comment
Hide comment
@SuicSoft

SuicSoft Mar 18, 2017

The speed of Guetzli can be improved using OpenCL probably (on the GPU)

SuicSoft commented Mar 18, 2017

The speed of Guetzli can be improved using OpenCL probably (on the GPU)

@bitbank2

This comment has been minimized.

Show comment
Hide comment
@bitbank2

bitbank2 Mar 18, 2017

I just profiled Gueztli and most of the time is spent in the butteraugli Convolution() and ButteraugliBlockDiff() methods. One of the big issues hurting the performance is the use of double-precision floating point values to calculate pixel errors. In this case, a 64-bit integer would provide the same accuracy for the error and increase the speed quite a bit since the original pixels could be left as-is. In certain cases, using doubles for pixels makes sense (e.g. some filter, scaling or transparency operations), but not for error calculations. The rest of the code has some efficiency problems, but won't affect the performance nearly as much.

bitbank2 commented Mar 18, 2017

I just profiled Gueztli and most of the time is spent in the butteraugli Convolution() and ButteraugliBlockDiff() methods. One of the big issues hurting the performance is the use of double-precision floating point values to calculate pixel errors. In this case, a 64-bit integer would provide the same accuracy for the error and increase the speed quite a bit since the original pixels could be left as-is. In certain cases, using doubles for pixels makes sense (e.g. some filter, scaling or transparency operations), but not for error calculations. The rest of the code has some efficiency problems, but won't affect the performance nearly as much.

@erikng

This comment has been minimized.

Show comment
Hide comment
@erikng

erikng Mar 18, 2017

When using the --verbose option, it would be great if an estimated time/memory consumption could be calculated and presented to the user. Perhaps calculating the megapixel count with the current estimated time/memory allocation.

erikng commented Mar 18, 2017

When using the --verbose option, it would be great if an estimated time/memory consumption could be calculated and presented to the user. Perhaps calculating the megapixel count with the current estimated time/memory allocation.

@clouless

This comment has been minimized.

Show comment
Hide comment
@clouless

clouless Mar 18, 2017

@erikng that in combination with a --dry-run option would be great. So that it just tests how long it would take and how much memory it would cost. JSON formatted output would also be a huge plus.

clouless commented Mar 18, 2017

@erikng that in combination with a --dry-run option would be great. So that it just tests how long it would take and how much memory it would cost. JSON formatted output would also be a huge plus.

@graysky2

This comment has been minimized.

Show comment
Hide comment
@graysky2

graysky2 Mar 18, 2017

Would it be an option to invoke the binary multiple times in parallel?

Functionally, you can do this with GNU parallel by invoking it like this:

parallel 'guetzli --quality 84 {} {.}.jpg' ::: *.png

Test it yourself:

wget https://github.com/google/guetzli/releases/download/v0/bees.png
for i in 1 2 3 4 5 6 7; do cp bees.png $i.png; done
time parallel 'guetzli --quality 84 {} {.}.jpg' ::: *.png

graysky2 commented Mar 18, 2017

Would it be an option to invoke the binary multiple times in parallel?

Functionally, you can do this with GNU parallel by invoking it like this:

parallel 'guetzli --quality 84 {} {.}.jpg' ::: *.png

Test it yourself:

wget https://github.com/google/guetzli/releases/download/v0/bees.png
for i in 1 2 3 4 5 6 7; do cp bees.png $i.png; done
time parallel 'guetzli --quality 84 {} {.}.jpg' ::: *.png
@bdkjones

This comment has been minimized.

Show comment
Hide comment
@bdkjones

bdkjones Mar 20, 2017

I'd love to implement this in my app, but the current performance figures are definitely a roadblock. Taking 13 minutes for a reasonably sized jpeg of a couple MB is simply too long to be practical in many applications.

From my perspective after reading all the current issues, the roadblocks to wide adoption are three and should be prioritized like this:

  1. Faster performance.
  2. Lower memory consumption.
  3. Failures on certain "non-standard" jpegs like those produced by certain cameras (you said you knew what the problem is here)

I think a good, rough goal would be to get to a point where a JPEG that's a couple MB in size takes no more than 10-12 seconds to optimize. That would make the algorithm practical in my use case, which is an app that optimizes hundreds of images at once as part of building websites.

bdkjones commented Mar 20, 2017

I'd love to implement this in my app, but the current performance figures are definitely a roadblock. Taking 13 minutes for a reasonably sized jpeg of a couple MB is simply too long to be practical in many applications.

From my perspective after reading all the current issues, the roadblocks to wide adoption are three and should be prioritized like this:

  1. Faster performance.
  2. Lower memory consumption.
  3. Failures on certain "non-standard" jpegs like those produced by certain cameras (you said you knew what the problem is here)

I think a good, rough goal would be to get to a point where a JPEG that's a couple MB in size takes no more than 10-12 seconds to optimize. That would make the algorithm practical in my use case, which is an app that optimizes hundreds of images at once as part of building websites.

@jayniz

This comment has been minimized.

Show comment
Hide comment
@jayniz

jayniz Mar 28, 2017

Another alternative to what @graysky2 said is https://github.com/fd0/machma

jayniz commented Mar 28, 2017

Another alternative to what @graysky2 said is https://github.com/fd0/machma

@luiseps

This comment has been minimized.

Show comment
Hide comment
@luiseps

luiseps Apr 23, 2017

Hi, I want create a multithreaded guetzli version, but a don´t understan the workflow. Anybody can explain how?..Or any document where I can find that.

luiseps commented Apr 23, 2017

Hi, I want create a multithreaded guetzli version, but a don´t understan the workflow. Anybody can explain how?..Or any document where I can find that.

@jdluzen

This comment has been minimized.

Show comment
Hide comment
@jdluzen

jdluzen Jul 6, 2017

I've attempted something like @bitbank2, and changed some of the doubles to floats in the 2 methods he mentioned, without a whole lot of net speedup, if any. I do want to attempt to convert to 64 bit ints and try again, but it seems like a more in depth undertaking than my current haphazard edits.

However, in order to try out some quick-and-dirty parallelization, I've also profiled it with a random JPG I have laying around. Adding OpenMP's #pragma omp parallel for to all the fors in butteraugli.cc:Mask seems to have improved the performance by ~10%. I first attempted to add them to the fors in the methods that @bitbank2 mentioned, but with mixed results: I got crashing, inconsistent iterations, different JPG file sizes, and high-CPU-with-no-progress (threads blocked? cache line misses?) I'll continue to poke at this as well.

In a related note, the memory usage from the version I've compiled vs the binary that is on the Releases page has drastically improved memory usage, sometimes saving up to 75%.
guetzli omp
guetzli stock

jdluzen commented Jul 6, 2017

I've attempted something like @bitbank2, and changed some of the doubles to floats in the 2 methods he mentioned, without a whole lot of net speedup, if any. I do want to attempt to convert to 64 bit ints and try again, but it seems like a more in depth undertaking than my current haphazard edits.

However, in order to try out some quick-and-dirty parallelization, I've also profiled it with a random JPG I have laying around. Adding OpenMP's #pragma omp parallel for to all the fors in butteraugli.cc:Mask seems to have improved the performance by ~10%. I first attempted to add them to the fors in the methods that @bitbank2 mentioned, but with mixed results: I got crashing, inconsistent iterations, different JPG file sizes, and high-CPU-with-no-progress (threads blocked? cache line misses?) I'll continue to poke at this as well.

In a related note, the memory usage from the version I've compiled vs the binary that is on the Releases page has drastically improved memory usage, sometimes saving up to 75%.
guetzli omp
guetzli stock

@leafjungle

This comment has been minimized.

Show comment
Hide comment
@leafjungle

leafjungle Aug 24, 2017

I print the time cost for each step, and the result:
image size: 100K.

time cost:
total cost: 86 seconds.
ApplyGlobalQuantization: 20 seconds.
SelectFrequencyMasking: 60 seconds.

leafjungle commented Aug 24, 2017

I print the time cost for each step, and the result:
image size: 100K.

time cost:
total cost: 86 seconds.
ApplyGlobalQuantization: 20 seconds.
SelectFrequencyMasking: 60 seconds.

@banghn

This comment has been minimized.

Show comment
Hide comment
@banghn

banghn Oct 25, 2017

Hi @leafjungle what is GloabalQuantization you applied. I also would like to improve compressing with guetzli

banghn commented Oct 25, 2017

Hi @leafjungle what is GloabalQuantization you applied. I also would like to improve compressing with guetzli

@luiseps

This comment has been minimized.

Show comment
Hide comment
@luiseps

luiseps Oct 25, 2017

@jdluzen I add #prama omp parallel for un butterugli.cc:convolution() but the procesa was more slow than original version. I believe that it was about the data race about the cache. Can hoy help me how to improve guetzli?. Thanks

luiseps commented Oct 25, 2017

@jdluzen I add #prama omp parallel for un butterugli.cc:convolution() but the procesa was more slow than original version. I believe that it was about the data race about the cache. Can hoy help me how to improve guetzli?. Thanks

@rogierlommers

This comment has been minimized.

Show comment
Hide comment
@rogierlommers

rogierlommers Oct 25, 2017

Not sure if related, but at our company, we chose to apply the Guetzli algorithm on all our rendered images. Because it's relative slow, we decided to distribute the load in a special way. You can read all about it here: https://techlab.bol.com/from-a-crazy-hackathon-idea-to-an-empty-queue/

rogierlommers commented Oct 25, 2017

Not sure if related, but at our company, we chose to apply the Guetzli algorithm on all our rendered images. Because it's relative slow, we decided to distribute the load in a special way. You can read all about it here: https://techlab.bol.com/from-a-crazy-hackathon-idea-to-an-empty-queue/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment