lbfgs failing with "function value changing less than tolX" when in GPU mode #52

Open
raptorecki opened this Issue Sep 26, 2015 · 2 comments

Comments

Projects
None yet
2 participants
@raptorecki

This code is some marvelous work! I'm stunned by the amazing results it can give.

I happened to encounter several small issues, maybe someone would be able to help me with.

I noticed some strange interruptions when rendering an image with lbfgs optimizer. It shows up like this:

(...)
Iteration 740 / 1000
Content 1 loss: 2133260.000000
Style 1 loss: 33612.725830
Style 2 loss: 1111955.078125
Style 3 loss: 456756.542969
Style 4 loss: 19963398.437500
Style 5 loss: 875.515652
Total loss: 23699858.300076
<optim.lbfgs> function value changing less than tolX

The moment it happens is always completely random. I played with weights and other parameters but the funny thing is - it doesn't matter. I can render the same style and source with the same settings several times in a row and it will eventually fail like this. Or if it keeps failing it will eventually go through. When I render JPG sequence using simple bash loop (with the same style and settings) it usually fails once or twice every ten frames and moves on. I can then render failed frames again with the same settings and they finish fine.

Exploring this issue a bit I tried to modify local tolX value in torch/install/share/lua/5.1/optim/lbfgs.lua or even comment out the entire if abs(f-f_old) < tolX then break. But then if during rendering the loss value stops changing I end up with black frames.

(...)
Iteration 990 / 1000
Content 1 loss: 4483204.062500
Style 1 loss: 30816.199951
Style 2 loss: 3080145.585938
Style 3 loss: 2145852.421875
Style 4 loss: 69358680.000000
Style 5 loss: 6758.217773
Total loss: 79105456.488037
Iteration 995 / 1000
Content 1 loss: 4483204.062500
Style 1 loss: 30816.199951
Style 2 loss: 3080145.585938
Style 3 loss: 2145852.421875
Style 4 loss: 69358680.000000
Style 5 loss: 6758.217773
Total loss: 79105456.488037
Iteration 1000 / 1000
Content 1 loss: 4483204.062500
Style 1 loss: 30816.199951
Style 2 loss: 3080145.585938
Style 3 loss: 2145852.421875
Style 4 loss: 69358680.000000
Style 5 loss: 6758.217773
Total loss: 79105456.488037
<optim.lbfgs> reached max number of iterations

What is worth mentioning, this happens only in GPU mode for both nn and cudnn backends. If only I could understand why is it happening I would love to investigate it a bit more.

I'm also curious about the memory limitations. I use for example -image_size 640 and according to nvidia-smi that uses 1221MiB/2046MiB so it seems there is plenty left. But when I proceed with -image_size 641 it drops with familiar cuda runtime error (2) : out of memory. Of course the values vary with different styles and sources. Idle state uses 45MiB/2046MiB. Could anyone explain what is preventing the library from using the remaining memory? In CPU mode I can use my entire 16G with no problem.

I'm rendering on GTX 770 (2G of GPU RAM) in GPU and i7 4790k (16G of RAM) in CPU using Ubuntu 14.04.2, Nvidia 352.39, CUDA 7.5.18-19867135 and CUDNN 7.0.

Again, results of this code are just mind blowing. Thank you for sharing this, jcjohnson!

@raptorecki

This comment has been minimized.

Show comment
Hide comment
@raptorecki

raptorecki Nov 1, 2015

So... Despite the issues mentioned above I managed to push out some visuals for my music.

You can see the results here

First of all big thanks for @jcjohnson for this code and @hughperkins for contributions (-seed saved me a lot of trouble).

I would like to share some of my experiences doing those tests:

  • Issue described in the original post was present on my GTX 770 and it caused about 4 out of 10 frames to be damaged with every image sequence. The very same sources and styles on EC2 did not cause any problems. This issue was present only on my GTX 770, problematic drivers maybe (352.39)?
  • I found out that I could mitigate this issue if instead of calling -style_weight 1000 -content_weight 500 I did -style_weight 'shuf -i990-1010 -n1' -content_weight 'shuf -i490-510 -n1'. It kept more or less consistent frames and made the number of invalid frames to drop significantly.
  • As the damaged frames leave a lot of zeroes in style loss you can catch them with grep and push to render again. They will eventually finish processing fine.

Some processing info:

  • On Amazon EC2 g2.2xlarge instance you get Nvidia GRID with 4GB of vRAM. You can go with up to around 920px but it's quite time consuming. With more reasonable size like 620px, single frame takes about 2'30" to render (you can get more or less 24 frames per hour).
  • I have a cage of HP Blades, each with 128GB of RAM so I made some tests there. I managed to render frames up to 2000px on CPU. Everything above 2000px failed after second iteration like so (regardless of weights, sources and styles were 4Mpx+):

Iteration 1 / 600
Content 1 loss: 12688274.218750
Style 1 loss: 1513260.009766
Style 2 loss: 261467156.250000
Style 3 loss: 76750781.250000
Style 4 loss: 2047350875.000000
Style 5 loss: 28623.922348
Total loss: 2399798970.650864
<optim.lbfgs> creating recyclable direction/step/history buffers
Iteration 2 / 600
Content 1 loss: 12688274.218750
Style 1 loss: 1513260.009766
Style 2 loss: 261467156.250000
Style 3 loss: 76750781.250000
Style 4 loss: 2047350875.000000
Style 5 loss: 28623.922348
Total loss: 2399798970.650864
<optim.lbfgs> function value changing less than tolX

  • Single 1920px frame took about 58GB of RAM and 5 hours on 12 cores of Intel Xeon E5-2650 to process. I deemed it unfeasible to use for a video, but it was interesting to see how far I can push it.
  • On my home GTX 770 with 2GB of vRAM I could process frames of about 600-620px at a rate of 1 frame per 3 minutes (~20 frames/hour).

Processing and post production thoughts:

  • My fps workflow was as follows:
    • I wanted 12fps-18fps "animated" look in 24fps H264 file,
    • I prepared image sequences from 24fps videos and various fps timelapses and processed them,
    • I interpreted processed frames as 12fps sequence,
    • I applied frame blending. That gave me blended intermediate frames every second original frame,
    • I interpreted result of the above as 24fps,
      In that way I had smoother transitions between frames, style closer to what I wanted, standard 24fps output file and had to render only half of the total 7936 frames required for that length of a final video.
  • 600-620px frames were upscaled to 720p 2.39:1 (1280x536). Denoising and perpendicular vector blur helped with smoothing it out a bit. Denoiser is to be used very carefully as it will degrade texture of the style. It was the most apparent with 'Starry Night' style, where denoising of only 2% caused heavy degradation of texture and fine details.
  • If you want to preserve face details on people, use -init image.
  • If the frame has one color, consistent background and you want to make it more interesting, use -init random.
  • Always use consistent -seed (again, thank you @hughperkins !).
  • In most situations -normalize_gradients helps significantly with noise issues.
  • Youtube compression messed it up big time...

Thanks guys, all the best, have fun!

So... Despite the issues mentioned above I managed to push out some visuals for my music.

You can see the results here

First of all big thanks for @jcjohnson for this code and @hughperkins for contributions (-seed saved me a lot of trouble).

I would like to share some of my experiences doing those tests:

  • Issue described in the original post was present on my GTX 770 and it caused about 4 out of 10 frames to be damaged with every image sequence. The very same sources and styles on EC2 did not cause any problems. This issue was present only on my GTX 770, problematic drivers maybe (352.39)?
  • I found out that I could mitigate this issue if instead of calling -style_weight 1000 -content_weight 500 I did -style_weight 'shuf -i990-1010 -n1' -content_weight 'shuf -i490-510 -n1'. It kept more or less consistent frames and made the number of invalid frames to drop significantly.
  • As the damaged frames leave a lot of zeroes in style loss you can catch them with grep and push to render again. They will eventually finish processing fine.

Some processing info:

  • On Amazon EC2 g2.2xlarge instance you get Nvidia GRID with 4GB of vRAM. You can go with up to around 920px but it's quite time consuming. With more reasonable size like 620px, single frame takes about 2'30" to render (you can get more or less 24 frames per hour).
  • I have a cage of HP Blades, each with 128GB of RAM so I made some tests there. I managed to render frames up to 2000px on CPU. Everything above 2000px failed after second iteration like so (regardless of weights, sources and styles were 4Mpx+):

Iteration 1 / 600
Content 1 loss: 12688274.218750
Style 1 loss: 1513260.009766
Style 2 loss: 261467156.250000
Style 3 loss: 76750781.250000
Style 4 loss: 2047350875.000000
Style 5 loss: 28623.922348
Total loss: 2399798970.650864
<optim.lbfgs> creating recyclable direction/step/history buffers
Iteration 2 / 600
Content 1 loss: 12688274.218750
Style 1 loss: 1513260.009766
Style 2 loss: 261467156.250000
Style 3 loss: 76750781.250000
Style 4 loss: 2047350875.000000
Style 5 loss: 28623.922348
Total loss: 2399798970.650864
<optim.lbfgs> function value changing less than tolX

  • Single 1920px frame took about 58GB of RAM and 5 hours on 12 cores of Intel Xeon E5-2650 to process. I deemed it unfeasible to use for a video, but it was interesting to see how far I can push it.
  • On my home GTX 770 with 2GB of vRAM I could process frames of about 600-620px at a rate of 1 frame per 3 minutes (~20 frames/hour).

Processing and post production thoughts:

  • My fps workflow was as follows:
    • I wanted 12fps-18fps "animated" look in 24fps H264 file,
    • I prepared image sequences from 24fps videos and various fps timelapses and processed them,
    • I interpreted processed frames as 12fps sequence,
    • I applied frame blending. That gave me blended intermediate frames every second original frame,
    • I interpreted result of the above as 24fps,
      In that way I had smoother transitions between frames, style closer to what I wanted, standard 24fps output file and had to render only half of the total 7936 frames required for that length of a final video.
  • 600-620px frames were upscaled to 720p 2.39:1 (1280x536). Denoising and perpendicular vector blur helped with smoothing it out a bit. Denoiser is to be used very carefully as it will degrade texture of the style. It was the most apparent with 'Starry Night' style, where denoising of only 2% caused heavy degradation of texture and fine details.
  • If you want to preserve face details on people, use -init image.
  • If the frame has one color, consistent background and you want to make it more interesting, use -init random.
  • Always use consistent -seed (again, thank you @hughperkins !).
  • In most situations -normalize_gradients helps significantly with noise issues.
  • Youtube compression messed it up big time...

Thanks guys, all the best, have fun!

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Nov 2, 2015

Contributor

What is worth mentioning, this happens only in GPU mode for both nn and cudnn backends.

Would be interesting to know whether it's present also for clnn. If it is, then it points to something in the code-base, and if not then it could be something in the driver. There's nothing particularly about GPUs that should mean the numbers are radically different than CPU, other than, GPUs are using 32-bit floats.

Hmmm... did you try CPU, with 32-bit floats? ie ,cast everything to a :float()?

Contributor

hughperkins commented Nov 2, 2015

What is worth mentioning, this happens only in GPU mode for both nn and cudnn backends.

Would be interesting to know whether it's present also for clnn. If it is, then it points to something in the code-base, and if not then it could be something in the driver. There's nothing particularly about GPUs that should mean the numbers are radically different than CPU, other than, GPUs are using 32-bit floats.

Hmmm... did you try CPU, with 32-bit floats? ie ,cast everything to a :float()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment