Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image.WriteToFile crashes with NetVips.Native 8.9.0-rc4 #53

Closed
Gigas002 opened this issue Jan 11, 2020 · 16 comments
Closed

Image.WriteToFile crashes with NetVips.Native 8.9.0-rc4 #53

Gigas002 opened this issue Jan 11, 2020 · 16 comments
Labels
bug Something isn't working
Milestone

Comments

@Gigas002
Copy link

Gigas002 commented Jan 11, 2020

Image.WriteToFile method with NetVips.Native 8.9.0-rc4 sometimes crashes unpredictably. I have an application, that writes tiles by cropping GeoTIFF. I've run some tests on CI and local machine (Win10). Each test writes approximately 7000 tiles. I've got following results:
8 tests on linux (CI) - passes without errors
12 tests on windows (CI) - 1/12 crashed
20 tests on windows (local) - 5/12 crashed

At first I tried to downgrade to NetVips 1.1.0 instead of 1.2.0-rc2 but that didn't solve the issue. Then I tried to downgrade to NetVips.Native 8.8.4 and run through unit tests again:
20 tests on windows (local) - all passes

I also looked into tiles, that was written wrong and their correct versions, they look like this:

Broken:
27304

Correct:
27304_2

The exact fragment of code, throwing the exception is following:

outputImage = NetVips.Image.Black(Enums.Image.Image.TileSize, Enums.Image.Image.TileSize).NewFromImage(0, 0, 0, 0);

// Insert tile into output image
outputImage = outputImage.Insert(tileImage, writePosX, writePosY);

//This string throws an exception sometimes
outputImage.WriteToFile(outputTileFileInfo.FullName);

Vips's exception message:

unable to call VipsForeignSavePngFile
    TIFFFillTile: Seek error at row 3584, col 0, tile 194
    vips2png: unable to write to target
@kleisauke kleisauke added the triage This issue is being investigated label Jan 13, 2020
@kleisauke
Copy link
Owner

Hi @Gigas002,

Thanks for testing the NetVips.Native v8.9.0-rc4 (pre-release) package! I couldn't reproduce this issue on Windows x64 with this test program:

Test program
namespace RandomCropTest
{
    using System;
    using NetVips;

    class RandomCropTest
    {
        static void Main(string[] args)
        {
            NetVips.CacheSetMax(0);

            // source: https://github.com/Gigas002/GTiff2Tiles/blob/master/Examples/Input/Input3785.tif
            var huge = Image.NewFromFile("huge.tif");

            const int tileSize = 256;
            const double scale = 0.7;

            var rnd = new Random();

            // make a black background image to insert our resized tiles into
            var background = Image.Black(10000, 10000);

            for (var j = 0; j < 100; j++)
            {
                // fetch source pixels from here ... tiles along the right and bottom will be
                // cropped against the image edges
                var oldX = rnd.Next(0, huge.Width);
                var oldY = rnd.Next(0, huge.Height);

                var width = Math.Min(tileSize, huge.Width - oldX);
                var height = Math.Min(tileSize, huge.Height - oldY);

                // write resized pixels here
                var newX = rnd.Next(0, background.Width);
                var newY = rnd.Next(0, background.Height);

                // fetch cropped tile from input ... because tiles have been cropped,
                // non-tile areas will appear transparent in the output
                using var tile = huge.Crop(oldX, oldY, width, height);
                using var resized = tile.Resize(scale);

                // write into background
                background = background.Insert(resized, newX, newY);

                Console.WriteLine($"[{oldX}, {oldY}, {width}, {height}] -> [{newX}, {newY}]");
            }

            background.WriteToFile("x.png");
            Console.WriteLine("Done!");
            Console.ReadKey();
        }
    }
}

(tested with both pre-compiled libvips v8.9.0-rc4 and v8.8.4 binaries)

It could be a issue with libvips v8.9.0 (since you couldn't reproduce it with libvips 8.8.4). I've a few of questions:

  1. Are you testing on Windows 64-bit or 32-bit? The AnyCPU target might load 32-bit DLL files.
  2. Are you able to provide a sample image and a standalone (without other dependencies) code sample that demonstrates this problem?

This issue might be bitness problem because the exception did not occur on Linux (all pre-compiled binaries for Linux are 64-bit).

@jcupitt Do you have any idea what could go wrong? I noticed that toff_t which is used for seeking and determining the length of a TIFF source is an unsigned 64 bit integer. It could break our error mechanism.

@Gigas002
Copy link
Author

Hello.
Here’s a quick standalone simulation of what my app does. It runs 10 identical tests. It also contains GeoTIFF I use to test (4326.tif). App takes two args: the path to input file and the path to output directory (run it like this: ./TestNetVips890 "D:/test/input/4326.tif" "D:/test/output").

And here’s the results I’ve got from them on Windows 10 x64 (binaries are also built targeting x64 only, not AnyCPU):
8.9.0-rc4 - 9/10 tests failed
8.8.4 - 10/10 passed

There’s some math in code, you don’t need to look on all of these, I marked it as math region in code. The exception is thrown in the WriteTile method. The line, that throws an exception is marked with TODO comment.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 13, 2020

I think the seek horribleness is OK -- libtiff does the same revolting mess:

https://gitlab.com/libtiff/libtiff/blob/master/libtiff/tif_unix.c#L123

ie. they cast -1 to unint64 to signal error. We could make the cast explicit.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 13, 2020

I found this related libtiff issue:

http://bugzilla.maptools.org/show_bug.cgi?id=2726

@kleisauke
Copy link
Owner

kleisauke commented Jan 13, 2020

Thanks for you standalone sample! I could reproduce this issue on Windows x64.

This slightly simpler program fails reliably for me (with sequential and random access):

namespace RandomCropStressTest
{
    using System;
    using System.Threading.Tasks;
    using NetVips;

    public static class RandomCropStressTest
    {
        static readonly Random Rnd = new Random();

        static Image RandomCrop(this Image image, int tileSize)
        {
            var x = Rnd.Next(0, image.Width);
            var y = Rnd.Next(0, image.Height);

            var width = Math.Min(tileSize, image.Width - x);
            var height = Math.Min(tileSize, image.Height - y);

            return image.Crop(x, y, width, height);
        }

        static void Main()
        {
            NetVips.CacheSetMax(0);

            const string access = "sequential"/*"random"*/;
            const int tileSize = 256;
            const int threads = 6;

            // dimensions: 6569 x 3073
            var huge = Image.NewFromFile("4326.tif", access: access);

            Parallel.For(0, 1000, new ParallelOptions { MaxDegreeOfParallelism = threads },
                i => huge.RandomCrop(tileSize).WriteToFile($"x_{i}.png"));

            Console.WriteLine("Done!");
            Console.ReadKey();
        }
    }
}

(i.e. random cropping with 6 threads on the same input image)

I'll have a look.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

You probably saw, but this is how libtiff tests for seek error:

https://gitlab.com/libtiff/libtiff/blob/master/libtiff/tif_aux.c#L409

    return off <= (~(uint64)0)/2 && TIFFSeekFile(tif,off,SEEK_SET)==off;

So I think we're OK (I think).

jcupitt added a commit to libvips/libvips that referenced this issue Jan 14, 2020
To make our meaning clearer.

libtiff uses uint64 for toff_t, with -1 cast to uint64 for an error
indication, see:

https://gitlab.com/libtiff/libtiff/blob/master/libtiff/tif_unix.c#L123

See also:

kleisauke/net-vips#53
@kleisauke
Copy link
Owner

Thanks for making the casting explicit. This issue looks like a concurrency problem, I was able to reproduce it with pyvips (on Linux and Windows):

pyvips reproduce script
#!/usr/bin/env python
import random
import sys
from threading import Thread

import pyvips

# import logging
# logging.basicConfig(level = logging.DEBUG)
# pyvips.cache_set_trace(True)

pyvips.cache_set_max(0)

access = 'sequential'  # 'random'
tile_size = 256
threads = 24  # 6

im = pyvips.Image.new_from_file(sys.argv[1], access=access)


def parallel_crop(image):
    for i in range(1000):
        x = random.randint(0, image.width - 1)
        y = random.randint(0, image.height - 1)

        width = min(tile_size, image.width - x)
        height = min(tile_size, image.height - y)

        crop = image.crop(x, y, width, height)

        try:
            avg = crop.avg()
        except pyvips.Error as e:
            print(e.message)
            print(e.detail)
            sys.exit(1)

        # print(pyvips.ffi.cast('GObject *', image.pointer).ref_count)


if __name__ == '__main__':
    # don't try this at home, I guess
    for i in range(threads):
        t = Thread(target=parallel_crop, args=(im,))
        t.start()

On Windows this issue occurs with 6 threads. On Linux it needs 24 threads(!) to reproduce it sometimes (you'll get an unable to call avg error).

The weird thing is that I couldn't reproduce it with the multiprocessing package:

multiprocessing script
#!/usr/bin/env python
import multiprocessing as mp
import random
import sys

import pyvips

# import logging
# logging.basicConfig(level = logging.DEBUG)
# pyvips.cache_set_trace(True)

pyvips.cache_set_max(0)

access = 'sequential'  # 'random'
tile_size = 256
threads = 6

image = pyvips.Image.new_from_file(sys.argv[1], access=access)


def parallel_crop(i):
    x = random.randint(0, image.width - 1)
    y = random.randint(0, image.height - 1)

    width = min(tile_size, image.width - x)
    height = min(tile_size, image.height - y)

    crop = image.crop(x, y, width, height)

    try:
        avg = crop.avg()
    except pyvips.Error as e:
        print(e.message)
        print(e.detail)
        sys.exit(1)

    print(pyvips.ffi.cast('GObject *', image.pointer).ref_count)


if __name__ == '__main__':
    with mp.Pool(processes=threads) as p:
        p.map(parallel_crop, range(0, 1000))

But perhaps it's using a mutex (the ref_count doesn't increase, fwiw).

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

Huh I'm confused now I look at the Python code. You shouldn't be able to do random crops from a TIFF in sequential mode.

Is this a tiled tiff? If so, access won't have any effect.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

Ah I see it here too, thanks!

Yes, it's a tiled tiff, so access doesn't matter. I can reproduce with:

#!/usr/bin/python3

import random
import sys
from threading import Thread
import pyvips

def parallel_crop(image):
    for i in range(1000):
        x = random.randint(0, image.width - 256)
        y = random.randint(0, image.height - 256)
        crop = image.crop(x, y, 256, 256)
        avg = crop.avg()

im = pyvips.Image.new_from_file(sys.argv[1])
for i in range(24):
    t = Thread(target=parallel_crop, args=(im,))
    t.start()

Some kind of horrible race condition. I'll have a look.

@kleisauke
Copy link
Owner

I did a quick test, it appears that this issue only occurs with tiled-TIFF images. For example:

vips copy tiled.tif non-tiled.tif
vips copy non-tiled.tif tiled2.tif[tile]

tiled.tif and tiled2.tif will fail with that test script, while non-tiled.tif seems to work properly.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

I tried with 8.8 and that seems to work, so I suppose it must be a race somewhere in the new IO system.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

Ah, found it! It's not in read at all, it's minimise.

The tiff loader closes the file descriptor after reading the header, and again after each computation. When a computation starts up, it opens the file descriptor again and restores the seek point. There's a lock inside the cache of recent tiles which prevents races on reopen, but there's no lock on the minimise action.

Probably the best solution is just not to minimise on end of computation for tiled images.

jcupitt added a commit to libvips/libvips that referenced this issue Jan 14, 2020
We weer minimising sources in the ::minimise handler, but this is called
outside the lock that protects _generate. This patch removes minimise in
this case.

See kleisauke/net-vips#53
@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

That seems to fix it for me.

Thanks for reporting this @Gigas002 !

kleisauke added a commit that referenced this issue Jan 14, 2020
@kleisauke kleisauke added blocked-upstream-dependency Upstream dependency needs to be updated bug Something isn't working and removed triage This issue is being investigated labels Jan 14, 2020
@kleisauke
Copy link
Owner

kleisauke commented Jan 14, 2020

Thanks! I confirm that it's working fine now. As a side note, you might be interested in this commit: 40cbfcd. I found it while debugging this issue.

Example code:

NetVips.CacheSetMax(0);

var image = Image.NewFromFile("lichtenstein.jpg");

for (var i = 0; i < 100; i++)
{
    using (var crop = image.Crop(0, 0, 256, 256))
    {
        var _ = crop.Avg();
    }

    // RefCount should not increase (i.e. operation should be freed)
    Console.WriteLine("reference count: " + image.RefCount);
}

The reference count of the main image could easily climb to 100 as the VipsOperation pointers (returned by vips_operation_new and vips_cache_operation_build) were released during GC.

I also tested ruby-vips (on Windows) with this:

ruby-vips test
#!/usr/bin/ruby

require 'vips'

Vips::cache_set_max 0

im = Vips::Image.new_from_file ARGV[0]

100.times do |_|
  crop = im.crop 0, 0, 256, 256
  avg = crop.avg

  # This can easily climb to 20
  puts "reference count: #{im.ref_count}"
end

lib/vips/image.rb

def ref_count
  return @struct[:parent][:parent][:ref_count]
end

And noticed a similar behavior, but it looks like the GC is kicking-in around ref_count >= 20. pyvips doesn't seem to exhibit this behavior:

pyvips test
#!/usr/bin/env python
import sys

import pyvips

pyvips.cache_set_max(0)

im = pyvips.Image.new_from_file(sys.argv[1])

for i in range(100):
    crop = im.crop(0, 0, 256, 256)
    avg = crop.avg()

    # This prints always 2
    print(pyvips.ffi.cast('GObject *', im.pointer).ref_count)

The reference count remains steady to 2. Replacing the crop with:

pyvips change
 im = pyvips.Image.new_from_file(sys.argv[1])
 
 for i in range(100):
-    crop = im.crop(0, 0, 256, 256)
-    avg = crop.avg()
+    avg = im.crop(0, 0, 256, 256).avg()
 
     print(pyvips.ffi.cast('GObject *', im.pointer).ref_count)

Changes the reference count to 1.

@jcupitt
Copy link
Contributor

jcupitt commented Jan 14, 2020

Great! Thanks for testing it.

That's an interesting commit, I wonder if ruby could free more aggressively. Though it now has a very nice generational GC, so things like this should be found quickly.

Python isn't GCd, it's refcounted -- things are freed as soon as they go out of scope. There's a mark-sweep GC as well, but it's only run occasionally to break any ref cycles. It's why py has such good memory behaviour.

@kleisauke kleisauke added this to the 1.2.0 milestone Jan 28, 2020
@kleisauke
Copy link
Owner

Ah I see, Python does indeed have (pretty) good memory behavior. Let me know if I should open a separate issue on the ruby-vips repo (the reference count might increase if running that loop parallelized, but I didn't test that).

@Gigas002 NetVips v1.2.0 and NetVips.Native v8.9.1 is now available. Thanks for reporting this!

@kleisauke kleisauke removed the blocked-upstream-dependency Upstream dependency needs to be updated label Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants