Value should be copied #5

arjenpdevries · 2020-06-25T02:03:49Z

Using this code, I got many copies of the same item... turns out this is the issue:

HadoopConcatGz/src/main/java/de/l3s/concatgz/io/warc/WarcGzInputFormat.java

Line 72 in 3098893

return value;

I now replaced it by:

        @Override
        public WarcWritable getCurrentValue() throws IOException, InterruptedException {
            WarcWritable retValue = this.value;
            this.value = new WarcWritable();
            return retValue;
        }

Good idea? Seems to work correctly in my tests, but they were only on small data so far.
I can make a pull request if that is still useful; the repo has been inactive...

The text was updated successfully, but these errors were encountered:

helgeho · 2020-06-25T08:48:29Z

Hey Arjen, thanks for looking into this!

As you noticed correctly, the repo has been pretty much inactive, I've not been using or maintaining this for quite some time. I've almost entirely switched to Spark for web archive data processing at the Archive and, although this input format could very well be used with Spark as well, I've been using different mechanism there to split up WARCs in slightly more lightweight ways. It's been on my list for a while to turn that new code into an new input format that can be used with Hadoop as a replacement of this one, but just haven't gotten to it yet.

Actually, it's considered good practice to reuse writables in input formats to not stress GC too much. And that shouldn't be causing any issues if implemented correctly, because it's generally presumed that only one value is read at a time and not be kept in memory for later. So, please check your code that's using the input format, whether this is something that can be fixed on your end by copying the values needed from WarcWritable before retrieving the next one.

If that's not the case and this turns out to be in fact a problem of this input format, I'm sure the bug is somewhere deeper and that's really what should be fixed. So, if you get a chance to investigate and potentially fix that, I would highly appreciate it and be happy to take a PR. Otherwise, please let me know and we might just take a PR of your above suggestion as a workaround for now.

arjenpdevries · 2020-06-26T07:57:20Z

Ah, I used this for reading WARCs in Spark - seemed more isolated code than anything else I could find that actually works.

What happens now in Spark, when you use the reader based on your code in the newAPIHadoopFile, it creates an RDD with many entries that all point to the exact same value. I used an initial map { identity } before switching libraries, and I think I tried that here too, maybe should revisit that alternative. Will try this with the extra copying map then, but I suspect given my experience with classes (from SURFsara) - that always worked fine and not any more - that something crucial may have changed in Hadoop v2 -> v3 such that expected behaviour is different (not sure though).

Would be very useful if we could develop your WARC reading extension for Spark that is lightweight and works well - there is no obvious alternative solution as far as I could see! I can help to make it happen, with a bit of guidance from your end. Let's take that discussion offline.

arjenpdevries · 2020-06-26T07:58:50Z

(Maybe the reader should fill a partition with records, handed over by Spark, one at a time.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value should be copied #5

Value should be copied #5

arjenpdevries commented Jun 25, 2020

helgeho commented Jun 25, 2020

arjenpdevries commented Jun 26, 2020

arjenpdevries commented Jun 26, 2020

Value should be copied #5

Value should be copied #5

Comments

arjenpdevries commented Jun 25, 2020

helgeho commented Jun 25, 2020

arjenpdevries commented Jun 26, 2020

arjenpdevries commented Jun 26, 2020