Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value should be copied #5

Open
arjenpdevries opened this issue Jun 25, 2020 · 3 comments
Open

Value should be copied #5

arjenpdevries opened this issue Jun 25, 2020 · 3 comments

Comments

@arjenpdevries
Copy link

Using this code, I got many copies of the same item... turns out this is the issue:

I now replaced it by:

        @Override
        public WarcWritable getCurrentValue() throws IOException, InterruptedException {
            WarcWritable retValue = this.value;
            this.value = new WarcWritable();
            return retValue;
        }

Good idea? Seems to work correctly in my tests, but they were only on small data so far.
I can make a pull request if that is still useful; the repo has been inactive...

@helgeho
Copy link
Owner

helgeho commented Jun 25, 2020

Hey Arjen, thanks for looking into this!

As you noticed correctly, the repo has been pretty much inactive, I've not been using or maintaining this for quite some time. I've almost entirely switched to Spark for web archive data processing at the Archive and, although this input format could very well be used with Spark as well, I've been using different mechanism there to split up WARCs in slightly more lightweight ways. It's been on my list for a while to turn that new code into an new input format that can be used with Hadoop as a replacement of this one, but just haven't gotten to it yet.

Actually, it's considered good practice to reuse writables in input formats to not stress GC too much. And that shouldn't be causing any issues if implemented correctly, because it's generally presumed that only one value is read at a time and not be kept in memory for later. So, please check your code that's using the input format, whether this is something that can be fixed on your end by copying the values needed from WarcWritable before retrieving the next one.

If that's not the case and this turns out to be in fact a problem of this input format, I'm sure the bug is somewhere deeper and that's really what should be fixed. So, if you get a chance to investigate and potentially fix that, I would highly appreciate it and be happy to take a PR. Otherwise, please let me know and we might just take a PR of your above suggestion as a workaround for now.

@arjenpdevries
Copy link
Author

Ah, I used this for reading WARCs in Spark - seemed more isolated code than anything else I could find that actually works.

What happens now in Spark, when you use the reader based on your code in the newAPIHadoopFile, it creates an RDD with many entries that all point to the exact same value. I used an initial map { identity } before switching libraries, and I think I tried that here too, maybe should revisit that alternative. Will try this with the extra copying map then, but I suspect given my experience with classes (from SURFsara) - that always worked fine and not any more - that something crucial may have changed in Hadoop v2 -> v3 such that expected behaviour is different (not sure though).

Would be very useful if we could develop your WARC reading extension for Spark that is lightweight and works well - there is no obvious alternative solution as far as I could see! I can help to make it happen, with a bit of guidance from your end. Let's take that discussion offline.

@arjenpdevries
Copy link
Author

(Maybe the reader should fill a partition with records, handed over by Spark, one at a time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants