-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value should be copied #5
Comments
Hey Arjen, thanks for looking into this! As you noticed correctly, the repo has been pretty much inactive, I've not been using or maintaining this for quite some time. I've almost entirely switched to Spark for web archive data processing at the Archive and, although this input format could very well be used with Spark as well, I've been using different mechanism there to split up WARCs in slightly more lightweight ways. It's been on my list for a while to turn that new code into an new input format that can be used with Hadoop as a replacement of this one, but just haven't gotten to it yet. Actually, it's considered good practice to reuse If that's not the case and this turns out to be in fact a problem of this input format, I'm sure the bug is somewhere deeper and that's really what should be fixed. So, if you get a chance to investigate and potentially fix that, I would highly appreciate it and be happy to take a PR. Otherwise, please let me know and we might just take a PR of your above suggestion as a workaround for now. |
Ah, I used this for reading WARCs in Spark - seemed more isolated code than anything else I could find that actually works. What happens now in Spark, when you use the reader based on your code in the Would be very useful if we could develop your WARC reading extension for Spark that is lightweight and works well - there is no obvious alternative solution as far as I could see! I can help to make it happen, with a bit of guidance from your end. Let's take that discussion offline. |
(Maybe the reader should fill a partition with records, handed over by Spark, one at a time.) |
Using this code, I got many copies of the same item... turns out this is the issue:
HadoopConcatGz/src/main/java/de/l3s/concatgz/io/warc/WarcGzInputFormat.java
Line 72 in 3098893
I now replaced it by:
Good idea? Seems to work correctly in my tests, but they were only on small data so far.
I can make a pull request if that is still useful; the repo has been inactive...
The text was updated successfully, but these errors were encountered: