You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.
I've been working with ARC files a fair bit recently, so I've become quite familiar with the error message that appears each time a file is loaded. It looks something like this:
I noticed that for the files I was working with, the size delta was always the same (76 bytes). I looked into this and realized this was the length of the second and third lines of the version block at the top of every file. E.g.:
These two lines are counted in the total record size but they are skipped over when the record content is read, so when copyStream encounters these records it always expects more. The second line represents the file format version (e.g., 1.1) and the origin of the file, and the third line is currently hard-coded in org.archive.io.arc.ARCRecord.
In branch arc-tobytes I modified ArcRecordUtils to fix this, but the fix is incomplete because there is no simple way to get the origin of the archive file. For the time being I've hard-coded "InternetArchive", which works for all the ARC files we have. There is also no way, using the ARCRecord-related classes, to recover the third line, although this is probably even less likely to actually cause an issue. (Are there non-WAC tools producing ARCs out there?)
My code changes toBytes() and getContent(). The former (note the if (meta.getOffset() == 0) block):
I can't think of any cases where meta.getOffest() would be 0 other than at the beginning of a file, which should always start with the version block. I suppose we could also check that the URL begins with filedesc:// to be extra safe.
@lintool, what do you think about this? Should I make a request to the @iipc folks to store the info from lines 2 and 3 of the block in ARCRecord? I suppose we could always extend their classes. Or drop the error checking. I think it would be nice to get this fixed one way or another, so that users don't see error messages suggesting data loss.
The text was updated successfully, but these errors were encountered:
I've been working with ARC files a fair bit recently, so I've become quite familiar with the error message that appears each time a file is loaded. It looks something like this:
I noticed that for the files I was working with, the size delta was always the same (76 bytes). I looked into this and realized this was the length of the second and third lines of the version block at the top of every file. E.g.:
These two lines are counted in the total record size but they are skipped over when the record content is read, so when
copyStream
encounters these records it always expects more. The second line represents the file format version (e.g., 1.1) and the origin of the file, and the third line is currently hard-coded in org.archive.io.arc.ARCRecord.In branch
arc-tobytes
I modified ArcRecordUtils to fix this, but the fix is incomplete because there is no simple way to get the origin of the archive file. For the time being I've hard-coded "InternetArchive", which works for all the ARC files we have. There is also no way, using the ARCRecord-related classes, to recover the third line, although this is probably even less likely to actually cause an issue. (Are there non-WAC tools producing ARCs out there?)My code changes
toBytes()
andgetContent()
. The former (note theif (meta.getOffset() == 0)
block):I can't think of any cases where
meta.getOffest()
would be 0 other than at the beginning of a file, which should always start with the version block. I suppose we could also check that the URL begins withfiledesc://
to be extra safe.@lintool, what do you think about this? Should I make a request to the @iipc folks to store the info from lines 2 and 3 of the block in
ARCRecord
? I suppose we could always extend their classes. Or drop the error checking. I think it would be nice to get this fixed one way or another, so that users don't see error messages suggesting data loss.The text was updated successfully, but these errors were encountered: