Skip to content
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.

Loading ARC files produces record size errror #199

Open
jrwiebe opened this issue Feb 19, 2016 · 2 comments
Open

Loading ARC files produces record size errror #199

jrwiebe opened this issue Feb 19, 2016 · 2 comments
Assignees

Comments

@jrwiebe
Copy link
Collaborator

jrwiebe commented Feb 19, 2016

I've been working with ARC files a fair bit recently, so I've become quite familiar with the error message that appears each time a file is loaded. It looks something like this:

ERROR ArcRecordUtils - Read 1222 bytes but expected 1298 bytes. Continuing...

I noticed that for the files I was working with, the size delta was always the same (76 bytes). I looked into this and realized this was the length of the second and third lines of the version block at the top of every file. E.g.:

1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length

These two lines are counted in the total record size but they are skipped over when the record content is read, so when copyStream encounters these records it always expects more. The second line represents the file format version (e.g., 1.1) and the origin of the file, and the third line is currently hard-coded in org.archive.io.arc.ARCRecord.

In branch arc-tobytes I modified ArcRecordUtils to fix this, but the fix is incomplete because there is no simple way to get the origin of the archive file. For the time being I've hard-coded "InternetArchive", which works for all the ARC files we have. There is also no way, using the ARCRecord-related classes, to recover the third line, although this is probably even less likely to actually cause an issue. (Are there non-WAC tools producing ARCs out there?)

My code changes toBytes() and getContent(). The former (note the if (meta.getOffset() == 0) block):

public static byte[] toBytes(ARCRecord record) throws IOException {
    ARCRecordMetaData meta = record.getMetaData();

    String metaline = meta.getUrl() + " " + meta.getIp() + " " + meta.getDate() + " "
        + meta.getMimetype() + " " + (int) meta.getLength() + "\n";
    String versionEtc = "";

    if (meta.getOffset() == 0) {
      versionEtc = meta.getVersion().replace(".", " ") +
              " InternetArchive\n" + // Should have meta.getOrigin()
              "URL IP-address Archive-date Content-type Archive-length\n";
      metaline += versionEtc;
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(baos);
    dout.write(metaline.getBytes());
    copyStream(record, (int) meta.getLength() - versionEtc.length(), true, dout);

    return baos.toByteArray();
  }

I can't think of any cases where meta.getOffest() would be 0 other than at the beginning of a file, which should always start with the version block. I suppose we could also check that the URL begins with filedesc:// to be extra safe.

@lintool, what do you think about this? Should I make a request to the @iipc folks to store the info from lines 2 and 3 of the block in ARCRecord? I suppose we could always extend their classes. Or drop the error checking. I think it would be nice to get this fixed one way or another, so that users don't see error messages suggesting data loss.

@jrwiebe
Copy link
Collaborator Author

jrwiebe commented May 13, 2016

When the next version of webarchive-commons (1.1.7) is released it will fix this issue.

@anjackson
Copy link
Contributor

BTW, 1.1.7 is out now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants