Loading ARC files produces record size errror #199

jrwiebe · 2016-02-19T01:56:36Z

I've been working with ARC files a fair bit recently, so I've become quite familiar with the error message that appears each time a file is loaded. It looks something like this:

ERROR ArcRecordUtils - Read 1222 bytes but expected 1298 bytes. Continuing...

I noticed that for the files I was working with, the size delta was always the same (76 bytes). I looked into this and realized this was the length of the second and third lines of the version block at the top of every file. E.g.:

1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length

These two lines are counted in the total record size but they are skipped over when the record content is read, so when copyStream encounters these records it always expects more. The second line represents the file format version (e.g., 1.1) and the origin of the file, and the third line is currently hard-coded in org.archive.io.arc.ARCRecord.

In branch arc-tobytes I modified ArcRecordUtils to fix this, but the fix is incomplete because there is no simple way to get the origin of the archive file. For the time being I've hard-coded "InternetArchive", which works for all the ARC files we have. There is also no way, using the ARCRecord-related classes, to recover the third line, although this is probably even less likely to actually cause an issue. (Are there non-WAC tools producing ARCs out there?)

My code changes toBytes() and getContent(). The former (note the if (meta.getOffset() == 0) block):

public static byte[] toBytes(ARCRecord record) throws IOException {
    ARCRecordMetaData meta = record.getMetaData();

    String metaline = meta.getUrl() + " " + meta.getIp() + " " + meta.getDate() + " "
        + meta.getMimetype() + " " + (int) meta.getLength() + "\n";
    String versionEtc = "";

    if (meta.getOffset() == 0) {
      versionEtc = meta.getVersion().replace(".", " ") +
              " InternetArchive\n" + // Should have meta.getOrigin()
              "URL IP-address Archive-date Content-type Archive-length\n";
      metaline += versionEtc;
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(baos);
    dout.write(metaline.getBytes());
    copyStream(record, (int) meta.getLength() - versionEtc.length(), true, dout);

    return baos.toByteArray();
  }

I can't think of any cases where meta.getOffest() would be 0 other than at the beginning of a file, which should always start with the version block. I suppose we could also check that the URL begins with filedesc:// to be extra safe.

@lintool, what do you think about this? Should I make a request to the @iipc folks to store the info from lines 2 and 3 of the block in ARCRecord? I suppose we could always extend their classes. Or drop the error checking. I think it would be nice to get this fixed one way or another, so that users don't see error messages suggesting data loss.

The text was updated successfully, but these errors were encountered:

jrwiebe · 2016-05-13T16:07:21Z

When the next version of webarchive-commons (1.1.7) is released it will fix this issue.

anjackson · 2016-06-17T11:36:24Z

BTW, 1.1.7 is out now.

jrwiebe mentioned this issue Mar 8, 2016

Store origin-code in ARCRecord header iipc/webarchive-commons#52

Merged

jrwiebe self-assigned this May 13, 2016

ianmilligan1 mentioned this issue Nov 30, 2017

ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes archivesunleashed/aut#128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading ARC files produces record size errror #199

Loading ARC files produces record size errror #199

jrwiebe commented Feb 19, 2016

jrwiebe commented May 13, 2016

anjackson commented Jun 17, 2016

Loading ARC files produces record size errror #199

Loading ARC files produces record size errror #199

Comments

jrwiebe commented Feb 19, 2016

jrwiebe commented May 13, 2016

anjackson commented Jun 17, 2016