Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid format exception in scanJobLog #239

Closed
ato opened this issue Mar 11, 2019 · 7 comments
Closed

Invalid format exception in scanJobLog #239

ato opened this issue Mar 11, 2019 · 7 comments
Labels

Comments

@ato
Copy link
Collaborator

ato commented Mar 11, 2019

@ruebot encountered the following exception after checkpointing and restarting Heritrix:

2019-03-11 14:08:53.599 SEVERE thread-1 org.archive.crawler.framework.Engine.addJobDirectory() bad cxml: /data/heritrix-jobs/academic-calendars/crawler-beans.cxml
java.lang.IllegalArgumentException: Invalid format: "539Z" is malformed at "Z"
    at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:634)
    at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65)
    at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:171)
    at org.joda.time.DateTime.<init>(DateTime.java:168)
    at org.archive.crawler.framework.CrawlJob.scanJobLog(CrawlJob.java:179)
    at org.archive.crawler.framework.CrawlJob.<init>(CrawlJob.java:101)
    at org.archive.crawler.framework.Engine.addJobDirectory(Engine.java:153)
    at org.archive.crawler.framework.Engine.findJobConfigs(Engine.java:109)
    at org.archive.crawler.framework.Engine.<init>(Engine.java:72)
    at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:335)
    at org.archive.crawler.Heritrix.main(Heritrix.java:188)

Looks like a bug here:

startPosition = jobLog.length()-(FileUtils.ONE_KB * 100);

If the job log is larger than 100KB, startPosition is set to 100KB from the end which might be in the middle of a line. If that partial line still happens to match Pattern.compile("(\\S+) (\\S+) Job launched") then an incomplete timestamp may be parsed causing the exception.

@anjackson suggests the following fix:

it should read a line and discard it if the startPosition is not zero.

@ato ato added the bug label Mar 11, 2019
anjackson added a commit to ukwa/heritrix3 that referenced this issue Mar 11, 2019
@anjackson
Copy link
Collaborator

I have a patch but haven't tested it yet.

@ruebot
Copy link
Collaborator

ruebot commented Mar 11, 2019

@anjackson happy to test with that jobs.log file on another machine if you need me to.

@anjackson
Copy link
Collaborator

@ruebot Awesome, thanks. I've popped a SNAPSHOT build here:

https://oss.sonatype.org/content/repositories/snapshots/org/archive/heritrix/heritrix/3.4.0-SNAPSHOT/heritrix-3.4.0-20190311.214753-2-dist.tar.gz

Can you give it a spin?

@ruebot
Copy link
Collaborator

ruebot commented Mar 12, 2019

Good to go on my end:

$ tail -f job.log
2019-03-01T12:01:32.130Z INFO Job instantiated
2019-03-01T12:01:42.293Z INFO Job launched
2019-03-01T12:01:43.923Z INFO PREPARING 20190301120143
2019-03-01T12:01:43.955Z INFO PAUSED 20190301120143
2019-03-01T12:01:54.079Z INFO RUNNING 20190301120143
2019-03-01T12:02:01.454Z WARNING nowhere to log added seed: http://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.registrar.yorku.ca/'; in processor 'candidates')
2019-03-01T12:02:07.490Z WARNING nowhere to log added seed: https://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.students.yorku.ca/'; in processor 'candidates')
2019-03-11T12:26:03.810Z INFO CHECKPOINTED cp00001-20190311122554
2019-03-11T12:26:06.453Z INFO PAUSING 20190301120143
2019-03-11T12:26:07.702Z INFO PAUSED 20190301120143
$ cat heritrix_out.log
Tue Mar 12 08:32:54 EDT 2019 Starting heritrix
Linux wombat 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
JAVA_OPTS= -Xmx256m
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256586
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256586
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Oracle Corporation OpenJDK Runtime Environment 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore adhoc.keystore -destkeystore adhoc.keystore -deststoretype pkcs12".
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:E7:65:F2:C6:5A:90:A0:79:61:FF:28:1F:24:81:A6:74:D0:38:2C:0B
Verify in browser before accepting exception.
2019-03-12 12:32:55.523 INFO thread-1 org.archive.crawler.framework.Engine.addJobDirectory() added crawl job: academic-calendars
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 9191
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web
interface will be internet-accessible.
Heritrix version: 3.4.0-SNAPSHOT-2019-03-11T21:36:31Z
log4j:WARN No appenders could be found for logger (freemarker.cache).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

@ato
Copy link
Collaborator Author

ato commented Mar 13, 2019

Patch looks correct to me.

@anjackson
Copy link
Collaborator

Thanks both!

@anjackson
Copy link
Collaborator

Pull request created: #240

@ato ato closed this as completed Mar 13, 2019
nlevitt added a commit that referenced this issue Apr 1, 2019
* master:
  use constant from rethinkdb lib for default port
  Update README.md
  Handle missing closing paren in srcset descriptor
  Teach jericho extractor srcset
  Don't run srcset test against jericho, it doesn't handle it
  Handle commas more compliantly when parsing srcset
  Ensure we start parsing full lines, for #239.
nlevitt added a commit to nlevitt/heritrix3 that referenced this issue Apr 10, 2019
* master:
  replace System.err.println with logger.info
  Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly"
  Removing outdated test.
  Disable questionalbe test.
  Avoid deprecated flag.
  Supply an iterator, for internetarchive#245
  Updated POM to use latest version.
  Update README.md
  Handle missing closing paren in srcset descriptor
  Teach jericho extractor srcset
  Don't run srcset test against jericho, it doesn't handle it
  Handle commas more compliantly when parsing srcset
  Ensure we start parsing full lines, for internetarchive#239.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants