Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARCs with datetime before year 1900 cause error in indexer #603

Closed
machawk1 opened this issue Jan 26, 2019 · 8 comments
Closed

WARCs with datetime before year 1900 cause error in indexer #603

machawk1 opened this issue Jan 26, 2019 · 8 comments

Comments

@machawk1
Copy link
Member

ipwb index /Path/tofb_fab_dates.warc
Traceback (most recent call last):dates.warc: 3/5
  File "/Users/machawk1/Library/Python/2.7/bin/ipwb", line 11, in <module>
    sys.exit(main())
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 18, in main
    args = checkArgs(sys.argv)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 165, in checkArgs
    results.func(results)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 33, in checkArgs_index
    debug=args.debug)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/indexer.py", line 170, in indexFileAt
    warcFileFullPath, **encryptionAndCompressionSetting)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/indexer.py", line 287, in getCDXJLinesFromFile
    record.rec_headers.get_header('WARC-Date'))
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/util.py", line 149, in iso8601ToDigits14
    return d.strftime('%Y%m%d%H%M%S')
ValueError: year=2 is before 1900; the datetime strftime() methods require year >= 1900

fb_fab_dates 2.warc.txt

@machawk1
Copy link
Member Author

This seems to be an issue with strftime with potential solutions provided here.

@shawnmjones
Copy link
Member

When would a WARC have a datetime prior to the year 1900?

@machawk1
Copy link
Member Author

@shawnmjones A WARC generated through conventional means should not, since 1900 predates the creation of the WARC spec and the Web. The WARC spec cites the W3C profile of the ISO W3C profile of ISO 8601:1988 spec as the WARC-Date basis. Dates prior to 1900 are legal here, so should not cause an exception.

However, the interpretation of a dates prior to 1900 in this field is likely due to a misinterpretation, misconfiguration, or a fabricated example, as attached ↑.

@machawk1
Copy link
Member Author

I tried this again with 0b9bb3e and the above WARC and was unable to replicate. I would like to see if there was some fix in Py3's strftime that might have remedied this issue from the runtime side. Currently running 3.7.2 on macOS 10.14.4.

@machawk1
Copy link
Member Author

machawk1 commented May 27, 2019

On second look, the above is using Python 2.7. Perhaps this was never an issue with Py3. I can replicate the above with Py2:

 ipwb index ~/Downloads/fb_fab_dates.2.warc
Traceback (most recent call last):dates.2.warc: 3/5
  File "/usr/local/bin/ipwb", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 18, in main
    args = checkArgs(sys.argv)
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 165, in checkArgs
    results.func(results)
  File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 33, in checkArgs_index
    debug=args.debug)
  File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 170, in indexFileAt
    warcFileFullPath, **encryptionAndCompressionSetting)
  File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 287, in getCDXJLinesFromFile
    record.rec_headers.get_header('WARC-Date'))
  File "/usr/local/lib/python2.7/site-packages/ipwb/util.py", line 149, in iso8601ToDigits14
    return d.strftime('%Y%m%d%H%M%S')
ValueError: year=2 is before 1900; the datetime strftime() methods require year >= 1900

...which should be moot as we drop support per #51.

@ibnesayeed
Copy link
Member

Yes, this should have been fixed in Py3 as per https://bugs.python.org/issue1777412

@machawk1
Copy link
Member Author

Thanks for finding this reference, @ibnesayeed. With the move to Py3 in #609, we should be able to close this issue, #608, and finally #51.

@machawk1
Copy link
Member Author

Closed in #609.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants