New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enron failures #3

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
1 participant
@jarib
Contributor

jarib commented Oct 8, 2018

This branch adds a few documents that fail when indexing the Enron dataset.

  1. Tika returns 415 for a few files. Probably a problem with our mime type detection? Anyway not a big deal.

  2. The file Enclosed.txt gives the following error:

 Traceback (most recent call last):
   File "/opt/hoover/snoop/snoop/data/tasks.py", line 146, in run_task
     result = func(*args, **depends_on)
   File "/opt/hoover/snoop/snoop/data/digests.py", line 104, in index
     content = _get_document_content(digest)
   File "/opt/hoover/snoop/snoop/data/digests.py", line 277, in _get_document_content
     content.update(email_meta(digest_data))
   File "/opt/hoover/snoop/snoop/data/digests.py", line 193, in email_meta
     message_date = zulu(email.parse_date(message_raw_date))
   File "/opt/hoover/snoop/snoop/data/analyzers/email.py", line 153, in parse_date
     return email.utils.parsedate_to_datetime(raw_date)
   File "/usr/local/lib/python3.6/email/utils.py", line 210, in parsedate_to_datetime
     *dtuple, tz = _parsedate_tz(data)
 TypeError: 'NoneType' object is not iterable
Add enron failures
1. Tika returns 415 for a few files. Probably we're getting the mime type wrong.

2. The file Enclosed.txt gives the following error:

 Traceback (most recent call last):
   File "/opt/hoover/snoop/snoop/data/tasks.py", line 146, in run_task
     result = func(*args, **depends_on)
   File "/opt/hoover/snoop/snoop/data/digests.py", line 104, in index
     content = _get_document_content(digest)
   File "/opt/hoover/snoop/snoop/data/digests.py", line 277, in _get_document_content
     content.update(email_meta(digest_data))
   File "/opt/hoover/snoop/snoop/data/digests.py", line 193, in email_meta
     message_date = zulu(email.parse_date(message_raw_date))
   File "/opt/hoover/snoop/snoop/data/analyzers/email.py", line 153, in parse_date
     return email.utils.parsedate_to_datetime(raw_date)
   File "/usr/local/lib/python3.6/email/utils.py", line 210, in parsedate_to_datetime
     *dtuple, tz = _parsedate_tz(data)
 TypeError: 'NoneType' object is not iterable

@wafflebot wafflebot bot assigned jarib Oct 8, 2018

@wafflebot wafflebot bot added the in progress label Oct 8, 2018

@jarib

This comment has been minimized.

Show comment
Hide comment
@jarib

jarib Oct 8, 2018

Contributor

If this is merged, the integration test in the snoop2 test suite will fail on these documents.

Contributor

jarib commented Oct 8, 2018

If this is merged, the integration test in the snoop2 test suite will fail on these documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment