Invalid dates on a "_date" ending field will crash the indexing #267

tobes · 2013-03-11T13:48:44Z

This commit 783cf82 introduced a new dynamic field on the Solr schema to index automatically as dates every field ending in "_date". Unfortunately Solr is quite picky in terms of its format and will return an error even with values like "2012-01-02"

SearchIndexError: HTTP code=400, reason=Invalid Date String:'2012-01-02'

There needs to be a check before indexing and format the date to a suitable format for Solr. It may be worth adding the dependency of dateutil for this.

tobes · 2013-02-20T11:57:44Z

diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py
I've just been hit by this and had to create the following patch to fix things.

It is horrible on many levels.

I'd really like to only 'repair' the dict if solr throws an error as going through all the keys seems like wasting effort/time
also I think my None may still kill solr

It would be nice to get somewhere on this even if we just do not index the 'broken' record but currently the whole indexing breaks which is not so nice.

using dateutil seems sensible

diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py
index 87b5606..98a00dc 100644
--- a/ckan/lib/search/index.py
+++ b/ckan/lib/search/index.py
@@ -6,6 +6,7 @@ import json

 import re

+from dateutil.parser import parse
 from pylons import config
 from paste.deploy.converters import asbool

@@ -227,6 +228,13 @@ class PackageSearchIndex(SearchIndex):

         assert pkg_dict, 'Plugin must return non empty package dict on index'

+        for key in pkg_dict:
+            if key.endswith('_date'):
+                try:
+                    pkg_dict[key] = parse(pkg_dict[key]).isoformat() + 'Z'
+                except ValueError:
+                    pkg_dict[key] = None
+
         # send to solr:
         try:
             conn = make_connection()

tobes · 2013-03-05T14:15:55Z

@amercader it would be good to make some progress on this.

I'd be happy to try fix this correctly ie only when solr chokes if you have no time

amercader · 2013-03-11T11:31:39Z

@tobes I'm un-assigning myself from this as I don't really have time now, if you can fix it at some point that would be great, I'm happy to review it.
I'm afraid walking through all the dict to see if there are _date fields is the safest option, because virtually all _date fields will fail on Solr, as it wants full iso dates with Z at the end.
Also if there are several _date fields, will we need to keep sending the dict to solr until all are fixed?
Note that we are already iterating the dict items on this line (195), so maybe it could be refactored to also check for _date fields:

pkg_dict = dict([(k.encode('ascii', 'ignore'), v) for (k, v) in pkg_dict.items()])

tobes · 2013-03-11T11:35:16Z

@amercader
thanks I'll look at your suggested fix point as that seems better than itterating twice

tobes · 2013-03-11T13:50:50Z

@amercader this is now done. Seems like it should be in 2.0

the owner_org -> organization change fixed an issue that affected me but will be being changed in a different branch not yet merged

domoritz · 2013-03-12T10:50:40Z

pip-requirements.txt

@@ -29,3 +29,4 @@ Jinja2==2.6
 fanstatic==0.12
 requests==1.1.0
 WebTest==1.4.3
+dateutils==0.6.5


Are you sure you want dateutils and not the plain python-dateutil>=1.5.0,<2.0.0?

what's the difference? All I want is parse()

As I understand it, dateutils is a library that uses python-dateutil. If you only want the parse, use python-dateutil

ok I'm lost here what would you do to make this how you want?

:-) Just replace the line 32 with python-dateutil>=1.5.0,<2.0.0. It will reduce the number of dependencies since only python-dateutil is required but not dateutils.

cool let's see what travis thinks

amercader · 2013-03-12T18:30:54Z

ckan/lib/search/index.py

+        # FIXME where are we getting these dirty keys from?  can we not just
+        # fix them in the correct place or is this something that always will
+        # be needed?  For my data not changing the keys seems to not cause a
+        # problem.


@tobes let's not pollute the code with long FIXME comments like this one. Ping the list, open an issue or better yet comment on source on GitHub. 👮
I'm not sure where would we get the dirty keys from to be honest.

I sort of think that adding FIXMEs is a good thing to do. Adding this as an issue on github seems silly as this is unlikely to ever get fixed unless someone is actually cleaning up the code around it.

maybe we should decide on this sort of thing as a team. In this particular instance this looks like we are doing unneeded work key = key.encode('ascii', 'ignore') does nothing at all on the data I have. Maybe it is needed but I cannot see where this happens. This just looks like a dirty fix that was made at some point but there is no comment explaining why we are doing this, and it just adds complexity and slowness to indexing.

Personally I think adding FIXMEs is good as it will eventually lead to better code I hope.

tobes · 2013-03-13T11:07:19Z

@amercader
I've just tested rebuilding the index with these commits on top of release-v2.0 and all is good. The data I have has a date that fails without these fixes

ghost assigned amercader Feb 5, 2013

tobes added a commit that referenced this pull request Feb 20, 2013

[#267] Change debug message from info to debug

e8d4013

tobes added 3 commits March 11, 2013 13:51

[#267] Fixes issue where incorrectly formated dates broke the indexing

9cc1aee

[#267] Add dateutils requirement

4a9d9ed

[#267] Stop new error related to organisation

39b059d

ghost assigned amercader Mar 12, 2013

domoritz reviewed Mar 12, 2013
View reviewed changes

[#267] change dependency

b928adc

amercader reviewed Mar 12, 2013
View reviewed changes

amercader added a commit that referenced this pull request Mar 13, 2013

[#267] Add small test

f962b6f

amercader merged commit b928adc into master Mar 13, 2013

amercader mentioned this pull request Apr 2, 2013

allow quoted dates in regex in before_index() okfn/ckanext-pdeu#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid dates on a "_date" ending field will crash the indexing #267

Invalid dates on a "_date" ending field will crash the indexing #267

tobes commented Mar 11, 2013

tobes commented Feb 20, 2013

tobes commented Mar 5, 2013

amercader commented Mar 11, 2013

tobes commented Mar 11, 2013

tobes commented Mar 11, 2013

domoritz Mar 12, 2013

tobes Mar 12, 2013

domoritz Mar 12, 2013

tobes Mar 12, 2013

domoritz Mar 12, 2013

tobes Mar 12, 2013

amercader Mar 12, 2013

tobes Mar 13, 2013

tobes commented Mar 13, 2013

Invalid dates on a "_date" ending field will crash the indexing #267

Invalid dates on a "_date" ending field will crash the indexing #267

Conversation

tobes commented Mar 11, 2013

tobes commented Feb 20, 2013

tobes commented Mar 5, 2013

amercader commented Mar 11, 2013

tobes commented Mar 11, 2013

tobes commented Mar 11, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobes commented Mar 13, 2013