Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid dates on a "_date" ending field will crash the indexing #267

Merged
merged 4 commits into from Mar 13, 2013

Conversation

tobes
Copy link
Contributor

@tobes tobes commented Mar 11, 2013

This commit 783cf82 introduced a new dynamic field on the Solr schema to index automatically as dates every field ending in "_date". Unfortunately Solr is quite picky in terms of its format and will return an error even with values like "2012-01-02"

SearchIndexError: HTTP code=400, reason=Invalid Date String:'2012-01-02' 

There needs to be a check before indexing and format the date to a suitable format for Solr. It may be worth adding the dependency of dateutil for this.

@ghost ghost assigned amercader Feb 5, 2013
@tobes
Copy link
Contributor

tobes commented Feb 20, 2013

diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py
I've just been hit by this and had to create the following patch to fix things.

It is horrible on many levels.

I'd really like to only 'repair' the dict if solr throws an error as going through all the keys seems like wasting effort/time
also I think my None may still kill solr

It would be nice to get somewhere on this even if we just do not index the 'broken' record but currently the whole indexing breaks which is not so nice.

using dateutil seems sensible

diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py
index 87b5606..98a00dc 100644
--- a/ckan/lib/search/index.py
+++ b/ckan/lib/search/index.py
@@ -6,6 +6,7 @@ import json

 import re

+from dateutil.parser import parse
 from pylons import config
 from paste.deploy.converters import asbool

@@ -227,6 +228,13 @@ class PackageSearchIndex(SearchIndex):

         assert pkg_dict, 'Plugin must return non empty package dict on index'

+        for key in pkg_dict:
+            if key.endswith('_date'):
+                try:
+                    pkg_dict[key] = parse(pkg_dict[key]).isoformat() + 'Z'
+                except ValueError:
+                    pkg_dict[key] = None
+
         # send to solr:
         try:
             conn = make_connection()

tobes added a commit that referenced this pull request Feb 20, 2013
@tobes
Copy link
Contributor

tobes commented Mar 5, 2013

@amercader it would be good to make some progress on this.

I'd be happy to try fix this correctly ie only when solr chokes if you have no time

@amercader
Copy link
Member Author

@tobes I'm un-assigning myself from this as I don't really have time now, if you can fix it at some point that would be great, I'm happy to review it.
I'm afraid walking through all the dict to see if there are _date fields is the safest option, because virtually all _date fields will fail on Solr, as it wants full iso dates with Z at the end.
Also if there are several _date fields, will we need to keep sending the dict to solr until all are fixed?
Note that we are already iterating the dict items on this line (195), so maybe it could be refactored to also check for _date fields:

pkg_dict = dict([(k.encode('ascii', 'ignore'), v) for (k, v) in pkg_dict.items()])

@tobes
Copy link
Contributor

tobes commented Mar 11, 2013

@amercader
thanks I'll look at your suggested fix point as that seems better than itterating twice

@tobes
Copy link
Contributor

tobes commented Mar 11, 2013

@amercader this is now done. Seems like it should be in 2.0

the owner_org -> organization change fixed an issue that affected me but will be being changed in a different branch not yet merged

@ghost ghost assigned amercader Mar 12, 2013
@@ -29,3 +29,4 @@ Jinja2==2.6
fanstatic==0.12
requests==1.1.0
WebTest==1.4.3
dateutils==0.6.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want dateutils and not the plain python-dateutil>=1.5.0,<2.0.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference? All I want is parse()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, dateutils is a library that uses python-dateutil. If you only want the parse, use python-dateutil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I'm lost here what would you do to make this how you want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) Just replace the line 32 with python-dateutil>=1.5.0,<2.0.0. It will reduce the number of dependencies since only python-dateutil is required but not dateutils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool let's see what travis thinks

# FIXME where are we getting these dirty keys from? can we not just
# fix them in the correct place or is this something that always will
# be needed? For my data not changing the keys seems to not cause a
# problem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobes let's not pollute the code with long FIXME comments like this one. Ping the list, open an issue or better yet comment on source on GitHub. 👮
I'm not sure where would we get the dirty keys from to be honest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of think that adding FIXMEs is a good thing to do. Adding this as an issue on github seems silly as this is unlikely to ever get fixed unless someone is actually cleaning up the code around it.

maybe we should decide on this sort of thing as a team. In this particular instance this looks like we are doing unneeded work key = key.encode('ascii', 'ignore') does nothing at all on the data I have. Maybe it is needed but I cannot see where this happens. This just looks like a dirty fix that was made at some point but there is no comment explaining why we are doing this, and it just adds complexity and slowness to indexing.

Personally I think adding FIXMEs is good as it will eventually lead to better code I hope.

@tobes
Copy link
Contributor

tobes commented Mar 13, 2013

@amercader
I've just tested rebuilding the index with these commits on top of release-v2.0 and all is good. The data I have has a date that fails without these fixes

amercader added a commit that referenced this pull request Mar 13, 2013
@amercader amercader merged commit b928adc into master Mar 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants