New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid dates on a "_date" ending field will crash the indexing #267
Conversation
diff --git a/ckan/lib/search/index.py b/ckan/lib/search/index.py It is horrible on many levels. I'd really like to only 'repair' the dict if solr throws an error as going through all the keys seems like wasting effort/time It would be nice to get somewhere on this even if we just do not index the 'broken' record but currently the whole indexing breaks which is not so nice. using dateutil seems sensible
|
@amercader it would be good to make some progress on this. I'd be happy to try fix this correctly ie only when solr chokes if you have no time |
@tobes I'm un-assigning myself from this as I don't really have time now, if you can fix it at some point that would be great, I'm happy to review it. pkg_dict = dict([(k.encode('ascii', 'ignore'), v) for (k, v) in pkg_dict.items()]) |
@amercader |
@amercader this is now done. Seems like it should be in 2.0 the owner_org -> organization change fixed an issue that affected me but will be being changed in a different branch not yet merged |
@@ -29,3 +29,4 @@ Jinja2==2.6 | |||
fanstatic==0.12 | |||
requests==1.1.0 | |||
WebTest==1.4.3 | |||
dateutils==0.6.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want dateutils
and not the plain python-dateutil>=1.5.0,<2.0.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference? All I want is parse()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, dateutils is a library that uses python-dateutil. If you only want the parse, use python-dateutil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I'm lost here what would you do to make this how you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-) Just replace the line 32 with python-dateutil>=1.5.0,<2.0.0
. It will reduce the number of dependencies since only python-dateutil is required but not dateutils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool let's see what travis thinks
# FIXME where are we getting these dirty keys from? can we not just | ||
# fix them in the correct place or is this something that always will | ||
# be needed? For my data not changing the keys seems to not cause a | ||
# problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobes let's not pollute the code with long FIXME comments like this one. Ping the list, open an issue or better yet comment on source on GitHub. 👮
I'm not sure where would we get the dirty keys from to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of think that adding FIXMEs is a good thing to do. Adding this as an issue on github seems silly as this is unlikely to ever get fixed unless someone is actually cleaning up the code around it.
maybe we should decide on this sort of thing as a team. In this particular instance this looks like we are doing unneeded work key = key.encode('ascii', 'ignore')
does nothing at all on the data I have. Maybe it is needed but I cannot see where this happens. This just looks like a dirty fix that was made at some point but there is no comment explaining why we are doing this, and it just adds complexity and slowness to indexing.
Personally I think adding FIXMEs is good as it will eventually lead to better code I hope.
@amercader |
This commit 783cf82 introduced a new dynamic field on the Solr schema to index automatically as dates every field ending in "_date". Unfortunately Solr is quite picky in terms of its format and will return an error even with values like "2012-01-02"
There needs to be a check before indexing and format the date to a suitable format for Solr. It may be worth adding the dependency of dateutil for this.