Skip to content
This repository has been archived by the owner. It is now read-only.

Port Django management command from Kitsune and Kuma. #168

Closed
wants to merge 13 commits into from

Conversation

@jezdez
Copy link
Member

@jezdez jezdez commented Sep 10, 2013

This is work in progress since I need to port over some tests, too ;)

@willkg
Copy link
Member

@willkg willkg commented Sep 10, 2013

Wow! I'm pretty psyched about this. I skimmed it and it looks good so far. I'm pretty hard up for free time, but after you land some tests and docs, I'll make a point of making time to look through it more carefully.

jezdez added 2 commits Sep 10, 2013
refresh_interval = index_settings.get('index.refresh_interval', '1s')

# Disable automatic refreshing
es.update_settings(write_index, {'index': {'refresh_interval': '-1'}})

This comment has been minimized.

@robhudson

robhudson Sep 10, 2013
Member

Another optimization could be to set num_replicas to zero while re-indexing, which reduces copying data between nodes. Once you're done you can set it back to whatever it was and ES will bulk copy to the replication nodes.

This comment has been minimized.

@jezdez

jezdez Sep 11, 2013
Author Member

Nice! Will incorporate this :)

This comment has been minimized.

@jezdez

jezdez Sep 11, 2013
Author Member

@robhudson Do you happen to have a reference somewhere that this is supposed to be used like this?

This comment has been minimized.

@robhudson

robhudson Sep 11, 2013
Member

I've tried to find docs on it but couldn't. If you have enough data locally you could maybe test it both ways and see which is faster. I forget where I learned of this... either the Elasticsearch training or maybe from Hanno while I was writing the reindexing jobs for marketplace?

This comment has been minimized.

@mythmon

mythmon Oct 14, 2013
Member

I remember hearing about this during ES training, so that's probably where you heard it too.

# variable, then run a script that takes timings for
# --criticalmass, runs overnight and returns a more "optimal"
# number.
for ids in chunked(id_list, 80):

This comment has been minimized.

@robhudson

robhudson Sep 10, 2013
Member

Shouldn't this be using the passed in size parameter?

This comment has been minimized.

@jezdez

jezdez Sep 11, 2013
Author Member

Err, yes. D'oh!

try:
documents.append(cls.extract_document(id_))

except UnindexDocument:

This comment has been minimized.

@robhudson

robhudson Sep 10, 2013
Member

Ooh, I'm going to steal this. On Marketplace we have a bug about indexing a deleted object and I've wondered about a good way to handle this. Thanks! :)

This comment has been minimized.

@jezdez

jezdez Sep 11, 2013
Author Member

I only stole that from Kitsune myself, but you're right, this is pretty amazing and will solve all kinds of problems for MDN.

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

For the record, I was never satisfied with this. It's kind of a weird thing for extract_document to kick up an UnindexMeBro exception. Plus if there's a significant number of things you don't want indexed, then this spends a lot of time not indexing them. I'm game for other ideas. Maybe passing the id list through a should_index filter first?

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

Er, that's a terrible idea. Then we're pulling all the data twice.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

I think an exception is a perfectly fine tool to inject a different state, the alternative would be to separate indexing and unindexing, which rubs me wrong. I suggest to record the objects which need to be unindexed and bulk unindex them after the indexing has happened.

try:
cls.registry[cls.get_mapping_type_name()] = cls
except (NotImplementedError, NoModelError):
pass

This comment has been minimized.

@robhudson

robhudson Sep 10, 2013
Member

This is pretty slick.

The words in the comment "branch" and "mount point" confuse me a little but it's still understandable.

This comment has been minimized.

@jezdez

jezdez Sep 11, 2013
Author Member

Right, this is just me copy/pasting things around 💃

@@ -95,6 +95,11 @@ file:
multiple indexes, but we have no tests for that and I haven't
tested it, either.

.. data:: ES_WRITE_INDEXES

Similar to :data:`ES_INDEXES` this specifies the indexes to write

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

More specifically, it's a mapping of doctypes -> indexes to write to.

def get_index_settings(index, es=None):
"""Returns ES settings for this index"""
if es is None:
es = get_es()

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

I think this doesn't make sense here. The default get_es() is almost certainly not what a developer intends to use.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

agreed, I'm going to remove the function, actually. it's only needed in the management command after all

@@ -1866,6 +1900,48 @@ def __contains__(self, item):
return self._results_dict.__contains__(item)


def get_mapping_types(mapping_types=None):

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

The argument here should probably be "mapping_type_names" and the docstring should clarify that this should match what's in get_mapping_type_name from the MappingTypes.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

done.

return values


def get_mappings(mapping_types=None):

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

The argument here should probably be "mapping_type_names".

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

done.

return mappings


def get_document_stats(index, mapping_types=None):

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

The argument here should probably be "mapping_type_names".

We could abbreviate that to "mt_names", but if someone says that aloud in a conference call, it'd get confusing.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

for the record, both "mapping_type_names" and "mt_names" are horrible.

@@ -7,14 +7,13 @@
from django.shortcuts import render
from django.utils.decorators import decorator_from_middleware_with_args

from elasticutils import F, InvalidFieldActionError, MLT, NoModelError # noqa

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

These were here so when someone is writing ElasticUtils code in Django, they could import everything from this module rather than have to remember what's overridden in the django stuff and what should get pulled from the regular stuff.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

I find that pretty distasteful and against best practices, to be honest. But you're the boss.

This comment has been minimized.

@willkg

willkg Sep 25, 2013
Member

That's a pretty strong opinion. You do understand the ease of use issues, right? I'm not sure this case is against best practices because what we really have here is ElasticUtils the framework agnostic library and the Django layer on top of it, but we package both in the same repository. If I had more time to spend, I'd split them into two separate libraries. But that's a pain in the ass maintenance-wise. Regardless, that's how I'm treating them.



log = logging.getLogger('elasticutils')
log = logging.getLogger('elasticutils.contrib.django')

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

This is fine, but it means that a developer who wants to watch the logs now has to set up two loggers. I'm not entirely sure it's important to know whether something came from elasticutils or elasticutils.contrib.django. If you think this is a good idea, then we should at least document it in the section on debugging.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

Well, I understand that the logger 'elasticutils' would automatically inherit the calls from the 'elasticutils.contrib.django' logger if no handlers for the latter are setup. No?

This comment has been minimized.

@mythmon

mythmon Sep 25, 2013
Member

@jezdez you've got it right here. The events will bubble up by default, so I think this change is a good one.

@@ -29,6 +28,33 @@
)


def get_indexes(alias='default', setting='ES_INDEXES'):

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

The keys to ES_INDEXES and ES_WRITE_INDEXES aren't aliases. They're mapping type names.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

Ok.

prefixed_indexes = []
for index in indexes:
prefixed_indexes.append(u'%s_%s' % (prefix, index))
return prefixed_indexes

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

Why not something shorter like?:

return [u'%s_%s' % (prefix, index) for index in indexes]

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

Just a side issue of refactoring and not having another "performance" pass over the code.

return indexes


def get_write_index(alias='default'):

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

The keys to ES_INDEXES and ES_WRITE_INDEXES are mapping type names--not aliases.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

Right, if I ever have to explain this long term to someone in person I'm going to remember this and tell her to complain to you about it :)

This comment has been minimized.

@willkg

willkg Sep 25, 2013
Member

The word "alias" has a very specific meaning in Elasticsearch. That's why I want to stay away from it unless we're actually talking about Elasticsearch aliases.

"""Calculate the write index name."""
try:
indexes = get_indexes(alias, setting='ES_WRITE_INDEXES')
except InvalidIndexError:

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

I don't understand this part... If you're just calculating the write index name, then why should this kick up an InvalidIndexError?

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

get_indexes may raise a InvalidIndexError exception if the provided "mapping_type_name" can't be found (if the ES_WRITE_INDEXES setting isn't defined or the wanted index isn't defined in it). it then falls back to using the ES_INDEXES setting.

es = get_es()

delete_index(index, es=es)
es.create_index(index, settings={'mappings': get_mappings()})

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

We've got a bunch of MappingTypes registered and they may be in different indexes. I think this get_mappings() should only be returning the mappings for the index being created. I think that means you need to:

  1. figure out which mapping types should be in this index
  2. figure out the mappings for those mapping types

Maybe go through ES_WRITE_INDEXES.items() (or ES_INDEXES.items()) looking for the mapping type names for the index being recreated. Then figure out which mapping type names aren't in ES_WRITE_INDEXES (or ES_INDEXES) and thus use whatever default points to.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

I assumed that ES is capable of ignoring the unneeded mappings when creating an index. This blows this feature completely out of proportion, frankly.

@@ -238,7 +275,7 @@ def get_index(cls):
"""Gets the index for this model.
The index for this model is specified in `settings.ES_INDEXES`
which is a dict of mapping type -> index name.
which is a dict of mapping type -> index name(s).

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

I don't think we need to make this change. This method explicitly only returns the first index listed in the mapping. So while this docstring is more correct about what ES_INDEXES has, it suggests things that we're not dealing with well which will mess up the developer.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

:rollseyes:


# We (ab)use override_settings to force ES_LIVE_INDEXING for the
# duration of this command so that it actually indexes stuff.
@override_settings(ES_LIVE_INDEXING=True)

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

For the record, the LIVE_INDEXING stuff is total crap. In Kitsune we conflate "live indexing" (which is when we reindex a document on post_save/post_delete) with "ES is enabled". This command shouldn't have to deal with live indexing at all.

Sorting out the live indexing vs. ES enabled thing is something I keep meaning to do for Kitsune, but haven't had a chance to, yet.

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

Er, let me put that in context. I'm not saying you did anything wrong, but rather that the source material you're cribbing from here is asstastic and it's my fault and I haven't had a chance to fix it, yet.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

This was just a left over, I've since removed it in a WIP version of this pull request locally. Thanks for the clarification though.


def __len__(self):
return len(_model_cache)

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

I'm glad you cleaned this up. This whole thing was so icky. Thank you!

@@ -1,6 +1,7 @@
from nose.tools import eq_

from elasticutils.contrib.django import S, F, InvalidFieldActionError
from elasticutils import F, InvalidFieldActionError
from elasticutils.contrib.django import S

This comment has been minimized.

@willkg

willkg Sep 20, 2013
Member

This is the sort of thing I was getting at earlier. It's easier on a dev to not have to remember which things come from where. Plus if we ever change it and Django-ify F (for example), then it's less likely to create subtle errors if the devs are importing everything from elasticutils.contrib.django.

This comment has been minimized.

@jezdez

jezdez Sep 25, 2013
Author Member

I disagree, if a developer can't remember what module to use, well, maybe he shouldn't be developing. If we're truly willing to provide a Django-like API, how can we assume that it matches the behavior of a maybe to be developed Django specific implementation? Imagine the horrible situation of having to explain why the elasticutils.contrib.django.F has changed its behavior?

The only solution to explain to the users that there is an F class, it lives in elasticutils. End of story.

@willkg
Copy link
Member

@willkg willkg commented Sep 20, 2013

I'm really really excited you're taking this on. I'm really really sorry it took me a couple of weeks to get really work through this.

I think there's a lot of good stuff in here. Clearly it needs some documentation changes and probably some tests, too. There are some API decisions in here that I want to play with in an example project (or two or three) before I know how I feel about them.

I think it's worth doing another round of fixes in this PR and if that's good, then we'll land it. We have a couple of options:

  1. throw caution to the wind and land it in master and fix issues as they come up
  2. toss it in a new branch and continue to flesh it out

Option 2 lets us do a 0.8.2 release if we need to. Plus it lets us tinker with things in a more leisurely pace. I think I'm inclined to go with that.

@jezdez What's your timeline for this? Is this something you need to land asap or is this something we can take the next month or so to work on?

@jezdez
Copy link
Member Author

@jezdez jezdez commented Sep 25, 2013

@willkg My timeline is basically ASAP. I'm a bit tired of discussing basic Python library design though, so don't expect me to spend (or wait) weeks for this to land. I thought I could make a quick leap for the project but the feedback you've given me so far shows that that was a mistake.

@willkg
Copy link
Member

@willkg willkg commented Sep 25, 2013

I'm just trying to help make sure the code is good. I didn't mean to piss you off.

@robhudson Can you take over reviewing from here? It's clear I'm sucking at it.

@kevinastone
Copy link
Contributor

@kevinastone kevinastone commented Nov 10, 2013

What's remaining to get the management command merged? I think have a out-of-the-box indexing management is a essential feature for django integration. ElasticUtils has been great with everything else (celery tasks, etc), but this was an obvious omission.

@willkg willkg added this to the 0.10 milestone Mar 1, 2014
@willkg
Copy link
Member

@willkg willkg commented May 19, 2014

After talking with @jezdez and @robhudson at PyCon, I'm going to close this out. We're going to go in a different direction.

@willkg willkg closed this May 19, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.