Optimize reindex #1233

noirbizarre · 2017-10-23T07:58:42Z

This PR brings a huge performance improvement on reindexing as well as improved errors handling and a more flexible and unique index command.

A single command

Both commands (search init and search reindex) have been merged into a single search index command.
This command is more flexible (models as variable arguments, plural or singular, delete obselete index by default...) and more powerful (allows to reindex multiple models in a single pass).

Previous commands still exists but will show a deprecation warning when used.

udata search init -df
WARNING: reindex command will be removed in udata 1.4, use index command instead
-> Initiliazing index "udata-prod-2017-10-21-15-52"

According to deprecation policy removal should be in udata 1.4 (An issue should be created).

Full reindexing

Simply execute the command without arguments

# Index deletion with user confirmation prompt
udata search index
# No confirmation prompt on deletion
udata search index -f
# Previous index deleted (or unfinished one on error)
udata search index -k

Partial reindex

Models are now simple arguments instead of the -t option from the previous search reindex command.

This means that to only reindex reuses and organizations you can do it in a single pass with:

udata search index reuse organization

instead of two passes before:

udata search reindex -t reuse
udata search reindex -t organization

The command also accept plural forms as it is a common error:

udata search index datasets

Performances

Performance improvements comes from following Tune for indexing speed guide and the bulk indexing section from the Update settings documentation.

Consequences:

indexation is way faster
resources consumption on the Elasticsearch side (both CPU and RAM) remains low (and way lower than before) during the whole reindexing

Before

time udata search init -df
-> Initiliazing index "udata-prod-2017-10-21-15-59"
-> Indexing Dataset objects
-> Indexing GeoZone objects
-> Indexing Organization objects
-> Indexing Reuse objects
-> Indexing User objects
-> Creating alias "udata-prod" on index "udata-prod-2017-10-21-15-59"

real	23m28,467s
user	11m39,063s
sys	0m24,022s

After

time udata search init
-> Initiliazing index "udata-prod-2017-10-21-13-41"
-> Indexing Dataset objects
-> Indexing GeoZone objects
-> Indexing Organization objects
-> Indexing Reuse objects
-> Indexing User objects
-> Creating alias "udata-prod" on index "udata-prod-2017-10-21-13-41"

real	6m20,919s
user	4m56,989s
sys	0m6,892s

Error handling

This PR improves error handling on indexation. No more ugly stacktrace, only the errors details (which where not displayed before by the way).
Now, both commands also properly handle kill signals and keyboard interrupt.

In case of error, the unfinished index is now properly deleted and so avoid having a lots of unfinished indexes (consuming ES memory).

-> Initiliazing index "udata-prod-2017-10-21-15-52"
-> Indexing Dataset objects
^C
WARNING: Interrupted by signal
-> Removing index udata-prod-2017-10-21-15-52

For debugging purpose, you can keep the unfinished index with the -k/--keep parameter.

Documentation

At last, the search index command is now documented in the "Administrative tasks" section.

abulte

👌

abulte · 2017-10-23T10:16:01Z

docs/administrative-tasks.md

+
+## Reindexing data
+
+Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).


s/in case/in case of/

s/defect of worker/worker defect/

abulte · 2017-10-23T10:16:34Z

docs/administrative-tasks.md

+## Reindexing data
+
+Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).
+You can use the `udata search index command` to do so.


s/udata search index command/udata search index command/

abulte · 2017-10-23T10:17:19Z

docs/administrative-tasks.md

+Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).
+You can use the `udata search index command` to do so.
+
+This command both support full reindex without arguments or partial with model names as arguments:


supports both full reindex without arguments and partial

abulte · 2017-10-23T10:17:51Z

docs/administrative-tasks.md

+udata search index reuses organizations
+```
+
+By default the command does delete the previous index in case of success or the new unfinished index in case of error but you can ask to keep indexes with the `-k/--keep` parameter


s/does delete/deletes/

abulte · 2017-10-23T10:19:44Z

docs/administrative-tasks.md

+udata search index -f
+```
+
+It's possible to do a partial reindex by providing models (support both singular or plural) as arguments:


both singular and plural are supported

abulte · 2017-10-23T10:22:06Z

udata/search/commands.py

+
+
+def iter_for_index(docs, index_name):
+    '''Iter over ES documents ensuring a given index'''


s/Iter/Iterate/

abulte · 2017-10-23T10:24:48Z

udata/search/commands.py

+    })
+
+
+def enable_refresh(index_name):


Maybe put refresh_interval as a parameter with a default value of 1s? In case this needs to be changed/configured.

abulte · 2017-10-23T10:25:29Z

udata/search/commands.py

+
+def enable_refresh(index_name):
+    '''
+    Enable refresh after indexing and force merge


Enable refresh and force merge. Used after indexing. ?

abulte · 2017-10-23T10:27:28Z

udata/search/commands.py

-        if force or prompt_bool(('Index {0} will be deleted, are you sure ?'
-                                 .format(index_name))):
+        if IS_INTERACTIVE and not force:
+            msg = 'Index {0} will be deleted, are you sure ?'


noirbizarre added documentation enhancement performance refactoring labels Oct 23, 2017

noirbizarre added this to the 1.2.1 milestone Oct 23, 2017

noirbizarre requested a review from a team October 23, 2017 07:58

noirbizarre added the in progress label Oct 23, 2017

abulte approved these changes Oct 23, 2017

View reviewed changes

noirbizarre added 5 commits October 23, 2017 13:12

Disable ES refresh while indexing

3a9e1a4

Bulk indexing

3b62b6f

Improve errors handling and reindex command signature

b3f42f4

Merge both search commands into a single index command

f385fe2

Added search index documentation and changelog

9e57c11

noirbizarre merged commit e73006a into opendatateam:master Oct 23, 2017

noirbizarre removed the in progress label Oct 23, 2017

noirbizarre deleted the optimize-reindex branch October 23, 2017 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize reindex #1233

Optimize reindex #1233

noirbizarre commented Oct 23, 2017 •

edited

abulte left a comment

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017

noirbizarre Oct 23, 2017

abulte Oct 23, 2017

abulte Oct 23, 2017


		## Reindexing data

		Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).



		def iter_for_index(docs, index_name):
		'''Iter over ES documents ensuring a given index'''

Optimize reindex #1233

Optimize reindex #1233

Conversation

noirbizarre commented Oct 23, 2017 • edited

A single command

Full reindexing

Partial reindex

Performances

Before

After

Error handling

Documentation

abulte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noirbizarre commented Oct 23, 2017 •

edited