Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reindex #1233

Merged
merged 5 commits into from Oct 23, 2017
Merged

Optimize reindex #1233

merged 5 commits into from Oct 23, 2017

Conversation

noirbizarre
Copy link
Contributor

@noirbizarre noirbizarre commented Oct 23, 2017

This PR brings a huge performance improvement on reindexing as well as improved errors handling and a more flexible and unique index command.

A single command

Both commands (search init and search reindex) have been merged into a single search index command.
This command is more flexible (models as variable arguments, plural or singular, delete obselete index by default...) and more powerful (allows to reindex multiple models in a single pass).

Previous commands still exists but will show a deprecation warning when used.

udata search init -df
WARNING: reindex command will be removed in udata 1.4, use index command instead
-> Initiliazing index "udata-prod-2017-10-21-15-52"

According to deprecation policy removal should be in udata 1.4 (An issue should be created).

Full reindexing

Simply execute the command without arguments

# Index deletion with user confirmation prompt
udata search index
# No confirmation prompt on deletion
udata search index -f
# Previous index deleted (or unfinished one on error)
udata search index -k

Partial reindex

Models are now simple arguments instead of the -t option from the previous search reindex command.

This means that to only reindex reuses and organizations you can do it in a single pass with:

udata search index reuse organization

instead of two passes before:

udata search reindex -t reuse
udata search reindex -t organization

The command also accept plural forms as it is a common error:

udata search index datasets

Performances

Performance improvements comes from following Tune for indexing speed guide and the bulk indexing section from the Update settings documentation.

Consequences:

  • indexation is way faster
  • resources consumption on the Elasticsearch side (both CPU and RAM) remains low (and way lower than before) during the whole reindexing

Before

time udata search init -df
-> Initiliazing index "udata-prod-2017-10-21-15-59"
-> Indexing Dataset objects
-> Indexing GeoZone objects
-> Indexing Organization objects
-> Indexing Reuse objects
-> Indexing User objects
-> Creating alias "udata-prod" on index "udata-prod-2017-10-21-15-59"

real	23m28,467s
user	11m39,063s
sys	0m24,022s

After

time udata search init
-> Initiliazing index "udata-prod-2017-10-21-13-41"
-> Indexing Dataset objects
-> Indexing GeoZone objects
-> Indexing Organization objects
-> Indexing Reuse objects
-> Indexing User objects
-> Creating alias "udata-prod" on index "udata-prod-2017-10-21-13-41"

real	6m20,919s
user	4m56,989s
sys	0m6,892s

Error handling

This PR improves error handling on indexation. No more ugly stacktrace, only the errors details (which where not displayed before by the way).
Now, both commands also properly handle kill signals and keyboard interrupt.

In case of error, the unfinished index is now properly deleted and so avoid having a lots of unfinished indexes (consuming ES memory).

-> Initiliazing index "udata-prod-2017-10-21-15-52"
-> Indexing Dataset objects
^C
WARNING: Interrupted by signal
-> Removing index udata-prod-2017-10-21-15-52

For debugging purpose, you can keep the unfinished index with the -k/--keep parameter.

Documentation

At last, the search index command is now documented in the "Administrative tasks" section.

Copy link
Contributor

@abulte abulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌


## Reindexing data

Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/in case/in case of/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/defect of worker/worker defect/

## Reindexing data

Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).
You can use the `udata search index command` to do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/udata search index command/udata search index command/

Sometimes, you need to reindex data (in case model breaking changes, defect of workers...).
You can use the `udata search index command` to do so.

This command both support full reindex without arguments or partial with model names as arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supports both full reindex without arguments and partial

udata search index reuses organizations
```

By default the command does delete the previous index in case of success or the new unfinished index in case of error but you can ask to keep indexes with the `-k/--keep` parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/does delete/deletes/

udata search index -f
```

It's possible to do a partial reindex by providing models (support both singular or plural) as arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both singular and plural are supported



def iter_for_index(docs, index_name):
'''Iter over ES documents ensuring a given index'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Iter/Iterate/

})


def enable_refresh(index_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put refresh_interval as a parameter with a default value of 1s? In case this needs to be changed/configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍


def enable_refresh(index_name):
'''
Enable refresh after indexing and force merge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable refresh and force merge. Used after indexing. ?

if force or prompt_bool(('Index {0} will be deleted, are you sure ?'
.format(index_name))):
if IS_INTERACTIVE and not force:
msg = 'Index {0} will be deleted, are you sure ?'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ ?/?/

@noirbizarre noirbizarre merged commit e73006a into opendatateam:master Oct 23, 2017
@noirbizarre noirbizarre deleted the optimize-reindex branch October 23, 2017 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants