Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling index rebuild #1134

Closed
njakobsen opened this issue May 3, 2019 · 21 comments
Closed

Rolling index rebuild #1134

njakobsen opened this issue May 3, 2019 · 21 comments

Comments

@njakobsen
Copy link

It would be useful to rebuild indices in a rolling manner, only dropping the index right before it's rebuilt. This is an issue specifically using realtime indices, where some indices take a long time to rebuild, forcing later indices to wait empty until their turn to rebuild. Though there would be a period of time where the index would return an empty or partial list of results as the index is built, it could continue serving stale results while other indices are being built.

@pat
Copy link
Owner

pat commented May 15, 2019

Hi Nicholas - thanks for this suggestion. It's definitely a fair request, but I'm not sure it's possible.

A rebuild is (or at least, should) mostly used when at least one index's structure is changed. Given this is all in a single file, it's not really possible to keep legacy data files around - all of the configuration is updated at once, rather than on a per-index basis. If index A was kept the same, and index B was changed, then it's not going to be possible for Sphinx to run reliably with old B data while processing A's records, because the configuration will have the new B schema.

I'll keep thinking this through in my head, but right now I don't see a clear path forward. Also: how often are you rebuilding? If it's more than just when the structure of indices are changing, I'm curious as to why that's the case.

@njakobsen
Copy link
Author

njakobsen commented May 15, 2019

Hi Pat, thanks for the quick reply. I wasn't using the proper jargon when I posted. I want to initiate a rebuild that updates the config once, then clears one index at a time as it re-indexes it. That way we can continue to serve up results from the untouched indices while we repopulate the current index, avoiding empty result sets for the majority of queries.

The use case is we add a new RT index, deploy, our config is updated on deploy, but that has the effect of breaking document_id calculations and also requires an index before we can write documents to the new index. Since a reindex of all indices will cause duplicates to appear because the document_id is now different for each record due to the addition of the new index (

key * config.indices.count + offset
), we are forced to rebuild to clear out the documents in each index. Reducing the amount of time the indices are empty is a priority.

As a sidenote, I have implemented two monkeypatches to improve this situation, perhaps you can weigh in on them.

  1. Implemented index priority to ensure we reindex the most important (visitor-facing) indices first when rebuilding.
  2. Overloaded the transcriber and delete callbacks to delete old records matching the current index and where the record's id matches the Sphinx document's sphinx_internal_id, eliminating the need to clear out old documents after creating/deleting a sphinx index since they will be cleaned out automatically.

@pat
Copy link
Owner

pat commented May 19, 2019

I want to initiate a rebuild that updates the config once, then clears one index at a time as it re-indexes it.

This isn't going to work, I'm afraid, because as you've noted, the offsets change, and so the existing index files will have old offsets, but any updates that may happen will expect new offsets, so you're going to find the wrong records get updated (via callbacks) in Sphinx.

I'm definitely interested in seeing the code for your monkey-patches, and I'm pondering how reindexing could become a multi-process thing, so all indices are processed in parallel (for a specified number of threads), which could hopefully speed things up.

As a manual workaround in the meantime, per-index processing (ts:index) could be a way to update a bunch of indices at the same time:

bundle exec rake ts:rebuild
# And then, once that starts on adding data to indices:
bundle exec rake ts:index INDEX_FILTER=user_core
bundle exec rake ts:index INDEX_FILTER=purchase_core
# etc

@njakobsen
Copy link
Author

Automatic removal of stale Sphinx documents during Realtime Indexing
https://gist.github.com/njakobsen/2135706c35139d484a56d660b089cc5d

The index priority selection is not as interesting, I simply sort the list of indices as desired at the end of ThinkingSphinx::Index#define.

@pat
Copy link
Owner

pat commented May 22, 2019

I was going to suggest you submit this as a PR - but frustratingly, Sphinx 2.1.x doesn't support anything but id in the WHERE clause (2.2.x onwards are more flexible).

Given Sphinx 2.2.x has been around for over five years, I'm thinking I should probably drop 2.1 support soon. I'll keep this patch in mind for then.

@njakobsen
Copy link
Author

njakobsen commented May 22, 2019

Sounds good.

FWIW while modifying Transcriber I ended up deleting by ActiveRecord id before replacing because the replace occurs deeper down in the Riddle gem (https://github.com/pat/riddle/blob/develop/lib/riddle/query/insert.rb) and I didn't want to change the behaviour too broadly while I was experimenting. Ideally, the replace would simply be based on ActiveRecord id instead, negating the need for a sphinx document id replace, but this would require better knowledge of the gems involved, their desired compatibility with other ORMs, and of Sphinx itself. This is why I didn't make a PR and instead opted for the gist to at least showcase the idea.

pat added a commit that referenced this issue Aug 3, 2019
In particular, this opens the door to parallel processing of indices. Prompted by discussions in #1134.

Use your own Processor class, which can either inherit from `ThinkingSphinx::RealTime::Processor` or be whatever you want, as long as it responds to `call`, accepting an array of indices, and a block that gets called after each index is processed.

https://gist.github.com/pat/53dbe982a2b311f5f7294809109419d2
@pat
Copy link
Owner

pat commented Aug 3, 2019

I've made a few attempts at parallel processing over the past couple of months, but have found a reasonably simple solution just by refactoring the current code and allowing custom processors. This leaves the door open to others building their own implementations. All wrapped up in f0a7c12.

And as noted in that commit, I've put together a quick example using the parallel gem:
https://gist.github.com/pat/53dbe982a2b311f5f7294809109419d2

In a test app locally, I'm finding this shaves almost 50% off indexing times, but I'm dealing with very simple indices (and only three of them). Certainly, I'd be keen to know whether such an approach is helpful in your app!

@njakobsen
Copy link
Author

This worked well though I ran into an issue that could potentially be addressed in a similar way. As I reach the end of the list unless the indexing time is well balanced, we end up with several CPUs sitting idle while any long indices are completed. This led me to try parallelizing within the Populator. That worked okay, though I noticed that the memory footprint seemed to grow depending on the number of records in the table being indexed (not scientifically tested), so I'll need to investigate why I wasn't able to keep it constant even though I was using find_in_batches to load records 100 at a time so we could keep each process fed.

The performance difference for my project on a Macbook pro with 4 cores + hyperthreading was:

  • No parallelization: Usually around 30 to 40 minutes
  • Parallelizing per index: 11 minutes
  • Parallelizing within an index: 8 minutes

@pat
Copy link
Owner

pat commented Aug 6, 2019

Ah, nice to know the per-index option at least made things a decent bit faster on your machine. Fair point about the balancing of indices by size though, that's a bit tricky.

Can you share your changes to the Populator code? I'm curious to see your approach to shift the parallelisation into there.

@njakobsen
Copy link
Author

njakobsen commented Aug 6, 2019

Be warned it's super janky proof of concept. I just threw this into my config/initializers to test it. The true part was while I was poking around to see if this was the cause of memory seeming to balloon when I indexed a large table. One downside to this is if for some reason memory is affected depending on the table size, all your processes balloon at the same time (unlike your example with per-index parallelization), which is why I started looking into it.

ThinkingSphinx::RealTime::Populator.class_eval do
  def populate(&block)
    instrument 'start_populating'

    Parallel.each(scope.find_in_batches(batch_size: 100)) do |instances|
      transcriber.copy *instances
      instrument 'populated', :instances => instances
      true # Don't emit any large object because results are accumulated
    end
    ActiveRecord::Base.connection.reconnect! # As recommended by the Parallel gem

    instrument 'finish_populating'
  end
end

@pat
Copy link
Owner

pat commented Aug 9, 2019

I've just added a commit that allows similar customisation in a slightly more native approach, using your code:

class ParallelPopulator < ThinkingSphinx::RealTime::Populator
  def populate
    instrument 'start_populating'

    Parallel.each(scope.find_in_batches(batch_size: batch_size)) do |instances|
      transcriber.copy *instances

      instrument 'populated', :instances => instances

      true # Don't emit any large object because results are accumulated
    end
    ActiveRecord::Base.connection.reconnect!

    instrument 'finish_populating'
  end
end

ThinkingSphinx::RealTime.populator = ParallelPopulator

That said, when I gave it a spin, I didn't find any significant speed improvements over the standard non-parallel approach, let alone the Processor parallelisation. So maybe it's far more dependent on the index definitions and the Rails app?

@njakobsen
Copy link
Author

I think it's going to be very dependent on how the indexes in one's app are balanced. We have some very large and very small indexes, so I ended up with the imbalance. However, combined with the reindexing priority monkey-patch I described in #1134 (comment), I ended up prioritizing the large index so it gets a head start while other shorter indices are being processed in parallel. Therefore I probably won't pursue within-index parallelization since the gain is minor compared to the simpler approach, and can be achieved with a manual tweaking of indexing order. I'm also a little worried about the whole ActiveRecord::Base.connection.reconnect! that the Parallel gem suggested, since it didn't really explain why we lose connection, and how exactly this line solved it, see https://github.com/grosser/parallel#activerecord.

FWIW here is a simplification of how I'm implementing index prioritization

module IndexExtensions
  def define(*, &block)
    super
    sort_indices
  end

  private

  def sort_indices
    ThinkingSphinx::Configuration.instance.indices.sort_by! do |index|
      index.options[:reindex_priority]
    end
  end
end

ThinkingSphinx::Index.singleton_class.prepend(IndexExtensions)

ThinkingSphinx::Index.define :page, reindex_priority: 5 do
   # your index here
end

@pat
Copy link
Owner

pat commented Aug 10, 2019

Yeah, avoiding the reconnecting is not a bad idea. For what it's worth, you could remove that monkey patch and instead shift the sorting into the custom processor instead? e.g.

class ParallelProcessor < ThinkingSphinx::RealTime::Processor
  def call(&block)
    Parallel.map(indices) do |index|
      puts "Populating index #{index.name}"
      ThinkingSphinx::RealTime::Populator.populate index
      puts "Populated index #{index.name}"

      block.call
    end
  end

  private

  def indices
    super.sort_by { |index| index.options[:reindex_priority] }
  end
end

ThinkingSphinx::RealTime.processor = ParallelProcessor

@njakobsen
Copy link
Author

I guess I could sort on-the-fly each time since there's no interface for extending ThinkingSphinx::Index. However, indices is a private method, so overriding it in my own subclass would still feel like a monkey patch. Would you be interested in a PR to make custom indexing order a standard feature? Or do you not see it as fundamental to the gem itself? Even something like adding a <=> method to ThinkingSphinx::Index, defaulting to the current behaviour of sorting by name, which could be overridden as desired.

@pat
Copy link
Owner

pat commented Aug 12, 2019

I'm a little wary of adding the new feature, just because this hasn't been requested before in TS' long history.

Fair point about the above solution… here's a simpler approach though that avoids the reliance on a private method and the superclass:

ThinkingSphinx::RealTime.processor = Proc.new do |indices, &block|
  Parallel.map(indices) do |index|
    puts "Populating index #{index.name}"
    ThinkingSphinx::RealTime.populator.populate index
    puts "Populated index #{index.name}"

    block.call
  end
end

@pat
Copy link
Owner

pat commented Aug 12, 2019

Ah, but I forgot to add the sorting in there. Still, easy enough to add :)

The processor needs to be something that responds to call and accepts two arguments: an array of indices, and a block that should be invoked (without any arguments) after each index is processed. The default is a class, but a proc does the job just as neatly.

@njakobsen
Copy link
Author

Yeah, makes sense, thanks for all your help!

@pat
Copy link
Owner

pat commented Aug 13, 2019

No worries, great to get some solutions sorted that help you :)

@pat
Copy link
Owner

pat commented Aug 21, 2019

These changes are now available in the freshly-released v4.4.0 :)

pat added a commit that referenced this issue Jun 23, 2020
Sphinx 2.1.x only allowed WHERE clauses for deletions to be on the `id` attribute, but 2.2.x onwards is more flexible, and saves us on calculations.

This change was suggested by @njakobsen in #1134.
pat added a commit that referenced this issue Jun 23, 2020
This was suggested by @njakobsen in #1134, and as he says: “This avoids the need to rebuild the index after adding or removing an index. Without this, the old indexed record would not be deleted since Sphinx's calculation of that record's id would not match now that the number/order of indices has changed.”

It does mean a touch more overhead when transcribing records, but I think it’s worth it.
@pat
Copy link
Owner

pat commented Jul 20, 2020

Hi Nicholas - just wanted to let you know that the newly released v5.0.0 release of Thinking Sphinx includes two changes you'd flagged in this discussion: real-time data is deleted before being inserted again (thus old data on different offset values is avoided), and the SphinxQL deletion calls are by ActiveRecord id rather than Sphinx's primary keys (as Sphinx 2.2.11 or newer is now required).

There's also a significant change with needing to specify callbacks - though this is more of a big deal for those not using real-time indices: https://github.com/pat/thinking-sphinx/releases/tag/v5.0.0

@njakobsen
Copy link
Author

I saw that! Thanks for the update @pat. I'll update our code as soon as I'm able and let you know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants