Make several indexing optimizations #4350

ManyTheFish · 2024-01-22T15:56:59Z

Summary

Implement several enhancements to reduce the indexing time.

Steps

Compute the indexing chunk size dynamically based on the available threads and the data size
Remove the merging step before the writing step and merge at the writing time
Remove append function
Make Facet search indexing incremental

Running Indexing process

`main`

Each type of data is written after a merging phase:

Highlighted parts are the writings

`remove-merging-phase-from-indexing`

When the extraction of a chunk is finished, the data is written:

Highlighted parts are the writings

+                .inspect(|result| {
+                    if proximity_precision == ProximityPrecision::ByWord {
+                        if let Ok((docid_word_positions_chunk, _)) = result {
+                            run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
+                                docid_word_positions_chunk.clone(),
+                                indexer,
+                                lmdb_writer_sx.clone(),
+                                extract_word_pair_proximity_docids,
+                                TypedChunk::WordPairProximityDocids,
+                                "word-pair-proximity-docids",
+                            );
+                        }
+                    }
+                })
+                .inspect(|result| {
+                    if let Ok((docid_word_positions_chunk, _)) = result {
+                        run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
+                            docid_word_positions_chunk.clone(),
+                            indexer,
+                            lmdb_writer_sx.clone(),
+                            extract_fid_word_count_docids,
+                            TypedChunk::FieldIdWordCountDocids,
+                            "field-id-wordcount-docids",
+                        );
+                    }
+                })
+                .inspect(|result| {
+                    if let Ok((docid_word_positions_chunk, _)) = result {
+                        let exact_attributes = exact_attributes.clone();
+                        run_extraction_task::<
+                            _,
+                            _,
+                            (
+                                grenad::Reader<BufReader<File>>,
+                                grenad::Reader<BufReader<File>>,
+                                grenad::Reader<BufReader<File>>,
+                            ),
+                        >(
+                            docid_word_positions_chunk.clone(),
+                            indexer,
+                            lmdb_writer_sx.clone(),
+                            move |doc_word_pos, indexer| {
+                                extract_word_docids(doc_word_pos, indexer, &exact_attributes)
+                            },
+                            |(
+                                word_docids_reader,
+                                exact_word_docids_reader,
+                                word_fid_docids_reader,
+                            )| {
+                                TypedChunk::WordDocids {
+                                    word_docids_reader,
+                                    exact_word_docids_reader,
+                                    word_fid_docids_reader,
+                                }
+                            },
+                            "word-docids",
+                        );
+                    }
+                })
+                .inspect(|result| {
+                    if let Ok((docid_word_positions_chunk, _)) = result {
+                        run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
+                            docid_word_positions_chunk.clone(),
+                            indexer,
+                            lmdb_writer_sx.clone(),
+                            extract_word_position_docids,
+                            TypedChunk::WordPositionDocids,
+                            "word-position-docids",
+                        );
+                    }
+                })
+                .inspect(|result| {
+                    if let Ok((_, (_, fid_docid_facet_strings_chunk))) = result {
+                        run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
+                            fid_docid_facet_strings_chunk.clone(),
+                            indexer,
+                            lmdb_writer_sx.clone(),
+                            extract_facet_string_docids,
+                            TypedChunk::FieldIdFacetStringDocids,
+                            "field-id-facet-string-docids",
+                        );
+                    }
+                })
+                .inspect(|result| {
+                    if let Ok((_, (fid_docid_facet_numbers_chunk, _))) = result {
+                        run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
+                            fid_docid_facet_numbers_chunk.clone(),
+                            indexer,
+                            lmdb_writer_sx.clone(),
+                            extract_facet_number_docids,
+                            TypedChunk::FieldIdFacetNumberDocids,
+                            "field-id-facet-number-docids",
+                        );
+                    }
+                })
+                .map(|r| r.map(|_| ()))


Could you replace all those inspect and this map with a single map instead. inspect is more meant to be used to debug stuff than traversing for actual work. Also, can't these inspect tasks run in parallel? It seems that they are run sequentially here.

Maybe you can use the for_each rayon trait method on various functions. Looping on the list of functions to run with the run_extraction_task could do the trick to run those extractions in parallel!

Kerollmops · 2024-01-23T11:00:59Z

milli/src/update/index_documents/helpers/grenad_helpers.rs

@@ -82,90 +82,6 @@ pub unsafe fn as_cloneable_grenad(
    Ok(reader)
 }

-pub trait MergeableReader


milli/src/update/index_documents/mod.rs

milli/src/update/index_documents/typed_chunk.rs

meili-bot · 2024-01-23T19:06:18Z

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_b6fc1819
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.04  1567.8±177.85ms        ? ?/sec    1.00  1505.7±142.25ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00       7.2±0.23s        ? ?/sec     2.30      16.6±0.49s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00      11.2±0.88s        ? ?/sec     1.09      12.2±0.81s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.44      52.1±3.18s        ? ?/sec     1.00      36.3±3.69s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.47      72.8±1.79s        ? ?/sec
indexing/Indexing movies in three batches                                 1.09       4.0±0.13s        ? ?/sec     1.00       3.7±0.19s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.15       4.5±0.44s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     2.90      14.1±0.34s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.31       5.6±0.31s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     2.54      73.7±2.05s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.95      65.5±1.85s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.32      41.0±1.00s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.83      59.8±1.54s        ? ?/sec
indexing/Indexing wiki                                                    1.00     297.3±8.79s        ? ?/sec     1.18     350.2±6.62s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     232.0±3.66s        ? ?/sec     1.47    342.1±11.13s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.7±0.17s        ? ?/sec
indexing/Reindexing movies with default settings                          1.05   224.0±32.46ms        ? ?/sec     1.00   214.1±35.45ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.02       3.6±0.04s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    399.1±12.94s        ? ?/sec     1.14     455.3±3.41s        ? ?/sec

ManyTheFish · 2024-01-25T17:24:26Z

/benchmark indexing

meili-bot · 2024-01-26T01:10:38Z

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_b6fc1819
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.04  1567.8±177.85ms        ? ?/sec    1.00  1505.7±142.25ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00       7.2±0.23s        ? ?/sec     2.30      16.6±0.49s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00      11.2±0.88s        ? ?/sec     1.09      12.2±0.81s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.44      52.1±3.18s        ? ?/sec     1.00      36.3±3.69s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.47      72.8±1.79s        ? ?/sec
indexing/Indexing movies in three batches                                 1.09       4.0±0.13s        ? ?/sec     1.00       3.7±0.19s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.15       4.5±0.44s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     2.90      14.1±0.34s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.31       5.6±0.31s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     2.54      73.7±2.05s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.95      65.5±1.85s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.32      41.0±1.00s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.83      59.8±1.54s        ? ?/sec
indexing/Indexing wiki                                                    1.00     297.3±8.79s        ? ?/sec     1.18     350.2±6.62s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     232.0±3.66s        ? ?/sec     1.47    342.1±11.13s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.7±0.17s        ? ?/sec
indexing/Reindexing movies with default settings                          1.05   224.0±32.46ms        ? ?/sec     1.00   214.1±35.45ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.02       3.6±0.04s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    399.1±12.94s        ? ?/sec     1.14     455.3±3.41s        ? ?/sec

Kerollmops · 2024-02-05T13:21:16Z

I measured this PR with a dump loading of a big index of about 150M documents. Writing along the way without merging at the end of the extraction was not interesting and was slowing down the whole process. However, keeping the chunk size formula this way was very good, reducing the number of files open in parallel and the number of files to be merged before writing them.

This screenshot shows:

Change the size of the document chunks (so merge chunks before writing them in LMDB)
Disabled facet search

The extraction functions take a long time to be processed. Not only 10s of Ms.

ManyTheFish · 2024-02-08T08:26:03Z

/benchmark indexing

meili-bot · 2024-02-08T15:07:50Z

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_3e120619
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.01  1567.8±177.85ms        ? ?/sec    1.00  1557.4±152.49ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.81       7.2±0.23s        ? ?/sec     1.00       4.0±0.18s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.52      11.2±0.88s        ? ?/sec     1.00       7.3±0.87s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.12      52.1±3.18s        ? ?/sec     1.00     46.7±12.25s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.17      57.8±1.11s        ? ?/sec
indexing/Indexing movies in three batches                                 1.06       4.0±0.13s        ? ?/sec     1.00       3.7±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.08       4.3±0.30s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     1.41       6.9±0.25s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.30       5.5±0.27s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     1.18      34.1±0.80s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.35      45.3±2.07s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.17      36.4±0.88s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.30      42.7±0.79s        ? ?/sec
indexing/Indexing wiki                                                    1.04     297.3±8.79s        ? ?/sec     1.00     286.7±8.13s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.03     232.0±3.66s        ? ?/sec     1.00    224.6±14.01s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.8±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01   224.0±32.46ms        ? ?/sec     1.00   221.4±37.04ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.00       3.6±0.05s        ? ?/sec
indexing/Reindexing wiki                                                  1.14    399.1±12.94s        ? ?/sec     1.00     348.7±8.30s        ? ?/sec

…exing threads

Kerollmops · 2024-02-12T13:15:56Z

milli/src/update/facet/mod.rs

-                        .map(|(_, c)| c)
-                        .collect();
-                    normalized_facet = normalized_truncated_facet.into();
+        if let Some(normalized_delta_data) = self.normalized_delta_data {


Can you extract this into another function, please?

milli/src/update/facet/mod.rs

milli/src/update/index_documents/extract/mod.rs

Kerollmops · 2024-02-12T13:29:59Z

milli/src/update/index_documents/helpers/merge_functions.rs

+pub fn merge_btreeset_string<'a>(_key: &[u8], values: &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>> {
+    if values.len() == 1 {
+        Ok(values[0].clone())
+    } else {
+        // TODO improve the perf by using a `#[borrow] Cow<str>`.
+        let strings: BTreeSet<String> = values
+            .iter()
+            .map(AsRef::as_ref)
+            .map(serde_json::from_slice::<BTreeSet<String>>)
+            .map(StdResult::unwrap)
+            .reduce(|mut current, new| {
+                for x in new {
+                    current.insert(x);
+                }
+                current
+            })
+            .unwrap();
+        Ok(Cow::Owned(serde_json::to_vec(&strings).unwrap()))
+    }
+}


May be no more used.

milli/src/update/index_documents/typed_chunk.rs

ManyTheFish · 2024-02-13T09:27:51Z

milli/src/update/index_documents/typed_chunk.rs

+impl PartialEq for TypedChunk {
+    fn eq(&self, other: &Self) -> bool {
+        use TypedChunk::*;
+        match (self, other) {
+            (FieldIdDocidFacetStrings(_), FieldIdDocidFacetStrings(_))
+            | (FieldIdDocidFacetNumbers(_), FieldIdDocidFacetNumbers(_))
+            | (Documents(_), Documents(_))
+            | (FieldIdWordCountDocids(_), FieldIdWordCountDocids(_))
+            | (WordDocids { .. }, WordDocids { .. })
+            | (WordPositionDocids(_), WordPositionDocids(_))
+            | (WordPairProximityDocids(_), WordPairProximityDocids(_))
+            | (FieldIdFacetStringDocids(_), FieldIdFacetStringDocids(_))
+            | (FieldIdFacetNumberDocids(_), FieldIdFacetNumberDocids(_))
+            | (FieldIdFacetExistsDocids(_), FieldIdFacetExistsDocids(_))
+            | (FieldIdFacetIsNullDocids(_), FieldIdFacetIsNullDocids(_))
+            | (FieldIdFacetIsEmptyDocids(_), FieldIdFacetIsEmptyDocids(_))
+            | (GeoPoints(_), GeoPoints(_))
+            | (ScriptLanguageDocids(_), ScriptLanguageDocids(_)) => true,
+            (
+                VectorPoints { embedder_name: left, expected_dimension: left_dim, .. },
+                VectorPoints { embedder_name: right, expected_dimension: right_dim, .. },
+            ) => left == right && left_dim == right_dim,
+            _ => false,
+        }
+    }
+}
+impl Eq for TypedChunk {}


can_accumulate_with / is_batchable_with

can_be_merged_with?

Co-authored-by: Clément Renault <clement@meilisearch.com>

Kerollmops

bors merge

meili-bors · 2024-02-14T15:06:50Z

Build succeeded:

ManyTheFish added this to the v1.7.0 milestone Jan 22, 2024

ManyTheFish requested a review from Kerollmops January 23, 2024 09:23

Kerollmops requested changes Jan 23, 2024

View reviewed changes

ManyTheFish marked this pull request as draft January 25, 2024 17:24

This was referenced Feb 8, 2024

Cap the maximum memory of the grenad sorters #4388

Merged

Importing larger dump fails with No file descriptors available #4396

Closed

ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from e8ed27b to 169f27f Compare February 8, 2024 11:11

ManyTheFish changed the title ~~Remove merging phase from indexing~~ Make several indexing optimizations Feb 8, 2024

ManyTheFish linked an issue Feb 8, 2024 that may be closed by this pull request

Make the Facet Search Indexing process incremental #4354

Closed

ManyTheFish force-pushed the remove-merging-phase-from-indexing branch 2 times, most recently from f8632f4 to c659cc2 Compare February 8, 2024 15:49

Compute chunk size based on the input data size ant the number of ind…

be1b054

…exing threads

dureuill force-pushed the remove-merging-phase-from-indexing branch from c659cc2 to 747d4fb Compare February 8, 2024 16:37

dureuill and others added 3 commits February 12, 2024 09:12

fix logs

7877788

yield in loop when the channel is not disconnected

7efb1ca

fix clippy

39c83cb

ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from d65368a to 39c83cb Compare February 12, 2024 08:13

ManyTheFish marked this pull request as ready for review February 12, 2024 08:15

ManyTheFish requested a review from Kerollmops February 12, 2024 08:16

Kerollmops requested changes Feb 12, 2024

View reviewed changes

ManyTheFish changed the base branch from main to release-v1.7.0 February 12, 2024 15:04

ManyTheFish linked an issue Feb 12, 2024 that may be closed by this pull request

max-indexing-threads CLI parameter consume an additional thread to write in database #4406

Closed

ManyTheFish commented Feb 13, 2024

View reviewed changes

ManyTheFish and others added 2 commits February 13, 2024 14:22

Update milli/src/update/facet/mod.rs

55de96f

Co-authored-by: Clément Renault <clement@meilisearch.com>

Update milli/src/update/index_documents/extract/mod.rs

e5e811e

Co-authored-by: Clément Renault <clement@meilisearch.com>

ManyTheFish requested a review from Kerollmops February 13, 2024 14:14

fix PR comments

48026aa

ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from a14bf6f to 48026aa Compare February 13, 2024 14:19

ManyTheFish added 2 commits February 14, 2024 11:46

Fix and add logs

3beda88

Change is_batchable_with by mergeable_with

03bb637

Kerollmops approved these changes Feb 14, 2024

View reviewed changes

meili-bors bot merged commit 72c1674 into release-v1.7.0 Feb 14, 2024
10 checks passed

meili-bors bot deleted the remove-merging-phase-from-indexing branch February 14, 2024 15:06

This was referenced Feb 14, 2024

Make the Facet Search Indexing process incremental #4354

Closed

max-indexing-threads CLI parameter consume an additional thread to write in database #4406

Closed

meili-bot added the v1.7.0 PRs/issues solved in v1.7.0 released on 2024-03-11 label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make several indexing optimizations #4350

Make several indexing optimizations #4350

ManyTheFish commented Jan 22, 2024 •

edited

ManyTheFish commented Jan 23, 2024

Kerollmops Jan 23, 2024

Kerollmops Jan 23, 2024

Kerollmops Jan 23, 2024

meili-bot commented Jan 23, 2024

ManyTheFish commented Jan 25, 2024

meili-bot commented Jan 26, 2024

Kerollmops commented Feb 5, 2024 •

edited

ManyTheFish commented Feb 8, 2024

meili-bot commented Feb 8, 2024

Kerollmops Feb 12, 2024

Kerollmops Feb 12, 2024 •

edited

ManyTheFish Feb 13, 2024

Kerollmops Feb 13, 2024

Kerollmops left a comment

meili-bors bot commented Feb 14, 2024

Make several indexing optimizations #4350

Make several indexing optimizations #4350

Conversation

ManyTheFish commented Jan 22, 2024 • edited

Summary

Steps

Running Indexing process

main

remove-merging-phase-from-indexing

Related

ManyTheFish commented Jan 23, 2024

Kerollmops Jan 23, 2024

Choose a reason for hiding this comment

Kerollmops Jan 23, 2024

Choose a reason for hiding this comment

Kerollmops Jan 23, 2024

Choose a reason for hiding this comment

meili-bot commented Jan 23, 2024

ManyTheFish commented Jan 25, 2024

meili-bot commented Jan 26, 2024

Kerollmops commented Feb 5, 2024 • edited

ManyTheFish commented Feb 8, 2024

meili-bot commented Feb 8, 2024

Kerollmops Feb 12, 2024

Choose a reason for hiding this comment

Kerollmops Feb 12, 2024 • edited

Choose a reason for hiding this comment

ManyTheFish Feb 13, 2024

Choose a reason for hiding this comment

Kerollmops Feb 13, 2024

Choose a reason for hiding this comment

Kerollmops left a comment

Choose a reason for hiding this comment

meili-bors bot commented Feb 14, 2024

ManyTheFish commented Jan 22, 2024 •

edited

`main`

`remove-merging-phase-from-indexing`

Kerollmops commented Feb 5, 2024 •

edited

Kerollmops Feb 12, 2024 •

edited