Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make several indexing optimizations #4350

Merged
merged 9 commits into from Feb 14, 2024

Conversation

ManyTheFish
Copy link
Member

@ManyTheFish ManyTheFish commented Jan 22, 2024

Summary

Implement several enhancements to reduce the indexing time.

Steps

  • Compute the indexing chunk size dynamically based on the available threads and the data size
  • Remove the merging step before the writing step and merge at the writing time
  • Remove append function
  • Make Facet search indexing incremental

Running Indexing process

main

Each type of data is written after a merging phase:
Capture d’écran 2024-01-23 à 10 18 08

Highlighted parts are the writings

remove-merging-phase-from-indexing

When the extraction of a chunk is finished, the data is written:
Capture d’écran 2024-01-23 à 10 18 18

Highlighted parts are the writings

Related

This PR removes the appending writes on several indexing parts, which may fix #4300. However, all of the appending writes are not removed. There are 2 remaining calls that could trigger this bug:

@ManyTheFish ManyTheFish added this to the v1.7.0 milestone Jan 22, 2024
@ManyTheFish
Copy link
Member Author

/benchmark indexing

Comment on lines 93 to 188
.inspect(|result| {
if proximity_precision == ProximityPrecision::ByWord {
if let Ok((docid_word_positions_chunk, _)) = result {
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
docid_word_positions_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
extract_word_pair_proximity_docids,
TypedChunk::WordPairProximityDocids,
"word-pair-proximity-docids",
);
}
}
})
.inspect(|result| {
if let Ok((docid_word_positions_chunk, _)) = result {
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
docid_word_positions_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
extract_fid_word_count_docids,
TypedChunk::FieldIdWordCountDocids,
"field-id-wordcount-docids",
);
}
})
.inspect(|result| {
if let Ok((docid_word_positions_chunk, _)) = result {
let exact_attributes = exact_attributes.clone();
run_extraction_task::<
_,
_,
(
grenad::Reader<BufReader<File>>,
grenad::Reader<BufReader<File>>,
grenad::Reader<BufReader<File>>,
),
>(
docid_word_positions_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
move |doc_word_pos, indexer| {
extract_word_docids(doc_word_pos, indexer, &exact_attributes)
},
|(
word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
)| {
TypedChunk::WordDocids {
word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
}
},
"word-docids",
);
}
})
.inspect(|result| {
if let Ok((docid_word_positions_chunk, _)) = result {
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
docid_word_positions_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
extract_word_position_docids,
TypedChunk::WordPositionDocids,
"word-position-docids",
);
}
})
.inspect(|result| {
if let Ok((_, (_, fid_docid_facet_strings_chunk))) = result {
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
fid_docid_facet_strings_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
extract_facet_string_docids,
TypedChunk::FieldIdFacetStringDocids,
"field-id-facet-string-docids",
);
}
})
.inspect(|result| {
if let Ok((_, (fid_docid_facet_numbers_chunk, _))) = result {
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>(
fid_docid_facet_numbers_chunk.clone(),
indexer,
lmdb_writer_sx.clone(),
extract_facet_number_docids,
TypedChunk::FieldIdFacetNumberDocids,
"field-id-facet-number-docids",
);
}
})
.map(|r| r.map(|_| ()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace all those inspect and this map with a single map instead. inspect is more meant to be used to debug stuff than traversing for actual work. Also, can't these inspect tasks run in parallel? It seems that they are run sequentially here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use the for_each rayon trait method on various functions. Looping on the list of functions to run with the run_extraction_task could do the trick to run those extractions in parallel!

@@ -82,90 +82,6 @@ pub unsafe fn as_cloneable_grenad(
Ok(reader)
}

pub trait MergeableReader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

milli/src/update/index_documents/mod.rs Outdated Show resolved Hide resolved
@meili-bot
Copy link
Contributor

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_b6fc1819
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.04  1567.8±177.85ms        ? ?/sec    1.00  1505.7±142.25ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00       7.2±0.23s        ? ?/sec     2.30      16.6±0.49s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00      11.2±0.88s        ? ?/sec     1.09      12.2±0.81s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.44      52.1±3.18s        ? ?/sec     1.00      36.3±3.69s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.47      72.8±1.79s        ? ?/sec
indexing/Indexing movies in three batches                                 1.09       4.0±0.13s        ? ?/sec     1.00       3.7±0.19s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.15       4.5±0.44s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     2.90      14.1±0.34s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.31       5.6±0.31s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     2.54      73.7±2.05s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.95      65.5±1.85s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.32      41.0±1.00s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.83      59.8±1.54s        ? ?/sec
indexing/Indexing wiki                                                    1.00     297.3±8.79s        ? ?/sec     1.18     350.2±6.62s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     232.0±3.66s        ? ?/sec     1.47    342.1±11.13s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.7±0.17s        ? ?/sec
indexing/Reindexing movies with default settings                          1.05   224.0±32.46ms        ? ?/sec     1.00   214.1±35.45ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.02       3.6±0.04s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    399.1±12.94s        ? ?/sec     1.14     455.3±3.41s        ? ?/sec

@ManyTheFish ManyTheFish marked this pull request as draft January 25, 2024 17:24
@ManyTheFish
Copy link
Member Author

/benchmark indexing

@meili-bot
Copy link
Contributor

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_b6fc1819
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.04  1567.8±177.85ms        ? ?/sec    1.00  1505.7±142.25ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00       7.2±0.23s        ? ?/sec     2.30      16.6±0.49s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00      11.2±0.88s        ? ?/sec     1.09      12.2±0.81s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.44      52.1±3.18s        ? ?/sec     1.00      36.3±3.69s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.47      72.8±1.79s        ? ?/sec
indexing/Indexing movies in three batches                                 1.09       4.0±0.13s        ? ?/sec     1.00       3.7±0.19s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.15       4.5±0.44s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     2.90      14.1±0.34s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.31       5.6±0.31s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     2.54      73.7±2.05s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.95      65.5±1.85s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.32      41.0±1.00s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.83      59.8±1.54s        ? ?/sec
indexing/Indexing wiki                                                    1.00     297.3±8.79s        ? ?/sec     1.18     350.2±6.62s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00     232.0±3.66s        ? ?/sec     1.47    342.1±11.13s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.7±0.17s        ? ?/sec
indexing/Reindexing movies with default settings                          1.05   224.0±32.46ms        ? ?/sec     1.00   214.1±35.45ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.02       3.6±0.04s        ? ?/sec
indexing/Reindexing wiki                                                  1.00    399.1±12.94s        ? ?/sec     1.14     455.3±3.41s        ? ?/sec

@Kerollmops
Copy link
Member

Kerollmops commented Feb 5, 2024

I measured this PR with a dump loading of a big index of about 150M documents. Writing along the way without merging at the end of the extraction was not interesting and was slowing down the whole process. However, keeping the chunk size formula this way was very good, reducing the number of files open in parallel and the number of files to be merged before writing them.

This screenshot shows:

  • Change the size of the document chunks (so merge chunks before writing them in LMDB)
  • Disabled facet search
Capture d’écran 2024-01-25 à 18 33 36

The extraction functions take a long time to be processed. Not only 10s of Ms.

@ManyTheFish
Copy link
Member Author

/benchmark indexing

@ManyTheFish ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from e8ed27b to 169f27f Compare February 8, 2024 11:11
@ManyTheFish ManyTheFish changed the title Remove merging phase from indexing Make several indexing optimizations Feb 8, 2024
@ManyTheFish ManyTheFish linked an issue Feb 8, 2024 that may be closed by this pull request
@meili-bot
Copy link
Contributor

Here are your indexing benchmarks diff 👊

group                                                                     indexing_main_8e016fbf                  indexing_remove-merging-phase-from-indexing_3e120619
-----                                                                     ----------------------                  ----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00       7.2±0.55s        ? ?/sec     1.09       7.8±0.36s        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.01  1567.8±177.85ms        ? ?/sec    1.00  1557.4±152.49ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.81       7.2±0.23s        ? ?/sec     1.00       4.0±0.18s        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.52      11.2±0.88s        ? ?/sec     1.00       7.3±0.87s        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.12      52.1±3.18s        ? ?/sec     1.00     46.7±12.25s        ? ?/sec
indexing/Indexing geo_point                                               1.00      49.5±1.48s        ? ?/sec     1.17      57.8±1.11s        ? ?/sec
indexing/Indexing movies in three batches                                 1.06       4.0±0.13s        ? ?/sec     1.00       3.7±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.00       3.9±0.33s        ? ?/sec     1.08       4.3±0.30s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.00       4.9±0.36s        ? ?/sec     1.41       6.9±0.25s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.00       4.2±0.20s        ? ?/sec     1.30       5.5±0.27s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      29.0±1.21s        ? ?/sec     1.18      34.1±0.80s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      33.6±1.03s        ? ?/sec     1.35      45.3±2.07s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      31.0±1.55s        ? ?/sec     1.17      36.4±0.88s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      32.8±1.71s        ? ?/sec     1.30      42.7±0.79s        ? ?/sec
indexing/Indexing wiki                                                    1.04     297.3±8.79s        ? ?/sec     1.00     286.7±8.13s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.03     232.0±3.66s        ? ?/sec     1.00    224.6±14.01s        ? ?/sec
indexing/Reindexing geo_point                                             1.00      12.5±0.12s        ? ?/sec     1.02      12.8±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01   224.0±32.46ms        ? ?/sec     1.00   221.4±37.04ms        ? ?/sec
indexing/Reindexing songs with default settings                           1.00       3.6±0.04s        ? ?/sec     1.00       3.6±0.05s        ? ?/sec
indexing/Reindexing wiki                                                  1.14    399.1±12.94s        ? ?/sec     1.00     348.7±8.30s        ? ?/sec

@ManyTheFish ManyTheFish force-pushed the remove-merging-phase-from-indexing branch 2 times, most recently from f8632f4 to c659cc2 Compare February 8, 2024 15:49
@dureuill dureuill force-pushed the remove-merging-phase-from-indexing branch from c659cc2 to 747d4fb Compare February 8, 2024 16:37
@ManyTheFish ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from d65368a to 39c83cb Compare February 12, 2024 08:13
@ManyTheFish ManyTheFish marked this pull request as ready for review February 12, 2024 08:15
.map(|(_, c)| c)
.collect();
normalized_facet = normalized_truncated_facet.into();
if let Some(normalized_delta_data) = self.normalized_delta_data {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract this into another function, please?

milli/src/update/facet/mod.rs Outdated Show resolved Hide resolved
milli/src/update/index_documents/extract/mod.rs Outdated Show resolved Hide resolved
Comment on lines 226 to 245
pub fn merge_btreeset_string<'a>(_key: &[u8], values: &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>> {
if values.len() == 1 {
Ok(values[0].clone())
} else {
// TODO improve the perf by using a `#[borrow] Cow<str>`.
let strings: BTreeSet<String> = values
.iter()
.map(AsRef::as_ref)
.map(serde_json::from_slice::<BTreeSet<String>>)
.map(StdResult::unwrap)
.reduce(|mut current, new| {
for x in new {
current.insert(x);
}
current
})
.unwrap();
Ok(Cow::Owned(serde_json::to_vec(&strings).unwrap()))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be no more used.

@ManyTheFish ManyTheFish changed the base branch from main to release-v1.7.0 February 12, 2024 15:04
Comment on lines 90 to 116
impl PartialEq for TypedChunk {
fn eq(&self, other: &Self) -> bool {
use TypedChunk::*;
match (self, other) {
(FieldIdDocidFacetStrings(_), FieldIdDocidFacetStrings(_))
| (FieldIdDocidFacetNumbers(_), FieldIdDocidFacetNumbers(_))
| (Documents(_), Documents(_))
| (FieldIdWordCountDocids(_), FieldIdWordCountDocids(_))
| (WordDocids { .. }, WordDocids { .. })
| (WordPositionDocids(_), WordPositionDocids(_))
| (WordPairProximityDocids(_), WordPairProximityDocids(_))
| (FieldIdFacetStringDocids(_), FieldIdFacetStringDocids(_))
| (FieldIdFacetNumberDocids(_), FieldIdFacetNumberDocids(_))
| (FieldIdFacetExistsDocids(_), FieldIdFacetExistsDocids(_))
| (FieldIdFacetIsNullDocids(_), FieldIdFacetIsNullDocids(_))
| (FieldIdFacetIsEmptyDocids(_), FieldIdFacetIsEmptyDocids(_))
| (GeoPoints(_), GeoPoints(_))
| (ScriptLanguageDocids(_), ScriptLanguageDocids(_)) => true,
(
VectorPoints { embedder_name: left, expected_dimension: left_dim, .. },
VectorPoints { embedder_name: right, expected_dimension: right_dim, .. },
) => left == right && left_dim == right_dim,
_ => false,
}
}
}
impl Eq for TypedChunk {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_accumulate_with / is_batchable_with

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_be_merged_with?

ManyTheFish and others added 2 commits February 13, 2024 14:22
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
@ManyTheFish ManyTheFish force-pushed the remove-merging-phase-from-indexing branch from a14bf6f to 48026aa Compare February 13, 2024 14:19
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

Copy link
Contributor

meili-bors bot commented Feb 14, 2024

@meili-bors meili-bors bot merged commit 72c1674 into release-v1.7.0 Feb 14, 2024
10 checks passed
@meili-bors meili-bors bot deleted the remove-merging-phase-from-indexing branch February 14, 2024 15:06
@meili-bot meili-bot added the v1.7.0 PRs/issues solved in v1.7.0 released on 2024-03-11 label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.7.0 PRs/issues solved in v1.7.0 released on 2024-03-11
Projects
None yet
4 participants