New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make several indexing optimizations #4350
Make several indexing optimizations #4350
Conversation
/benchmark indexing |
.inspect(|result| { | ||
if proximity_precision == ProximityPrecision::ByWord { | ||
if let Ok((docid_word_positions_chunk, _)) = result { | ||
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>( | ||
docid_word_positions_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
extract_word_pair_proximity_docids, | ||
TypedChunk::WordPairProximityDocids, | ||
"word-pair-proximity-docids", | ||
); | ||
} | ||
} | ||
}) | ||
.inspect(|result| { | ||
if let Ok((docid_word_positions_chunk, _)) = result { | ||
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>( | ||
docid_word_positions_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
extract_fid_word_count_docids, | ||
TypedChunk::FieldIdWordCountDocids, | ||
"field-id-wordcount-docids", | ||
); | ||
} | ||
}) | ||
.inspect(|result| { | ||
if let Ok((docid_word_positions_chunk, _)) = result { | ||
let exact_attributes = exact_attributes.clone(); | ||
run_extraction_task::< | ||
_, | ||
_, | ||
( | ||
grenad::Reader<BufReader<File>>, | ||
grenad::Reader<BufReader<File>>, | ||
grenad::Reader<BufReader<File>>, | ||
), | ||
>( | ||
docid_word_positions_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
move |doc_word_pos, indexer| { | ||
extract_word_docids(doc_word_pos, indexer, &exact_attributes) | ||
}, | ||
|( | ||
word_docids_reader, | ||
exact_word_docids_reader, | ||
word_fid_docids_reader, | ||
)| { | ||
TypedChunk::WordDocids { | ||
word_docids_reader, | ||
exact_word_docids_reader, | ||
word_fid_docids_reader, | ||
} | ||
}, | ||
"word-docids", | ||
); | ||
} | ||
}) | ||
.inspect(|result| { | ||
if let Ok((docid_word_positions_chunk, _)) = result { | ||
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>( | ||
docid_word_positions_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
extract_word_position_docids, | ||
TypedChunk::WordPositionDocids, | ||
"word-position-docids", | ||
); | ||
} | ||
}) | ||
.inspect(|result| { | ||
if let Ok((_, (_, fid_docid_facet_strings_chunk))) = result { | ||
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>( | ||
fid_docid_facet_strings_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
extract_facet_string_docids, | ||
TypedChunk::FieldIdFacetStringDocids, | ||
"field-id-facet-string-docids", | ||
); | ||
} | ||
}) | ||
.inspect(|result| { | ||
if let Ok((_, (fid_docid_facet_numbers_chunk, _))) = result { | ||
run_extraction_task::<_, _, grenad::Reader<BufReader<File>>>( | ||
fid_docid_facet_numbers_chunk.clone(), | ||
indexer, | ||
lmdb_writer_sx.clone(), | ||
extract_facet_number_docids, | ||
TypedChunk::FieldIdFacetNumberDocids, | ||
"field-id-facet-number-docids", | ||
); | ||
} | ||
}) | ||
.map(|r| r.map(|_| ())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace all those inspect and this map with a single map instead. inspect
is more meant to be used to debug stuff than traversing for actual work. Also, can't these inspect
tasks run in parallel? It seems that they are run sequentially here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can use the for_each
rayon trait method on various functions. Looping on the list of functions to run with the run_extraction_task
could do the trick to run those extractions in parallel!
@@ -82,90 +82,6 @@ pub unsafe fn as_cloneable_grenad( | |||
Ok(reader) | |||
} | |||
|
|||
pub trait MergeableReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Here are your indexing benchmarks diff 👊
|
/benchmark indexing |
Here are your indexing benchmarks diff 👊
|
/benchmark indexing |
e8ed27b
to
169f27f
Compare
Here are your indexing benchmarks diff 👊
|
f8632f4
to
c659cc2
Compare
c659cc2
to
747d4fb
Compare
d65368a
to
39c83cb
Compare
milli/src/update/facet/mod.rs
Outdated
.map(|(_, c)| c) | ||
.collect(); | ||
normalized_facet = normalized_truncated_facet.into(); | ||
if let Some(normalized_delta_data) = self.normalized_delta_data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract this into another function, please?
pub fn merge_btreeset_string<'a>(_key: &[u8], values: &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>> { | ||
if values.len() == 1 { | ||
Ok(values[0].clone()) | ||
} else { | ||
// TODO improve the perf by using a `#[borrow] Cow<str>`. | ||
let strings: BTreeSet<String> = values | ||
.iter() | ||
.map(AsRef::as_ref) | ||
.map(serde_json::from_slice::<BTreeSet<String>>) | ||
.map(StdResult::unwrap) | ||
.reduce(|mut current, new| { | ||
for x in new { | ||
current.insert(x); | ||
} | ||
current | ||
}) | ||
.unwrap(); | ||
Ok(Cow::Owned(serde_json::to_vec(&strings).unwrap())) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be no more used.
impl PartialEq for TypedChunk { | ||
fn eq(&self, other: &Self) -> bool { | ||
use TypedChunk::*; | ||
match (self, other) { | ||
(FieldIdDocidFacetStrings(_), FieldIdDocidFacetStrings(_)) | ||
| (FieldIdDocidFacetNumbers(_), FieldIdDocidFacetNumbers(_)) | ||
| (Documents(_), Documents(_)) | ||
| (FieldIdWordCountDocids(_), FieldIdWordCountDocids(_)) | ||
| (WordDocids { .. }, WordDocids { .. }) | ||
| (WordPositionDocids(_), WordPositionDocids(_)) | ||
| (WordPairProximityDocids(_), WordPairProximityDocids(_)) | ||
| (FieldIdFacetStringDocids(_), FieldIdFacetStringDocids(_)) | ||
| (FieldIdFacetNumberDocids(_), FieldIdFacetNumberDocids(_)) | ||
| (FieldIdFacetExistsDocids(_), FieldIdFacetExistsDocids(_)) | ||
| (FieldIdFacetIsNullDocids(_), FieldIdFacetIsNullDocids(_)) | ||
| (FieldIdFacetIsEmptyDocids(_), FieldIdFacetIsEmptyDocids(_)) | ||
| (GeoPoints(_), GeoPoints(_)) | ||
| (ScriptLanguageDocids(_), ScriptLanguageDocids(_)) => true, | ||
( | ||
VectorPoints { embedder_name: left, expected_dimension: left_dim, .. }, | ||
VectorPoints { embedder_name: right, expected_dimension: right_dim, .. }, | ||
) => left == right && left_dim == right_dim, | ||
_ => false, | ||
} | ||
} | ||
} | ||
impl Eq for TypedChunk {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can_accumulate_with
/ is_batchable_with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can_be_merged_with
?
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
a14bf6f
to
48026aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
Summary
Implement several enhancements to reduce the indexing time.
Steps
Running Indexing process
main
Each type of data is written after a merging phase:
remove-merging-phase-from-indexing
When the extraction of a chunk is finished, the data is written:
Related
This PR removes the appending writes on several indexing parts, which may fix #4300. However, all of the appending writes are not removed. There are 2 remaining calls that could trigger this bug: