New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize latin languages detection #108
Optimize latin languages detection #108
Conversation
Use intermediate char score instead of directly score Languages in order to reduce imbricated loop quadratic effect
src/alphabets/latin.rs
Outdated
|
||
// score of each character. | ||
let mut max_raw_score = 0; | ||
let mut scores: Vec<_> = chars.iter().map(|_| 0).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut scores: Vec<_> = chars.iter().map(|_| 0).collect(); | |
let mut scores = vec![0; chars.count()]; |
e46ff8f
to
e8a83b0
Compare
@@ -43,7 +43,7 @@ pub fn alphabet_calculate_scores(text: &str) -> Outcome { | |||
} | |||
} | |||
|
|||
raw_scores.sort_by(|a, b| b.1.cmp(&a.1)); | |||
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1)); | |
raw_scores.sort_unstable_by_key(|(lang, _)| lang); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's Reversed, no?
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1)); | |
raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score)); |
misc/alphabets/latin_gen.rb
Outdated
@@ -26,7 +26,7 @@ def load_alphabets | |||
alphabets[code] = normalize_alphabet(alphabet) | |||
end | |||
|
|||
alphabets.sort_by {|k, _| k }.to_h | |||
alphabets.sort_unstable_by {|k, _| k }.to_h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, this is a ruby script, not Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aïe aïe aïe
src/alphabets/cyrillic.rs
Outdated
@@ -35,7 +35,7 @@ pub fn alphabet_calculate_scores(text: &LowercaseText, filter_list: &FilterList) | |||
} | |||
} | |||
|
|||
raw_scores.sort_by(|a, b| b.1.cmp(&a.1)); | |||
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by_key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1)); | |
raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score)); |
/// Inverted map binding a character to a set of languages. | ||
pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep only one vec here? Is it slower?
/// Inverted map binding a character to a set of languages. | |
pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| { | |
/// Inverted map binding a character to a set of languages. | |
pub static ALPHABET_LANG_MAP: Lazy<(Vec<(char, Vec<Lang>)>)> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use them at the same time.
chars
is used in the first loop to compute char scores based on their occurrence in the text.
langs
is used in the second loop to sum char scores into lang scores.
keeping them merged would just add noisy tuples with anonymous variables everywhere.
In terms of computing time, there are no significant changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a quite a while to understand why this data structure the way it is..
The binding through the position in the vectors is not very obvious. Would you mind adding a comment on that? (in a separate PR, this one I guess will be merged today)
@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment. |
e8a83b0
to
7cd4ccb
Compare
src/trigrams/detection.rs
Outdated
@@ -77,7 +77,7 @@ fn calculate_scores_in_profiles( | |||
} | |||
|
|||
// Sort languages by distance | |||
lang_distances.sort_by_key(|key| key.1); | |||
lang_distances.sort_unstable_by_key(|key| key.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by_key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! by_key
, you want me to remove it? 😝
@@ -33,7 +33,7 @@ fn trigram_occurances_to_positions( | |||
.into_iter() | |||
.map(|(trigram, count)| (count, trigram)) | |||
.collect(); | |||
count_vec.sort_by(|a, b| b.cmp(a)); | |||
count_vec.sort_unstable_by(|a, b| b.cmp(a)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by_key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one I don't know.
@greyblake does this sort could be equivalent to below code?
count_vec.sort_unstable_by(|a, b| b.cmp(a)); | |
count_vec.sort_unstable_by_key(|(count, _trigram)| Reverse(*count)); |
or is the trigram sorting important?
we could rewrite a bit the function like
#[allow(clippy::unnecessary_sort_by)]
fn trigram_occurances_to_positions(
trigram_occurances: HashMap<Trigram, u32>,
) -> HashMap<Trigram, u32> {
- // Sort in descending order by number of occurrences and trigrams
+ // Sort in ascending order by number of occurrences and trigrams
let mut count_vec: Vec<_> = trigram_occurances
.into_iter()
.map(|(trigram, count)| (count, trigram))
.collect();
- count_vec.sort_unstable_by(|a, b| b.cmp(a));
+ count_vec.sort_unstable();
count_vec
.into_iter()
+ .rev() // take starting from the highest count
.take(TEXT_TRIGRAMS_SIZE) // we're interested only in the first 600 (2 * MAX_TRIGRAM_DISTANCE)
.enumerate()
.map(|(i, (_, trigram))| (trigram, i as u32))
.collect()
}
I'll probably try, in another PR, to collect in a BinaryHeap
instead of in a Vec
to sort the HashMap
🤔
we may just keep the current implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will surely be faster to use a vec and sort it than using a binary heap here, the amount of cmp will be highly reduced. I like the function rewrite you propose ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greyblake does this sort could be equivalent to below code?
I'd say rather no than yes.
Back in the days, there were some really hard to reproduce tests that would non-deterministically fail.
It was fixed in 30d142d
but now, try to recall what was exactly there I blame myself for not documenting that enough :D
src/alphabets/latin.rs
Outdated
( | ||
l, | ||
(lang_scores[l as usize] + common_score).saturating_sub(max_raw_score), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it more clear this way, what do you think?
( | |
l, | |
(lang_scores[l as usize] + common_score).saturating_sub(max_raw_score), | |
) | |
let score = (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score); | |
(l, score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
7cd4ccb
to
0918974
Compare
@ManyTheFish Just to let you know, I haven't forgotten about this PR. |
@ManyTheFish @Kerollmops |
FYI: the optimization is released in 0.14.0. |
Hey @greyblake! I'm pleased to see this PR merged. 😊 I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. 🙂 |
Summary
Optimize
alphabet_calculate_scores
function used during Latin Language detection.Compute the score in two steps:
This avoids imbricated loops that make the compute complexity quadratic.
For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅
But, I will probably try to optimize it in another PR.
Whatlang benchmarks
main branch
Commits
Replace sort_by
Use inverted mapping between char and Lang
Clamp score in normalization loop instead of creating intermediate vec
Increment a common score when a common char is found
Use binary search instead of iter find
Fix returned raw score
Use intermediate char score
Count Max score in char score Loop
Make lang score access O(1) when iterating over char scores