Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize latin languages detection #108

Merged

Conversation

ManyTheFish
Copy link
Contributor

@ManyTheFish ManyTheFish commented Mar 12, 2022

Summary

Optimize alphabet_calculate_scores function used during Latin Language detection.

Compute the score in two steps:

  • iterate over the text and scores character based on their frequency
  • then sum the character's scores in Language's scores

This avoids imbricated loops that make the compute complexity quadratic.

For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅
But, I will probably try to optimize it in another PR.

Whatlang benchmarks

main branch

test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)

test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)

Commits

Replace sort_by

test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)

test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)

Use inverted mapping between char and Lang

test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)

test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)

Clamp score in normalization loop instead of creating intermediate vec

test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)

test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)

Increment a common score when a common char is found

test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)

test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)

Use binary search instead of iter find

test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)

test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)

Fix returned raw score

test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)

test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)

Use intermediate char score

test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)

test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)

Count Max score in char score Loop

test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)

test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)

Make lang score access O(1) when iterating over char scores

test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)

test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)


// score of each character.
let mut max_raw_score = 0;
let mut scores: Vec<_> = chars.iter().map(|_| 0).collect();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut scores: Vec<_> = chars.iter().map(|_| 0).collect();
let mut scores = vec![0; chars.count()];

@ManyTheFish ManyTheFish force-pushed the optimize-latin-language-detection branch 3 times, most recently from e46ff8f to e8a83b0 Compare March 13, 2022 06:45
@@ -43,7 +43,7 @@ pub fn alphabet_calculate_scores(text: &str) -> Outcome {
}
}

raw_scores.sort_by(|a, b| b.1.cmp(&a.1));
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));
raw_scores.sort_unstable_by_key(|(lang, _)| lang);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's Reversed, no?

Suggested change
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));
raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score));

@@ -26,7 +26,7 @@ def load_alphabets
alphabets[code] = normalize_alphabet(alphabet)
end

alphabets.sort_by {|k, _| k }.to_h
alphabets.sort_unstable_by {|k, _| k }.to_h
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this is a ruby script, not Rust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aïe aïe aïe

@@ -35,7 +35,7 @@ pub fn alphabet_calculate_scores(text: &LowercaseText, filter_list: &FilterList)
}
}

raw_scores.sort_by(|a, b| b.1.cmp(&a.1));
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));
raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score));

Comment on lines +86 to +89
/// Inverted map binding a character to a set of languages.
pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep only one vec here? Is it slower?

Suggested change
/// Inverted map binding a character to a set of languages.
pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| {
/// Inverted map binding a character to a set of languages.
pub static ALPHABET_LANG_MAP: Lazy<(Vec<(char, Vec<Lang>)>)> = Lazy::new(|| {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use them at the same time.
chars is used in the first loop to compute char scores based on their occurrence in the text.
langs is used in the second loop to sum char scores into lang scores.
keeping them merged would just add noisy tuples with anonymous variables everywhere.

In terms of computing time, there are no significant changes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a quite a while to understand why this data structure the way it is..
The binding through the position in the vectors is not very obvious. Would you mind adding a comment on that? (in a separate PR, this one I guess will be merged today)

@greyblake
Copy link
Owner

@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment.

@ManyTheFish ManyTheFish force-pushed the optimize-latin-language-detection branch from e8a83b0 to 7cd4ccb Compare March 14, 2022 09:20
src/scripts/detect.rs Outdated Show resolved Hide resolved
@@ -77,7 +77,7 @@ fn calculate_scores_in_profiles(
}

// Sort languages by distance
lang_distances.sort_by_key(|key| key.1);
lang_distances.sort_unstable_by_key(|key| key.1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_key ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! by_key, you want me to remove it? 😝

@@ -33,7 +33,7 @@ fn trigram_occurances_to_positions(
.into_iter()
.map(|(trigram, count)| (count, trigram))
.collect();
count_vec.sort_by(|a, b| b.cmp(a));
count_vec.sort_unstable_by(|a, b| b.cmp(a));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_key ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one I don't know.
@greyblake does this sort could be equivalent to below code?

Suggested change
count_vec.sort_unstable_by(|a, b| b.cmp(a));
count_vec.sort_unstable_by_key(|(count, _trigram)| Reverse(*count));

or is the trigram sorting important?

we could rewrite a bit the function like

#[allow(clippy::unnecessary_sort_by)]
fn trigram_occurances_to_positions(
    trigram_occurances: HashMap<Trigram, u32>,
) -> HashMap<Trigram, u32> {
-   // Sort in descending order by number of occurrences and trigrams
+   // Sort in ascending order by number of occurrences and trigrams
    let mut count_vec: Vec<_> = trigram_occurances
        .into_iter()
        .map(|(trigram, count)| (count, trigram))
        .collect();
-   count_vec.sort_unstable_by(|a, b| b.cmp(a));
+   count_vec.sort_unstable();

    count_vec
        .into_iter()
+       .rev() // take starting from the highest count
        .take(TEXT_TRIGRAMS_SIZE) // we're interested only in the first 600 (2 * MAX_TRIGRAM_DISTANCE)
        .enumerate()
        .map(|(i, (_, trigram))| (trigram, i as u32))
        .collect()
}

I'll probably try, in another PR, to collect in a BinaryHeap instead of in a Vec to sort the HashMap 🤔

we may just keep the current implementation here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will surely be faster to use a vec and sort it than using a binary heap here, the amount of cmp will be highly reduced. I like the function rewrite you propose ✅

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greyblake does this sort could be equivalent to below code?

I'd say rather no than yes.
Back in the days, there were some really hard to reproduce tests that would non-deterministically fail.
It was fixed in 30d142d
but now, try to recall what was exactly there I blame myself for not documenting that enough :D

Comment on lines 159 to 162
(
l,
(lang_scores[l as usize] + common_score).saturating_sub(max_raw_score),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it more clear this way, what do you think?

Suggested change
(
l,
(lang_scores[l as usize] + common_score).saturating_sub(max_raw_score),
)
let score = (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score);
(l, score)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@ManyTheFish ManyTheFish force-pushed the optimize-latin-language-detection branch from 7cd4ccb to 0918974 Compare March 14, 2022 15:25
@greyblake
Copy link
Owner

greyblake commented Apr 5, 2022

@ManyTheFish Just to let you know, I haven't forgotten about this PR.

@greyblake
Copy link
Owner

@ManyTheFish @Kerollmops
Than you guys! This is very outstanding PR! I've learned something new today :)
Would you like to add further improvements? If you want, we can try to arrange a call, I can explain how trigrams work :)

@greyblake greyblake merged commit 5a7c7d2 into greyblake:master Apr 15, 2022
@greyblake
Copy link
Owner

FYI: the optimization is released in 0.14.0.

@ManyTheFish
Copy link
Contributor Author

ManyTheFish commented Apr 19, 2022

Hey @greyblake! I'm pleased to see this PR merged. 😊

I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants