Optimize latin languages detection #108

ManyTheFish · 2022-03-12T16:17:12Z

Summary

Optimize alphabet_calculate_scores function used during Latin Language detection.

Compute the score in two steps:

iterate over the text and scores character based on their frequency
then sum the character's scores in Language's scores

This avoids imbricated loops that make the compute complexity quadratic.

For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅
But, I will probably try to optimize it in another PR.

Whatlang benchmarks

main branch

test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)

test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)

Commits

Replace sort_by

test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)

test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)

Use inverted mapping between char and Lang

test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)

test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)

Clamp score in normalization loop instead of creating intermediate vec

test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)

test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)

Increment a common score when a common char is found

test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)

test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)

Use binary search instead of iter find

test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)

test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)

Fix returned raw score

test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)

test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)

Use intermediate char score

test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)

test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)

Count Max score in char score Loop

test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)

test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)

Make lang score access O(1) when iterating over char scores

test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)

test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)

Use intermediate char score instead of directly score Languages in order to reduce imbricated loop quadratic effect

Kerollmops · 2022-03-12T20:10:17Z

src/alphabets/latin.rs

+
+    // score of each character.
+    let mut max_raw_score = 0;
+    let mut scores: Vec<_> = chars.iter().map(|_| 0).collect();


Suggested change

let mut scores: Vec<_> = chars.iter().map(|_| 0).collect();

let mut scores = vec![0; chars.count()];

Kerollmops · 2022-03-13T11:30:07Z

misc/alphabets/calculate_scores.rs.erb

@@ -43,7 +43,7 @@ pub fn alphabet_calculate_scores(text: &str) -> Outcome {
        }
    }

-    raw_scores.sort_by(|a, b| b.1.cmp(&a.1));
+    raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));


Suggested change

raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));

raw_scores.sort_unstable_by_key(|(lang, _)| lang);

it's Reversed, no?

Suggested change

raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));

raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score));

Kerollmops · 2022-03-13T11:30:56Z

misc/alphabets/latin_gen.rb

@@ -26,7 +26,7 @@ def load_alphabets
    alphabets[code] = normalize_alphabet(alphabet)
  end

-  alphabets.sort_by {|k, _| k }.to_h
+  alphabets.sort_unstable_by {|k, _| k }.to_h


Here, this is a ruby script, not Rust.

aïe aïe aïe

Kerollmops · 2022-03-13T11:31:10Z

src/alphabets/cyrillic.rs

@@ -35,7 +35,7 @@ pub fn alphabet_calculate_scores(text: &LowercaseText, filter_list: &FilterList)
        }
    }

-    raw_scores.sort_by(|a, b| b.1.cmp(&a.1));
+    raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));


Suggested change

raw_scores.sort_unstable_by(|a, b| b.1.cmp(&a.1));

raw_scores.sort_unstable_by_key(|(_, score)| Reverse(score));

Kerollmops · 2022-03-13T11:34:24Z

src/alphabets/latin.rs

+/// Inverted map binding a character to a set of languages.
+pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| {


Why not keep only one vec here? Is it slower?

Suggested change

/// Inverted map binding a character to a set of languages.

pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(|| {

/// Inverted map binding a character to a set of languages.

pub static ALPHABET_LANG_MAP: Lazy<(Vec<(char, Vec<Lang>)>)> = Lazy::new(|| {

I don't use them at the same time.
chars is used in the first loop to compute char scores based on their occurrence in the text.
langs is used in the second loop to sum char scores into lang scores.
keeping them merged would just add noisy tuples with anonymous variables everywhere.

In terms of computing time, there are no significant changes.

It took me a quite a while to understand why this data structure the way it is..
The binding through the position in the vectors is not very obvious. Would you mind adding a comment on that? (in a separate PR, this one I guess will be merged today)

greyblake · 2022-03-13T14:41:02Z

@ManyTheFish @Kerollmops Thank you guys! I do not promise to review this soon, because there is a lot of shit going on with my relatives and friends in Ukraine, and helping them and Ukraine is much higher priority for me at the moment.

src/scripts/detect.rs

Kerollmops · 2022-03-14T12:49:12Z

src/trigrams/detection.rs

@@ -77,7 +77,7 @@ fn calculate_scores_in_profiles(
    }

    // Sort languages by distance
-    lang_distances.sort_by_key(|key| key.1);
+    lang_distances.sort_unstable_by_key(|key| key.1);


yes! by_key, you want me to remove it? 😝

Kerollmops · 2022-03-14T12:49:20Z

src/trigrams/utils.rs

@@ -33,7 +33,7 @@ fn trigram_occurances_to_positions(
        .into_iter()
        .map(|(trigram, count)| (count, trigram))
        .collect();
-    count_vec.sort_by(|a, b| b.cmp(a));
+    count_vec.sort_unstable_by(|a, b| b.cmp(a));


this one I don't know.
@greyblake does this sort could be equivalent to below code?

Suggested change

count_vec.sort_unstable_by(|a, b| b.cmp(a));

count_vec.sort_unstable_by_key(|(count, _trigram)| Reverse(*count));

or is the trigram sorting important?

we could rewrite a bit the function like

#[allow(clippy::unnecessary_sort_by)] fn trigram_occurances_to_positions( trigram_occurances: HashMap<Trigram, u32>, ) -> HashMap<Trigram, u32> { - // Sort in descending order by number of occurrences and trigrams + // Sort in ascending order by number of occurrences and trigrams let mut count_vec: Vec<_> = trigram_occurances .into_iter() .map(|(trigram, count)| (count, trigram)) .collect(); - count_vec.sort_unstable_by(|a, b| b.cmp(a)); + count_vec.sort_unstable(); count_vec .into_iter() + .rev() // take starting from the highest count .take(TEXT_TRIGRAMS_SIZE) // we're interested only in the first 600 (2 * MAX_TRIGRAM_DISTANCE) .enumerate() .map(|(i, (_, trigram))| (trigram, i as u32)) .collect() }

I'll probably try, in another PR, to collect in a BinaryHeap instead of in a Vec to sort the HashMap 🤔

we may just keep the current implementation here.

It will surely be faster to use a vec and sort it than using a binary heap here, the amount of cmp will be highly reduced. I like the function rewrite you propose ✅

@greyblake does this sort could be equivalent to below code?

I'd say rather no than yes.
Back in the days, there were some really hard to reproduce tests that would non-deterministically fail.
It was fixed in 30d142d
but now, try to recall what was exactly there I blame myself for not documenting that enough :D

Kerollmops · 2022-03-14T12:53:10Z

src/alphabets/latin.rs

+            (
+                l,
+                (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score),
+            )


I find it more clear this way, what do you think?

Suggested change

(

l,

(lang_scores[l as usize] + common_score).saturating_sub(max_raw_score),

)

let score = (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score);

(l, score)

greyblake · 2022-04-05T10:31:35Z

@ManyTheFish Just to let you know, I haven't forgotten about this PR.

greyblake · 2022-04-15T15:34:29Z

@ManyTheFish @Kerollmops
Than you guys! This is very outstanding PR! I've learned something new today :)
Would you like to add further improvements? If you want, we can try to arrange a call, I can explain how trigrams work :)

greyblake · 2022-04-15T16:19:04Z

FYI: the optimization is released in 0.14.0.

ManyTheFish · 2022-04-19T08:25:53Z

Hey @greyblake! I'm pleased to see this PR merged. 😊

I'll probably come back with a new PR between the 2 and the 5 of May if I have a bit of time to work on it. 🙂

ManyTheFish added 9 commits March 10, 2022 15:36

Replace sort_by calls by sort_unstable_by

354b5bd

Use inverted mapping between char and Lang

b013058

Clamp score in normalization loop instead of creating intermediate vec

ab4b519

Increment a common score when a common char is found

cb92fa6

Use binary search instead of iter find

287dcb0

Fix returned raw score

3cfe1b2

Use intermediate char score

e88b302

Use intermediate char score instead of directly score Languages in order to reduce imbricated loop quadratic effect

Count Max score in char score loop

9be0014

Make lang score access O(1) when iterating over char scores

61503a2

Kerollmops suggested changes Mar 12, 2022

View reviewed changes

ManyTheFish force-pushed the optimize-latin-language-detection branch 3 times, most recently from e46ff8f to e8a83b0 Compare March 13, 2022 06:45

Kerollmops reviewed Mar 13, 2022

View reviewed changes

ManyTheFish force-pushed the optimize-latin-language-detection branch from e8a83b0 to 7cd4ccb Compare March 14, 2022 09:20

Kerollmops suggested changes Mar 14, 2022

View reviewed changes

Make some cleaning

0918974

ManyTheFish force-pushed the optimize-latin-language-detection branch from 7cd4ccb to 0918974 Compare March 14, 2022 15:25

Merge branch 'master' into optimize-latin-language-detection

3f4270e

greyblake merged commit 5a7c7d2 into greyblake:master Apr 15, 2022

This was referenced Apr 15, 2022

simplify alphabet_calculate_scores #109

Closed

Make Alphabet Score calculation generic over Script #111

Closed

greyblake mentioned this pull request May 1, 2022

Using generic function to compute alphabet score for Cyrillic and latin #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize latin languages detection #108

Optimize latin languages detection #108

ManyTheFish commented Mar 12, 2022 •

edited

Kerollmops Mar 12, 2022

Kerollmops Mar 13, 2022

ManyTheFish Mar 14, 2022

Kerollmops Mar 13, 2022 •

edited

ManyTheFish Mar 14, 2022

Kerollmops Mar 13, 2022

ManyTheFish Mar 14, 2022

Kerollmops Mar 13, 2022

ManyTheFish Mar 14, 2022 •

edited

greyblake Apr 15, 2022

greyblake commented Mar 13, 2022

Kerollmops Mar 14, 2022

ManyTheFish Mar 14, 2022

Kerollmops Mar 14, 2022

ManyTheFish Mar 14, 2022 •

edited

Kerollmops Mar 14, 2022

greyblake Apr 15, 2022

Kerollmops Mar 14, 2022

ManyTheFish Mar 14, 2022

greyblake commented Apr 5, 2022 •

edited

greyblake commented Apr 15, 2022

greyblake commented Apr 15, 2022

ManyTheFish commented Apr 19, 2022 •

edited

	let mut scores: Vec<_> = chars.iter().map(\|_\| 0).collect();
	let mut scores = vec![0; chars.count()];

	raw_scores.sort_unstable_by(\|a, b\| b.1.cmp(&a.1));
	raw_scores.sort_unstable_by_key(\|(lang, _)\| lang);

	raw_scores.sort_unstable_by(\|a, b\| b.1.cmp(&a.1));
	raw_scores.sort_unstable_by_key(\|(_, score)\| Reverse(score));

		/// Inverted map binding a character to a set of languages.
		pub static ALPHABET_LANG_MAP: Lazy<(Vec<char>, Vec<Vec<Lang>>)> = Lazy::new(\|\| {

	count_vec.sort_unstable_by(\|a, b\| b.cmp(a));
	count_vec.sort_unstable_by_key(\|(count, _trigram)\| Reverse(*count));

-            (
-                l,
-                (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score),
-            )
+            let score = (lang_scores[l as usize] + common_score).saturating_sub(max_raw_score);
+            (l, score)

Optimize latin languages detection #108

Optimize latin languages detection #108

Conversation

ManyTheFish commented Mar 12, 2022 • edited

Summary

Whatlang benchmarks

main branch

Commits

Replace sort_by

Use inverted mapping between char and Lang

Clamp score in normalization loop instead of creating intermediate vec

Increment a common score when a common char is found

Use binary search instead of iter find

Fix returned raw score

Use intermediate char score

Count Max score in char score Loop

Make lang score access O(1) when iterating over char scores

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kerollmops Mar 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ManyTheFish Mar 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greyblake commented Mar 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ManyTheFish Mar 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greyblake commented Apr 5, 2022 • edited

greyblake commented Apr 15, 2022

greyblake commented Apr 15, 2022

ManyTheFish commented Apr 19, 2022 • edited

ManyTheFish commented Mar 12, 2022 •

edited

Kerollmops Mar 13, 2022 •

edited

ManyTheFish Mar 14, 2022 •

edited

ManyTheFish Mar 14, 2022 •

edited

greyblake commented Apr 5, 2022 •

edited

ManyTheFish commented Apr 19, 2022 •

edited