New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize latin languages detection #108
Changes from all commits
354b5bd
b013058
ab4b519
cb92fa6
287dcb0
3cfe1b2
e88b302
9be0014
61503a2
0918974
3f4270e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -35,7 +35,7 @@ fn trigram_occurances_to_positions( | |||||
.into_iter() | ||||||
.map(|(trigram, count)| (count, trigram)) | ||||||
.collect(); | ||||||
count_vec.sort_by(|a, b| b.cmp(a)); | ||||||
count_vec.sort_unstable_by(|a, b| b.cmp(a)); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this one I don't know.
Suggested change
or is the trigram sorting important? we could rewrite a bit the function like #[allow(clippy::unnecessary_sort_by)]
fn trigram_occurances_to_positions(
trigram_occurances: HashMap<Trigram, u32>,
) -> HashMap<Trigram, u32> {
- // Sort in descending order by number of occurrences and trigrams
+ // Sort in ascending order by number of occurrences and trigrams
let mut count_vec: Vec<_> = trigram_occurances
.into_iter()
.map(|(trigram, count)| (count, trigram))
.collect();
- count_vec.sort_unstable_by(|a, b| b.cmp(a));
+ count_vec.sort_unstable();
count_vec
.into_iter()
+ .rev() // take starting from the highest count
.take(TEXT_TRIGRAMS_SIZE) // we're interested only in the first 600 (2 * MAX_TRIGRAM_DISTANCE)
.enumerate()
.map(|(i, (_, trigram))| (trigram, i as u32))
.collect()
} I'll probably try, in another PR, to collect in a we may just keep the current implementation here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will surely be faster to use a vec and sort it than using a binary heap here, the amount of cmp will be highly reduced. I like the function rewrite you propose ✅ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd say rather no than yes. |
||||||
|
||||||
count_vec | ||||||
.into_iter() | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep only one vec here? Is it slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use them at the same time.
chars
is used in the first loop to compute char scores based on their occurrence in the text.langs
is used in the second loop to sum char scores into lang scores.keeping them merged would just add noisy tuples with anonymous variables everywhere.
In terms of computing time, there are no significant changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a quite a while to understand why this data structure the way it is..
The binding through the position in the vectors is not very obvious. Would you mind adding a comment on that? (in a separate PR, this one I guess will be merged today)