-
-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Match synonyms and xx: entries when computing taxonomy suggestions #8190
Conversation
@@ -2,6 +2,7 @@ | |||
"errors" : [], | |||
"status" : "success", | |||
"suggestions" : [ | |||
"Mint-flavoured syrup with sugar diluted in water", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.
@@ -2,6 +2,7 @@ | |||
"errors" : [], | |||
"status" : "success", | |||
"suggestions" : [ | |||
"Mint-flavoured syrup with sugar diluted in water", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.
@@ -1,4 +1,5 @@ | |||
[ | |||
"Mint-flavoured syrup with sugar diluted in water", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.
@@ -8,7 +8,8 @@ | |||
"Recycle in paper bin", | |||
"Recycle with drink cartons", | |||
"Recycle with plastics", | |||
"Recycle with plastics - metal and bricks" | |||
"Recycle with plastics - metal and bricks", | |||
"Discard" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra "Discard" entry is added because the input "recy" partially matches the synonym "non-recyclable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! That will help a lot.
I added some remarks.
lib/ProductOpener/Ecoscore.pm
Outdated
my $target_material | ||
= canonicalize_taxonomy_tag("en", "packaging_materials", $assignment_ref->{target_material}); | ||
my $source_material | ||
= canonicalize_taxonomy_tag("en", "packaging_materials", $assignment_ref->{source_material}); | ||
|
||
if (not exists_taxonomy_tag("packaging_materials", $target_material)) { | ||
die( "target_material " | ||
. $assignment_ref->{target_material} | ||
. " does not exist in the packaging_materials taxonomy"); | ||
} | ||
if (not exists_taxonomy_tag("packaging_materials", $source_material)) { | ||
die( "source_material " | ||
. $assignment_ref->{source_material} | ||
. " does not exist in the packaging_materials taxonomy"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small sub to canonicalize and verify existence, would improve readability a lot. (usable also below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored canonicalize_taxonomy_tag so that it can indicate if the returned entry exists in the taxonomy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still to be pushed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm still addressing the other suggestions
lib/ProductOpener/Ecoscore.pm
Outdated
$ecoscore_data{packaging_materials}{$target} = $ecoscore_data{packaging_materials}{$source}; | ||
$properties{packaging_materials}{$target}{"ecoscore_score:en"} | ||
= $ecoscore_data{packaging_materials}{$source}{"score"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shan't we verify that $ecoscore_data{packaging_materials}{$source}
is defined ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can add that.
# fuzzy match | ||
elsif ($best_match eq "fuzzy") { | ||
push @suggestions_f, $tag; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are at it, don't you think it would be a good opportunity to add a memoize on this function ? (and with perf problem we have at this moment, it could help) :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not. In the worse case scenario (huge taxonomy like categories, and a text query that does not match anything so we have to go through everything, it takes about 0.5 seconds in prod). I'm not sure we will gain much on smaller taxonomies, but we only have a bit of space to lose.
Another option is to use memcached (so that the cache is shared between processes), but then it might be tricky when we update things and get old cache results even after restarting Apache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I didn't remember we where using multiple processes (and quite a lot) so memcached would really be needed. Only thing is that I didn't see ready to use memoize using memcached with a RLU strategy (MRU would be even better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I chose to use memcached, I'm refactoring some stuff around it.
lc => "en", | ||
string => "yog", | ||
expected => ['Yogurts', 'Banana yogurts'], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some more tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few more tests
Co-authored-by: Alex Garel <alex@garel.org>
Codecov Report
@@ Coverage Diff @@
## main #8190 +/- ##
==========================================
- Coverage 47.06% 46.09% -0.97%
==========================================
Files 104 105 +1
Lines 20449 20523 +74
Branches 4650 4668 +18
==========================================
- Hits 9624 9460 -164
- Misses 9678 9936 +258
+ Partials 1147 1127 -20
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@alexgarel the PR is ready for review again, I added some caching, did a bit of refactoring and added more tests |
@@ -1797,7 +1797,7 @@ Build all taxonomies | |||
=cut | |||
|
|||
sub build_all_taxonomies ($publish) { | |||
foreach my $taxonomy (@taxonomy_fields) { | |||
foreach my $taxonomy (@taxonomy_fields, "test") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -97,17 +98,56 @@ Hash of fields that can be taken into account to generate relevant suggestions | |||
- categories: comma separated list of categories (tags ids or strings in the $search_lc language) | |||
- shape: packaging shape (tag id or string in the $search_lc language) | |||
|
|||
=head3 Note | |||
|
|||
The results of this function are cached using memcached. Restart memcached if you want fresh results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also add a time to live to our cached entries ? 1 day seems enough.
So right now as we use it for mongodb we might have old entries served ?
Memcached has built-in support for that in set you can add an expiration time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool.
Two comments though:
- I think it would be better to integrate the key prefix as a parameter to the cache key function (more explicit for future usage)
- For refactor I did propose, I was seeing it at a higher level (but it's ok if you keep things as is)
lib/ProductOpener/Cache.pm
Outdated
my $key = $server_domain . "/" . $json->encode($context_ref); | ||
my $md5_key = md5_hex($key); | ||
$log->debug("generate_cache_key", {context_ref => $context_ref, key => $key, md5_key => $md5_key}) | ||
if $log->is_debug(); | ||
return $md5_key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit dangerous. Collision may happens.
They are not maybe a big issue if we at least prefix the key by - a domain + a function.
So I would associate each domain with a single character, same for each function, and then encode the rest of $context_ref with md5.
I'm not sure for the domain prefix for there might be too much case to handler, don't we have a variable saying the "product-family" ?
SERVER_DOMAINS_PREFIX = {"openfoodfacts.org" => 'a', …}
FN_PREFIX = {"suggestion_cache" => 'a', "mongo_query" => 'b'}
sub generace_cache_key($fn, $context_ref) {
$key = SERVER_DOMAINS_PREFIX->{$server_domain} . FN_PREFIX->{$fn};
...
$key .= $md5_key
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below that you did it indeed, by adding a prefix… but in this case, let's add the prefix as a parameter to this function, it will be more clear.
And ok then for the domain in the hash (the most important thing is to have a "result type" prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I made the key server domain + result type + hash of context, directly in the generate_cache_key function
lib/ProductOpener/Ecoscore.pm
Outdated
my $target_material_id_exists_in_taxonomy; | ||
my $target_material = canonicalize_taxonomy_tag( | ||
"en", "packaging_materials", | ||
$assignment_ref->{target_material}, | ||
\$target_material_id_exists_in_taxonomy | ||
); | ||
|
||
my $source_material_id_exists_in_taxonomy; | ||
my $source_material = canonicalize_taxonomy_tag( | ||
"en", "packaging_materials", | ||
$assignment_ref->{source_material}, | ||
\$source_material_id_exists_in_taxonomy | ||
); | ||
|
||
if (not $target_material_id_exists_in_taxonomy) { | ||
die( "target_material " | ||
. $assignment_ref->{target_material} | ||
. " does not exist in the packaging_materials taxonomy"); | ||
} | ||
if (not $source_material_id_exists_in_taxonomy) { | ||
die( "source_material " | ||
. $assignment_ref->{source_material} | ||
. " does not exist in the packaging_materials taxonomy"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the exists in taxonomy parameter make sense, I don't think you get my point with the refactoring. For me it was more that you are doing 4 times the same thing…
I would have reduced that to:
my $target_material = get_canonical_or_die($assignement_ref, "target_material", "packaging_material");
my $source_material = get_canonical_or_die($assignement_ref, "source_material", "packaging_material");
my $target_shape = get_canonical_or_die($assignement_ref, "target_shape", "packaging_shapes");
my $source_shape = get_canonical_or_die($assignement_ref, "source_shape", "packaging_shapes");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I added a function like that. I did the other because there are many other places in the code where we canonicalize a tag and then check its existence in the taxonomy (e.g. for all the parsing of ingredients lists etc.)
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect !
Previously, taxonomy suggestions were only returned when the canonical taxonomy entry matched the input string from the user. Now we also try to match the input string to synonyms of the entry, including in the xx: wildcard language.
Also added a unit test to ease debugging.