Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Match synonyms and xx: entries when computing taxonomy suggestions #8190

Merged
merged 13 commits into from
Mar 23, 2023

Conversation

stephanegigandet
Copy link
Contributor

Previously, taxonomy suggestions were only returned when the canonical taxonomy entry matched the input string from the user. Now we also try to match the input string to synonyms of the entry, including in the xx: wildcard language.

Also added a unit test to ease debugging.

@stephanegigandet stephanegigandet requested a review from a team as a code owner March 13, 2023 14:13
@github-actions github-actions bot added ✏️ Editing - Auto Suggest Providing autosuggest for taxonomized fields. Mostly used in editing scenarii 📦 Packaging https://wiki.openfoodfacts.org/Category:Recycling 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests labels Mar 13, 2023
@@ -2,6 +2,7 @@
"errors" : [],
"status" : "success",
"suggestions" : [
"Mint-flavoured syrup with sugar diluted in water",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.

@@ -2,6 +2,7 @@
"errors" : [],
"status" : "success",
"suggestions" : [
"Mint-flavoured syrup with sugar diluted in water",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.

@@ -1,4 +1,5 @@
[
"Mint-flavoured syrup with sugar diluted in water",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry is because we have a weird CIQUAL category that applies to both strawberry and mint flavoured syrups... To be fixed in the taxonomy.

@@ -8,7 +8,8 @@
"Recycle in paper bin",
"Recycle with drink cartons",
"Recycle with plastics",
"Recycle with plastics - metal and bricks"
"Recycle with plastics - metal and bricks",
"Discard"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra "Discard" entry is added because the input "recy" partially matches the synonym "non-recyclable".

@github-actions github-actions bot added the 🌱 Eco-Score https://world.openfoodfacts.org/eco-score-the-environmental-impact-of-food-products label Mar 13, 2023
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! That will help a lot.

I added some remarks.

Comment on lines 506 to 520
my $target_material
= canonicalize_taxonomy_tag("en", "packaging_materials", $assignment_ref->{target_material});
my $source_material
= canonicalize_taxonomy_tag("en", "packaging_materials", $assignment_ref->{source_material});

if (not exists_taxonomy_tag("packaging_materials", $target_material)) {
die( "target_material "
. $assignment_ref->{target_material}
. " does not exist in the packaging_materials taxonomy");
}
if (not exists_taxonomy_tag("packaging_materials", $source_material)) {
die( "source_material "
. $assignment_ref->{source_material}
. " does not exist in the packaging_materials taxonomy");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small sub to canonicalize and verify existence, would improve readability a lot. (usable also below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored canonicalize_taxonomy_tag so that it can indicate if the returned entry exists in the taxonomy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still to be pushed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm still addressing the other suggestions

lib/ProductOpener/Ecoscore.pm Show resolved Hide resolved
Comment on lines 544 to 546
$ecoscore_data{packaging_materials}{$target} = $ecoscore_data{packaging_materials}{$source};
$properties{packaging_materials}{$target}{"ecoscore_score:en"}
= $ecoscore_data{packaging_materials}{$source}{"score"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shan't we verify that $ecoscore_data{packaging_materials}{$source} is defined ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can add that.

# fuzzy match
elsif ($best_match eq "fuzzy") {
push @suggestions_f, $tag;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it, don't you think it would be a good opportunity to add a memoize on this function ? (and with perf problem we have at this moment, it could help) :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not. In the worse case scenario (huge taxonomy like categories, and a text query that does not match anything so we have to go through everything, it takes about 0.5 seconds in prod). I'm not sure we will gain much on smaller taxonomies, but we only have a bit of space to lose.

Another option is to use memcached (so that the cache is shared between processes), but then it might be tricky when we update things and get old cache results even after restarting Apache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I didn't remember we where using multiple processes (and quite a lot) so memcached would really be needed. Only thing is that I didn't see ready to use memoize using memcached with a RLU strategy (MRU would be even better).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I chose to use memcached, I'm refactoring some stuff around it.

lc => "en",
string => "yog",
expected => ['Yogurts', 'Banana yogurts'],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some more tests ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more tests

Co-authored-by: Alex Garel <alex@garel.org>
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2023

Codecov Report

Merging #8190 (37b690f) into main (b989be3) will decrease coverage by 0.97%.
The diff coverage is 75.37%.

@@            Coverage Diff             @@
##             main    #8190      +/-   ##
==========================================
- Coverage   47.06%   46.09%   -0.97%     
==========================================
  Files         104      105       +1     
  Lines       20449    20523      +74     
  Branches     4650     4668      +18     
==========================================
- Hits         9624     9460     -164     
- Misses       9678     9936     +258     
+ Partials     1147     1127      -20     
Impacted Files Coverage Δ
lib/ProductOpener/Index.pm 50.00% <ø> (-1.29%) ⬇️
lib/ProductOpener/Orgs.pm 14.56% <ø> (-0.83%) ⬇️
lib/ProductOpener/Users.pm 5.22% <ø> (-0.24%) ⬇️
lib/ProductOpener/Display.pm 4.67% <20.00%> (-0.02%) ⬇️
lib/ProductOpener/Tags.pm 37.87% <72.72%> (-16.21%) ⬇️
lib/ProductOpener/Ecoscore.pm 75.17% <73.91%> (-0.76%) ⬇️
lib/ProductOpener/TaxonomySuggestions.pm 54.92% <77.50%> (+42.32%) ⬆️
tests/unit/taxonomy_suggestions.t 78.94% <78.94%> (ø)
lib/ProductOpener/Cache.pm 94.11% <87.50%> (-5.89%) ⬇️
tests/unit/tags.t 88.97% <100.00%> (+0.22%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions github-actions bot added Display Orgs 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers Tags 🧪 unit tests 👥 Users labels Mar 22, 2023
@stephanegigandet
Copy link
Contributor Author

@alexgarel the PR is ready for review again, I added some caching, did a bit of refactoring and added more tests

@@ -1797,7 +1797,7 @@ Build all taxonomies
=cut

sub build_all_taxonomies ($publish) {
foreach my $taxonomy (@taxonomy_fields) {
foreach my $taxonomy (@taxonomy_fields, "test") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -97,17 +98,56 @@ Hash of fields that can be taken into account to generate relevant suggestions
- categories: comma separated list of categories (tags ids or strings in the $search_lc language)
- shape: packaging shape (tag id or string in the $search_lc language)

=head3 Note

The results of this function are cached using memcached. Restart memcached if you want fresh results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also add a time to live to our cached entries ? 1 day seems enough.

So right now as we use it for mongodb we might have old entries served ?

Memcached has built-in support for that in set you can add an expiration time.

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool.

Two comments though:

  • I think it would be better to integrate the key prefix as a parameter to the cache key function (more explicit for future usage)
  • For refactor I did propose, I was seeing it at a higher level (but it's ok if you keep things as is)

Comment on lines 77 to 81
my $key = $server_domain . "/" . $json->encode($context_ref);
my $md5_key = md5_hex($key);
$log->debug("generate_cache_key", {context_ref => $context_ref, key => $key, md5_key => $md5_key})
if $log->is_debug();
return $md5_key;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit dangerous. Collision may happens.
They are not maybe a big issue if we at least prefix the key by - a domain + a function.

So I would associate each domain with a single character, same for each function, and then encode the rest of $context_ref with md5.

I'm not sure for the domain prefix for there might be too much case to handler, don't we have a variable saying the "product-family" ?

SERVER_DOMAINS_PREFIX = {"openfoodfacts.org" => 'a', …}

FN_PREFIX = {"suggestion_cache" => 'a', "mongo_query" => 'b'}

sub generace_cache_key($fn, $context_ref) {
   $key = SERVER_DOMAINS_PREFIX->{$server_domain} . FN_PREFIX->{$fn};
   ...
   $key .= $md5_key
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below that you did it indeed, by adding a prefix… but in this case, let's add the prefix as a parameter to this function, it will be more clear.

And ok then for the domain in the hash (the most important thing is to have a "result type" prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I made the key server domain + result type + hash of context, directly in the generate_cache_key function

Comment on lines 511 to 534
my $target_material_id_exists_in_taxonomy;
my $target_material = canonicalize_taxonomy_tag(
"en", "packaging_materials",
$assignment_ref->{target_material},
\$target_material_id_exists_in_taxonomy
);

my $source_material_id_exists_in_taxonomy;
my $source_material = canonicalize_taxonomy_tag(
"en", "packaging_materials",
$assignment_ref->{source_material},
\$source_material_id_exists_in_taxonomy
);

if (not $target_material_id_exists_in_taxonomy) {
die( "target_material "
. $assignment_ref->{target_material}
. " does not exist in the packaging_materials taxonomy");
}
if (not $source_material_id_exists_in_taxonomy) {
die( "source_material "
. $assignment_ref->{source_material}
. " does not exist in the packaging_materials taxonomy");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the exists in taxonomy parameter make sense, I don't think you get my point with the refactoring. For me it was more that you are doing 4 times the same thing…

I would have reduced that to:

  my $target_material = get_canonical_or_die($assignement_ref, "target_material", "packaging_material");
  my $source_material  = get_canonical_or_die($assignement_ref, "source_material", "packaging_material");
  my $target_shape = get_canonical_or_die($assignement_ref, "target_shape", "packaging_shapes");
  my $source_shape = get_canonical_or_die($assignement_ref, "source_shape", "packaging_shapes");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I added a function like that. I did the other because there are many other places in the code where we canonicalize a tag and then check its existence in the taxonomy (e.g. for all the parsing of ingredients lists etc.)

@sonarcloud
Copy link

sonarcloud bot commented Mar 23, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect !

@alexgarel alexgarel merged commit e1304de into main Mar 23, 2023
@alexgarel alexgarel deleted the packaging_materials_suggestions branch March 23, 2023 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Display 🌱 Eco-Score https://world.openfoodfacts.org/eco-score-the-environmental-impact-of-food-products ✏️ Editing - Auto Suggest Providing autosuggest for taxonomized fields. Mostly used in editing scenarii Orgs 📦 Packaging https://wiki.openfoodfacts.org/Category:Recycling 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers Tags 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests 🧪 unit tests 👥 Users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants