Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingredients_text is truncated in mongodb dump #7244

Open
alexgarel opened this issue Aug 24, 2022 · 11 comments
Open

ingredients_text is truncated in mongodb dump #7244

alexgarel opened this issue Aug 24, 2022 · 11 comments
Labels
🐛 bug This is a bug, not a feature request. Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it.

Comments

@alexgarel
Copy link
Member

alexgarel commented Aug 24, 2022

Describe the bug

As reported by Kristina on slack

Some items in the jsonl and bson export of the mongodb have "ingredients_text" field truncated, whereas the API have it all, as well as mongodb.

Those are not new items.

To Reproduce

On mongo:

 db.products.find({"_id": "1340951640901"}, {"ingredients_text":1})
{ "_id" : "1340951640901", "ingredients_text" : "Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid." }

whereas in the json:

{"_id":"01340951640901", ..., ingredients_text":"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"

Additional examples:

00637480006835
00016571950293
00016571910303
07203671232

Expected behavior

"ingredients_text" should be complete.

@alexgarel alexgarel added the 🐛 bug This is a bug, not a feature request. label Aug 24, 2022
@alexgarel
Copy link
Member Author

I took time to investigate this morning but didn't find any clue doing internet research.

I read the mongodump and mongoexport doc in search of a config parameter or so… nothing.

The only limit I can find is about document size which should be under 16Mb, and:

> Object.bsonsize(db.products.find({"_id": "1340951640901"}))
78782

our object is 78k so we are far from this limit.

Another limit is for index fields (1024) but this is normally removed in our version of mongodb and our string are truncated well before (while being ascii, so utf-8 encoding is not longer).

@alexgarel
Copy link
Member Author

I made a test to retrieve 1340951640901 on a test mongo instance where @CharlesNepote has imported data some monthes ago from the dump… the string is not truncated !

@alexgarel
Copy link
Member Author

alexgarel commented Aug 24, 2022

I continue investigation to find differences.

Here is the value I get using:

  • in the bson: (decode with bsondump products.bson |grep -A 3 -e '\(1340951640901\|00637480006835\|00016571950293\|00016571910303\|07203671232\)' > truncated.json
  • in jsonl: (zcat ../../openfoodfacts-products.jsonl.gz |grep -A 3 -e '\(1340951640901\|00637480006835\|00016571950293\|00016571910303\|07203671232\)' > truncated-json.json
  • in mongosh: db.products.find({"_id": "xxxxx"}, {"ingredients_text":1})
  • search api: https://world.openfoodfacts.org/api/v2/search/?codes_tags=xxxxxxx&fields=ingredients_text,last_modified_t,last_editor,code
  • direct access: https://world.openfoodfacts.org/api/v2/product/xxxxxx&fields=ingredients_text
1340951640901
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."

00637480006835
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate, calcium caseinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrate, sodium hexametaphosphate, natural and artificial flavors, calcium phosphate, salt, acesulfame potassium, carrageenan, soy lecithin, sucralose. vitamin mineral blend: sodium ascorbate (vitamin c), zinc gluconate, dl-apha-tocopheryl acetate (vitamin e), niacinamide (vitamin b3), manganese gluconate, d-calcium pantothenate (vitamin b5), pyridoxine hydrochloride (vitamin b6), thiamin hydrochloride (vitamin b1), riboflavin (vitamin b2), chromium chloride, folic acid (vitamin b9), biotin (vitamin b7), potassium iodide, sodium molybdate, sodium selenite, phylloquinone (vitamin k1), cyanocobalamin (vitamin b12), cholecalciferol (vitamin d3)."
"Water, dairy protein blend (milk protein concentrate, calcium caseinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrate, sodium hexametaphosphate, natural and artificial flavors, calcium phosphate, salt, acesulfame potassium, carrageenan, soy lecithin, sucralose. vitamin mineral blend: sodium ascorbate (vitamin c), zinc gluconate, dl-apha-tocopheryl acetate (vitamin e), niacinamide (vitamin b3), manganese gluconate, d-calcium pantothenate (vitamin b5), pyridoxine hydrochloride (vitamin b6), thiamin hydrochloride (vitamin b1), riboflavin (vitamin b2), chromium chloride, folic acid (vitamin b9), biotin (vitamin b7), potassium iodide, sodium molybdate, sodium selenite, phylloquinone (vitamin k1), cyanocobalamin (vitamin b12), cholecalciferol (vitamin d3)."


00016571950293
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated water, citric acid, natural flavors, lemon juice concentrate, potassium benzoate (to ensure freshness), fruit and vegetable juice (for color), sucralose, beta carotene (for color), green tea extract, ester gum, calcium disodium edta (to protect flavor), biotin, niacinamide (vitamin b3), calcium pantothenate (vitamin b5), vitamin a, vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6)."

00016571910303
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"carbonated water, natural flavors, malic acid, vegetable juice (for color), blackberry juice concentrate, potassium benzoate (to ensure freshness), sucralose, gum arabic, green tea extract, biotin, niacinamide (vitamin b3), vitamin a, calcium pantothenate (vitamin b5), vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6),"
"carbonated water, natural flavors, malic acid, vegetable juice (for color), blackberry juice concentrate, potassium benzoate (to ensure freshness), sucralose, gum arabic, green tea extract, biotin, niacinamide (vitamin b3), vitamin a, calcium pantothenate (vitamin b5), vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6),"

07203671232
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"

My conclusions so far:

  • 07203671232 is ok, it's primary data that is truncated
  • there is no real consistent pattern between data in dum, mongosh, search api, product ! (strange)
  • when string is truncated, it's always truncated at 255 characters (this does not seems coincidental…)

@alexgarel
Copy link
Member Author

for 0016571950293 it seems it's rev1 data which are in the index.
This come from an old usda import.

for 00016571910303 I do not see a version with this ingredients, but in json / index we have the same ingredients as 0016571950293, stranegly

@stephanegigandet
Copy link
Contributor

the first USDA dump import had ingredients truncated at 255 characters, that was a bug in the USDA data.

@CharlesNepote CharlesNepote added MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data labels Aug 24, 2022
@alexgarel
Copy link
Member Author

alexgarel commented Aug 24, 2022

Hum for 00016571950293 and 00016571910303, I think I'got the trick: as they are 14 digits, our API removes the first 0.

If you look carefully at https://world.openfoodfacts.org/api/v2/product/00016571950293&fields=code,ingredients_text,codes_tags you see the leading 0 is removed in the returned object.

But it seems we have old 14-digit references in the mongodb ! Which are not up to date…

On staging:

> db.products.find({_id: "00016571950293"}, {ingredients_text:1})
{ "_id" : "00016571950293", "ingredients_text" : "Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow" }
> db.products.find({_id: "0016571950293"}, {ingredients_text:1})
{ "_id" : "0016571950293", "ingredients_text" : "Carbonated water, citric acid, natural flavors, lemon juice concentrate, potassium benzoate (to ensure freshness), fruit and vegetable juice (for color), sucralose, beta carotene (for color), green tea extract, ester gum, calcium disodium edta (to protect flavor), biotin, niacinamide (vitamin b3), calcium pantothenate (vitamin b5), vitamin a, vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6)." }

There seems to be only 147 of them. See https://world.openfoodfacts.org/api/v2/search?codes_tags=0xxxxxxxxxxxxx&fields=code

@alexgarel
Copy link
Member Author

Note that the product exists on the filesystem (here on staging):

/mnt/podata/products$ ls 000/165/719/10303
1.sto  changes.sto  product.sto

@alexgarel
Copy link
Member Author

@stephanegigandet should we remove those products and their reference in the mongodb ?

Should Kristina deal with those 14-digit and remove the first digit if it's a 0 ?

That would leave us only with the first problematic case.

@stephanegigandet
Copy link
Contributor

The ideal solution would be to move them to the code without 0 (if the product does not exist), or delete them (if the product already exists).

It would be good to also dump all codes from MongoDB and run them through normalize_code() to see if we have other instances of products that are now stored differently.

@alexgarel
Copy link
Member Author

I opened #7249 and #7248

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity.

@github-actions github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug This is a bug, not a feature request. Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it.
Projects
Status: To discuss and validate
Development

No branches or pull requests

3 participants