Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: USDA API import to a list of products missing in OFF #9083

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

rkiddy
Copy link

@rkiddy rkiddy commented Sep 27, 2023

I am not sure of how to start so I am just starting something and, if it is completely wrong, I can re-start.

I can see correspondences between many of the keys in the USDA structure and the OFF structure. How much am I supposed to be add from the USDA data? How much is necessary to have in the OFF data? If some of the OFF data is created, are there processes that will create the other values that can be derived from these?

I cannot see the answers to these questions. If my python seems gross, apologies. I spent many years doing java and at times I am not pythonic.

@rkiddy rkiddy requested a review from a team as a code owner September 27, 2023 23:05
@rkiddy
Copy link
Author

rkiddy commented Sep 27, 2023

You do not need to run this yourself. :--) Here is the output at this point:

 $ python3 usda_to_off.py
 -------------------------
 status: 200
 -------------------------
 USDA data:
 {'allHighlightFields': '<b>GTIN/UPC</b>: <em>619128673216</em>',
  'brandOwner': "NATURE'S HARVEST",
  'dataSource': 'LI',
  'dataType': 'Branded',
  'description': 'SOUTHWESTERN HOT BAR MIX, HOT',
  'fdcId': 1115329,
  'finalFoodInputFoods': [],
  'foodAttributeTypes': [{'description': 'Changes that were made to this food',
                          'foodAttributes': [{'id': 1019459,
                                              'name': 'Description',
                                              'value': '2'}],
                          'id': 998,
                          'name': 'Update Log'}],
  'foodAttributes': [],
  'foodCategory': 'Other Snacks',
  'foodMeasures': [],
  'foodNutrients': [{'derivationCode': 'LCCS',
                     'derivationDescription': 'Calculated from value per '
                                              'serving size measure',
                     'derivationId': 70,
                     'foodNutrientId': 13797640,
                     'foodNutrientSourceCode': '12',
                     'foodNutrientSourceDescription': "Manufacturer's "
                                                      'analytical; partial '
                                                      'documentation',
                     'foodNutrientSourceId': 9,
                     'indentLevel': 1,
                     'nutrientId': 1003,
                     'nutrientName': 'Protein',
                     'nutrientNumber': '203',
                     'rank': 600,
                     'unitName': 'G',
                     'value': 25.0},
                    {'derivationCode': 'LCCS',
                     'derivationDescription': 'Calculated from value per '
                                              'serving size measure',
                     'derivationId': 70,
                     'foodNutrientId': 13797641,
                     'foodNutrientSourceCode': '12',
                     'foodNutrientSourceDescription': "Manufacturer's "
                                                      'analytical; partial '
                                                      'documentation',
                     'foodNutrientSourceId': 9,
                     'indentLevel': 1,
                     'nutrientId': 1004,
                     'nutrientName': 'Total lipid (fat)',
                     'nutrientNumber': '204',
                     'percentDailyValue': 21,
                     'rank': 800,
                     'unitName': 'G',
                     'value': 50.0},
                    ...(SNIP)....
                    {'derivationCode': 'LCCS',
                     'derivationDescription': 'Calculated from value per '
                                              'serving size measure',
                     'derivationId': 70,
                     'foodNutrientId': 13797652,
                     'foodNutrientSourceCode': '12',
                     'foodNutrientSourceDescription': "Manufacturer's "
                                                      'analytical; partial '
                                                      'documentation',
                     'foodNutrientSourceId': 9,
                     'indentLevel': 1,
                     'nutrientId': 1257,
                     'nutrientName': 'Fatty acids, total trans',
                     'nutrientNumber': '605',
                     'rank': 15400,
                     'unitName': 'G',
                     'value': 0.0},
                    {'derivationCode': 'LCCS',
                     'derivationDescription': 'Calculated from value per '
                                              'serving size measure',
                     'derivationId': 70,
                     'foodNutrientId': 13797653,
                     'foodNutrientSourceCode': '12',
                     'foodNutrientSourceDescription': "Manufacturer's "
                                                      'analytical; partial '
                                                      'documentation',
                     'foodNutrientSourceId': 9,
                     'indentLevel': 1,
                     'nutrientId': 1258,
                     'nutrientName': 'Fatty acids, total saturated',
                     'nutrientNumber': '606',
                     'percentDailyValue': 10,
                     'rank': 9700,
                     'unitName': 'G',
                     'value': 7.14}],
  'foodVersionIds': [],
  'gtinUpc': '619128673216',
  'ingredients': 'CHILE CHICKPEAS, CHILE PEANUTS, PEPITAS, SMOKEHOUSE HICKORY '
                 'SMOKED ALMONDS, CHILE POWDER, CITRIC ACID, VEGETABLE OIL & '
                 'SALT.',
  'marketCountry': 'United States',
  'microbes': [],
  'modifiedDate': '2020-09-12',
  'publishedDate': '2020-11-13',
  'score': -405.18314,
  'servingSize': 28.0,
  'servingSizeUnit': 'g',
  'tradeChannels': ['NO_TRADE_CHANNEL']}
 -------------------------
 OFF data:
 {'categories_tags': ['en:Snacks'],
  'code': '0619128673216',
  'code_tags': ['code-13',
                '0619128673XXX',
                '061912867XXXX',
                '06191286XXXXX',
                '0619128XXXXXX',
                '061912XXXXXXX',
                '06191XXXXXXXX',
                '0619XXXXXXXXX',
                '061XXXXXXXXXX',
                '06XXXXXXXXXXX',
                '0XXXXXXXXXXXX'],
  'ingredients_text': 'CHILE CHICKPEAS, CHILE PEANUTS, PEPITAS, SMOKEHOUSE '
                      'HICKORY SMOKED ALMONDS, CHILE POWDER, CITRIC ACID, '
                      'VEGETABLE OIL & SALT.',
  'ingredients_text_en': 'CHILE CHICKPEAS, CHILE PEANUTS, PEPITAS, SMOKEHOUSE '
                         'HICKORY SMOKED ALMONDS, CHILE POWDER, CITRIC ACID, '
                         'VEGETABLE OIL & SALT.',
  'nutriments': {'calcium_100g': 14.3,
                 'calcium_serving': 143,
                 'calcium_unit': 'MG',
                 'calcium_value': 143,
                 'carbohydrates_100g': 0.679,
                 'carbohydrates_serving': 67.9,
                 'carbohydrates_unit': 'G',
                 'carbohydrates_value': 67.9,
                 'cholesterol_100g': 0.0,
                 'cholesterol_serving': 0.0,
                 'cholesterol_unit': 'MG',
                 'cholesterol_value': 0.0,
                 'energy_serving': 571,
                 'energy_unit': 'KCAL',
                 'energy_value': 571,
                 'fat_100g': 0.5,
                 'fat_serving': 50.0,
                 'fat_unit': 'G',
                 'fat_value': 50.0,
                 'fiber_100g': 0.071,
                 'fiber_serving': 7.1,
                 'fiber_unit': 'G',
                 'fiber_value': 7.1,
                 'iron_100g': 0.257,
                 'iron_serving': 2.57,
                 'iron_unit': 'MG',
                 'iron_value': 2.57,
                 'proteins_100g': 0.25,
                 'proteins_serving': 25.0,
                 'proteins_unit': 'G',
                 'proteins_value': 25.0,
                 'saturated-fat_100g': 0.07139999999999999,
                 'saturated-fat_serving': 7.14,
                 'saturated-fat_unit': 'G',
                 'saturated-fat_value': 7.14,
                 'sodium_100g': 67.9,
                 'sodium_serving': 679,
                 'sodium_unit': 'MG',
                 'sodium_value': 679,
                 'sugars_100g': 0.035699999999999996,
                 'sugars_serving': 3.57,
                 'sugars_unit': 'G',
                 'sugars_value': 3.57,
                 'trans-fat_100g': 0.0,
                 'trans-fat_serving': 0.0,
                 'trans-fat_unit': 'G',
                 'trans-fat_value': 0.0,
                 'vitamin-a_serving': 0.0,
                 'vitamin-a_unit': 'IU',
                 'vitamin-a_value': 0.0,
                 'vitamin-c_100g': 0.43,
                 'vitamin-c_serving': 4.3,
                 'vitamin-c_unit': 'MG',
                 'vitamin-c_value': 4.3},
  'serving_quantity': 28.0,
  'serving_size': '28.0 g',
  'sources_fields': {'org-database-usda': {'available_date': None,
                                           'fdc_category': 'Other Snacks',
                                           'fdc_data_source': 'LI',
                                           'fdc_id': 1115329,
                                           'modified_date': '2020-09-12',
                                           'published_date': '2020-11-13'}}}
 -------------------------

@raphael0202
Copy link
Contributor

raphael0202 commented Sep 28, 2023

Hello!
We already import USDA data, see https://github.com/openfoodfacts/openfoodfacts-server/tree/main/scripts/usda-import

@rkiddy
Copy link
Author

rkiddy commented Sep 29, 2023

Hello! We already import USDA data, see https://github.com/openfoodfacts/openfoodfacts-server/tree/main/scripts/usda-import

Yes. We are trying to use the USDA API and not the downloaded csv file.

Having the code to do this in something more modern than perl would not be a terrible thing also.

see https://docs.google.com/spreadsheets/d/1EoguFCEF3ZOxyhoJikNX2a6mjGBxV0jqt5A9gV8wmPo/edit#gid=1527328796
and
#4943

@rkiddy rkiddy changed the title Start something. From a list of products missing in OFF, start to create them USDA API to a list of products missing in OFF, start to create them Sep 29, 2023
@rkiddy
Copy link
Author

rkiddy commented Oct 13, 2023

I am not seeing something. The USDA gives me nutrients per 100g. I can see some things in the USDA's labelNutrients values. If something is in the per100g data, I can calculate the label value.

But. It works if I get the serving size in grams. But what if I get the serving size in mL and no way to determine the mass of a serving?

For example, see:

 python3 usda_check.py --upc 041190406913

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rkiddy your approach seems good to me.

I think originally @stephanegigandet did this through a csv so it can be imported in the producers platform and checked there. But I think it's ok to have json objects, we could then submit to the API or transform as CSV.

Although, you don't need to create "_tags" fields, they will be created by ProductOpener on product import.

You can attend at the example csv file on the pro platform to have an idea of fields that needs to be supplied (or simply look at the product form on open food facts web).

I will commit a file of the correspondences of fields as setup by @stephanegigandet at the time.

So you could also lean on this file to transform field names in the script (with a for loop) and then do specific transformations on fields that needs it.

def category(name):
global categories
if len(categories) == 0:
with open('/home/ray/Projects/OFF/usda/USDA_fdc_categories.csv', newline='') as csvfile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beware the absolute path ;-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Stephane includes absolute paths for files on his computers but I cannot do it here? Humph. :--)

Beer,1,en:Beers,
Amino Acid Supplements,1,,
Processed Cheese & Cheese Novelties,1,,
Sauces - Cooking (Shelf Stable),1,en:Sauces,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may have been where I got it. Stephane pointed it out to me in a slack thread. I am, by the way, not understanding that part of it yet. So, I am open to any suggestions.

@alexgarel
Copy link
Member

alexgarel commented Oct 25, 2023

fields potentially provided by producers openfoodfacts_import.xlsx

@teolemon teolemon added 🇺🇸 United States Project to improve support in the United States. Data import 🐍 Python labels Oct 25, 2023
@rkiddy rkiddy changed the title USDA API to a list of products missing in OFF, start to create them USDA API import to a list of products missing in OFF Oct 25, 2023
@rkiddy
Copy link
Author

rkiddy commented Oct 25, 2023

@alexgarel A few things in the the usda_to_off_fields.json:

I assume that this list is not complete, yes? For example, I see in a product that there are columns for "omega-3-fat", "omega-6-fat", "omega-9-fat" and others. I will go ahead and add them.

Are these the same columns as in the API? For example, there I do not see a "modified-date", but I do see a "last_modified_t" that is a unix timestamp. So there may be a separate list for the API?

Would it be objectionable to add a "path" string? Perhaps this is how to get around the API difference issue. Perhaps "fiber-g-100g:paths" should be ["nutriment:fiber", "nutriment:fiber_100g", "nutriment:fiber_unit", "nutriment:fiber_value" }.

See, for example, https://world.openfoodfacts.org/api/v3/product/3760232740033

Also, I am tempted to add a "datatype" field but will resist for now. For example, "code" is a arbitrary-length integer and "category" is a string, or perhaps a list of strings. And "to-create" is a boolean? But "modified-date" would seem to be a date and I will (for now) add a "format" value here, that might look like "YYYY-mm-dd HH:MM:ss" in UTC.

@rkiddy
Copy link
Author

rkiddy commented Oct 25, 2023

FYI, I was asked to look at 10 barcodes, but:

Not in USDA and not in OFF:
'20200129783'
'44400176002'
'41570094754'
'72745804113'
'15800050117'

Not in USDA but in OFF:
'0053000006329'

In USDA and OFF:
'4099100028829'
'850229005207'
'856481003043'
'4099100099157

@sonarcloud
Copy link

sonarcloud bot commented Oct 25, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 11 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@teolemon teolemon changed the title USDA API import to a list of products missing in OFF feat: USDA API import to a list of products missing in OFF Oct 26, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #9083 (0ee12e4) into main (bd6b3da) will not change coverage.
Report is 3 commits behind head on main.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #9083   +/-   ##
=======================================
  Coverage   47.95%   47.95%           
=======================================
  Files          65       65           
  Lines       20223    20223           
  Branches     4914     4914           
=======================================
  Hits         9697     9697           
  Misses       9271     9271           
  Partials     1255     1255           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@alexgarel
Copy link
Member

Are these the same columns as in the API? For example, there I do not see a "modified-date", but I do see a "last_modified_t" that is a unix timestamp. So there may be a separate list for the API?

@rkiddy you are right. The fact is this is a mapping to obtain a csv file that can be imported through the import function.

So we have to options:

  • either you stick to your approach of creating products through the API (which seems fine to me)
  • either from the USDA API you create a csv/excel file which is imported to the producers platform

@stephanegigandet any thoughts on that ?

@rkiddy
Copy link
Author

rkiddy commented Nov 8, 2023

@alexgarel I have learned some things.

The first thing is that no API coming out of the USDA has per-serving nutrition information. So, no matter what, we will have to calculate things.

The second thing is a bit of s surprise. Looking at the information coming out of the API and out of the exported CSV file, the data coming out of the CSV contains a higher number of digits. The API information is rounded. This suggests that the API info should not be the canonical source. It suggests that the CSV file should be.

But there is obviously a problem with processing updates to the information. So, what might you think of these suggestions?

  1. code can run periodically and automatically (somehow) to make note of new files in the USDA download page.
  2. code for doing the CSV import can be moved from ... that other project where it is, to inside the openfoodfacts-server project
  3. code can be put into the openfoodfacts-server project that will check the API for updated or added products, to be run periodically and automatically (again, somehow). This has a chance of staying under the rate limit for the USDA API.
  4. anything that I am forgetting that will better integrate the import process into the app.

What think you?

@stephanegigandet
Copy link
Contributor

I think it's best to keep things simple. We can use the CSV only, and update it every 6 months when there is a new CSV export.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants