Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LLMs to extract ingredient lists from raw text #309

Open
Tracked by #289
raphael0202 opened this issue May 14, 2023 · 3 comments
Open
Tracked by #289

Use LLMs to extract ingredient lists from raw text #309

raphael0202 opened this issue May 14, 2023 · 3 comments

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented May 14, 2023

Successful test using ChatGPT (GPT-3.5):

Extract ingredient lists from the following texts. The ingredient list should start with the first ingredient and include allergy, label or origin information if this information is provided just after the ingredient list.
The output format must be a JSON list containing one element per ingredient list. If there are ingredients in several languages, the JSON should contains as many elements as detected ingredient languages. Each JSON element should have two fields: 
- a "text" field containing the detected ingredient list. The text should be a substring of the original text, you must not alter the original text.
- a  "lang" field containing the detected language of the ingredient list.
Don't output anything else than the expected JSON.

(FI) Hapatettua vaniljakaurajäätelöä Ainesosat Kaurapohja (vesi, kaura), maltodekstriini, dekstroosi, kasviöljyt (kookos, rypsi), sokeri, tärkkelyssiirappi, vanilja (1,5 %), perunaproteiini, emulgointiaine (E471), stabilointiaineet (johanneksenleipäpuujauhe, guarkumi), suola, luontainen vanilja-aromi, hapate. Sisältää 5% kauraa (SV/DA/NO) Syrad vaniljhavreglass/Symmet vaniljehavreis Ingredienser: Havrebas (vatten/vand/vann, havre), maltodextrin, dextros, vegetabi olier (kokos, rybs), socker/sukker, stärkelsesirap/stivelsesirup, vanilj (1,5%), potas-kat potetprotein, emulgeringsmedel/emulgator (E471), stabiliseringsmedel/stabilisatore (fruktkärnmjöl/johannesbrødkernemel, guarkärnmjöl/guargummi, salt, naturligania symingskultur. Innehåller/Indeholder 5 % havre.  Ravintoarvot/Näringsvärde/ Næringsindhold/Næringsinnhold Energia/Energi Rasva/Fett/Fedt josta tyydyttynyttä / varav mättat fett/heraf Hiilihydraatit / Kolhydrat / Kulhydrat/Karbohydrat maettede fedtsyrer / hvorav mettede fettsyrer josta sokereita/varav sockerarter/ heraf sukkerarter/hvorav sukkerarter  Proteiini/Protein Suola/Salt  836 kJ/200  SR **S

Output:

[
    {
        "text": "Kaurapohja, maltodekstriini, dekstroosi, kasviöljyt (kookos, rypsi), sokeri, tärkkelyssiirappi, vanilja (1,5 %), perunaproteiini, emulgointiaine (E471), stabilointiaineet (johanneksenleipäpuujauhe, guarkumi), suola, luontainen vanilja-aromi, hapate",
        "lang": "fi"
    },
    {
        "text": "Havrebas, maltodextrin, dextros, vegetabi olier (kokos, rybs), socker, stärkelsesirap, vanilj (1,5%), potas-kat potetprotein, emulgeringsmedel (E471), stabiliseringsmedel (fruktkärnmjöl, guarkärnmjöl, salt, naturligania symingskultur",
        "lang": "sv/da/no"
    }
]

Part of

@raphael0202
Copy link
Contributor Author

raphael0202 commented May 14, 2023

Another one:

Extract ingredient lists from the following texts. The ingredient list should start with the first ingredient and include allergy, label or origin information if this information is provided just after the ingredient list.
The output format must be a JSON list containing one element per ingredient list. If there are ingredients in several languages, the JSON should contains as many elements as detected ingredient languages. Each JSON element should have two fields: 
- a "text" field containing the detected ingredient list. The text should be a substring of the original text, you must not alter the original text.
- a  "lang" field containing the detected language of the ingredient list.
Don't output anything else than the expected JSON.

P  bir  Depuis 1976, notre chocolaterie liégeoise crée, produit en toute indépendance... de manière responsable et transparente.  Onze Luikse chocolaterie creëert en produceert sinds 1976 in alle onafhankelijkheid, op een verantwoorde en transparante manier.  Our chocolate factory, proudly based in Liège/has been producing and creating independently, responsibly and transparently since 1976.  FR-Chocolat noir 85% de cacao et éclats de caramel. Ingrédients: chocolat noir 85% de cacao (pâte de cacao, sucre, beurre de cacao, émulsifiant : lécithine de soja), éclat de caramel 19% (sucre, beurre de cacao, amidon de riz/sel de Guérande, acidifiant: bicarbonate de sodium), sel de Guérande. Cacao: 85% min. Peut contenir : des fruits à coque, lait, oeufs et céréales contenant du gluten.  NL - Pure chocolade 85% cacao en karamelstukjes. Ingrediënten: pure chocolade 85% cacao (cacaopasta, suiker, cacaoboter, emulgator: sojalecithine), karamel stukjes 19% (suiker, cacaoboter, rijstzetmeel, Guérande zout, voedingszuur: natriumbicarbonaat), Guérande zout. Cacao: 85% min. Kan bevatten: schaalvruchten, melk, eieren en gluten bevattende granen.  EN-Dark chocolate 85% cocoa and caramel pieces. Ingredients: dark chocolate 85% cocoa (cocoa paste, sugar, cocoa butter, emulsifier: soy lecithin), caramel pieces 19% (sugar, cocoa butter, rice starch, Guérande salt, acidifier: sodium bicarbonate), Guérande salt. Cocoa: 85% min. May contain: nuts, milk, eggs and cereals containing gluten.  DE-Bitterschokolade 85% Kakao und Karamell stückchen. Zutaten: Bitterschokolade 85% Kakao (Kakaomasse, Zucker, Kakaobutter, emulgator: Sojalecithin), Karamellblättchen 19% (Zucker, Kakaobutter, Reisstärke, Guérande-Salz, Säuerungsmittel: Natriumbicarbonat), Guérande-Salz. Kakao: 85% mindestens. Kann enthalten: Schalenfruchten, Milch, Eiern und Glutenhaltiges Getreide.  ES- Chocolate negro 85% de cacao y copos de caramelo. Ingredientes: chocolate negro 85% cacao (pasta de cacao, azúcar, manteca de cacao, emulsionante: lecitina de soja), copos de caramelo 19% (azúcar, manteca de cacao, almidón de arroz, sal de Guérande, acidificante: bicarbonato de sodio), sal de Guérande. Cacao: 85% min. Puede contener: frutos con cáscara, leche, huevos y cereales que contengan gluten.  يحسن استهلاكه قبل  IT-Cioccolato fondente 85% cacao e scoppio di caramello. Ingredienti: cioccolato fondente 85% cacao (pasta di cacao, zucchero, burro di cacao, emulsionante: lecitina di soia), schegge di caramello 19% (zucchero, burro di cacao, amido di riso, sale di Guérande, acidificante: bicarbonato di sodio), sale de Guérande. Cacao: 85% min. Può contenere: frutta a guscio, latte, uova e cereali contenenti glutine.  المكونات : شوكولا سوداء 85 % على الأقل من الكاكاو ( عجينة كاكاو, سكر, زبدة كاكاو, مستحلب : ليسيتين الصويا ), كاراميل مدقوق 19 % ( سكر, زبدة كاكاو, نشأ إرز, ملح, محمض : بيكربونات الصوديوم ), ملح. كاكاو : 85 % على الأقل. يمكن ان يحتوي على : فواكه جافة وحليب وبيض وحبوب محتوات على غولتين.  290CT2021  Consumare preferibilmente entro il: Consumir preferentemente antes del: Ten minste houdbaar tot: / Best before: A consommer de préférence avant le: Mindestens haltbar bis: 15 \"412038129373  FAIRTRADE  Max 16°C / 60.8°F  Made in Belgium Galler Chocolatiers S.A. chocolaterie Galler Rue de la Station 39 indépendante  COCOA  80g € 2.80Z  Cacao: le bilan de masse est utilisé pour correspondre au volume acheté aux conditions Fairtrade. www.info.fairtrade.net/sourcing  4051 Vaux-sous-Chèvremont - Belgium Phone: +32 (0)4/367.22.11 www.galler.com - infos@galler.com  Analyses nutritionnelles pour 100g Voedingswaarden per 100g / Nutrition facts per 100g/ Nährwerte pro 100g / Información nutricional por 100g Valori nutrizionali per 100g/100  Valeur énergétique Energetische waarde / Energy value / Brennwert / Valor energético / Valore energetico /  Matières Vetten/Fat/  grasses  Fett / Grasas /Grassi /  في  Dont acides gras saturés / Waarvan verzadigde vetzuren / Of which saturated fat / Davon gesättigte Fettsäuren / De los cuales ácidos grasos saturados / Di cui acidi grassi saturi /sl  Glucides / Koolhydraten / Carbohydrates / Kohlenhydrate/ Hidratos de carbono /  الـكـريـوهـيـدرات / Carboidrati  Dont sucres / Waarvan suikers/Of which sugars / Davon Zucker / De los cuales azúcares /  منها سكر/ Di cui zuccheri  Fibres/Vezels/Fibers/ Fasern/Fibras / Fibre/  صنع ببلجيكا  Protéines / Eiwitten / Protein / Eiweiß / Proteínas / Proteine / Sel/Zout/Salt / Salz/Sal / Sale/  2292 kJ 548 kcal  41,1g  24,7g  29,6g  25,3g  12,7g  8,0g  1,3g  TON  ww 

ChatGPT:

[
    {
        "text": "chocolat noir 85% de cacao (pâte de cacao, sucre, beurre de cacao, émulsifiant : lécithine de soja), éclat de caramel 19% (sucre, beurre de cacao, amidon de riz/sel de Guérande, acidifiant: bicarbonate de sodium), sel de Guérande",
        "lang": "fr"
    },
    {
        "text": "Pure chocolade 85% cacao (cacaopasta, suiker, cacaoboter, emulgator: sojalecithine), karamel stukjes 19% (suiker, cacaoboter, rijstzetmeel, Guérande zout, voedingszuur: natriumbicarbonaat), Guérande zout",
        "lang": "nl"
    },
    {
        "text": "dark chocolate 85% cocoa (cocoa paste, sugar, cocoa butter, emulsifier: soy lecithin), caramel pieces 19% (sugar, cocoa butter, rice starch, Guérande salt, acidifier: sodium bicarbonate), Guérande salt",
        "lang": "en"
    },
    {
        "text": "Bitterschokolade 85% Kakao (Kakaomasse, Zucker, Kakaobutter, emulgator: Sojalecithin), Karamellblättchen 19% (Zucker, Kakaobutter, Reisstärke, Guérande-Salz, Säuerungsmittel: Natriumbicarbonat), Guérande-Salz",
        "lang": "de"
    },
    {
        "text": "Chocolate negro 85% cacao (pasta de cacao, azúcar, manteca de cacao, emulsionante: lecitina de soja), copos de caramelo 19% (azúcar, manteca de cacao, almidón de arroz, sal de Guérande, acidificante: bicarbonato de sodio), sal de Guérande",
        "lang": "es"
    },
    {
        "text": "cioccolato fondente 85% cacao (pasta di cacao, zucchero, burro di cacao, emulsionante: lecitina di soia), schegge di caramello 19% (zucchero, burro di cacao, amido di riso, sale di Guérande, acidificante: bicarbonato di sodio), sale de Guérande",
        "lang": "it"
    },
    {
        "text": "شوكولا سوداء 85 % على الأقل من الكاكاو ( عجينة كاكاو, سكر, زبدة كاكاو, مستحلب : ليسيتين الصويا ), كاراميل مدقوق 19 % ( سكر, زبدة كاكاو, نشأ إرز, ملح",
        "lang": "ar"
    }
]

Vicuna-13B:

[
    {
        "text": "FR-Chocolat noir 85% de cacao et éclats de caramel. Ingrédients: chocolat noir 85% de cacao (pâte de cacao, sucre, beurre de cacao, émulsifiant : lécithine de soja), éclat de caramel 19% (sucre, beurre de cacao, amidon de riz/sel de Guérande, acidifiant: bicarbonate de sodium), sel de Guérande.",
        "lang": "FR"
    }
]

@teolemon
Copy link
Member

ChatGPT 3.5 or 4 @raphael0202 ?

@raphael0202
Copy link
Contributor Author

@teolemon 3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants