# Analyze Categories Taxonomy

Goal: Create a curated set of food categories that are suitable for search auto-suggest.

Situation:
- auto-suggest is typically based on most popular searches, but we don't have search data
- as an alternative show the best matching categories from the taxonomy

Constraints:
- only categories that have many products
- exclude root nodes (too generic) and leaf nodes (too specific) from categories taxonomy
- exclude categories that express food processing (cooked rice)

Category Sources:
- Category Taxonomy (https://github.com/openfoodfacts/openfoodfacts-server/blob/main/taxonomies/food/categories.txt)
- Statistics about Products in Categories (https://world.openfoodfacts.org/categories)

In [1]:
from openfoodfacts.taxonomy import get_taxonomy, Taxonomy
import networkx as nx
import pandas as pd

## Load Category Taxonomy

In [2]:
categories: Taxonomy = get_taxonomy("category")
print(f"categories: {len(categories):,}")

categories: 13,404


## Exploratory Data Analysis

### Category Languages

In [3]:
tax_nodes = categories.nodes.values()
lang_prefix = [node.id[:3] for node in tax_nodes]
pd.Series(lang_prefix).value_counts().head(10)

en:    7936
fr:    2829
it:    1097
es:     395
de:     247
pt:     168
el:     166
bg:      74
hu:      65
ro:      64
Name: count, dtype: int64

### Italian Categories

In [4]:
it_nodes = [node for node in tax_nodes if node.id[:2] == 'it']
it_nodes[:10]

[<TaxonomyNode it:rubicone>,
 <TaxonomyNode it:lago-di-corbara>,
 <TaxonomyNode it:mafaldine>,
 <TaxonomyNode it:vino-doc-molisano>,
 <TaxonomyNode it:colli-di-rimini>,
 <TaxonomyNode it:grottino-di-roccanova>,
 <TaxonomyNode it:salaparuta>,
 <TaxonomyNode it:colli-di-parma-malvasia-spumante-secco>,
 <TaxonomyNode it:casteller>,
 <TaxonomyNode it:oltrepo-pavese-pinot-grigio>]

### Closest English Parents of Italian Categories

In [5]:
## recurse until enlish parent is found
def get_en_ancestors(node):
    if node.id[:2] == 'en':
        return [node]
    else:
        return [ena for parent in node.parents for ena in get_en_ancestors(parent)]

it_parents = [parent for node in it_nodes for parent in get_en_ancestors(node)]
pd.Series(it_parents).value_counts()

<TaxonomyNode en:wines-from-italy>         809
<TaxonomyNode en:pastas>                    55
<TaxonomyNode en:italian-meat-products>     41
<TaxonomyNode en:italian-cheeses>           36
<TaxonomyNode en:italian-fruit-spirit>      15
                                          ... 
<TaxonomyNode en:clementines>                1
<TaxonomyNode en:italian-liqueurs>           1
<TaxonomyNode en:specks>                     1
<TaxonomyNode en:brandys>                    1
<TaxonomyNode en:ricciarelli>                1
Name: count, Length: 63, dtype: int64

In [6]:
print(categories["en:ricciarelli"].names)
print(categories["en:ricciarelli"].parents)

{'it': 'Ricciarelli', 'es': 'Ricciarelli', 'hr': 'Ricciarelli', 'xx': 'Ricciarelli', 'fr': 'Ricciarelli', 'en': 'Ricciarelli', 'de': 'Ricciarelli'}
[<TaxonomyNode en:biscuits>]


### French Categories

In [7]:
fr_nodes = [node for node in tax_nodes if node.id[:2] == 'fr']
fr_nodes[:10]

[<TaxonomyNode fr:givry-premier-cru-servoisine>,
 <TaxonomyNode fr:cotes-du-roussillon>,
 <TaxonomyNode fr:rhum-de-sucrerie-de-la-baie-du-galion>,
 <TaxonomyNode fr:piment-kabyle>,
 <TaxonomyNode fr:jambon-du-kintoa>,
 <TaxonomyNode fr:medoc>,
 <TaxonomyNode fr:chablis-premier-cru-les-lys>,
 <TaxonomyNode fr:jambons-de-mayence>,
 <TaxonomyNode fr:le-pepper-mont-jolien>,
 <TaxonomyNode fr:beaune-premier-cru-les-tuvilains-rouge>]

In [8]:
# find closest english parents
fr_parents = [parent for node in fr_nodes for parent in get_en_ancestors(node)]
pd.Series(fr_parents).value_counts()

<TaxonomyNode en:burgundy-wines>            1390
<TaxonomyNode en:red-wines>                 1138
<TaxonomyNode en:white-wines>               1005
<TaxonomyNode en:wines-from-france>          372
<TaxonomyNode en:rose-wines>                  66
                                            ... 
<TaxonomyNode en:absinthium>                   1
<TaxonomyNode en:french-sparkling-wines>       1
<TaxonomyNode en:mediterannean-honeys>         1
<TaxonomyNode en:pine-honeys>                  1
<TaxonomyNode en:strawberry-compotes>          1
Name: count, Length: 449, dtype: int64

### Multi-parent Categories

Taxonomy entries can have multiple parent. This can be an indication that one parent is of a generic type.

In [9]:
mpar = [parent for node in tax_nodes for parent in node.parents if len(node.parents) > 1]
pd.Series(mpar).value_counts()

<TaxonomyNode en:white-wines>                113
<TaxonomyNode en:red-wines>                   93
<TaxonomyNode en:cow-cheeses>                 75
<TaxonomyNode en:burgundy-wines>              68
<TaxonomyNode en:french-cheeses>              55
                                            ... 
<TaxonomyNode en:marzipan>                     1
<TaxonomyNode en:dark-chocolate-biscuits>      1
<TaxonomyNode en:lentil-sprouts>               1
<TaxonomyNode en:spiced-flavoured-oils>        1
<TaxonomyNode en:vegetarian-nuggets>           1
Name: count, Length: 1378, dtype: int64

### Root Nodes

In [10]:
root_nodes = [node for node in tax_nodes if len(node.parents) == 0 and len(node.children) != 0]
print(f"{len(root_nodes)} root nodes with #children")
pd.Series(map(lambda node: len(node.children), root_nodes), index = map(lambda x: x.id, root_nodes)).sort_values(ascending=False)

52 root nodes with #children


en:meals                                   95
en:desserts                                48
en:sandwiches                              40
en:meats-and-their-products                35
en:seafood                                 35
en:frozen-foods                            34
en:sweet-pies                              26
en:variety-packs                           23
en:condiments                              20
en:terrines                                18
en:breaded-products                        17
en:fats                                    16
en:dairies                                 16
en:baby-foods                              16
en:meat-alternatives                       14
en:food-additives                          12
en:cooking-helpers                         11
en:specific-products                        9
en:canned-foods                             8
en:broths                                   8
en:cocoa-and-its-products                   8
en:spreads                        

## Data Pre-Processing

In [11]:
categories["de:obstbrand"].names

{'hr': 'Voćna rakija',
 'de': 'Obstbrand',
 'fi': 'Hedelmäviina',
 'no': 'Obstler',
 'hu': 'Gyümölcspárlat',
 'en': 'Fruit brandy',
 'sv': 'Obstler'}

## Data Inspection

In [12]:
node = list(tax_nodes)[0]
print(node)

<TaxonomyNode sv:svensk-vodka>


In [13]:
node.get_parents_hierarchy()

[<TaxonomyNode en:vodka>,
 <TaxonomyNode en:eaux-de-vie>,
 <TaxonomyNode en:hard-liquors>,
 <TaxonomyNode en:distilled-beverages>,
 <TaxonomyNode en:alcoholic-beverages>,
 <TaxonomyNode en:beverages>,
 <TaxonomyNode en:beverages-and-beverages-preparations>]

In [14]:
node.names

{'sv': 'Svensk Vodka'}

In [15]:
node.parents[0].names

{'de': 'Wodkas',
 'lt': 'Vodka',
 'ro': 'Vodcă',
 'nl': 'Wodkas',
 'ja': 'ウォッカ',
 'pl': 'Wódka',
 'zh': '伏特加',
 'bg': 'Водка',
 'en': 'Vodka',
 'fr': 'Vodkas',
 'ru': 'Водка',
 'it': 'Vodka',
 'es': 'Vodkas',
 'hr': 'Votka',
 'fi': 'Votka'}

In [16]:
node.parents[0].parents

[<TaxonomyNode en:eaux-de-vie>]

In [17]:
node.parents[0].parents[0].names

{'de': 'Eau de vie',
 'hr': 'Eaux de vie',
 'it': 'Distillati aromatizzati alla frutta',
 'en': 'Eaux de vie',
 'fr': 'Eaux-de-vie',
 'nl': 'Vruchtendistillaten'}

In [18]:
node.parents[0].parents[0].parents

[<TaxonomyNode en:hard-liquors>]

### Root Nodes

## Create Taxonomy Graph

In [19]:
def get_node_label(node):
    if 'en' in node.names.keys():
        return f"en:{node.names['en']}"
    else:
        return f"{node.id[:3]}{node.names[node.id[:2]]}"

def build_taxonomy_graph(nodes):
    G = nx.DiGraph();
    for node in nodes:
        for parent in node.parents:
            # prefer english name
            #node_label = node.names.get('en', node.names[node.id[:2]])
            #parent_label = parent.names.get('en', parent.names[node.id[:2]])
            G.add_edge(parent.id, node.id)
    return G

In [20]:
G = build_taxonomy_graph(tax_nodes)

In [21]:
print(f"#Nodes: {len(G.nodes):,}")
print(f"#Edges: {len(G.edges):,}")

#Nodes: 13,384
#Edges: 15,689


In [22]:
n = list(G.nodes)[2]
print(n)

en:chocolate-sprinkles


In [23]:
nx.descendants(G, n)

{'en:dark-chocolate-sprinkles', 'en:milk-chocolate-sprinkles'}

In [24]:
list(G.successors("de:obstbrand"))

['en:french-fruit-spirit',
 'en:somerset-cider-brandy',
 'en:croatian-fruit-spirit',
 'en:hungarian-fruit-spirit',
 'en:austrian-fruit-spirit',
 'en:german-fruit-spirit',
 'en:palinka',
 'en:bulgarian-fruit-spirit',
 'en:slovenian-fruit-spirit',
 'en:romanian-fruit-spirit',
 'en:italian-fruit-spirit']

## Root Nodes and Leaf Nodes

In [25]:
root_nodes = [node for node in G.nodes if G.in_degree(node) == 0]
leaf_nodes = [node for node in G.nodes if G.out_degree(node) == 0]

In [26]:
print("#root nodes:", len(root_nodes))
print("#leaf nodes:", len(leaf_nodes))

#root nodes: 52
#leaf nodes: 10715


### Inspect Root Nodes

Hypothesis: root nodes are too generic

In [27]:
def child_info(id: str):
    children_ids = [child.id for child in categories[id].children]
    return nodes_info(G, children_ids)

def nodes_info(G, nodes):
    counts = {node: [len(nx.descendants(G, node)), len(list(G.successors(node)))] for node in nodes}
    display(sorted(counts.items(), key=lambda x: x[1][0], reverse=True))    

nodes_info(G, root_nodes)

[('en:beverages-and-beverages-preparations', [4640, 2]),
 ('en:plant-based-foods-and-beverages', [3568, 4]),
 ('en:meats-and-their-products', [1249, 35]),
 ('en:dairies', [1024, 16]),
 ('en:fermented-foods', [826, 3]),
 ('en:snacks', [754, 5]),
 ('en:meals', [702, 95]),
 ('en:seafood', [617, 35]),
 ('en:condiments', [607, 20]),
 ('en:desserts', [483, 48]),
 ('en:spreads', [425, 7]),
 ('en:breakfasts', [353, 4]),
 ('en:fats', [333, 16]),
 ('en:frozen-foods', [328, 34]),
 ('en:canned-foods', [195, 8]),
 ('en:sweeteners', [183, 6]),
 ('en:dried-products', [166, 5]),
 ('en:cocoa-and-its-products', [162, 8]),
 ('en:farming-products', [129, 2]),
 ('en:bee-products', [115, 6]),
 ('en:sandwiches', [107, 40]),
 ('en:fresh-foods', [87, 3]),
 ('en:fish-and-meat-and-eggs', [83, 4]),
 ('en:baby-foods', [82, 16]),
 ('en:syrups', [61, 6]),
 ('en:festive-foods', [60, 4]),
 ('en:meat-alternatives', [59, 14]),
 ('en:food-additives', [47, 12]),
 ('en:sweet-pies', [47, 26]),
 ('en:breaded-products', [45, 

In [28]:
# 'en:fried-foods', 'en:caviar-substitutes'

child_info('en:two-crust-pies')

[('en:squid-and-spicy-tomato-sauce-pie', [0, 0]), ('en:mushroom-pies', [0, 0])]

In [29]:
def show_child_count(G, nodes):
    counts = {node: len(list(G.successors(node))) for node in nodes}
    display(list(sort_by_count(counts).items()))

def show_desc_count(G, nodes):
    counts = {node: len(nx.descendents(G, node)) for node in nodes}
    display(list(sort_by_count(counts).items()))

def sort_by_count(d: dict):
    return dict(sorted(d.items(), key=lambda x: x[1], reverse=True))

In [30]:
show_child_count(G, root_nodes)

[('en:meals', 95),
 ('en:desserts', 48),
 ('en:sandwiches', 40),
 ('en:meats-and-their-products', 35),
 ('en:seafood', 35),
 ('en:frozen-foods', 34),
 ('en:sweet-pies', 26),
 ('en:variety-packs', 23),
 ('en:condiments', 20),
 ('en:terrines', 18),
 ('en:breaded-products', 17),
 ('en:fats', 16),
 ('en:dairies', 16),
 ('en:baby-foods', 16),
 ('en:meat-alternatives', 14),
 ('en:food-additives', 12),
 ('en:cooking-helpers', 11),
 ('en:specific-products', 9),
 ('en:broths', 8),
 ('en:canned-foods', 8),
 ('en:cocoa-and-its-products', 8),
 ('en:spreads', 7),
 ('en:bee-products', 6),
 ('en:dietary-supplements', 6),
 ('en:sweeteners', 6),
 ('en:artisan-products', 6),
 ('en:syrups', 6),
 ('en:meal-kits', 5),
 ('en:snacks', 5),
 ('en:dried-products', 5),
 ('en:chips-and-fries', 5),
 ('en:breakfasts', 4),
 ('en:mountain-products', 4),
 ('en:crepes-and-galettes', 4),
 ('en:skewers', 4),
 ('en:fish-and-meat-and-eggs', 4),
 ('en:capsules', 4),
 ('en:non-food-products', 4),
 ('en:plant-based-foods-and-

In [31]:
root_counts = {node: len(nx.descendants(G, node)) for node in root_nodes}
list(sort_by_count(root_counts).items())

[('en:beverages-and-beverages-preparations', 4640),
 ('en:plant-based-foods-and-beverages', 3568),
 ('en:meats-and-their-products', 1249),
 ('en:dairies', 1024),
 ('en:fermented-foods', 826),
 ('en:snacks', 754),
 ('en:meals', 702),
 ('en:seafood', 617),
 ('en:condiments', 607),
 ('en:desserts', 483),
 ('en:spreads', 425),
 ('en:breakfasts', 353),
 ('en:fats', 333),
 ('en:frozen-foods', 328),
 ('en:canned-foods', 195),
 ('en:sweeteners', 183),
 ('en:dried-products', 166),
 ('en:cocoa-and-its-products', 162),
 ('en:farming-products', 129),
 ('en:bee-products', 115),
 ('en:sandwiches', 107),
 ('en:fresh-foods', 87),
 ('en:fish-and-meat-and-eggs', 83),
 ('en:baby-foods', 82),
 ('en:syrups', 61),
 ('en:festive-foods', 60),
 ('en:meat-alternatives', 59),
 ('en:food-additives', 47),
 ('en:sweet-pies', 47),
 ('en:breaded-products', 45),
 ('en:chips-and-fries', 45),
 ('en:specific-products', 38),
 ('en:cooking-helpers', 38),
 ('en:terrines', 29),
 ('en:broths', 29),
 ('en:non-food-products', 2

In [32]:
categories['en:two-crust-pies'].children

[<TaxonomyNode en:squid-and-spicy-tomato-sauce-pie>,
 <TaxonomyNode en:mushroom-pies>]

In [33]:
print(leaf_nodes[:20])

['sv:svensk-vodka', 'en:milk-chocolate-sprinkles', 'en:haitian-rums', 'en:nougat-ice-cream-tubs', 'en:smoked-chicken-breast', 'en:sorbets-on-stick', 'en:soft-ripened-round-cheese-with-bloomy-rind-5-to-11-fat', 'en:still-soft-drink-with-tea-extract-with-sugar-and-artificial-sweetener-s', 'en:tatin-tart', 'en:baker-s-yeast', 'en:cantaloupe-melon-pulp', 'de:pfälzer', 'en:rabbit-fresh-meat', 'en:poultry-ham-in-cube', 'fr:rhum-de-sucrerie-de-la-baie-du-galion', 'en:filled-fritter-garnished-with-shrimps-and-vegetables-and-poultry-and-meat', 'en:little-millet', 'de:badischer', 'bg:любимец', 'nl:groentemengsels-voor-spaghetti-en-macaroni']


## Nodes with most descendants

In [34]:
def count_descendants(G: nx.DiGraph):
    all_counts = {}
    for node in G.nodes:
        all_counts[node] = len(nx.descendants(G, node))
    return all_counts

def sort_by_count(d: dict):
    return dict(sorted(d.items(), key=lambda x: x[1], reverse=True))

In [35]:
counts = count_descendants(G)

In [36]:
sorted_counts = sort_by_count(counts)

In [37]:
list(sorted_counts.items())[:20]

[('en:beverages-and-beverages-preparations', 4640),
 ('en:beverages', 4508),
 ('en:alcoholic-beverages', 4092),
 ('en:plant-based-foods-and-beverages', 3568),
 ('en:wines', 3567),
 ('en:plant-based-foods', 3325),
 ('en:wines-from-france', 1871),
 ('en:burgundy-wines', 1436),
 ('en:meats-and-their-products', 1249),
 ('en:red-wines', 1124),
 ('en:fruits-and-vegetables-based-foods', 1080),
 ('en:white-wines', 1037),
 ('en:dairies', 1024),
 ('en:cereals-and-potatoes', 861),
 ('en:fermented-foods', 826),
 ('en:wines-from-italy', 820),
 ('en:fermented-milk-products', 800),
 ('en:snacks', 754),
 ('en:meals', 702),
 ('en:sweet-snacks', 652)]

## Nodes with most children

In [38]:
child_count = {node: G.out_degree(node) for node in G.nodes}

In [39]:
sorted_child_count = sort_by_count(child_count)

In [40]:
list(sorted_child_count.items())[:30]

[('en:wines-from-italy', 563),
 ('en:wines-from-france', 279),
 ('en:wines-from-spain', 146),
 ('fr:chassagne-montrachet', 144),
 ('en:wines-from-greece', 129),
 ('fr:nuits-saint-georges', 129),
 ('fr:beaune', 121),
 ('en:white-wines', 115),
 ('en:red-wines', 100),
 ('en:fishes', 96),
 ('en:meals', 95),
 ('en:burgundy-wines', 95),
 ('en:pastas', 94),
 ('en:sauces', 91),
 ('en:cow-cheeses', 90),
 ('en:cheeses', 85),
 ('en:wines-from-portugal', 75),
 ('fr:saint-aubin', 72),
 ('fr:savigny-les-beaune', 71),
 ('en:wines-from-germany', 68),
 ('en:french-cheeses', 60),
 ('fr:morey-saint-denis', 57),
 ('en:wines-from-bulgaria', 56),
 ('en:poultries', 53),
 ('en:italian-cheeses', 53),
 ('en:plant-based-foods', 52),
 ('fr:puligny-montrachet', 52),
 ('en:vegetables', 52),
 ('fr:alsace-grand-cru', 51),
 ('en:chablis', 50)]