# Course 2: Project - Data cleaning

This notebook is concerned with cleaning the [Open Food Facts](https://www.kaggle.com/openfoodfacts/world-food-facts) dataset (version 5), downloaded from Kaggle. The dataset originates from https://world.openfoodfacts.org/data. A description of the fields is available at https://static.openfoodfacts.org/data/data-fields.txt.

**Contents:**
* [Imports](#imports)
* [Preparatives](#preparatives)
* [A. Importing, cleaning and numerical summaries](#task-a)
  * [A.1. Column group: General information](#task-a1-general-information)
  * [A.1. Column group: Tags](#task-a1-tags)
  * [A.1. Column group: Ingredients](#task-a1-ingredients)

## Imports<a name="imports"></a>
---

In [5]:
# Standard library:
import collections
import enum
import functools
import pathlib
import re
import typing as t

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Project:
import cleanquantity
import ean

%matplotlib inline

## Preparatives<a name="preparatives"></a>
---

This section regroups utility functions, etc. that we will use later in this notebook.

### Utilities

In [6]:
@functools.wraps(display)  # nicer for interactive use
def display_allcols(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns."""
    with pd.option_context('display.max_columns', None):
        display(*args, **kwargs)
        
        
@functools.wraps(display)  # nicer for interactive use
def display_allcols_notrunc(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns with no truncation."""
    with pd.option_context('display.max_columns', None, 'display.max_colwidth', -1):
        display(*args, **kwargs)


def profile(df: pd.DataFrame) -> pd.DataFrame:
    
    
    def get_type_name(obj: t.Any) -> str:
        return type(obj).__name__
        
    types_ = [df[col].map(get_type_name).value_counts().to_dict() for col in df.columns]
        
    data = {
        'Types': types_,
        'NA Count': df.apply(lambda series: series.isna().sum()),
        'NA %': df.apply(lambda series: series.isna().mean() * 100.0)
    }
    return pd.DataFrame(data).transpose()
    

def parse_t(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, utc=True, unit='s')


def parse_datetime(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, format='%Y-%m-%dT%H:%M:%S%z')


def move_after(words: t.List[str], word: str, word_to_move: str) -> t.List[str]:
    """Utility function to re-order columns."""
    try:
        word_idx = words.index(word)
        word_to_move_idx = words.index(word_to_move)
    except ValueError:
        pass
    else:
        if word_idx < word_to_move_idx:
            words.pop(word_to_move_idx)
            words.insert(word_idx + 1, word_to_move)
        else:
            words.insert(word_idx + 1, word_to_move)
            words.pop(word_to_move_idx)
    return words

### EAN-13/EAN-8/UPC-A

As per the description of the field `code`: For products with a barcode, this is the barcode of the product (EAN-13 code or some internal code assigned by the store). For products without a barcode, Open Food Facts assigns a number starting with the 200 reserved prefix. We implement utility functions to check whether a given code is a valid EAN-13/EAN-8/UPC-A code.

**Note:** In order not to clutter the notebook the code is in a separate module - *ean.py*.

## A. Importing, cleaning and numerical summaries<a name="task-a"></a>
---

Since we are not familiar with the dataset and warned that it is quite messy, we first let Pandas read the TSV file entirely into memory and guess the type of each column. As the end of this notebook, we will export a cleaned-up file that Pandas will be able to read more efficiently (e.g. by specifying the type of each column upfront).

In [8]:
data_filename = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.tsv')

In [4]:
df = pd.read_csv(data_filename, sep='\t', low_memory=False)

We first get some general information about the `DataFrame`...

In [5]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 356027 data points with 163 features


**Note:** It turns out that reading the TSV file this way is problematic (at least on macOS) since 26 lines contain a carriage return. We noticed this by focusing on the first row where `code` was NaN and looking at the line corresponding to the previous row directly in the TSV file:
```bash
sed -n '193909 l' ./en.openfoodfacts.org.products.tsv
(...)fr-32-464-040-ec\t43.400279,0.199525\r\t\tvillecomtal-sur-arros-gers-france(...)
                                         ^^
```
26 data points is a negligible fraction of all data points and we could have dropped the rows but it turns out that there is an even simpler solution:

In [9]:
df = pd.read_csv(data_filename, sep='\t', lineterminator='\n', low_memory=False)

In [10]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 356001 data points with 163 features


Having taken care of this, we look at the first 5 rows:

In [11]:
display_allcols(df.head())

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_uk,nutrition_grade_fr,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,main_category,main_category_en,image_url,image_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,28 g (1 ONZ),,0.0,[ bananas -> en:bananas ] [ vegetable-oil -...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2243.0,,28.57,28.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.018,64.29,14.29,,,,,,,,,3.6,3.57,,,,0.0,0.0,,0.0,,,,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,28 g (0.25 cup),,0.0,[ peanuts -> en:peanuts ] [ wheat-flour -> ...,,,0.0,,,0.0,,,,b,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1941.0,,17.86,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,60.71,17.86,,,,,,,,,7.1,17.86,,,,0.635,0.25,,0.0,,,,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Organic hazelnuts, organic cashews, organic wa...",,,,,,28 g (0.25 cup),,0.0,[ organic-hazelnuts -> en:organic-hazelnuts ...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2540.0,,57.14,5.36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.86,3.57,,,,,,,,,7.1,17.86,,,,1.22428,0.482,,,,,,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,United States,Organic polenta,,,,,,35 g (0.25 cup),,0.0,[ organic-polenta -> en:organic-polenta ] [...,,,0.0,,,0.0,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1552.0,,1.43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,77.14,,,,,,,,,,5.7,8.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


For each column, we briefly look at the type guessed by Pandas and the number of non-null values:

In [9]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356001 entries, 0 to 356000
Data columns (total 163 columns):
code                                          356001 non-null object
url                                           356001 non-null object
creator                                       355998 non-null object
created_t                                     356001 non-null int64
created_datetime                              356000 non-null object
last_modified_t                               356001 non-null int64
last_modified_datetime                        356001 non-null object
product_name                                  338489 non-null object
generic_name                                  57688 non-null object
quantity                                      119262 non-null object
packaging                                     89959 non-null object
packaging_tags                                89959 non-null object
brands                                        326977 non-null obj

**Comment:** We notice that 20+ columns contain only NA values.

**Decision:** Drop these columns.

In [10]:
df = df.dropna(how='all', axis=1)

In [11]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 356001 data points with 142 features


### A.1. Column group: General information<a name="task-a1-general-information"></a>

#### Column: `code`

In [12]:
profile(df[['code']])

Unnamed: 0,code
Types,{'str': 356001}
NA Count,0
NA %,0


**Comment:** The column `code` has type `object`, all values are of type `str` and there are no NA values.

**Decision:** Keep the column and all rows.

Out of curiosity, we check how many codes belong to the following categories: `e` - valid EAN-13/EAN-8/UPC-A code, `a` - code assigned by Open Food Facts (prefix 200) and `i` - internal code (store). (Mistyped EAN-13/EAN-8/UPC-A codes will be incorrectly classified as internal codes but we will not pursue this further.) We see that the majority of all codes (over 85 %) are valid EAN-13/EAN-8/UPC-A codes:

In [13]:
def categorize(code: str) -> str:
    return ('e' if ean.is_valid(code) else
            'a' if code.startswith('200') else
            'i')


df['code'].map(categorize).value_counts()

e    313911
i     40340
a      1750
Name: code, dtype: int64

#### Column: `url`

In [14]:
profile(df[['url']])

Unnamed: 0,url
Types,{'str': 356001}
NA Count,0
NA %,0


**Comment:** The column `url` has type `object`, all values are of type `str` and there are no NA values.

**Decision:** Keep the column and all rows.

#### Column: `creator`

In [15]:
profile(df[['creator']])

Unnamed: 0,creator
Types,"{'str': 355998, 'float': 3}"
NA Count,3
NA %,0.000842694


**Comment:** The column `creator` has type `object`, non-NA values are of type `str` and there are a neligible number of NA values.

**Decision:** Keep the column and all rows.

#### Columns: `created_(t,datetime)`, `last_modified_(t,datetime)`

In [16]:
profile(df[['created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime']])

Unnamed: 0,created_t,created_datetime,last_modified_t,last_modified_datetime
Types,{'int': 356001},"{'str': 356000, 'float': 1}",{'int': 356001},{'str': 356001}
NA Count,0,1,0,0
NA %,0,0.000280898,0,0


**Comment:** For the `created_t`/`created_datetime` pair: The column `created_t` has type `object`, all values are of type `str` and there are no NA values. Ditto for the column `created_datetime`, except that there is 1 NA value. We confirm that these columns agree where both are not NA:

In [17]:
s1 = parse_t(df['created_t'])
s2 = parse_datetime(df['created_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision:** Use the column `created_t` to generate a column `created_on` (type `Timestamp`) and drop the columns `created_t` and `created_datetime`:

In [18]:
df['created_on'] = s1
df = df.drop(columns=['created_t', 'created_datetime'])

**Comment (cont.):** For the `last_modified_t`/`last_modified_datetime` pair: The column `last_modified_t` has type `object`, all values are of type `str` and there are no NA values. Ditto for the column `last_modified_datetime`. We confirm that these columns agree where both are not NA:

In [19]:
s1 = parse_t(df['last_modified_t'])
s2 = parse_datetime(df['last_modified_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision (cont.):** Use the column `last_modified_t` to generate a column `last_modified_on` (type `Timestamp`) and drop the columns `last_modified_t` and `last_modified_datetime`:

In [20]:
df['last_modified_on'] = s1
df = df.drop(columns=['last_modified_t', 'last_modified_datetime'])

#### Column: `product_name`

In [21]:
profile(df[['product_name']])

Unnamed: 0,product_name
Types,"{'str': 338489, 'float': 17512}"
NA Count,17512
NA %,4.91909


**Comment:** The column `product_name` has type `object`, non-NA values are of type `str` and there are under 5 % of NA values.

**Decision:** Keep the column and all rows.

#### Column: `generic_name`

In [22]:
profile(df[['generic_name']])

Unnamed: 0,generic_name
Types,"{'float': 298313, 'str': 57688}"
NA Count,298313
NA %,83.7956


**Comment:** The column `generic_name` has type `object`, non-NA values are of type `str` and there are over 80 % of NA values. Inspecting a couple of records by hand, we notice that the language seems to vary quite a lot. Maybe tellingly, this column is not even documented.

**Decision:** Drop the column.

In [23]:
df = df.drop(columns='generic_name')

#### Column: `quantity`

In [24]:
profile(df[['quantity']])

Unnamed: 0,quantity
Types,"{'float': 236739, 'str': 119262}"
NA Count,236739
NA %,66.4995


**Comment:** The column `quantity` has type object, non-NA values are of type `str` and there are over 65 % of NA values. Inspecting a couple of entries by hand, we notice that, in most cases, the column indicates the quantity sold at once and the unit of measurement used. We also notice at least the following issues: a) Incorrect/incomplete entries (e.g. price in euros, product name, unitless quantity, etc.). b) Multiple languages are used (e.g. _320 г_ seems to mean 320 g in Russian). c) Mixed metric and imperial units.

**Decision:** We try to salvage as many of the non-NA values as possible as it might be interesting to know the quantity of product sold at once.

In order to keep complexity under control, we make the following assumptions:
* An entry must be a "valid number" followed by a "valid unit". White spaces are allowed and ignored. Additional information at the end is allowed and ignored. Letter case is ignored.
* A "valid number" is any string that matches `r'\d+(?:[.,]\d*)?'`.
* A "valid unit" is any string in `VALID_UNITS` (see code).
* Since imperial units differ between UK, US and USC, we decide to *ignore* those (see e.g. [How US labelling requirements undermine honest labelling in the UK](http://metricviews.org.uk/2013/03/how-us-labelling-requirements-undermine-honest-labelling-in-the-uk/).)

**Desired output:** 2 columns: `quantity_number` (type `float`) and `quantity_unit` (type `str`, either `g` or `l`). In the process, we convert all weights to gram and all volumes to liter.

**Note:** In order not to clutter the notebook, most of the code is in a separate module - _cleanquantity.py_.

In [25]:
ninitial = df['quantity'].notna().sum()

df_qty = cleanquantity.clean(df['quantity'])
df['quantity_number'] = df_qty['number']
df['quantity_unit'] = df_qty['unit']
df = df.drop(columns='quantity')

nstandardized = df_qty['number'].notna().sum()
pstandardized = nstandardized / ninitial * 100

print(f"entries initial: {ninitial}")
print(f"entries standardized: {nstandardized} ({pstandardized:.2f} % of not-NA)")

entries initial: 119262
entries standardized: 106470 (89.27 % of not-NA)


We check the result for outliers:

In [26]:
(df[['quantity_number', 'quantity_unit']]
 .groupby(by='quantity_unit')
 .describe())

Unnamed: 0_level_0,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
quantity_unit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
g,86658.0,403.239589,8037.160368,0.0,150.0,250.0,420.0,1390000.0
l,19812.0,0.809454,3.336295,0.0,0.35,0.75,1.0,450.0


**Comment (cont.):** Weight: Inspecting a couple of entries above 10 kg by hand (22 entries), we notice the following: a) At least one record desribes food for *animals*, not for humans (code: 289259, categories: *aliment pour chevaux* (*food for horses*)). b) Some values do make sense (e.g. a 25 kg bag of flour for a baker) while others do not. Volume: Inspecting a couple of records above 12 liters by hand (12 entries), we notice that some values do makes sense (e.g. a 20 l barrel of wine) while others do not.

**Decision (cont.):** We have a negligible number of outliers. Weight: Replace a weigth of 0 by NA and drop records above 10 kg, except when `product_name` contains the word _farine_ (_flour_). Volume: Replace 0 by NA and drop records above 12 l except when `product_name` contains the word _tonneau_ (_barrel_).

In [27]:
df.loc[df['quantity_number'] == 0, ['quantity_number', 'quantity_unit']] = np.nan

weight_cond = ((df['quantity_unit'] == 'g')
          & (df['quantity_number'] > 10_000)
          & ~df['product_name'].str.contains('farine', case=False, na=False, regex=False))

volume_cond = ((df['quantity_unit'] == 'l')
          & (df['quantity_number'] > 12)
          & ~df['product_name'].str.contains('tonneau', case=False, na=False, regex=False))

df = df.drop(df[weight_cond | volume_cond].index, axis=0)

In [28]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 355971 data points with 140 features


### A.1. Column group: Tags<a name="task-a1-tags"></a>

#### Column: *first\_packaging\_code\_geo*

In [29]:
profile(df[['first_packaging_code_geo']])

Unnamed: 0,first_packaging_code_geo
Types,"{'float': 335103, 'str': 20868}"
NA Count,335103
NA %,94.1377


**Comment:** The column *first\_packaging\_code\_geo* has type object, non-NA values are of type `str` and there are  almost 95 % of NA values. We confirm that all non-NA entries are valid coordinates:

In [30]:
def are_valid_coordinates(text: str) -> bool:
    """\
    Checks whether a string contains valid geographic coordinates.
    """
    if pd.isna(text):
        return text
    
    tokens = text.split(',')
    if len(tokens) != 2:
        return False
    lat_token, lon_token = tokens
    try:
        lat = float(lat_token)
        lon = float(lon_token)
    except ValueError:
        return False
    if not (-90.0 <= lat <= 90.0):
        return False
    if not (-180.0 <= lon <= 180.0):
        return False
    return True


is_valid = df['first_packaging_code_geo'].map(are_valid_coordinates)
df['first_packaging_code_geo'].notna().sum() == is_valid.sum()

True

**Decision:** Keep the column and all rows.

### A.1. Column group: General information<a name="task-a1-general-information"></a>

In order to keep complexity under control, we make the following assumptions about a "properly formatted" list of ingredients:
* It is a delimited list of ingredients.
* The class of delimiters is `[.,;•]` (we do not support abbreviated words ending with a period).
* An ingredient is a sequence of words that may contains white spaces before, between and after the words.
* A word is a sequence of symbols which are not delimiters nor parentheses or a "number".
* Each ingredient is either a "simple" ingredient or a "composite" ingredient.
* A "composite" ingredient is followed by its own list of ingredients ("simple" or "composite") inside a pair of matching parentheses.
* Matching pairs of parentheses are `(` / `)` and `[` / `]`.

Recovery strategy:
* Use "panic mode" for the lexer.
```
# Words:
<space> ::= re('\s+') .
<delim> ::= '.' | ',' | ';' | '•' .
<l-par> ::= '(' | '[' .
<r-par> ::= ')' | ']' .

# Syntax:
<list> ::= <ingr> { <delim> <ingr> }* .
<ingr> ::= <field> [ <l-par> <list> <r-par> ] .
```

In [13]:
display_allcols(df.loc[[355945, 355948, 355951, 355952, 355953, 355954], :])

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_uk,nutrition_grade_fr,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,main_category,main_category_en,image_url,image_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
355945,9782091778181,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1496600202,2017-06-04T18:16:42Z,1496600206,2017-06-04T18:16:46Z,La pratique du vocabulaire allemand,,,,,Nathan,nathan,,,,,,,,,,,,,,,,,,France,en:france,France,I • La grammajre françatse La communication or...,,,,,,,,0.0,[ i -> fr:i ] [ la-grammajre-francatse-la-c...,,,0.0,,,0.0,,,,e,unknown,unknown,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,96.0,,25.0,15.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,54.0,53.0,,,,,,,,,2.0,5.0,,,,12.0,4.724409,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,28.0,28.0,,
355948,9782226399311,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1505392509,2017-09-14T12:35:09Z,1505392514,2017-09-14T12:35:14Z,Treize Raisons,,,,,netflix,netflix,en:milks,"en:dairies,en:milks","Dairies,Milks",,,,,en:organic,en:organic,Organic,,,,,,,,France,en:france,France,"(J'espère que vous êtes prêts, parce que je va...",,,,,,,,0.0,[ j-espere-que-vous-etes-prets -> fr:j-espere...,,,0.0,,,0.0,,,,d,Milk and dairy products,Milk and yogurt,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",en:dairies,Dairies,,,13.0,,5.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,58.0,38.0,,,,,,,,,55.0,2.0,,,,80.0,31.496063,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,16.0,16.0,,
355951,9782745932792,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1504285120,2017-09-01T16:58:40Z,1504285128,2017-09-01T16:58:48Z,Ma bite,,,,,Juratay,juratay,en:beverages,"en:beverages,en:non-sugared-beverages","Beverages,Non-sugared beverages",,,,,en:organic,en:organic,Organic,,,,,,,,France,en:france,France,Du sperme de la viande du viagra,,,,,,,,0.0,[ du-sperme-de-la-viande-du-viagra -> fr:du-s...,,,0.0,,,0.0,,,,b,Beverages,Non-sugared beverages,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",en:beverages,Beverages,,,63.0,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,,,,,,,,10.0,10.0,,,,1.0,0.393701,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-2.0,-6.0,,
355952,9782803671168,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1505503122,2017-09-15T19:18:42Z,1505503126,2017-09-15T19:18:46Z,Les schtroumpfs & le village des fille,,,,,Les Schtroumpfs,les-schtroumpfs,en:fats,en:fats,Fats,,,,,,,,,,,,,,,France,en:france,France,"LES SCHTROUMPFS t, LES SCHTROUMPFS NOIRS 2. LE...",ŒUF,,,,,,,0.0,[ les-schtroumpfs-t -> fr:les-schtroumpfs-t ...,,,0.0,,,0.0,,,,d,Fat and sauces,Fats,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",en:fats,Fats,,,33.0,,42.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,5.0,,,,,,,,,,5.0,,,,4.0,1.574803,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,13.0,18.0,,
355953,9782811613488,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1503745692,2017-08-26T11:08:12Z,1503745703,2017-08-26T11:08:23Z,Fairy tail,,,,,Pika Edition,pika-edition,,,,,,,,,,,,,,,,,,France,en:france,France,Le deuxième jour du Grand Tournoi de la magie ...,,,,,,,,0.0,[ le-deuxieme-jour-du-grand-tournoi-de-la-mag...,,,0.0,,,0.0,,,,d,unknown,unknown,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,50.0,,12.0,12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12.0,12.0,,,,,,,,,12.0,12.0,,,,12.0,4.724409,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.0,17.0,,
355954,9782811635848,http://world-en.openfoodfacts.org/product/9782...,kiliweb,1499617591,2017-07-09T16:26:31Z,1499617638,2017-07-09T16:27:18Z,Fairy Tail,,,,,Hiro Mashima,hiro-mashima,,,,,,,,,,,,,,,,,,France,en:france,France,"Eileen a fait usage d'Univers 0ne, un sort de ...",,,,,,,,0.0,[ eileen-a-fait-usage-d-univers-0ne -> fr:eil...,,,0.0,,,0.0,,,,b,unknown,unknown,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,0.0,0.0,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,


In [423]:
# df2 = df[df['countries_tags'].notna()][['countries', 'countries_tags']]
# df3 = df2['countries_tags'].str.extract(re.compile(r'^(?P<lang2>..)\:(?P<name>.*)$'))
# display(
#     df3['lang2'].value_counts(),
#     df3['name'].value_counts()
# )

en = ('united-states', 'united-kingdom', 'australia')  # leave out canada
fr = ('france', 'switzerland', 'belgium')  # leave out canada
pattern_en = re.compile('|'.join(re.escape(name) for name in en))
pattern_fr = re.compile('|'.join(re.escape(name) for name in fr))
has_en = df['countries_tags'].str.contains(pattern_en, na=False, regex=True)
has_fr = df['countries_tags'].str.contains(pattern_fr, na=False, regex=True)

display(
    df[has_en]['ingredients_text'].notna().sum(),
    df[has_fr]['ingredients_text'].notna().sum()
)

#df[has_en][['countries_tags', 'ingredients_text']]
#df[has_fr][['countries_tags', 'ingredients_text']]

175494

95590

In [376]:
def to_pairs(tokens: t.Iterable[Token]) -> t.List[t.Tuple[str, bool]]:
    result = []
    state = ParseState(next(tokens), tokens, [])
    cur = state.peek
    while cur.type != TokenType.END:
        state = read(state)
        nxt = state.peek
        
        if cur.type == TokenType.FIELD:
            text = cur.text
            result.append( (text, state.peek.type == TokenType.LPAR) )
        
        cur = nxt
    return result

In [380]:
import unicodedata


def remove_accents(text: str) -> str:
    """\
    DOCME
    """
    normalized = unicodedata.normalize('NFKD', text)
    return "".join(c for c in normalized if not unicodedata.combining(c))


def replace_punctuation(text: str, by: str) -> str:
    """\
    DOCME
    """
    return "".join(by if unicodedata.category(c).startswith('P') else c for c in text)


def normalize(text: str) -> str:
    return remove_accents(replace_punctuation(text.strip().lower(), by=''))

In [422]:
# import nltk
# nltk.download('punkt')

from nltk.tokenize import word_tokenize, sent_tokenize
#sent_tokenize(text)
#word_tokenize(text)

In [383]:
def try_to_parse(series: pd.Series) -> t.Tuple[str, int]:
    field = series['ingredients_text']
    
    if pd.isna(field):
        return field, None, None
    
    tokens = tokenize(field)
    state = ParseState(next(tokens), tokens, [])
    state, result = parse_list(state)
    processed = flatten(merge(result))
    normalized = [(normalize(t), b) for t, b in processed]
    return field, str(normalized), len(state.errors)


result = df.head(100000)[['ingredients_text']].apply(try_to_parse, axis='columns', result_type='expand')
display(
    result[2].isna().sum(),
    (result[2] > 0).sum()
)
# display_allcols_notrunc(result)
display_allcols_notrunc(result[result[2] > 0])

2882

813

Unnamed: 0,0,1,2
471,"Wheatflour contains Gluten (With Wheatflour, Calcium Carbonate, Iron, Niacin, nut) Thiamin) Dark Chocolate Chips (20%) (Sugar ? Cocoa Butter ? Cocoa Mass ? Emulsifier: Soya Lecithin ? Vanilla llavouri ng) ? Palm Oil ? Sugar ? Roasted Hav!nuts (3%) ? Dried }kimmed Milk ? Golden Syrup (Invert Sugar Syrup) Salt ? Raising Agent: Sodium Bicarbonate, E450, E503 ? Emulsifier: Soya Lecithin ? Vanilla Flavouring. Dark Chocolate contains Cocoa Solids 39% minimum.","[('wheatflour contains gluten', True), ('with wheatflour', False), ('calcium carbonate', False), ('iron', False), ('niacin', False), ('nut', False), ('thiamin', False), ('dark chocolate chips', True), ('20', False), ('sugar', False), ('cocoa butter', False), ('cocoa mass', False), ('emulsifier soya lecithin', False), ('vanilla llavouri ng', False), ('palm oil', False), ('sugar', False), ('roasted havnuts', True), ('3', False), ('dried kimmed milk', False), ('golden syrup', True), ('invert sugar syrup', False), ('salt', False), ('raising agent sodium bicarbonate', False), ('e450', False), ('e503', False), ('emulsifier soya lecithin', False), ('vanilla flavouring', False), ('dark chocolate contains cocoa solids 39 minimum', False)]",1.0
559,",tato crisps With mint and salt marsh lamb Seasoning GREDIENTS Potatoes • Sunflower Oil • Rice Flour • Igar • Salt • Dried Yeast Extract • Dried Onions . Dried der Vinegar • Dried Garlic • Dried Mint Flavouring • itish Salt Marsh Lamb Seasoning (Salt • Lamb Fat • )semary Extract) • Ground Black Pepper • Acidity Citric Acid • Colour: Paprika Extract. NFORMATION This product is sold by weight. }ettling of the contents may occur.","[('tato crisps with mint and salt marsh lamb seasoning gredients potatoes', False), ('sunflower oil', False), ('rice flour', False), ('igar', False), ('salt', False), ('dried yeast extract', False), ('dried onions', False), ('dried der vinegar', False), ('dried garlic', False), ('dried mint flavouring', False), ('itish salt marsh lamb seasoning', True), ('salt', False), ('lamb fat', False), ('semary extract', False), ('ground black pepper', False), ('acidity citric acid', False), ('colour paprika extract', False), ('nformation this product is sold by weight', False), ('ettling of the contents may occur', False)]",1.0
573,Apple Juice (50%) Coconut Water 15%) ? Cucumber Purée (15%) Apple Purée 14%) ? Spinach Purée (3%) ? Lemon Juice ? Spirulina Extract ? Antioxidant: Ascorbic Acid.,"[('apple juice', True), ('50', False), ('coconut water 15', False), ('cucumber puree', True), ('15', False), ('apple puree 14', False), ('spinach puree', True), ('3', False), ('lemon juice', False), ('spirulina extract', False), ('antioxidant ascorbic acid', False)]",2.0
698,"Jus de pomme pressée pasteurisé ? Purée de banane ? Purée de pomme ? Purée dei cassis (10%) ? Puree de framboise 1096) ? Purée de baie dlaçaï ? correcteur d'acidité : Acide citrique ? . Antioxydant : Acide ascorbique. :iNFORMATlON Ce produit peut se idissocier naturellement. INGREDIÈNTEN ) Gepasteuriseerd geperst àppelsap ? bananenpuree ? làppelpuree ? zwafte bessenpuree S) ? frambozenpuree (10%) ? bessenpuree ? zuufteregelaar: :titrcenzuur ? antioxidant: Èscorbinezuur. INFORMATE Dit product kan van nature bezinken. movennes/Gemiddelde waarden : : ter -00ml: Energie/Energie 3ò6KJ/87kcal ? Matières grasses/ l'etten 1.9g, dont acides gras saturés/ Waawan verzadigde vetzuren 1.0g ? G!uc des/Koolhydraten 14.8g, dont Sucres/waarvan suikers 9.4g ? Fibres à)mentairesNezels 0.6g ? Protéines/ 2.3g ? Sel/Zout 0.05g. 5)65. looml 25(31) per 500ml 125(1 : = Valeurs Nutritionnelles a Référence/ VRI Dagelijkse Referentie-lnnanne BEST BEFORE","[('jus de pomme pressee pasteurise', False), ('puree de banane', False), ('puree de pomme', False), ('puree dei cassis', True), ('10', False), ('puree de framboise 1096', False), ('puree de baie dlacai', False), ('correcteur dacidite acide citrique', False), ('antioxydant acide ascorbique', False), ('informatlon ce produit peut se idissocier naturellement', False), ('ingredienten', False), ('gepasteuriseerd geperst appelsap', False), ('bananenpuree', False), ('lappelpuree', False), ('zwafte bessenpuree s', False), ('frambozenpuree', True), ('10', False), ('bessenpuree', False), ('zuufteregelaar titrcenzuur', False), ('antioxidant escorbinezuur', False), ('informate dit product kan van nature bezinken', False), ('movennesgemiddelde waarden ter 00ml energieenergie 3o6kj87kcal', False), ('matieres grasses letten 19g', False), ('dont acides gras satures waawan verzadigde vetzuren 10g', False), ('guc deskoolhydraten 148g', False), ('dont sucreswaarvan suikers 94g', False), ('fibres a', False), ('mentairesnezels 06g', False), ('proteines 23g', False), ('selzout 005g', False), ('5', False), ('65', False), ('looml 25', True), ('31', False), ('per 500ml 125', True), ('1 = valeurs nutritionnelles a reference vri dagelijkse referentielnnanne best before', False)]",5.0
737,"raisins, fraisest framboises, mures et myrtilles : INGREDIENTS Raisins (55%)• • Fraises (22%) Framboises : 10%) • Mûres (tilles 4%)","[('raisins', False), ('fraisest framboises', False), ('mures et myrtilles ingredients raisins', True), ('55', False), ('fraises', True), ('22', False), ('framboises 10', False), ('mures', True), ('tilles 4', False)]",1.0
881,"Durum wheat semolina, niacin, iron ferrous lactate), thiamin mononitrate, riboflavin, folic acid.","[('durum wheat semolina', False), ('niacin', False), ('iron ferrous lactate', False), ('thiamin mononitrate', False), ('riboflavin', False), ('folic acid', False)]",1.0
899,"Milk chocolate (sugar, whole milk, cocoa butter, chocolate liquor, soy lecithin [emulsifier], vanilla), dark chocolate (sugar, chocolate liquor, cocoa nutter, anhydrous milk fat, soy lecithin [emulsifier], vanila), soy lecithin [emulsifier], vanilla).","[('milk chocolate', True), ('sugar', False), ('whole milk', False), ('cocoa butter', False), ('chocolate liquor', False), ('soy lecithin', True), ('emulsifier', False), ('vanilla', False), ('dark chocolate', True), ('sugar', False), ('chocolate liquor', False), ('cocoa nutter', False), ('anhydrous milk fat', False), ('soy lecithin', True), ('emulsifier', False), ('vanila', False), ('soy lecithin', True), ('emulsifier', False), ('vanilla', False)]",1.0
917,"Milk chocolate (sugar, cocoa butter, milk, chocolate liquor, soy lecithin (an emulsfier), vanilla), and pretzels enriched wheat flour, flour, salt, corn syrup, vegetable oil, sodium bicarbonate, yeast).","[('milk chocolate', True), ('sugar', False), ('cocoa butter', False), ('milk', False), ('chocolate liquor', False), ('soy lecithin', True), ('an emulsfier', False), ('vanilla', False), ('and pretzels enriched wheat flour', False), ('flour', False), ('salt', False), ('corn syrup', False), ('vegetable oil', False), ('sodium bicarbonate', False), ('yeast', False)]",1.0
930,"NGREDIENTS Carrot Juice • Carrot Purée Acid: )itric Acid. Suitable for vegetarians and vegans SERVING Serve chilled. Shake well before serving. , This product may naturally separate. NUTRITION Serves/Portions/Porties: 5 Typical values/Valeurs moyenr)es/Gemidde!de , waarden per 100ml: Energy/Energie/Energie 90kJ/21 kcal • Fat/Matières grassesNetten g, of which saturates/dont acides gras saturés/ : waarvan verzadigde vetzuren &lt;O.I g • Carbohydrate/Glucides/Koolhydraten 4.4g, of which sugars/dont sucres/waarvan suikers 4.0g • FibtNFlbres alimentairesNezels &lt;0.1 g • Protein/ . Protéines/Eiwitten O.7g • Salt/SelZout &lt;0.1 g. Vitamin A ug (NRVO/0)Nitamine A ug UNR0/oy (%DRI) per 100ml 60.0(8) per 150ml 90.0(11) • Vitamin C mg (NRV%)Nitamine C mg (VNR%)/ (%DRI) Per looml 6.0(8) per 150ml 9.0(12) NRV Nutrient Reference Value VNR = Valeurs Nutritionnelles de Référence/ \ DRI Dagelijkse Referentie-lnname BEST BEFORE","[('ngredients carrot juice', False), ('carrot puree acid', False), ('itric acid', False), ('suitable for vegetarians and vegans serving serve chilled', False), ('shake well before serving', False), ('this product may naturally separate', False), ('nutrition servesportionsporties 5 typical valuesvaleurs moyenr', False), ('esgemiddede', False), ('waarden per 100ml energyenergieenergie 90kj21 kcal', False), ('fatmatieres grassesnetten g', False), ('of which saturatesdont acides gras satures waarvan verzadigde vetzuren lt', False), ('o', False), ('i g', False), ('carbohydrateglucideskoolhydraten 44g', False), ('of which sugarsdont sucreswaarvan suikers 40g', False), ('fibtnflbres alimentairesnezels lt', False), ('01 g', False), ('protein', False), ('proteineseiwitten o', False), ('7g', False), ('saltselzout lt', False), ('01 g', False), ('vitamin a ug', True), ('nrvo0', False), ('nitamine a ug unr0oy', True), ('dri', False), ('per 100ml 600', True), ('8', False), ('per 150ml 900', True), ('11', False), ('vitamin c mg', True), ('nrv', False), ('nitamine c mg', True), ('vnr', False), ('', True), ('dri', False), ('per looml 60', True), ('8', False), ('per 150ml 90', True), ('12', False), ('nrv nutrient reference value vnr = valeurs nutritionnelles de reference dri dagelijkse referentielnname best before', False)]",2.0
1084,"Mechanically separated chicken pork, water, bacon cured with; water, salt, sodium phosphate, sugar, sodium erythorbate, hydrolyzed soy and corn protein, monosodium glutamate, sodium nitrate). wheat flour, modified food starch, corn syrup, salt, flavoring,","[('mechanically separated chicken pork', False), ('water', False), ('bacon cured with', False), ('water', False), ('salt', False), ('sodium phosphate', False), ('sugar', False), ('sodium erythorbate', False), ('hydrolyzed soy and corn protein', False), ('monosodium glutamate', False), ('sodium nitrate', False), ('wheat flour', False), ('modified food starch', False), ('corn syrup', False), ('salt', False), ('flavoring', False)]",1.0


In [None]:
# Unbalanced parenthesis:
s = '''Peanut butter coating (evaporated cane juice, fractionated palm kernal oil, partially defatted peanut flour, whey powder [milk], soy lecithin [an emulsifier]), milk chocolate coating (dehydrated cane juice, cocoa butter, unsweetened chocolate, whole milk powder, soy lecithin [an emulsifier], natural vanilla), malt balls (glucose syrup [corn], whey powder, malted milk powder [malted barley, wheat flour, milk, bicarbonate of soda, mono and diglycerides), pure food glaze.'''
#test_parse(s)
#list(get_tokens(s.lower()))

In [None]:
display_allcols_notrunc(df[df.iloc[:, 0:30].notna().all(axis='columns')])

Specification:
* Lower-case
* Split on '.' -> groups
* Split group on ','
* If ends with '... (...) for the moment ignore

In [None]:
s_ingredients = df['ingredients_text']


def clean_ingredients_text(text: str) -> t.List[str]:
    if pd.isna(text):
        return text
    
    # Strip leading/trailing white-spaces:
    text = text.strip()
    # Convert to lower case:
    text = text.lower()
    # Remove trailing period, if any:
    if text.endswith('.'):
        text = text[:-1]
    words = text.split(',')
    words = [word.strip() for word in words]
        
    return '|'.join(words)
    
    
s2 = s_ingredients.map(clean_ingredients_text)
df['tmp'] = s2
display_allcols_notrunc(s2)

# TAGS / WHAT TO DO?

In [None]:
BASE_COLUMNS = (
    'packaging',
    'brands',
    'categories',
    'origins',
    'manufacturing_places',
    'labels',
    'emb_codes',
    'cities',
    'countries'
)

# To investigage:
# - purchase_places
# - stores

def get_related_tags_columns(df: pd.DataFrame, base_column: str) -> t.List[str]:
    result = []
    for column in df.columns:
        if column == base_column:
            result.append(column)
        elif re.match(fr'{re.escape(base_column)}_tags', column):
            result.append(column)
        elif re.match(fr'{re.escape(base_column)}_[a-z]{{2}}', column):
            result.append(column)
    return result

In [None]:
#display_allcols(df[df['first_packaging_code_geo'].notna()])
#display_allcols(df[df['purchase_places'].notna()])


In [None]:
get_related_tags_columns(df, 'categories')

Plan for tags:
* Lower-case the text, split the text on white-spaces (remove leading/trailing/repeated white-spaces), remove accents for all words, join the words with dashes.
* In a 2-language column, filter out tags with a "foreign" prefix. E.g. filter out fr:Foies tags in categories_en.

Example: (_code_: `452`, *product\_name* : `Foie gras canard Périgord`)
* _categories_: `Foie gras de canard`
* *categories\_tags*: `en:fish-and-meat-and-eggs,fr:foies-gras,fr:foies-gras-de-canard`
* *categories\_en*: `Fish and meat and eggs,fr:Foies gras,fr:Foies gras de canard`

Expected output:
* _categories_: `foie-gras-de-canard`
* *categories\_tags*: `en:fish-and-meat-and-eggs,fr:foies-gras,fr:foies-gras-de-canard` (no change)
* *categories\_en*: `fish-and-meat-and-eggs`

Hypothesis for tags:
* `<name>` is a comma-separated list of short descriptions in one of the languages of the country where the product was bought. These keywords may include white spaces and seem to start with an upper case letter.
* `<name>_tags_en` is the translation of `<name>` to English.
* `<name>_tags` is a comma-separated list of tags, where tags are derived from the keywords by lower-casing all words and joining them with dashes. Each tag is optionally prefixed with a language 2-letter code followed by a colon.

**Comment:** The number of tags does not seem to agree. E.g. for categories sometimes 

In [None]:
# Split 'categories' into a list of short descriptions:
s1 = df['categories'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of short descriptions:
ns1 = s1.map(lambda keywords: len(keywords), na_action='ignore')

# Split 'categories_en' into a list of short descriptions:
s2 = df['categories_en'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of short descriptions:
ns2 = s2.map(lambda keywords: len(keywords), na_action='ignore')

# Split 'categories_tags' into a list of tags:
s3 = df['categories_tags'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of tags:
ns3 = s3.map(lambda tags: len(tags), na_action='ignore')

df_tmp = pd.DataFrame(data={
    'categories': s1, 'categories_len': ns1,
    'categories_en': s2, 'categories_en_len': ns2,
    'categories_tags': s3, 'categories_tags_len': ns3
})

display(
    (df_tmp['categories_len'] < df_tmp['categories_en_len']).sum(),
    (df_tmp['categories_len'] > df_tmp['categories_en_len']).sum(),
    (df_tmp['categories_len'] == df_tmp['categories_en_len']).sum(),
    '---',
    (df_tmp['categories_len'] < df_tmp['categories_tags_len']).sum(),
    (df_tmp['categories_len'] > df_tmp['categories_tags_len']).sum(),
    (df_tmp['categories_len'] == df_tmp['categories_tags_len']).sum(),
    '---',
    (df_tmp['categories_en_len'] < df_tmp['categories_tags_len']).sum(),
    (df_tmp['categories_en_len'] > df_tmp['categories_tags_len']).sum(),
    (df_tmp['categories_en_len'] == df_tmp['categories_tags_len']).sum()
)

# display(
#     (df_tmp['categories_len'] != df_tmp['categories_en_len']).sum(),
#     (df_tmp['categories_len'] != df_tmp['categories_tags_len']).sum(),
#     (df_tmp['categories_en_len'] != df_tmp['categories_tags_len']).sum()
# )

# display_allcols_notrunc(df_tmp[(
#     df_tmp['categories'].notna() &
#     ~((df_tmp['categories_len'] == df_tmp['categories_en_len']) & (df_tmp['categories_len'] == df_tmp['categories_tags_len'])))
# ])

In [None]:
columns = [
    'packaging', 'packaging_tags',
    'brands', 'brands_tags',
    'categories', 'categories_tags', 'categories_en',
    'origins', 'origins_tags',
    'manufacturing_places', 'manufacturing_places_tags',
    'labels', 'labels_tags', 'labels_en',
    'emb_codes', 'emb_codes_tags'
]
profile(df[columns])

To investigage:
* _first_packaging_code_geo_
* *purchase\_places*
* _stores_

In [None]:
display_allcols_notrunc(df.loc[[392,353516, 353517], :])

In [None]:
#display_allcols(df[df['first_packaging_code_geo'].notna()])
#display_allcols(df[df['purchase_places'].notna()])
#display_allcols(df[df['stores'].notna()])
#display_allcols(df[df['countries'].notna()])
display_allcols_notrunc(df[df['emb_codes'].notna()])