# Course 2: Project - Task A - Data cleaning

<a name="top"></a>
This notebook is concerned with task A. The goal is to clean the [Open Food Facts](https://www.kaggle.com/openfoodfacts/world-food-facts) dataset (version 5), downloaded from Kaggle. The dataset originates from https://world.openfoodfacts.org/data. A description of the fields is available at https://static.openfoodfacts.org/data/data-fields.txt.

**Contents:**
* [Imports](#task-a-imports)
* [Utilities](#task-a-utilities)
* [Data loading](#task-a-data-loading)
* [Cleaning](#task-a-cleaning)
  * [NA columns](#task-a-cleaning-na-columns)
  * [Column group: General information](#task-a-cleaning-general-information)
  * [Column group: Tags](#task-a-cleaning-tags)
  * [Column group: Ingredients](#task-a-cleaning-ingredients)
  * [Column group: Miscellaneous data](#task-a-cleaning-miscellaneous-data)
  * [Column group: Nutrition facts](#task-a-cleaning-nutrition-facts)
* [Result](#task-a-cleaning-result)

## Imports<a name="task-a-imports"></a> ([top](#top))
---

In [1]:
# Standard library:
import pathlib
import typing as t
import urllib.parse

# 3rd party:
import numpy as np
import pandas as pd
import pandas.io.formats.style

# Project:
import ean
import quantity
import tags
import utils

## Utilities<a name="task-a-utilities"></a> ([top](#top))
---

**Note:** Utilities used in multiple notebooks are in a separate module - see `utils.py`.

In [2]:
def style_percentages(df: pd.DataFrame) -> pd.io.formats.style.Styler:
    """\
    Returns a "styler" to render the columns with a name that contains '%' as percentages.
    
    Args:
        df: The data-frame to style.
        
    Returns:
        As described above.
    """
    formatter = { name: lambda p: f'{p:.2f} %' for name in df.columns if '%' in name }
    return df.style.format(formatter)


def reorder(items: t.List[str], item_to_move: str, item: str, after: bool = True) -> t.List[str]:
    """\
    Re-order strings by moving a given string afer or before another target string.

    Args:
        items: A list of strings.
        item_to_move: The string to move.
        item: The target string.
        after: ``True`` to move the string after the target string; ``False`` to move it before.

    Returns:
        As described above.
    """
    try:
        item_idx = items.index(item)
        item_to_move_idx = items.index(item_to_move)
    except ValueError:
        pass
    else:
        remove_idx = item_to_move_idx
        insert_idx = item_idx + (1 if after else -1)
        if insert_idx < remove_idx:
            items.pop(remove_idx)
            items.insert(insert_idx, item_to_move)
        else:
            items.insert(insert_idx, item_to_move)
            items.pop(remove_idx)
    return items


def is_valid_url(url: t.Union[float, str], na=True) -> bool:
    """\
    Tests whether a string is a valid URL.
    
    .. note:: We could perform a tighter test with 3rd party libraries such as 
        `rfc3986<https://pypi.org/project/rfc3986/>`_ but these are not included in the Conda setup
        for this course and we do not want to change the setup.
    
    Args:
        url: Either NaN or a string to test.
        na: What to return if the ``url`` is NaN. 
        
    Returns:
        As described above.
    """
    if pd.isna(url):
        return na
    result = urllib.parse.urlparse(url)
    return result.scheme != '' and result.netloc != ''

## Data loading<a name="task-a-data-loading"></a> ([top](#top))
---

Since we are not familiar with the dataset and warned that it is quite messy, we first let Pandas read the TSV file entirely into memory and guess the type of each column. As the end of this notebook, we will export a cleaned-up file that Pandas will be able to read more efficiently.

In [3]:
data_filename = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.tsv')

In [4]:
df = pd.read_csv(data_filename, sep='\t', low_memory=False)

We first get some general information about the data-frame:

In [5]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 356027 rows and 163 columns


**Note:** It turns out that reading the TSV file this way is problematic (at least on macOS) since 26 lines contain a carriage return. We noticed this by focusing on the first row where `code` was NA and looking at the line corresponding to the previous row directly in the TSV file:
```bash
sed -n '193909 l' ./en.openfoodfacts.org.products.tsv
(...)fr-32-464-040-ec\t43.400279,0.199525\r\t\tvillecomtal-sur-arros-gers-france(...)
                                         ^^
```
26 rows is a negligible fraction of all rows and we could have dropped the them but it turns out that there is an even simpler solution:

In [6]:
df = pd.read_csv(data_filename, sep='\t', lineterminator='\n', low_memory=False)

In [7]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 356001 rows and 163 columns


Having taken care of this, we look at the first few rows:

In [8]:
utils.display_with_options(utils.ALL_COLS)(df.head())

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_uk,nutrition_grade_fr,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,main_category,main_category_en,image_url,image_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,28 g (1 ONZ),,0.0,[ bananas -> en:bananas ] [ vegetable-oil -...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2243.0,,28.57,28.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.018,64.29,14.29,,,,,,,,,3.6,3.57,,,,0.0,0.0,,0.0,,,,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,28 g (0.25 cup),,0.0,[ peanuts -> en:peanuts ] [ wheat-flour -> ...,,,0.0,,,0.0,,,,b,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1941.0,,17.86,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,60.71,17.86,,,,,,,,,7.1,17.86,,,,0.635,0.25,,0.0,,,,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Organic hazelnuts, organic cashews, organic wa...",,,,,,28 g (0.25 cup),,0.0,[ organic-hazelnuts -> en:organic-hazelnuts ...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2540.0,,57.14,5.36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.86,3.57,,,,,,,,,7.1,17.86,,,,1.22428,0.482,,,,,,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,United States,Organic polenta,,,,,,35 g (0.25 cup),,0.0,[ organic-polenta -> en:organic-polenta ] [...,,,0.0,,,0.0,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1552.0,,1.43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,77.14,,,,,,,,,,5.7,8.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We briefly look at the type guessed by Pandas and the number of non-NA values in each column:

In [9]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356001 entries, 0 to 356000
Data columns (total 163 columns):
code                                          356001 non-null object
url                                           356001 non-null object
creator                                       355998 non-null object
created_t                                     356001 non-null int64
created_datetime                              356000 non-null object
last_modified_t                               356001 non-null int64
last_modified_datetime                        356001 non-null object
product_name                                  338489 non-null object
generic_name                                  57688 non-null object
quantity                                      119262 non-null object
packaging                                     89959 non-null object
packaging_tags                                89959 non-null object
brands                                        326977 non-null obj

## Cleaning<a name="task-a-cleaning"></a> ([top](#top))
---

### Cleaning - NA columns<a name="task-a-cleaning-na-columns"></a> ([top](#top))
---

We noticed above that some columns contain only NA entries:

In [10]:
df.columns[df.isna().all(axis='rows')].tolist()

['cities',
 'allergens_en',
 'no_nutriments',
 'ingredients_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil',
 'nutrition_grade_uk',
 '-butyric-acid_100g',
 '-caproic-acid_100g',
 '-behenic-acid_100g',
 '-lignoceric-acid_100g',
 '-cerotic-acid_100g',
 '-melissic-acid_100g',
 '-dihomo-gamma-linolenic-acid_100g',
 '-elaidic-acid_100g',
 '-gondoic-acid_100g',
 '-mead-acid_100g',
 '-erucic-acid_100g',
 '-nervonic-acid_100g',
 'chlorophyl_100g',
 'glycemic-index_100g',
 'water-hardness_100g']

**Decision:** We decide to drop those columns that are not part of a breakdown (i.e. those columns with a name that does not start with a *'-'*).

In [11]:
columns_to_drop = [
  'cities',
  'allergens_en',
  'no_nutriments',
  'ingredients_from_palm_oil',
  'ingredients_that_may_be_from_palm_oil',
  'nutrition_grade_uk',
  'chlorophyl_100g',
  'glycemic-index_100g',
  'water-hardness_100g'
]

df = df.drop(columns=columns_to_drop)

### Cleaning - Column group: General information<a name="task-a-cleaning-general-information"></a>  ([top](#top))
---

We first "profile" the columns in question:

In [12]:
columns = df.columns.to_series()['code': 'quantity'].to_list()
style_percentages(utils.profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
code,{'str': 356001},0,0.00 %,356001,100.00 %
url,{'str': 356001},0,0.00 %,356001,100.00 %
creator,"{'str': 355998, 'float': 3}",3,0.00 %,355998,100.00 %
created_t,{'int': 356001},0,0.00 %,356001,100.00 %
created_datetime,"{'str': 356000, 'float': 1}",1,0.00 %,356000,100.00 %
last_modified_t,{'int': 356001},0,0.00 %,356001,100.00 %
last_modified_datetime,{'str': 356001},0,0.00 %,356001,100.00 %
product_name,"{'str': 338489, 'float': 17512}",17512,4.92 %,338489,95.08 %
generic_name,"{'float': 298313, 'str': 57688}",298313,83.80 %,57688,16.20 %
quantity,"{'float': 236739, 'str': 119262}",236739,66.50 %,119262,33.50 %


**Comment:** Columns in this group are quite diverse and we proceed to clean them one by one.

#### Column: `code`
---

From the description of the `code` column we know that:
* For products with a barcode, this is the barcode of the product (EAN-13 code or some internal code assigned by the store).
* For products without a barcode, Open Food Facts assigns a number starting with the 200 reserved prefix.

**Comment:** All values are of type `str` and there are no NA values.

**Decision:** We decide to keep the column and all rows. We also make sure that there are no leading/trailing white spaces.

In [13]:
df['code'] = df['code'].str.strip()

Out of curiosity, we check how many codes belong to the following categories: `e` - valid [EAN-13/EAN-8/UPC-A](https://en.wikipedia.org/wiki/International_Article_Number) code, `a` - code assigned by Open Food Facts (prefix 200) and `i` - internal code (store). (Mistyped EAN-13/EAN-8/UPC-A codes will be incorrectly classified as internal codes but we will not pursue this further.)

**Note:** In order not to clutter the notebook, most of the code is in a separate module - see `ean.py`.

Here is the result:

In [14]:
def categorize(code: str) -> str:
    return ('e' if ean.is_valid(code) else
            'a' if code.startswith('200') else
            'i')

In [15]:
df_categories = pd.DataFrame(df['code'].map(categorize).value_counts().rename('count'))
df_categories['count %'] = df_categories['count'] / df_categories['count'].sum() * 100
style_percentages(df_categories)

Unnamed: 0,count,count %
e,313911,88.18 %
i,40340,11.33 %
a,1750,0.49 %


We see that > 88 % of all codes are valid EAN-13/EAN-8/UPC-A codes.

#### Column: `url`
---

**Comment:** All values are of type `str` and there are no NA values.

**Decision:** We decide to keep the column and to replace invalid URLs (if any) by NA. We also make sure that there are no leading/trailing white spaces.

In [16]:
df['url'] = df['url'].str.strip()

cond_is_invalid_url = ~df['url'].map(is_valid_url)
print(f'found {cond_is_invalid_url.sum()} invalid URL(s)')
df.loc[cond_is_invalid_url, ['url']] = np.nan

found 0 invalid URL(s)


#### Column: `creator`
---

**Comment:** Non-NA values are of type `str` and there are almost no NA values.

**Decision:** We decide to keep the column and all rows. We also make sure that there are no leading/trailing white spaces.

In [17]:
df['creator'] = df['creator'].str.strip()

#### Columns: `created_(t,datetime)`, `last_modified_(t,datetime)`
---

We first define utility functions to parse the strings into proper time-stamps:

In [18]:
def parse_t(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, utc=True, unit='s')


def parse_datetime(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, format='%Y-%m-%dT%H:%M:%S%z')

We first look at the `created_t` and `creatd_datetime` columns.

**Comment:** For the `created_t` column: All values are of type `int`. For the `created_datetime` column: Non-NA values are of type `str` and there is only 1 NA value. We confirm that these columns agree where both are not NA:

In [19]:
s1 = parse_t(df['created_t'])
s2 = parse_datetime(df['created_datetime'])
cond_both_notna = s1.notna() & s2.notna()
(s1[cond_both_notna] == s2[cond_both_notna]).all()

True

We take a closer look at the row where `created_datetime` is NA:

In [20]:
df.loc[df['created_datetime'].isna(), ['created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime']]

Unnamed: 0,created_t,created_datetime,last_modified_t,last_modified_datetime
192048,0,,1488992055,2017-03-08T16:54:15Z


Despite `created_t` not being NA, the value is the "epoch" and is most likely means the same thing. We will fix this later below.

**Decision:** We decide to use the `created_t` column to generate a new `created_on_utc` column (type `Timestamp`) and to drop the `created_t` and `created_datetime` columns.

In [21]:
# Remove the time-zone information as it proved cumbersome:
df['created_on_utc'] = s1.dt.tz_convert(None)
# Move the new column after 'created_datetime':
df = df[reorder(df.columns.to_list(), 'created_on_utc', 'created_datetime')]
# Drop the old columns:
df = df.drop(columns=['created_t', 'created_datetime'])

We now look at the `last_modified_t` and `last_modified_datetime` columns.

**Comment:** For the `last_modified_t` column: All values are of type `int`. For the `last_modified_datetime` column: All values are of type `str`. We confirm that these columns agree:

In [22]:
s1 = parse_t(df['last_modified_t'])
s2 = parse_datetime(df['last_modified_datetime'])
(s1 == s2).all()

True

**Decision:** We decide to use the `last_modified_t` column to generate a new `last_modified_on_utc` column (type `Timestamp`) and to drop the `last_modified_t` and `last_modified_datetime` columns.

In [23]:
# Remove the time-zone information as it proved cumbersome:
df['last_modified_on_utc'] = s1.dt.tz_convert(None)
# Move the new column after 'last_modified_datetime':
df = df[reorder(df.columns.to_list(), 'last_modified_on_utc', 'last_modified_datetime')]
# Drop the old columns:
df = df.drop(columns=['last_modified_t', 'last_modified_datetime'])

Finally, we take care of the "epoch" issue mentioned above by replacing `created_on_utc` by `last_modified_on_utc`:

In [24]:
EPOCH = pd.Timestamp(0, unit='s')
df['created_on_utc'] = np.where(df['created_on_utc'] != EPOCH, df['created_on_utc'], df['last_modified_on_utc'])

#### Column: `product_name`
---

**Comment:** Non-NA values are of type `str` and there are < 5 % of NA values.

**Decision:** We decide to keep the column and all rows. We also make sure that there are no leading/trailing white spaces.

In [25]:
df['product_name'] = df['product_name'].str.strip()

#### Column: `generic_name`
---

**Comment:** Non-NA values are of type `str` and there are > 83 % of NA values.

**Decision:** The language seems to vary quite a lot among the entries. Maybe tellingly, this column is not even documented. We decide to drop the column.

In [26]:
df = df.drop(columns='generic_name')

#### Column: `quantity`
---

In [27]:
df[df['quantity'].notna()][['product_name', 'quantity']].head()

Unnamed: 0,product_name,quantity
0,Farine de blé noir,1kg
46,Naturablue original,250ml
47,Filet de bœuf,2.46 kg
48,Marks % Spencer 2 Blueberry Muffins,230g
51,Naturakrill original,60 capsules


**Comment:** Non-NA values are of type `str` and there are > 65 % of NA values. In most cases, the column indicates the quantity sold at once and the unit of measurement used.

**Decision:** We decide to salvage as many of the non-NA values as possible, as this might be useful information.

##### Exploration

After manually inspecting a few entries (spot-checking), we noticed that:
* There are invalid and/or incomplete entries (price in euros, product name instead of quantity, unitless quantity, etc.).
* Multiple languages are used (e.g. _320 г_ seems to mean 320 g in Russian).
* Metric and imperial units are used.

##### Implementation

**Note:** In order not to clutter the notebook, most of the code is in a separate module - see `quantity.py`.

In order to keep complexity under control, we take the following decisions:
* An entry must be a "valid number" followed by a "valid unit". White spaces are allowed and ignored. Additional information at the end is allowed and ignored. Letter case is ignored.
* A "valid number" is any string that matches `r'\d+(?:[.,]\d*)?'`.
* A "valid unit" is any string in `VALID_UNITS` (see code).
* Since imperial units differ between UK, US and USC, we decide to *ignore* those (see e.g. [How US labelling requirements undermine honest labelling in the UK](http://metricviews.org.uk/2013/03/how-us-labelling-requirements-undermine-honest-labelling-in-the-uk/)).

**Desired output:** 2 columns: `quantity_number` (type `float`) and `quantity_unit` (type `category`, either `g` or `l`). In the process, we convert all weights to gram and all volumes to liter.

##### Execution

In [28]:
ninitial = df['quantity'].notna().sum()

df_qty = quantity.clean(df['quantity'])
df['quantity_number'] = df_qty['number']
df['quantity_unit'] = df_qty['unit'].astype('category')
# Move the new columns after 'quantity':
df = df[reorder(df.columns.to_list(), 'quantity_number', 'quantity')]
df = df[reorder(df.columns.to_list(), 'quantity_unit', 'quantity_number')]
# Drop the old column:
df = df.drop(columns='quantity')

nstandardized = df_qty['number'].notna().sum()
pstandardized = nstandardized / ninitial * 100
print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 119262
entries (standardized): 106470 (89.27 % of initial)


##### Result

We check the result:

In [29]:
(df[['quantity_number', 'quantity_unit']]
 .groupby(by='quantity_unit')
 .describe())

Unnamed: 0_level_0,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
quantity_unit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
g,86658.0,403.239589,8037.160368,0.0,150.0,250.0,420.0,1390000.0
l,19812.0,0.809454,3.336295,0.0,0.35,0.75,1.0,450.0


After manually inspecting entries above 10 kg (< 30 entries) or above 12 ℓ (< 20 entries), we noticed that:
* **Weight:** Some entries seem to contain food for *animals*, not for humans (e.g. for `code`: *289259*, we have `categories`: *aliment pour chevaux* (*food for horses*)). Some values do make sense (e.g. a 25 kg bag of flour for a baker) while others clearly do not.
* **Volume:** Some values do makes sense (e.g. a 20 ℓ barrel of wine) while others clearly do not.

**Decision:**
* **Weight:** We decide to replace a weigth of 0 by NA and to drop records above 10 kg, except when `product_name` contains the word *farine* (*flour*).
* **Volume:** We decide to replace 0 by NA and to drop records above 12 ℓ except when `product_name` contains the word *tonneau* (*barrel*).

In [30]:
df.loc[df['quantity_number'] == 0, ['quantity_number', 'quantity_unit']] = np.nan

cond_weight = ((df['quantity_unit'] == 'g')
          & (df['quantity_number'] > 10e3)
          & ~df['product_name'].str.contains('farine', case=False, na=False, regex=False))

cond_volume = ((df['quantity_unit'] == 'l')
          & (df['quantity_number'] > 12)
          & ~df['product_name'].str.contains('tonneau', case=False, na=False, regex=False))

df = df.drop(df[cond_weight | cond_volume].index, axis=0)

### Cleaning - Column group: Tags<a name="task-a-cleaning-tags"></a>  ([top](#top))
---

We first "profile" the columns in question:

In [31]:
columns = df.columns.to_series()['packaging': 'countries_en'].to_list()
style_percentages(utils.profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
packaging,"{'float': 266026, 'str': 89945}",266026,74.73 %,89945,25.27 %
packaging_tags,"{'float': 266026, 'str': 89945}",266026,74.73 %,89945,25.27 %
brands,"{'str': 326957, 'float': 29014}",29014,8.15 %,326957,91.85 %
brands_tags,"{'str': 326937, 'float': 29034}",29034,8.16 %,326937,91.84 %
categories,"{'float': 252712, 'str': 103259}",252712,70.99 %,103259,29.01 %
categories_tags,"{'float': 252712, 'str': 103259}",252712,70.99 %,103259,29.01 %
categories_en,"{'float': 252712, 'str': 103259}",252712,70.99 %,103259,29.01 %
origins,"{'float': 330955, 'str': 25016}",330955,92.97 %,25016,7.03 %
origins_tags,"{'float': 330993, 'str': 24978}",330993,92.98 %,24978,7.02 %
manufacturing_places,"{'float': 313994, 'str': 41977}",313994,88.21 %,41977,11.79 %


**Comment:** Non-NA values are of type `str`. After manually inspecting a few entries (spot-checking), we noticed that:
* The `first_packaging_code_geo` column contains geographic coordinates and will need to be treated separately.
* Some entries in a given `(column)` contain language prefixes. Example: for `code`: *190* we have `categories`: *en:beverages*.
* The number of entries in a given `(column)` and the number of entries in the corresponding `(column)_tags` do not always agree.
* The `(column)_en` column, which (according to the documentation) contains *"the set of tags in that language"* (i.e. English) sometimes contain tags labeled with another language (maybe because a tag in English is not readily available). Example: for *code*: *452*, we have *product\_name*: *Foie gras canard Périgord* and:
  * `categories`: *Foie gras de canard*
  * `categories_tags`: *en:fish-and-meat-and-eggs,fr:foies-gras,fr:foies-gras-de-canard*
  * `categories_en`: *Fish and meat and eggs,fr:Foies gras,fr:Foies gras de canard*
* The `stores` and `purchase_places` columns do not have a corresponding `(column)_tags` column.

We first look at columns other than `fist_packaging_code_geo`.

In [32]:
columns = df.columns.to_series()['packaging': 'countries_en'].to_list()
columns.remove('first_packaging_code_geo')

**Decision:** We keep all columns and all rows. We consider that all `(column)`, `(column)_tags` and `(column)_en` are comma-delimited lists with entries potentially prefixed with a 2-letter language code. We also make sure that there are no leading/trailing white spaces and that the prefix (if any) is indeed 2-letter long.

**Note:** In order not to clutter the notebook, most of the code is in a separate module - see `tags.py`.

In [33]:
df[columns] = df[columns].applymap(tags.clean_tags)

We now look at the `first_packaging_code_geo` column.

**Comment:** Non-NA values are of type `str` and there are > 94 % of NA values.

**Decision:** We decide to use the `first_packaging_code_geo` column to generate 2 new `first_packaging_code_lat` (type `float`) and `first_packaging_code_lon` (type `float`) columns and to drop the `first_packaging_code_geo` column.

In [34]:
def to_coordinates(values: pd.Series) -> t.Tuple[float, float]:
    """\
    If a string contains valid geographic coordinates, return a pair (lat, lon).
    """
    na = np.nan, np.nan
    
    text = values['first_packaging_code_geo']
    if pd.isna(text):
        return na
    
    tokens = text.split(',')
    if len(tokens) != 2:
        return na
    lat_token, lon_token = tokens
    try:
        lat = float(lat_token)
        lon = float(lon_token)
    except ValueError:
        return na
    if not (-90.0 <= lat <= 90.0):
        return na
    if not (-180.0 <= lon <= 180.0):
        return na
    return lat, lon

In [35]:
ninitial = df['first_packaging_code_geo'].notna().sum()

df_coords = (
    df[['first_packaging_code_geo']]
    .apply(to_coordinates, axis=1, result_type='expand')
    .rename(columns={0: 'lat', 1: 'lon'})
)
df['first_packaging_code_lat'] = df_coords['lat']
df['first_packaging_code_lon'] = df_coords['lon']
# Move the new columns after 'first_packaging_code_geo':
df = df[reorder(df.columns.to_list(), 'first_packaging_code_lat', 'first_packaging_code_geo')]
df = df[reorder(df.columns.to_list(), 'first_packaging_code_lon', 'first_packaging_code_lat')]
# Drop the old column:
df = df.drop(columns='first_packaging_code_geo')

nstandardized = df_coords['lon'].notna().sum()
pstandardized = nstandardized / ninitial * 100
print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 20868
entries (standardized): 20868 (100.00 % of initial)


### Cleaning - Column group: Ingredients<a name="task-a-cleaning-ingredients"></a>  ([top](#top))
---

We first "profile" the columns in question:

In [36]:
columns = df.columns.to_series()['ingredients_text': 'traces_en'].to_list()
style_percentages(utils.profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
ingredients_text,"{'str': 283884, 'float': 72087}",72087,20.25 %,283884,79.75 %
allergens,"{'float': 318793, 'str': 37178}",318793,89.56 %,37178,10.44 %
traces,"{'float': 327578, 'str': 28393}",327578,92.02 %,28393,7.98 %
traces_tags,"{'float': 327579, 'str': 28392}",327579,92.02 %,28392,7.98 %
traces_en,"{'float': 327579, 'str': 28392}",327579,92.02 %,28392,7.98 %


**Comment:** We can divide columns in this group into a few sub-groups.

#### Column: `ingredients_text`
---

**Comment:** Non-NA values are of type `str` and there are > 20 % of NA values.

**Decision:** We decide to keep the column and all rows as-is for the moment (task C). We also make sure that there are no leading/trailing white spaces.

In [37]:
df['ingredients_text'] = df['ingredients_text'].str.strip()

#### Columns: `traces`
---

**Comment:** Non-NA values are of type `str` and there are > 89 % of NA values.

In [38]:
df[df['allergens'].notna()][['product_name', 'allergens']].head()

Unnamed: 0,product_name,allergens
186,Biscuits sablés fourrage au cacao,"Blé, Beurre, Oeufs, Noisette"
199,Côtes du Rhône Villages 2014,sulfites
223,Belgische Pralinen,"Vollmilchpulver, Vollmilchpulver, Soja, Butter..."
227,Luxury Christmas Pudding,"Wheat Flour, Milk, Walnuts, Almonds, Nut, Whea..."
228,Luxury Christmas Pudding,"Wheat Flour, Milk, Walnuts, Almonds, Nut, Whea..."


**Decision:** This is very similar to `ingredients_text` and we decide to keep the column and all rows as-is. We also make sure that there are no leading/trailing white spaces.

In [39]:
df['allergens'] = df['allergens'].str.strip()

#### Columns: "tags" columns
---

In [40]:
columns = [
    'traces',
    'traces_tags',
    'traces_en'
]

**Comment**: This is very similar to the columns of the [Column group: Tags](#task-a-cleaning-tags) above.

**Decision:** Same as for the [Column group: Tags](#task-a-cleaning-tags) columns.

In [41]:
df[columns] = df[columns].applymap(tags.clean_tags)

### Cleaning - Column group: Miscellaneous data<a name="task-a-cleaning-miscellaneous-data"></a> ([top](#top))
---

We first "profile" the columns in question:

In [42]:
columns = df.columns.to_series()['serving_size': 'image_small_url'].to_list()
style_percentages(utils.profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
serving_size,"{'str': 216616, 'float': 139355}",139355,39.15 %,216616,60.85 %
additives_n,{'float': 355971},72087,20.25 %,283884,79.75 %
additives,"{'str': 283842, 'float': 72129}",72129,20.26 %,283842,79.74 %
additives_tags,"{'float': 185759, 'str': 170212}",185759,52.18 %,170212,47.82 %
additives_en,"{'float': 185759, 'str': 170212}",185759,52.18 %,170212,47.82 %
ingredients_from_palm_oil_n,{'float': 355971},72087,20.25 %,283884,79.75 %
ingredients_from_palm_oil_tags,"{'float': 349399, 'str': 6572}",349399,98.15 %,6572,1.85 %
ingredients_that_may_be_from_palm_oil_n,{'float': 355971},72087,20.25 %,283884,79.75 %
ingredients_that_may_be_from_palm_oil_tags,"{'float': 341642, 'str': 14329}",341642,95.97 %,14329,4.03 %
nutrition_grade_fr,"{'str': 254877, 'float': 101094}",101094,28.40 %,254877,71.60 %


**Comment:** We can divide columns in this group into a few sub-groups.

#### Column: `serving_size`
---

In [43]:
df[df['serving_size'].notna()][['product_name', 'serving_size']].head()

Unnamed: 0,product_name,serving_size
1,Banana Chips Sweetened (Whole),28 g (1 ONZ)
2,Peanuts,28 g (0.25 cup)
3,Organic Salted Nut Mix,28 g (0.25 cup)
4,Organic Polenta,35 g (0.25 cup)
5,Breadshop Honey Gone Nuts Granola,52 g (0.5 cup)


**Comment:** Non-NA values are of type `str` and there are > 39 % of NA values. In most cases, the column indicates the quantity for one serving and the unit of measurement used. This is very similar to the `quantity` column above.

**Decision:** Same as for the `quantity` column.

In [44]:
ninitial = df['serving_size'].notna().sum()

df_qty = quantity.clean(df['serving_size'])
df['serving_number'] = df_qty['number']
df['serving_unit'] = df_qty['unit'].astype('category')
df = df.drop(columns='serving_size')

nstandardized = df_qty['number'].notna().sum()
pstandardized = nstandardized / ninitial * 100

print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 216616
entries (standardized): 211682 (97.72 % of initial)


#### Column: `nutrition_grade_fr`
---

In [45]:
df[df['nutrition_grade_fr'].notna()][['product_name', 'nutrition_grade_fr']].head()

Unnamed: 0,product_name,nutrition_grade_fr
1,Banana Chips Sweetened (Whole),d
2,Peanuts,b
3,Organic Salted Nut Mix,d
7,Organic Muesli,c
12,Zen Party Mix,d


**Comment:** Non-NA values are of type `str` and there are > 28 % of NA values.  This column contains the [Nutri-Score](https://quoidansmonassiette.fr/comment-est-calcule-le-nutri-score-logo-nutritionnel/).

**Decision:** We decide to keep the column and to replace invalid scores (if any) by NA. (We could check that the mapping is consistent with the column *nutrition-score-fr_100g* but we will not pursue this any further.) We also convert to categorical data.

In [46]:
# Check:
cond_is_invalid = ~df['nutrition_grade_fr'].isin(frozenset([np.nan, 'a', 'b', 'c', 'd', 'e']))
print(f'found {cond_is_invalid.sum()} invalid entry(-ies)')
df.loc[cond_is_invalid, ['nutrition_grade_fr']] = np.nan
# Convert:
df['nutrition_grade_fr'] = df['nutrition_grade_fr'].astype('category')

found 0 invalid entry(-ies)


#### Columns: `pnns_groups_1`, `pnns_groups_2`
---

In [47]:
df[~df['pnns_groups_1'].isin([np.nan, 'unknown'])][[
    'product_name',
    'pnns_groups_1',
    'pnns_groups_2'
]].head()

Unnamed: 0,product_name,pnns_groups_1,pnns_groups_2
176,Salade Cesar,Fruits and vegetables,Vegetables
177,Danoises à la cannelle roulées,Sugary snacks,Biscuits and cakes
179,Flute,Cereals and potatoes,Bread
182,Chaussons tressés aux pommes,Sugary snacks,Biscuits and cakes
184,lentilles vertes,Cereals and potatoes,Legumes


**Comment:** Non-NA values are of type `str` and there are > 62 % and > 63 % of NA values. These columns refer to "Programme National Nutrition Sante" food groups.

We inspect the groups:

In [48]:
df['pnns_groups_1'].value_counts()

unknown                    43600
Sugary snacks              14749
Beverages                  13473
Milk and dairy products    10757
Cereals and potatoes       10076
Fish Meat Eggs              9470
Composite foods             7972
Fat and sauces              7118
Fruits and vegetables       6763
Salty snacks                3299
fruits-and-vegetables       1097
sugary-snacks                619
cereals-and-potatoes          19
salty-snacks                   1
Name: pnns_groups_1, dtype: int64

In [49]:
df['pnns_groups_2'].value_counts()

unknown                             43600
Non-sugared beverages                7287
One-dish meals                       6495
Sweets                               5684
Biscuits and cakes                   5511
Cereals                              4685
Cheese                               4564
Dressings and sauces                 4521
Milk and yogurt                      4275
Processed meat                       3835
Alcoholic beverages                  3608
Chocolate products                   3554
Vegetables                           3159
Fish and seafood                     3069
Sweetened beverages                  2983
Fats                                 2597
Appetizers                           2464
Fruits                               2419
Fruit juices                         2349
Bread                                2336
Meat                                 1948
Breakfast cereals                    1760
Legumes                              1098
vegetables                        

**Decision:** There are many *unknown* entries and one could wonder whether this means the same as NA. For the time being we decide to keep both. We also notice that some entries are duplicated (different case, with/without dashes, etc.). We decide to normalize the entries. We also make sure that there are no leading/trailing white spaces.

In [50]:
def normalize_pnns(group: str) -> str:
    if pd.isna(group):
        return group
    parts = group.lower().split()
    return '-'.join(parts)

In [51]:
df['pnns_groups_1'] = df['pnns_groups_1'].apply(normalize_pnns)
df['pnns_groups_2'] = df['pnns_groups_2'].apply(normalize_pnns)

#### Columns: "numbers" columns
---

In [52]:
columns = [
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n'
]

**Comment:** Non-NA values are of type `str` and there are > 20 % of NA values.

**Decision:** We decide to keep all columns and to replace invalid values (if any) by NA.

In [53]:
df_is_invalid = (df[columns] < 0)
n_invalid_entries = df_is_invalid.sum(axis='columns').sum()
n_invalid_rows = df_is_invalid.any(axis='columns').sum()
print(f'found {n_invalid_entries} invalid entry(-ies) in {n_invalid_rows} invalid row(s)')
df[columns] = df[columns].mask(df_is_invalid, np.nan)

found 0 invalid entry(-ies) in 0 invalid row(s)


We check the result:

In [54]:
df[columns].describe()

Unnamed: 0,additives_n,ingredients_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_n
count,283884.0,283884.0,283884.0
mean,1.877267,0.023429,0.059736
std,2.501347,0.153089,0.280657
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,1.0,0.0,0.0
75%,3.0,0.0,0.0
max,30.0,2.0,6.0


Values seem to be in a reasonnable range.

#### Columns: "tags" columns
---

In [55]:
columns = [
    'additives',
    'additives_tags',
    'additives_en',
    'ingredients_from_palm_oil_tags',
    'ingredients_that_may_be_from_palm_oil_tags',
    'states',
    'states_tags',
    'states_en',
    'main_category',
    'main_category_en'
]

**Comment**: This is very similar to the columns of the [Column group: Tags](#task-a-cleaning-tags) above.

**Decision:** Same as for the [Column group: Tags](#task-a-cleaning-tags) columns.

In [56]:
df[columns] = df[columns].applymap(tags.clean_tags)

#### Columns: `image_url`, `image_small_url`
---

**Comment:** Non-NA values are of type `str` and there are > 78 % of NA values. This is very similar to the `url` column above.

**Decision:** Same as for the `url` column.

In [57]:
df['image_url'] = df['image_url'].str.strip()

cond_is_invalid_url = ~df['image_url'].map(is_valid_url)
print(f'found {cond_is_invalid_url.sum()} invalid URL(s)')
df.loc[cond_is_invalid_url, ['image_url']] = np.nan

found 0 invalid URL(s)


In [58]:
df['image_small_url'] = df['image_small_url'].str.strip()

cond_is_invalid_url = ~df['image_small_url'].map(is_valid_url)
print(f'found {cond_is_invalid_url.sum()} invalid URLS(s)')
df.loc[cond_is_invalid_url, ['image_small_url']] = np.nan

found 0 invalid URLS(s)


### Cleaning - Column group: Nutrition facts<a name="task-a-cleaning-nutrition-facts"></a>  ([top](#top))
---

We first "profile" the columns in question:

In [59]:
columns = df.columns.to_series()['energy_100g': 'nutrition-score-uk_100g'].to_list()
style_percentages(utils.profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
energy_100g,{'float': 355971},60585,17.02 %,295386,82.98 %
energy-from-fat_100g,{'float': 355971},355102,99.76 %,869,0.24 %
fat_100g,{'float': 355971},76455,21.48 %,279516,78.52 %
saturated-fat_100g,{'float': 355971},92127,25.88 %,263844,74.12 %
-butyric-acid_100g,{'float': 355971},355971,100.00 %,0,0.00 %
-caproic-acid_100g,{'float': 355971},355971,100.00 %,0,0.00 %
-caprylic-acid_100g,{'float': 355971},355970,100.00 %,1,0.00 %
-capric-acid_100g,{'float': 355971},355969,100.00 %,2,0.00 %
-lauric-acid_100g,{'float': 355971},355967,100.00 %,4,0.00 %
-myristic-acid_100g,{'float': 355971},355970,100.00 %,1,0.00 %


**Comment:** We can divide columns in this group into a few sub-groups.

#### Columns: `energy_100g`, `energy-from-fat_100g`
---

**Comment:** Non-NA values are of type `float` and there are > 17 % and > 99 % of NA values. We look at the distribution of values:

In [60]:
columns = ['energy_100g', 'energy-from-fat_100g']
df[columns].describe()

Unnamed: 0,energy_100g,energy-from-fat_100g
count,295386.0,869.0
mean,1125.381352,587.216617
std,936.808583,713.255708
min,0.0,0.0
25%,381.0,49.4
50%,1092.0,300.0
75%,1674.0,900.0
max,231199.0,3830.0


**Decision:** Turning to the Internet, we found out that there is ca. 9000 calories in 1 kg of pure fat, i.e. 9000 / 10 × 4.184 = 3765 kJ in 100 g of pure fat. So anything with more than 4000 kJ per 100 g or 100 ml is most likely an error. We decide to keep both columns and to drop the rows with more than 4000 kJ per 100 g or 100 ml.

In [61]:
cond_is_invalid = (df['energy_100g'] > 4e3)
print(f"'energy_100g': found {cond_is_invalid.sum()} invalid row(s)")
df = df.drop(df[cond_is_invalid].index, axis=0)

cond_is_invalid = (df['energy-from-fat_100g'] > 4e3)
print(f"'energy-from-fat_100g': found {cond_is_invalid.sum()} invalid row(s)")
df = df.drop(df[cond_is_invalid].index, axis=0)

'energy_100g': found 113 invalid row(s)
'energy-from-fat_100g': found 0 invalid row(s)


#### Columns: Nutrients
---

**Comment:** All values are of type `float`.

In [62]:
columns = df.columns.to_series()['fat_100g': 'cocoa_100g'].to_list()
columns.remove('ph_100g')

**Decision:** We decide to keep all columns and to drop invalid rows, i.e. rows that contain entries outside the range *[0, 100]*. Note that this test is also suitable for columns that, despite their name, contain a percentage (`alcohol_100g` - % alcool by volume, `fruits-vegetables-nuts_100g` - % of fruits, vegetables and nuts, `fruits-vegetables-nuts-estimate_100g` - % of fruits, vegetables and nuts (estimate), `collagen-meat-protein-ratio_100g` - % of collagen in meat protein, `cocoa_100g` - % cocoa).

In [63]:
df_is_invalid = (df[columns] < 0) | (df[columns] > 100)
n_invalid_entries = df_is_invalid.sum(axis='columns').sum()
n_invalid_rows = df_is_invalid.any(axis='columns').sum()
print(f'found {n_invalid_entries} invalid entry(-ies) in {n_invalid_rows} invalid row(s)')
df = df.drop(df[df_is_invalid.any(axis='columns')].index, axis=0)

found 315 invalid entry(-ies) in 259 invalid row(s)


We also make sure that, when a nutrient is broken down into components, the sum of the components does not exceed the amount of nutrient itself.

In [64]:
nutrient = 'saturated-fat_100g'
components = df.columns.to_series()['-butyric-acid_100g': '-melissic-acid_100g'].to_list()
cond_is_invalid = (df[components].sum(axis='columns') > df[nutrient])
print(f'found {cond_is_invalid.sum()} invalid row(s)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 1 invalid row(s)


In [65]:
nutrient = 'omega-3-fat_100g'
components = df.columns.to_series()['-alpha-linolenic-acid_100g': '-docosahexaenoic-acid_100g'].to_list()
cond_is_invalid = (df[components].sum(axis='columns') > df[nutrient])
print(f'found {cond_is_invalid.sum()} invalid row(s)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 2 invalid row(s)


In [66]:
nutrient = 'omega-6-fat_100g'
components = df.columns.to_series()['-linoleic-acid_100g': '-dihomo-gamma-linolenic-acid_100g'].to_list()
cond_is_invalid = (df[components].sum(axis='columns') > df[nutrient])
print(f'found {cond_is_invalid.sum()} invalid row(s)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 0 invalid row(s)


In [67]:
nutrient = 'omega-9-fat_100g'
components = df.columns.to_series()['-oleic-acid_100g': '-nervonic-acid_100g'].to_list()
cond_is_invalid = (df[components].sum(axis='columns') > df[nutrient])
print(f'found {cond_is_invalid.sum()} invalid row(s)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 1 invalid row(s)


In [68]:
nutrient = 'sugars_100g'
components = df.columns.to_series()['-sucrose_100g': '-maltodextrins_100g'].to_list()
cond_is_invalid = (df[components].sum(axis='columns') > df[nutrient])
print(f'found {cond_is_invalid.sum()} invalid row(s)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 22 invalid row(s)


#### Column: *ph_100g*
---

**Comment:** All values are of type `float`.

**Decision:** Turning to the Internet (e.g. [pH](https://en.wikipedia.org/wiki/PH) on Wikipedia), we found out that the usual PH range is *[0, 14]* but since we are dealing with food products, anything with a PH outside the range *[2, 12]* is most likely an error. We decide to keep the column and to drop the rows with a PH outside the range *[2, 12]*.

In [69]:
cond_is_invalid = (df['ph_100g'] < 2) | (df['ph_100g'] > 12)
n_invalid_entries = cond_is_invalid.sum()
print(f'found {n_invalid_entries} invalid entry(-ies)')
df = df.drop(df[cond_is_invalid].index, axis=0)

found 4 invalid entry(-ies)


#### Column: *carbon-footprint_100g*
---

**Comment:** All values are of type `float`. We look at the distribution of values:

In [70]:
df[['carbon-footprint_100g']].describe()

Unnamed: 0,carbon-footprint_100g
count,278.0
mean,335.790664
std,423.244817
min,0.0
25%,82.65
50%,190.95
75%,378.7
max,2842.0


**Decision:** Turning to the Internet, we found out that these values seem to make sense (see e.g. [Climate change food calculator: What's your diet's carbon footprint?](https://www.bbc.com/news/science-environment-46459714)). We decide to keep all columns and rows.

#### Column: *nutrition-score-fr_100g*
---

**Comment:** All values are of type `float` and there are < 30 % of NA entries. We look at the distribution of values:

In [71]:
df[['nutrition-score-fr_100g']].describe()

Unnamed: 0,nutrition-score-fr_100g
count,254651.0
mean,9.160537
std,8.997361
min,-15.0
25%,1.0
50%,10.0
75%,16.0
max,40.0


**Decision:** Turning to the Internet we found out that the final score for the Nutri-Score lies in range *[-15, 40]* (see e.g. [here](https://solidarites-sante.gouv.fr/prevention-en-sante/preserver-sa-sante/nutrition/article/articles-scientifiques-et-documents-publies-relatifs-au-nutri-score)). All non-NA entries seem to be valid in that regard. We decide keep the column and all rows.

#### Column: *nutrition-score-uk_100g*
---

**Comment:** All values are of type `float` and there are < 30 % of NA entries. We look at the distribution of values:

In [72]:
df[['nutrition-score-uk_100g']].describe()

Unnamed: 0,nutrition-score-uk_100g
count,254651.0
mean,8.974765
std,9.149149
min,-15.0
25%,1.0
50%,9.0
75%,16.0
max,37.0


**Decision:** Turning to the Internet, we found out that the final score for the Nutrient Profiling Model lies in range *[-15, 40]* (see e.g. [here](https://www.gov.uk/government/publications/the-nutrient-profiling-model)). All non-NA entries seem to be valid in that regard. We decide to keep the column and all rows.

## Result<a name="task-a-cleaning-result"></a> ([top](#top))
---

Finally, we export the cleaned-up data-frame:

In [73]:
base_name = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.clean')
data_types = df.dtypes.astype(str).to_dict()
utils.dump_dtypes(base_name, data_types)
df.to_csv(f'{base_name}.csv', index=False)