# Course 2: Project - Task A - Data cleaning

<a name="top"></a>
This notebook is concerned with task A. The goal is to clean the [Open Food Facts](https://www.kaggle.com/openfoodfacts/world-food-facts) dataset (version 5), downloaded from Kaggle. The dataset originates from https://world.openfoodfacts.org/data. A description of the fields is available at https://static.openfoodfacts.org/data/data-fields.txt.

**Contents:**
* [Imports](#task-a-imports)
* [Preparatives](#task-a-preparatives)
* [Data loading](#task-a-data-loading)
* [Cleaning](#task-a-cleaning)
  * [NA columns](#task-a-cleaning-na-columns)
  * [Column group: General information](#task-a-cleaning-general-information)
  * [Column group: Tags](#task-a-cleaning-tags)
  * [Column group: Ingredients](#task-a-cleaning-ingredients)
  * [Column group: Miscellaneous data](#task-a-cleaning-miscellaneous-data)
  * [Column group: Nutrition facts](#task-a-cleaning-nutrition-facts)
* [Result](#task-a-cleaning-result)

## Imports<a name="task-a-imports"></a> ([top](#top))
---

In [85]:
# Standard library:
import collections
import enum
import functools
import pathlib
import re
import typing as t
import urllib.parse

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pandas.io.formats.style
import seaborn as sns

# Project:
import cleanquantity
import ean
import tags
import utils

%matplotlib inline

## Preparatives<a name="task-a-preparatives"></a> ([top](#top))
---

This section regroups utility functions, etc. that we will use later in this notebook.

### Utilities

In [2]:
@functools.wraps(display)  # nicer for interactive use
def display_allcols(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns."""
    with pd.option_context('display.max_columns', None):
        display(*args, **kwargs)
        
        
@functools.wraps(display)  # nicer for interactive use
def display_allcols_notrunc(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns with no truncation."""
    with pd.option_context('display.max_columns', None, 'display.max_colwidth', -1):
        display(*args, **kwargs)


def profile(df: pd.DataFrame) -> pd.DataFrame:
    
    
    def get_type_name(obj: t.Any) -> str:
        return type(obj).__name__
        
    types_ = [df[col].map(get_type_name).value_counts().to_dict() for col in df.columns]
        
    data = {
        'Types': types_,
        'NA': df.apply(lambda series: series.isna().sum()),
        'NA %': df.apply(lambda series: series.isna().mean() * 100.0),
        'Non-NA': df.apply(lambda series: series.notna().sum()),
        'Non-NA %': df.apply(lambda series: series.notna().mean() * 100.0)
    }
    return pd.DataFrame(data)


def style_percentages(df: pd.DataFrame) -> pd.io.formats.style.Styler:
    formatter = { name: lambda n: f'{n:.2f} %' for name in df.columns if '%' in name }
    return df.style.format(formatter)


def parse_t(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, utc=True, unit='s')


def parse_datetime(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, format='%Y-%m-%dT%H:%M:%S%z')


def move_after(words: t.List[str], word: str, word_to_move: str) -> t.List[str]:
    """Utility function to re-order columns."""
    try:
        word_idx = words.index(word)
        word_to_move_idx = words.index(word_to_move)
    except ValueError:
        pass
    else:
        if word_idx < word_to_move_idx:
            words.pop(word_to_move_idx)
            words.insert(word_idx + 1, word_to_move)
        else:
            words.insert(word_idx + 1, word_to_move)
            words.pop(word_to_move_idx)
    return words


def is_invalid_url(url: str, if_na=False) -> bool:
    if pd.isna(url):
        return if_na
    result = urllib.parse.urlparse(url)
    return result.scheme == '' or result.netloc == ''

### EAN-13/EAN-8/UPC-A

As per the description of the field `code`: For products with a barcode, this is the barcode of the product (EAN-13 code or some internal code assigned by the store). For products without a barcode, Open Food Facts assigns a number starting with the 200 reserved prefix. We implement utility functions to check whether a given code is a valid EAN-13/EAN-8/UPC-A code.

**Note:** In order not to clutter the notebook the code is in a separate module - *ean.py*.

## Data loading<a name="task-a-data-loading"></a> ([top](#top))
---

Since we are not familiar with the dataset and warned that it is quite messy, we first let Pandas read the TSV file entirely into memory and guess the type of each column. As the end of this notebook, we will export a cleaned-up file that Pandas will be able to read more efficiently.

In [3]:
data_filename = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.tsv')

In [4]:
df = pd.read_csv(data_filename, sep='\t', low_memory=False)

We first get some general information about the data-frame:

In [5]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 356027 rows and 163 columns


**Note:** It turns out that reading the TSV file this way is problematic (at least on macOS) since 26 lines contain a carriage return. We noticed this by focusing on the first row where _code_ was NA and looking at the line corresponding to the previous row directly in the TSV file:
```bash
sed -n '193909 l' ./en.openfoodfacts.org.products.tsv
(...)fr-32-464-040-ec\t43.400279,0.199525\r\t\tvillecomtal-sur-arros-gers-france(...)
                                         ^^
```
26 data points is a negligible fraction of all data points and we could have dropped the rows but it turns out that there is an even simpler solution:

In [6]:
df = pd.read_csv(data_filename, sep='\t', lineterminator='\n', low_memory=False)

In [7]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} rows and {ncols} columns')

the dataset contains 356001 rows and 163 columns


Having taken care of this, we look at the first few rows:

In [8]:
display_allcols(df.head())

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_uk,nutrition_grade_fr,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,main_category,main_category_en,image_url,image_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,28 g (1 ONZ),,0.0,[ bananas -> en:bananas ] [ vegetable-oil -...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2243.0,,28.57,28.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.018,64.29,14.29,,,,,,,,,3.6,3.57,,,,0.0,0.0,,0.0,,,,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,28 g (0.25 cup),,0.0,[ peanuts -> en:peanuts ] [ wheat-flour -> ...,,,0.0,,,0.0,,,,b,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1941.0,,17.86,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,60.71,17.86,,,,,,,,,7.1,17.86,,,,0.635,0.25,,0.0,,,,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Organic hazelnuts, organic cashews, organic wa...",,,,,,28 g (0.25 cup),,0.0,[ organic-hazelnuts -> en:organic-hazelnuts ...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2540.0,,57.14,5.36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.86,3.57,,,,,,,,,7.1,17.86,,,,1.22428,0.482,,,,,,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,United States,Organic polenta,,,,,,35 g (0.25 cup),,0.0,[ organic-polenta -> en:organic-polenta ] [...,,,0.0,,,0.0,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1552.0,,1.43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,77.14,,,,,,,,,,5.7,8.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


For each column, we briefly look at the type guessed by Pandas and the number of non-NA values:

In [9]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356001 entries, 0 to 356000
Data columns (total 163 columns):
code                                          356001 non-null object
url                                           356001 non-null object
creator                                       355998 non-null object
created_t                                     356001 non-null int64
created_datetime                              356000 non-null object
last_modified_t                               356001 non-null int64
last_modified_datetime                        356001 non-null object
product_name                                  338489 non-null object
generic_name                                  57688 non-null object
quantity                                      119262 non-null object
packaging                                     89959 non-null object
packaging_tags                                89959 non-null object
brands                                        326977 non-null obj

## Cleaning<a name="task-a-cleaning"></a> ([top](#top))
---

### Cleaning - NA columns<a name="task-a-cleaning-na-columns"></a> ([top](#top))
---

We list columns with only NA entries:

In [10]:
df.columns[df.isna().all()].tolist()

['cities',
 'allergens_en',
 'no_nutriments',
 'ingredients_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil',
 'nutrition_grade_uk',
 '-butyric-acid_100g',
 '-caproic-acid_100g',
 '-behenic-acid_100g',
 '-lignoceric-acid_100g',
 '-cerotic-acid_100g',
 '-melissic-acid_100g',
 '-dihomo-gamma-linolenic-acid_100g',
 '-elaidic-acid_100g',
 '-gondoic-acid_100g',
 '-mead-acid_100g',
 '-erucic-acid_100g',
 '-nervonic-acid_100g',
 'chlorophyl_100g',
 'glycemic-index_100g',
 'water-hardness_100g']

**Comment:** We notice that 20+ columns contain only NA values.

**Decision:** We drop those columns that are not part of the breakdown of a macro-nutrient (i.e. those columns with a name that does not start with a _'-'_):

In [11]:
columns_to_drop = [
  'cities',
  'allergens_en',
  'no_nutriments',
  'ingredients_from_palm_oil',
  'ingredients_that_may_be_from_palm_oil',
  'nutrition_grade_uk',
  'chlorophyl_100g',
  'glycemic-index_100g',
  'water-hardness_100g'
]

df = df.drop(columns=columns_to_drop)

### Cleaning - Column group: General information<a name="task-a-cleaning-general-information"></a>  ([top](#top))
---

We first "profile" the columns:

In [12]:
columns = df.columns.to_series()['code': 'quantity'].to_list()
style_percentages(profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
code,{'str': 356001},0,0.00 %,356001,100.00 %
url,{'str': 356001},0,0.00 %,356001,100.00 %
creator,"{'str': 355998, 'float': 3}",3,0.00 %,355998,100.00 %
created_t,{'int': 356001},0,0.00 %,356001,100.00 %
created_datetime,"{'str': 356000, 'float': 1}",1,0.00 %,356000,100.00 %
last_modified_t,{'int': 356001},0,0.00 %,356001,100.00 %
last_modified_datetime,{'str': 356001},0,0.00 %,356001,100.00 %
product_name,"{'str': 338489, 'float': 17512}",17512,4.92 %,338489,95.08 %
generic_name,"{'float': 298313, 'str': 57688}",298313,83.80 %,57688,16.20 %
quantity,"{'float': 236739, 'str': 119262}",236739,66.50 %,119262,33.50 %


#### Column: *code*
---

**Comment:** The column _code_ has type `object`, all values are of type `str` and there are no NA values.

**Decision:** We keep the column and all rows. We ensure there are no leading/trailing white spaces:

In [13]:
df['code'] = df['code'].str.strip()

Out of curiosity, we check how many codes belong to the following categories: `e` - valid EAN-13/EAN-8/UPC-A code, `a` - code assigned by Open Food Facts (prefix 200) and `i` - internal code (store). (Mistyped EAN-13/EAN-8/UPC-A codes will be incorrectly classified as internal codes but we will not pursue this further.) Result:

In [14]:
def categorize(code: str) -> str:
    return ('e' if ean.is_valid(code) else
            'a' if code.startswith('200') else
            'i')


df_categories = pd.DataFrame(df['code'].map(categorize).value_counts())
df_categories['code %'] = df_categories['code'] / df_categories['code'].sum() * 100
style_percentages(df_categories)

Unnamed: 0,code,code %
e,313911,88.18 %
i,40340,11.33 %
a,1750,0.49 %


#### Column: *url*
---

**Comment:** The column _url_ has type `object`, all values are of type `str` and there are no NA values.

**Decision:** We keep the column. We ensure there are no leading/trailing white spaces. We replace invalid URLs (if any) by NA:

In [15]:
df['url'] = df['url'].str.strip()

cond_is_invalid_url = df['url'].map(is_invalid_url)
print(f'found {cond_is_invalid_url.sum()} invalid url(s)')
df.loc[cond_is_invalid_url, ['url']] = np.nan

found 0 invalid url(s)


#### Column: *creator*
---

**Comment:** The column _creator_ has type `object`, non-NA values are of type `str` and there are a neligible number of NA values.

**Decision:** We keep the column and all rows. We ensure there are no leading/trailing white spaces:

In [16]:
df['creator'] = df['creator'].str.strip()

#### Columns: *created_(t,datetime)*, *last_modified_(t,datetime)*
---

**Comment:** For the *created_t*/*created_datetime* pair: | Ditto for the column *created_datetime*, except that there is 1 NA value. We confirm that these columns agree where both are not NA:

In [17]:
s1 = parse_t(df['created_t'])
s2 = parse_datetime(df['created_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision:** We use the column *created_t* to generate a column *created_on* (type `Timestamp`) and drop the columns *created_t* and *created_datetime*:

In [18]:
df['created_on'] = s1
df = df.drop(columns=['created_t', 'created_datetime'])

**Comment:** For the *last_modified_t*/*last_modified_datetime* pair: The column *last_modified_t* has type `object`, all values are of type `str` and there are no NA values. Ditto for the column *last_modified_datetime*. We confirm that these columns agree where both are not NA:

In [19]:
s1 = parse_t(df['last_modified_t'])
s2 = parse_datetime(df['last_modified_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision:** We use the column *last_modified_t* to generate a column *last_modified_on* (type `Timestamp`) and drop the columns *last_modified_t* and *last_modified_datetime*:

In [20]:
df['last_modified_on'] = s1
df = df.drop(columns=['last_modified_t', 'last_modified_datetime'])

#### Column: *product_name*
---

**Comment:** The column *product_name* has type `object`, non-NA values are of type `str` and there are < 5 % of NA values.

**Decision:** We keep the column and all rows. We ensure there are no leading/trailing white spaces:

In [21]:
df['product_name'] = df['product_name'].str.strip()

#### Column: *generic_name*
---

**Comment:** The column *generic_name* has type `object`, non-NA values are of type `str` and there are > 80 % of NA values. Inspecting a couple of records by hand, we notice that the language seems to vary quite a lot. Maybe tellingly, this column is not even documented.

**Decision:** Drop the column.

In [22]:
df = df.drop(columns='generic_name')

#### Column: *quantity*
---

**Comment:** The column *quantity* has type object, non-NA values are of type `str` and there are > 65 % of NA values. In most cases, the column indicates the quantity sold at once and the unit of measurement used.

**Decision:** We try to salvage as many of the non-NA values as possible as it might be interesting to know the quantity of product sold at once.

##### Exploration

After manually inspecting multiple records (spot-checking), we are able to make the following observations: a) There are invalid and/or incomplete entries (price in euros, product name instead of quantity, unitless quantity, etc.). b) Multiple languages are used (e.g. _320 г_ seems to mean 320 g in Russian). c) Metric and imperial units are used.

##### Implementation

In order to keep complexity under control, we take the following decisions:
* An entry must be a "valid number" followed by a "valid unit". White spaces are allowed and ignored. Additional information at the end is allowed and ignored. Letter case is ignored.
* A "valid number" is any string that matches `r'\d+(?:[.,]\d*)?'`.
* A "valid unit" is any string in `VALID_UNITS` (see code).
* Since imperial units differ between UK, US and USC, we decide to *ignore* those (see e.g. [How US labelling requirements undermine honest labelling in the UK](http://metricviews.org.uk/2013/03/how-us-labelling-requirements-undermine-honest-labelling-in-the-uk/)).

**Desired output:** 2 columns: `quantity_number` (type `float`) and `quantity_unit` (type `str`, either `g` or `l`). In the process, we convert all weights to gram and all volumes to liter.

**Note:** In order not to clutter the notebook, most of the code is in a separate module - _cleanquantity.py_.

##### Execution

In [23]:
ninitial = df['quantity'].notna().sum()

df_qty = cleanquantity.clean(df['quantity'])
df['quantity_number'] = df_qty['number']
df['quantity_unit'] = df_qty['unit']
df = df.drop(columns='quantity')

nstandardized = df_qty['number'].notna().sum()
pstandardized = nstandardized / ninitial * 100

print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 119262
entries (standardized): 106470 (89.27 % of initial)


##### Result

We check for outliers:

In [24]:
(df[['quantity_number', 'quantity_unit']]
 .groupby(by='quantity_unit')
 .describe())

Unnamed: 0_level_0,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
quantity_unit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
g,86658.0,403.239589,8037.160368,0.0,150.0,250.0,420.0,1390000.0
l,19812.0,0.809454,3.336295,0.0,0.35,0.75,1.0,450.0


**Observations:**
* **Weight:** After manually inspecting entries above 10 kg by hand (22 entries): a) Some entries seem to contain food for *animals*, not for humans (e.g. for *code: 289259*, we have *categories: aliment pour chevaux* (English: *food for horses*)). b) Some values do make sense (e.g. a 25 kg bag of flour for a baker) while others do not.
* **Volume:** After manually inspecting entries above 12 liters (12 entries): a) Some values do makes sense (e.g. a 20 l barrel of wine) while others do not.

**Decision:** We have a negligible number of outliers:
* **Weight:** We replace a weigth of 0 by NA and drop records above 10 kg, except when *product_name* contains the word *farine* (English: *flour*).
* **Volume:** We replace 0 by NA and drop records above 12 l except when *product_name* contains the word *tonneau* (English: *barrel*).

In [25]:
df.loc[df['quantity_number'] == 0, ['quantity_number', 'quantity_unit']] = np.nan

weight_cond = ((df['quantity_unit'] == 'g')
          & (df['quantity_number'] > 10e3)
          & ~df['product_name'].str.contains('farine', case=False, na=False, regex=False))

volume_cond = ((df['quantity_unit'] == 'l')
          & (df['quantity_number'] > 12)
          & ~df['product_name'].str.contains('tonneau', case=False, na=False, regex=False))

df = df.drop(df[weight_cond | volume_cond].index, axis=0)

### Cleaning - Column group: Tags<a name="task-a-cleaning-tags"></a>  ([top](#top))
---

We first "profile" the columns:

In [66]:
columns = df.columns.to_series()['packaging': 'countries_en'].to_list()
style_percentages(profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
packaging,"{'float': 265686, 'str': 89883}",265686,74.72 %,89883,25.28 %
packaging_tags,"{'float': 265686, 'str': 89883}",265686,74.72 %,89883,25.28 %
brands,"{'str': 326575, 'float': 28994}",28994,8.15 %,326575,91.85 %
brands_tags,"{'str': 326555, 'float': 29014}",29014,8.16 %,326555,91.84 %
categories,"{'float': 252389, 'str': 103180}",252389,70.98 %,103180,29.02 %
categories_tags,"{'float': 252389, 'str': 103180}",252389,70.98 %,103180,29.02 %
categories_en,"{'float': 252389, 'str': 103180}",252389,70.98 %,103180,29.02 %
origins,"{'float': 330577, 'str': 24992}",330577,92.97 %,24992,7.03 %
origins_tags,"{'float': 330615, 'str': 24954}",330615,92.98 %,24954,7.02 %
manufacturing_places,"{'float': 313622, 'str': 41947}",313622,88.20 %,41947,11.80 %


**Comment:** The columns have type `object`, with non-NA entries of type `str` and NA entries represented by NaN of type `float`. After checking a few entries by hand (spot-checking), we notice that:
* The *first_packaging_code_geo* column contains geographic coordinates and will need to be treated separately.
* Some entries in a given *(column)* contain language prefixes (e.g. in the *categories* column, for *code*: *190* we have *categories*: *en:beverages*).
* The number of entries in a given _(column)_ and the number of entries in the corresponding *(column)_tags* do not always agree.
* The *(column)_en* column, which (according to the documentation) contains *"the set of tags in that language"* (i.e. English) sometimes contain tags labeled with another language (maybe because a tag in English is not readily available). Example: for *code*: *452*, we have *product\_name*: *Foie gras canard Périgord* and:
  * *categories*: *Foie gras de canard*
  * *categories\_tags*: *en:fish-and-meat-and-eggs,fr:foies-gras,fr:foies-gras-de-canard*
  * *categories\_en*: *Fish and meat and eggs,fr:Foies gras,fr:Foies gras de canard*
* The *stores* and *purchase_places* columns do not have a corresponding *(column)_tags* column. 

**Decision:** We keep all columns and all rows. We consider that all *(column)*, *(column)_tags* and *(column)_en* are comma-delimited lists with entries potentially prefixed with a 2-letter language code. We ensure that each entry has no leading/trailing white spaces and that the prefix (if any) is indeed 2-letter long:

#### Columns: Delimited lists

In [27]:
columns = df.columns.to_series()['packaging': 'countries_en'].to_list()
columns.remove('first_packaging_code_geo')

In [28]:
df[columns] = df[columns].applymap(tags.clean_tags)

#### Column: *first\_packaging\_code\_geo*

**Comment:** The column *first\_packaging\_code\_geo* has type object, non-NA values are of type `str` and there are > 90 % of NA values. We confirm that all non-NA entries are valid coordinates:

**Decision:** Keep the column and all rows.

In [29]:
def to_coordinates(values: pd.Series) -> t.Tuple[float, float]:
    """\
    Checks whether a string contains valid geographic coordinates.
    """
    na = np.nan, np.nan
    
    text = values['first_packaging_code_geo']
    
    if pd.isna(text):
        return na
    
    tokens = text.split(',')
    if len(tokens) != 2:
        return na
    lat_token, lon_token = tokens
    try:
        lat = float(lat_token)
        lon = float(lon_token)
    except ValueError:
        return na
    if not (-90.0 <= lat <= 90.0):
        return na
    if not (-180.0 <= lon <= 180.0):
        return na
    return lat, lon


ninitial = df['first_packaging_code_geo'].notna().sum()

df_coords = (
    df[['first_packaging_code_geo']]
    .apply(to_coordinates, axis=1, result_type='expand')
    .rename(columns={0: 'lat', 1: 'lon'})
)

df['first_packaging_code_lat'] = df_coords['lat']
df['first_packaging_code_lon'] = df_coords['lon']
df = df.drop(columns='first_packaging_code_geo')

nstandardized = df_coords['lon'].notna().sum()
pstandardized = nstandardized / ninitial * 100

print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 20868
entries (standardized): 20868 (100.00 % of initial)


Re-order columns:

In [30]:
old_columns = df.columns.to_list()
new_columns = move_after(old_columns, 'emb_codes_tags', 'first_packaging_code_lat')
new_columns = move_after(new_columns, 'first_packaging_code_lat', 'first_packaging_code_lon')
df_columns = new_columns

### Cleaning - Column group: Ingredients<a name="task-a-cleaning-ingredients"></a>  ([top](#top))
---

We first "profile" the columns:

In [67]:
columns = df.columns.to_series()['ingredients_text': 'traces_en'].to_list()
style_percentages(profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
ingredients_text,"{'str': 283533, 'float': 72036}",72036,20.26 %,283533,79.74 %
allergens,"{'float': 318410, 'str': 37159}",318410,89.55 %,37159,10.45 %
traces,"{'float': 327187, 'str': 28382}",327187,92.02 %,28382,7.98 %
traces_tags,"{'float': 327188, 'str': 28381}",327188,92.02 %,28381,7.98 %
traces_en,"{'float': 327188, 'str': 28381}",327188,92.02 %,28381,7.98 %


### Cleaning - Column group: Miscellaneous data<a name="task-a-cleaning-miscellaneous-data"></a> ([top](#top))
---

We first "profile" the columns:

In [32]:
columns = df.columns.to_series()['serving_size': 'image_small_url'].to_list()
style_percentages(profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
serving_size,"{'str': 216616, 'float': 139355}",139355,39.15 %,216616,60.85 %
additives_n,{'float': 355971},72087,20.25 %,283884,79.75 %
additives,"{'str': 283842, 'float': 72129}",72129,20.26 %,283842,79.74 %
additives_tags,"{'float': 185759, 'str': 170212}",185759,52.18 %,170212,47.82 %
additives_en,"{'float': 185759, 'str': 170212}",185759,52.18 %,170212,47.82 %
ingredients_from_palm_oil_n,{'float': 355971},72087,20.25 %,283884,79.75 %
ingredients_from_palm_oil_tags,"{'float': 349399, 'str': 6572}",349399,98.15 %,6572,1.85 %
ingredients_that_may_be_from_palm_oil_n,{'float': 355971},72087,20.25 %,283884,79.75 %
ingredients_that_may_be_from_palm_oil_tags,"{'float': 341642, 'str': 14329}",341642,95.97 %,14329,4.03 %
nutrition_grade_fr,"{'str': 254877, 'float': 101094}",101094,28.40 %,254877,71.60 %


## Column: *serving_size*

In [33]:
df[df['serving_size'].notna()][['product_name', 'serving_size']].head()

Unnamed: 0,product_name,serving_size
1,Banana Chips Sweetened (Whole),28 g (1 ONZ)
2,Peanuts,28 g (0.25 cup)
3,Organic Salted Nut Mix,28 g (0.25 cup)
4,Organic Polenta,35 g (0.25 cup)
5,Breadshop Honey Gone Nuts Granola,52 g (0.5 cup)


In [71]:
ninitial = df['serving_size'].notna().sum()

df_qty = cleanquantity.clean(df['serving_size'])
df['serving_number'] = df_qty['number']
df['serving_unit'] = df_qty['unit']
df = df.drop(columns='serving_size')

nstandardized = df_qty['number'].notna().sum()
pstandardized = nstandardized / ninitial * 100

print(f"entries (initial): {ninitial}")
print(f"entries (standardized): {nstandardized} ({pstandardized:.2f} % of initial)")

entries (initial): 216314
entries (standardized): 211382 (97.72 % of initial)


## Columns: nutrition_grade_fr, pnns_groups_1, pnns_groups_2

**Comment:** The *nutrition_grade_fr* columns is the Nutri-Score:

In [34]:
df[df['nutrition_grade_fr'].notna()][['product_name', 'nutrition_grade_fr']].head()

Unnamed: 0,product_name,nutrition_grade_fr
1,Banana Chips Sweetened (Whole),d
2,Peanuts,b
3,Organic Salted Nut Mix,d
7,Organic Muesli,c
12,Zen Party Mix,d


**Decision:** We just check that non-NA entries are letters from *'a'* to *'e'*. (We could check that the mapping is consistent with the column *nutrition-score-fr_100g* but we will not pursue this any further.)

In [73]:
cond_is_invalid = ~df['nutrition_grade_fr'].isin(frozenset([np.nan, 'a', 'b', 'c', 'd', 'e']))
print(f'found {cond_is_invalid.sum()} invalid entry(-ies)')
df.loc[cond_is_invalid, ['nutrition_grade_fr']] = np.nan

found 0 invalid entry(-ies)


The *pnns_groups_1* and *pnns_group_2* columns refer to "Programme National Nutrition Sante" food categories:

In [36]:
df[~df['pnns_groups_1'].isin([np.nan, 'unknown'])][['product_name', 'pnns_groups_1', 'pnns_groups_2']].head()

Unnamed: 0,product_name,pnns_groups_1,pnns_groups_2
176,Salade Cesar,Fruits and vegetables,Vegetables
177,Danoises à la cannelle roulées,Sugary snacks,Biscuits and cakes
179,Flute,Cereals and potatoes,Bread
182,Chaussons tressés aux pommes,Sugary snacks,Biscuits and cakes
184,lentilles vertes,Cereals and potatoes,Legumes


**Discussion:** There are many *unknown* entries and one could wonder whether this means the same as NA. For the time being we decide to keep both. We also strip white spaces:

In [82]:
df['pnns_groups_1'] = df['pnns_groups_1'].str.strip()
df['pnns_groups_2'] = df['pnns_groups_2'].str.strip()

## Columns: Numbers

We have a mixture of *(column)_n* "number" columns and *(column)\[\_tags,\_(lang2)\]* "tag" columns.

We check the "number" columns:

In [37]:
columns = [
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n'
]

df_is_invalid = (df[columns] < 0)
n_invalid_entries = df_is_invalid.sum(axis='columns').sum()
n_invalid_rows = df_is_invalid.any(axis='columns').sum()
print(f'found {n_invalid_entries} invalid entry(-ies) in {n_invalid_rows} invalid row(s)')

found 0 invalid entry(-ies) in 0 invalid row(s)


Furthermore, values seem to be in a reasonnable range:

In [38]:
df[columns].describe()

Unnamed: 0,additives_n,ingredients_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_n
count,283884.0,283884.0,283884.0
mean,1.877267,0.023429,0.059736
std,2.501347,0.153089,0.280657
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,1.0,0.0,0.0
75%,3.0,0.0,0.0
max,30.0,2.0,6.0


## Columns: Tags

We first clean the "tag" columns:

In [39]:
columns_to_clean = [
    'additives',
    'additives_tags',
    'additives_en',
    'ingredients_from_palm_oil_tags',
    'ingredients_that_may_be_from_palm_oil_tags',
    'states',
    'states_tags',
    'states_en',
    'main_category',
    'main_category_en'
]

df[columns_to_clean] = df[columns_to_clean].applymap(tags.clean_tags)

## Columns: *image_url* and *image_small_url*

**Comment:** The columns *image_url* and *image_small_url* have type `object`, there are > 20 % of non-NA values, of type `str`, and NA values are NaN of type `float`.

**Decision:** We keep the column. We ensure there are no leading/trailing white spaces. We replace invalid URLs (if any) by NA:

In [40]:
df['image_url'] = df['image_url'].str.strip()

cond_is_invalid_url = df['image_url'].map(is_invalid_url)
print(f'found {cond_is_invalid_url.sum()} invalid url(s)')
df.loc[cond_is_invalid_url, ['image_url']] = np.nan

found 0 invalid url(s)


In [41]:
df['image_small_url'] = df['image_small_url'].str.strip()

cond_is_invalid_url = df['image_small_url'].map(is_invalid_url)
print(f'found {cond_is_invalid_url.sum()} invalid url(s)')
df.loc[cond_is_invalid_url, ['image_small_url']] = np.nan

found 0 invalid url(s)


### Cleaning - Column group: Nutrition facts<a name="task-a-cleaning-nutrition-facts"></a>  ([top](#top))
---

We first "profile" the columns:

In [42]:
columns = df.columns.to_series()['energy_100g': 'nutrition-score-uk_100g'].to_list()
style_percentages(profile(df[columns]))

Unnamed: 0,Types,NA,NA %,Non-NA,Non-NA %
energy_100g,{'float': 355971},60585,17.02 %,295386,82.98 %
energy-from-fat_100g,{'float': 355971},355102,99.76 %,869,0.24 %
fat_100g,{'float': 355971},76455,21.48 %,279516,78.52 %
saturated-fat_100g,{'float': 355971},92127,25.88 %,263844,74.12 %
-butyric-acid_100g,{'float': 355971},355971,100.00 %,0,0.00 %
-caproic-acid_100g,{'float': 355971},355971,100.00 %,0,0.00 %
-caprylic-acid_100g,{'float': 355971},355970,100.00 %,1,0.00 %
-capric-acid_100g,{'float': 355971},355969,100.00 %,2,0.00 %
-lauric-acid_100g,{'float': 355971},355967,100.00 %,4,0.00 %
-myristic-acid_100g,{'float': 355971},355970,100.00 %,1,0.00 %


#### Columns: *energy_100g*, *energy-from-fat_100g*
---

**Comment:** All columns have type `float`. We look at the distribution of values:

In [43]:
columns = ['energy_100g', 'energy-from-fat_100g']
df[columns].describe()

Unnamed: 0,energy_100g,energy-from-fat_100g
count,295386.0,869.0
mean,1125.381352,587.216617
std,936.808583,713.255708
min,0.0,0.0
25%,381.0,49.4
50%,1092.0,300.0
75%,1674.0,900.0
max,231199.0,3830.0


**Comment:** There is ca. 9000 calories in 1 kg of pure fat, i.e. 9000 / 10 × 4.184 = 3765 kJ in 100 g of pure fat. So anything with more than 4000 kJ per 100 g or 100 ml is most likely an error. There are 100+ such entries:

In [44]:
(df['energy_100g'] > 4e3).sum()

113

**Decision:** Column *energy_100g*: We keep the column and drop the rows with more than 4000 kJ per 100 g or 100 ml. Column *energy-from-fat_100g*: We keep the column and all rows.

In [45]:
df = df.drop(df[df['energy_100g'] > 4e3].index, axis=0)

#### Columns: Nutrients
---

We are interested in the following columns:

In [46]:
columns = df.columns.to_series()['fat_100g': 'cocoa_100g'].to_list()
columns.remove('ph_100g')

**Comment:** All columns have type `float`. We first make sure that all non-NA entries are in the range *[0, 100]*. Note that this test is also suitable for columns that, despite their name, contain a percentage (*alcohol_100g* - % alcool by volume, *fruits-vegetables-nuts_100g* - % of fruits, vegetables and nuts, *fruits-vegetables-nuts-estimate_100g* - % of fruits, vegetables and nuts (estimate), *collagen-meat-protein-ratio_100g* - % of the collagen in meat protein, *cocoa_100g* - % cocoa).

**Decision:** Keep the columns and drop rows that fail the test.

In [47]:
df_is_outlier = ((df[columns] < 0) | (df[columns] > 100))

In [48]:
n_invalid_entries = df_is_outlier.sum(axis='columns').sum()
n_invalid_rows = df_is_outlier.any(axis='columns').sum()
print(f'found {n_invalid_entries} invalid entry(-ies) in {n_invalid_rows} invalid row(s)')

found 315 invalid entry(-ies) in 259 invalid row(s)


In [49]:
df = df.drop(df[df_is_outlier.any(axis='columns')].index, axis=0)

**Comment:** We can also make sure that, when a macro-nutrient is broken down into micro-nutrients, the sum of the amounts of micro-nutrients does not exceed the amount of macro-nutrient.

**Decision:** Keep the columns and drop rows that fail the test:

In [50]:
macro = 'saturated-fat_100g'
micros = df.columns.to_series()['-butyric-acid_100g': '-melissic-acid_100g'].to_list()
cond_is_invalid = (df[micros].sum(axis='columns') > df[macro])
print(f'found {cond_is_invalid.sum()} invalid row(s)')

found 1 invalid row(s)


In [51]:
df = df.drop(df[cond_is_invalid].index, axis=0)

In [52]:
macro = 'omega-3-fat_100g'
micros = df.columns.to_series()['-alpha-linolenic-acid_100g': '-docosahexaenoic-acid_100g'].to_list()
cond_is_invalid = (df[micros].sum(axis='columns') > df[macro])
print(f'found {cond_is_invalid.sum()} invalid row(s)')

found 2 invalid row(s)


In [53]:
df = df.drop(df[cond_is_invalid].index, axis=0)

In [54]:
macro = 'omega-6-fat_100g'
micros = df.columns.to_series()['-linoleic-acid_100g': '-dihomo-gamma-linolenic-acid_100g'].to_list()
cond_is_invalid = (df[micros].sum(axis='columns') > df[macro])
print(f'found {cond_is_invalid.sum()} invalid row(s)')

found 0 invalid row(s)


In [55]:
macro = 'omega-9-fat_100g'
micros = df.columns.to_series()['-oleic-acid_100g': '-nervonic-acid_100g'].to_list()
cond_is_invalid = (df[micros].sum(axis='columns') > df[macro])
print(f'found {cond_is_invalid.sum()} invalid row(s)')

found 1 invalid row(s)


In [56]:
df = df.drop(df[cond_is_invalid].index, axis=0)

In [57]:
macro = 'sugars_100g'
micros = df.columns.to_series()['-sucrose_100g': '-maltodextrins_100g'].to_list()
cond_is_invalid = (df[micros].sum(axis='columns') > df[macro])
print(f'found {cond_is_invalid.sum()} invalid row(s)')

found 22 invalid row(s)


In [58]:
df = df.drop(df[cond_is_invalid].index, axis=0)

#### Column: *ph_100g*
---

**Comment:** The column has type `float`. After reading the page dedicated to [pH](https://en.wikipedia.org/wiki/PH) on Wikipedia, we decide to check that entries are in the range *[2, 12]* (the usual range seems to be *[0, 14]* but we are dealing with food):

In [59]:
cond_is_invalid = ((df['ph_100g'] < 2) | (df['ph_100g'] > 12))

In [60]:
n_invalid_entries = cond_is_invalid.sum()
print(f'found {n_invalid_entries} invalid entry(-ies)')

found 4 invalid entry(-ies)


**Decision:** Keep the column and drop rows that fail the test:

In [61]:
df = df.drop(df[cond_is_invalid].index, axis=0)

#### Column: *carbon-footprint_100g*
---

**Comment:** The column has type `float`. We look at the distribution of values:

In [62]:
df[['carbon-footprint_100g']].describe()

Unnamed: 0,carbon-footprint_100g
count,278.0
mean,335.790664
std,423.244817
min,0.0
25%,82.65
50%,190.95
75%,378.7
max,2842.0


**Comment:** We lack subject-matter knowledge to assess the above. Turning to the Internet, these values seem to make sense (see e.g. [Climate change food calculator: What's your diet's carbon footprint?](https://www.bbc.com/news/science-environment-46459714)).

**Decision:** WE Keep the columns and all rows.

#### Column: *nutrition-score-fr_100g*
---

**Comment:** The column has type `float` and contains > 70 % of non-NA entries. After turning to the Internet to gain some knowledge, we found out that the final score for the Nutri-Score lies in range *[-15, 40]* (see e.g. [here](https://solidarites-sante.gouv.fr/prevention-en-sante/preserver-sa-sante/nutrition/article/articles-scientifiques-et-documents-publies-relatifs-au-nutri-score)). All non-NA entries seem to be valid in that regard:

In [63]:
df[['nutrition-score-fr_100g']].describe()

Unnamed: 0,nutrition-score-fr_100g
count,254651.0
mean,9.160537
std,8.997361
min,-15.0
25%,1.0
50%,10.0
75%,16.0
max,40.0


**Decision:** We keep the column and all rows.

#### Column: *nutrition-score-uk_100g*
---

**Comment:** The column has type `float` and contains > 70 % of non-NA entries. After turning to the Internet to gain some knomedge, we found out that the final score for the Nutrient Profiling Model lies in range *[-15, 40]* (see e.g. [here](https://www.gov.uk/government/publications/the-nutrient-profiling-model)). All non-NA entries seem to be valid in that regard:

In [64]:
df[['nutrition-score-uk_100g']].describe()

Unnamed: 0,nutrition-score-uk_100g
count,254651.0
mean,8.974765
std,9.149149
min,-15.0
25%,1.0
50%,9.0
75%,16.0
max,37.0


**Decision:** Keep the column and all rows. 

## Result<a name="task-a-result"></a> ([top](#top))
---

In [92]:
df.dtypes

code                                                 object
url                                                  object
creator                                              object
product_name                                         object
packaging                                            object
packaging_tags                                       object
brands                                               object
brands_tags                                          object
categories                                           object
categories_tags                                      object
categories_en                                        object
origins                                              object
origins_tags                                         object
manufacturing_places                                 object
manufacturing_places_tags                            object
labels                                               object
labels_tags                             

In [102]:
import importlib
importlib.reload(utils)

rootname = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.clean')
utils.dump_dataframe(df, rootname)

In [103]:
import importlib
importlib.reload(utils)

df2 = utils.load_dataframe(rootname)

In [104]:
df2.dtypes

code                                                 object
url                                                  object
creator                                              object
product_name                                         object
packaging                                            object
packaging_tags                                       object
brands                                               object
brands_tags                                          object
categories                                           object
categories_tags                                      object
categories_en                                        object
origins                                              object
origins_tags                                         object
manufacturing_places                                 object
manufacturing_places_tags                            object
labels                                               object
labels_tags                             