# Course 2: Project - Data cleaning

This notebook is concerned with cleaning the [Open Food Facts](https://www.kaggle.com/openfoodfacts/world-food-facts) dataset (version 5), downloaded from Kaggle. The dataset originates from https://world.openfoodfacts.org/data. A description of the fields is available at https://static.openfoodfacts.org/data/data-fields.txt.

**Contents:**
* [Imports](#imports)
* [Preparatives](#preparatives)
* [A. Importing, cleaning and numerical summaries](#task-a)
  * [A.1. Column group: General information](#task-a1-general-information)
  * [A.1. Column group: Tags](#task-a1-tags)

## Imports<a name="imports"></a>
---

In [37]:
# Standard library:
import functools
import pathlib
import re
import typing as t

# 3rd party:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Project:
import cleanquantity
import ean

%matplotlib inline

## Preparatives<a name="preparatives"></a>
---

This section regroups utility functions, etc. that we will use later in this notebook.

### Utilities

In [79]:
@functools.wraps(display)  # nicer for interactive use
def display_allcols(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns."""
    with pd.option_context('display.max_columns', None):
        display(*args, **kwargs)
        
        
@functools.wraps(display)  # nicer for interactive use
def display_allcols_notrunc(*args, **kwargs):
    """Behaves exactly like ``display`` but in a context where Pandas display all columns with no truncation."""
    with pd.option_context('display.max_columns', None, 'display.max_colwidth', -1):
        display(*args, **kwargs)


def profile(df: pd.DataFrame) -> pd.DataFrame:
    
    
    def get_type_name(obj: t.Any) -> str:
        return type(obj).__name__
        
    types_ = [df[col].map(get_type_name).value_counts().to_dict() for col in df.columns]
        
    data = {
        'Types': types_,
        'NA Count': df.apply(lambda series: series.isna().sum()),
        'NA %': df.apply(lambda series: series.isna().mean() * 100.0)
    }
    return pd.DataFrame(data).transpose()
    

def parse_t(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, utc=True, unit='s')


def parse_datetime(series: pd.Series) -> pd.Series:
    return pd.to_datetime(series, format='%Y-%m-%dT%H:%M:%S%z')


def move_after(words: t.List[str], word: str, word_to_move: str) -> t.List[str]:
    """Utility function to re-order columns."""
    try:
        word_idx = words.index(word)
        word_to_move_idx = words.index(word_to_move)
    except ValueError:
        pass
    else:
        if word_idx < word_to_move_idx:
            words.pop(word_to_move_idx)
            words.insert(word_idx + 1, word_to_move)
        else:
            words.insert(word_idx + 1, word_to_move)
            words.pop(word_to_move_idx)
    return words

### EAN-13/EAN-8/UPC-A

As per the description of the field `code`: For products with a barcode, this is the barcode of the product (EAN-13 code or some internal code assigned by the store). For products without a barcode, Open Food Facts assigns a number starting with the 200 reserved prefix. We implement utility functions to check whether a given code is a valid EAN-13/EAN-8/UPC-A code.

**Note:** In order not to clutter the notebook the code is in a separate module - *ean.py*.

## A. Importing, cleaning and numerical summaries<a name="task-a"></a>
---

Since we are not familiar with the dataset and warned that it is quite messy, we first let Pandas read the TSV file entirely into memory and guess the type of each column. As the end of this notebook, we will export a cleaned-up file that Pandas will be able to read more efficiently (e.g. by specifying the type of each column upfront).

In [39]:
data_filename = pathlib.Path.cwd().joinpath('en.openfoodfacts.org.products.tsv')

In [40]:
df = pd.read_csv(data_filename, sep='\t', low_memory=False)

We first get some general information about the `DataFrame`...

In [41]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 356027 data points with 163 features


**Note:** It turns out that reading the TSV file this way is problematic (at least on macOS) since 26 lines contain a carriage return. We noticed this by focusing on the first row where `code` was NaN and looking at the line corresponding to the previous row directly in the TSV file:
```bash
sed -n '193909 l' ./en.openfoodfacts.org.products.tsv
(...)fr-32-464-040-ec\t43.400279,0.199525\r\t\tvillecomtal-sur-arros-gers-france(...)
                                         ^^
```
26 data points is a negligible fraction of all data points and we could have dropped the rows but it turns out that there is an even simpler solution:

In [42]:
df = pd.read_csv(data_filename, sep='\t', lineterminator='\n', low_memory=False)

In [43]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 356001 data points with 163 features


Having taken care of this, we look at the first 5 rows:

In [44]:
display_allcols(df.head(5))

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_uk,nutrition_grade_fr,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,main_category,main_category_en,image_url,image_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,28 g (1 ONZ),,0.0,[ bananas -> en:bananas ] [ vegetable-oil -...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2243.0,,28.57,28.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.018,64.29,14.29,,,,,,,,,3.6,3.57,,,,0.0,0.0,,0.0,,,,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,28 g (0.25 cup),,0.0,[ peanuts -> en:peanuts ] [ wheat-flour -> ...,,,0.0,,,0.0,,,,b,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1941.0,,17.86,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,60.71,17.86,,,,,,,,,7.1,17.86,,,,0.635,0.25,,0.0,,,,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Organic hazelnuts, organic cashews, organic wa...",,,,,,28 g (0.25 cup),,0.0,[ organic-hazelnuts -> en:organic-hazelnuts ...,,,0.0,,,0.0,,,,d,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,2540.0,,57.14,5.36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,17.86,3.57,,,,,,,,,7.1,17.86,,,,1.22428,0.482,,,,,,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,United States,Organic polenta,,,,,,35 g (0.25 cup),,0.0,[ organic-polenta -> en:organic-polenta ] [...,,,0.0,,,0.0,,,,,,,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,1552.0,,1.43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,77.14,,,,,,,,,,5.7,8.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


For each column, we briefly look at the type guessed by Pandas and the number of non-null values:

In [45]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356001 entries, 0 to 356000
Data columns (total 163 columns):
code                                          356001 non-null object
url                                           356001 non-null object
creator                                       355998 non-null object
created_t                                     356001 non-null int64
created_datetime                              356000 non-null object
last_modified_t                               356001 non-null int64
last_modified_datetime                        356001 non-null object
product_name                                  338489 non-null object
generic_name                                  57688 non-null object
quantity                                      119262 non-null object
packaging                                     89959 non-null object
packaging_tags                                89959 non-null object
brands                                        326977 non-null obj

**Comment:** We notice that 20+ columns contain only NA values.

**Decision:** Drop these columns.

In [76]:
df = df.dropna(how='all', axis=1)

In [75]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 355971 data points with 142 features


### A.1. Column group: General information<a name="task-a1-general-information"></a>

#### Column: `code`

In [46]:
profile(df[['code']])

Unnamed: 0,code
Types,{'str': 356001}
NA Count,0
NA %,0


**Comment:** The column `code` has type `object`, all values are of type `str` and there are no NA values.

**Decision:** Keep the column and all rows.

Out of curiosity, we check how many codes belong to the following categories: `e` - valid EAN-13/EAN-8/UPC-A code, `a` - code assigned by Open Food Facts (prefix 200) and `i` - internal code (store). (Mistyped EAN-13/EAN-8/UPC-A codes get miscategorized as internal codes but we will not pursue this further.) We see that the majority of all codes (ca. 88 %) are valid EAN-13/EAN-8/UPC-A codes:

In [47]:
def categorize(code: str) -> str:
    return ('e' if ean.is_valid(code) else
            'a' if code.startswith('200') else
            'i')


df['code'].map(categorize).value_counts()

e    313911
i     40340
a      1750
Name: code, dtype: int64

#### Column: `url`

In [48]:
profile(df[['url']])

Unnamed: 0,url
Types,{'str': 356001}
NA Count,0
NA %,0


**Comment:** The column `url` has type `object`, all values are of type `str` and there are no NA values.

**Decision:** Keep the column and all rows.

#### Column: `creator`

In [49]:
profile(df[['creator']])

Unnamed: 0,creator
Types,"{'str': 355998, 'float': 3}"
NA Count,3
NA %,0.000842694


**Comment:** The column `creator` has type `object`, non-NA values are of type `str` and there are a neligible number of NA values.

**Decision:** Keep the column and all rows.

#### Columns: `created_(t,datetime)`, `last_modified_(t,datetime)`

In [50]:
profile(df[['created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime']])

Unnamed: 0,created_t,created_datetime,last_modified_t,last_modified_datetime
Types,{'int': 356001},"{'str': 356000, 'float': 1}",{'int': 356001},{'str': 356001}
NA Count,0,1,0,0
NA %,0,0.000280898,0,0


**Comment:** For the `created_t`/`created_datetime` pair: The column `created_t` has type `object`, all values are of type `str` and there are no NA values. Ditto for the column `created_datetime`, except that there is 1 NA value. We confirm that these columns agree where both are not NA:

In [51]:
s1 = parse_t(df['created_t'])
s2 = parse_datetime(df['created_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision:** Use the column `created_t` to generate a column `created_on` (type `Timestamp`) and drop the columns `created_t` and `created_datetime`:

In [52]:
df['created_on'] = s1
df = df.drop(columns=['created_t', 'created_datetime'])

**Comment:** For the `last_modified_t`/`last_modified_datetime` pair: The column `last_modified_t` has type `object`, all values are of type `str` and there are no NA values. Ditto for the column `last_modified_datetime`. We confirm that these columns agree where both are not NA:

In [53]:
s1 = parse_t(df['last_modified_t'])
s2 = parse_datetime(df['last_modified_datetime'])
both_notna = s1.notna() & s2.notna()
(s1[both_notna] == s2[both_notna]).all()

True

**Decision:** Use the column `last_modified_t` to generate a column `last_modified_on` (type `Timestamp`) and drop the columns `last_modified_t` and `last_modified_datetime`:

In [54]:
df['last_modified_on'] = s1
df = df.drop(columns=['last_modified_t', 'last_modified_datetime'])

#### Column: `product_name`

In [55]:
profile(df[['product_name']])

Unnamed: 0,product_name
Types,"{'str': 338489, 'float': 17512}"
NA Count,17512
NA %,4.91909


**Comment:** The column `product_name` has type `object`, non-NA values are of type `str` and there are under 5 % of NA values.

**Decision:** Keep the column and all rows.

#### Column: `generic_name`

In [56]:
profile(df[['generic_name']])

Unnamed: 0,generic_name
Types,"{'float': 298313, 'str': 57688}"
NA Count,298313
NA %,83.7956


**Comment:** The column `generic_name` has type `object`, non-NA values are of type `str` and there are over 80 % of NA values. Inspecting a couple of records by hand, we notice that the language seems to vary quite a lot. Maybe tellingly, this column is not even documented.

**Decision:** Drop the column.

In [57]:
df = df.drop(columns='generic_name')

#### Column: `quantity`

In [58]:
profile(df[['quantity']])

Unnamed: 0,quantity
Types,"{'float': 236739, 'str': 119262}"
NA Count,236739
NA %,66.4995


**Comment:** The column `quantity` has type object, non-NA values are of type `str` and there are over 65 % of NA values. Inspecting a couple of entries by hand, we notice that in most cases, the column indicates the quantity sold at once and the unit of measurement used. We also notice at least the following issues: a) Incorrect/incomplete entries (e.g. price in euros, product name, unitless quantity, etc.). b) Multiple languages are used (e.g. _320 г_ seems to mean 320 g in Russian). c) Mixed metric and imperial units.

**Decision:** We try to salvage as many of the non-NA values as possible as it might be interesting to know the quantity of product sold at once.

In order to keep complexity under control, we set the following rules for this task:
* An entry must be a "valid number" followed by a "valid unit". White spaces are allowed and ignored. Additional information at the end is allowed and ignored. Letter case is ignored.
* A "valid number" is any string that matches `r'\d+(?:[.,]\d*)?'`.
* A "valid unit" is any string in `VALID_UNITS` (see code).
* Since imperial units differ between UK, US and USC, we decide to *ignore* those (see e.g. [How US labelling requirements undermine honest labelling in the UK](http://metricviews.org.uk/2013/03/how-us-labelling-requirements-undermine-honest-labelling-in-the-uk/).)

**Desired output:** 2 columns: `quantity_number` (type `float`) and `quantity_unit` (type `str`, either `g` or `l`). In the process, we convert all weights to gram and all volumes to liter.

**Note:** In order not to clutter the notebook, most of the code is in a separate module - _cleanquantity.py_.

In [59]:
nnotna = df['quantity'].notna().sum()

# Create a DataFrame with columns 'number' and 'unit':
df_qty = df['quantity'].str.extract(cleanquantity.PATTERN_QUANTITY)

# Convert numbers to 'float':
df_qty['number'] = (
    df_qty['number']
    .str.replace(',', '.', regex=False)
    .map(lambda value: float(value))
)

# Convert units to lower case:
df_qty['unit'] = (
    df_qty['unit']
    .str.lower()
)

# Standardize:
df_qty = (
    df_qty[['number', 'unit']]
    .apply(cleanquantity.standardize_quantity, axis=1, result_type='expand')
    .rename(columns={0: 'number', 1: 'unit'})
)

nstandardized = df_qty['number'].notna().sum()

df['quantity_number'] = df_qty['number']
df['quantity_unit'] = df_qty['unit']
df = df.drop(columns='quantity')

pstandardizednotna = nstandardized / nnotna * 100
print(f"entries not-NA: {nnotna}")
print(f"entries standardized: {nstandardized} ({pstandardizednotna:.2f} % of not-NA)")

entries not-NA: 119262
entries standardized: 106470 (89.27 % of not-NA)


We check the result for outliers:

In [60]:
(
    df[['quantity_number', 'quantity_unit']]
    .groupby(by='quantity_unit')
    .describe()
)

Unnamed: 0_level_0,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number,quantity_number
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
quantity_unit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
g,86658.0,403.239589,8037.160368,0.0,150.0,250.0,420.0,1390000.0
l,19812.0,0.809454,3.336295,0.0,0.35,0.75,1.0,450.0


**Comment (cont.):** Weight: Inspecting a couple of entries above 10 kg by hand (22 entries), we notice the following: a) At least one record desribes food for *animals*, not for humans (code: 289259, categories: *aliment pour chevaux* (*food for horses*)). b) Some values do make sense (e.g. a 25 kg bag of flour for a baker) while others do not. Volume: Inspecting a couple of records above 12 liters by hand (12 entries), we notice that some values do makes sense (e.g. a 20 l barrel of wine) while others do not.

**Decision (cont.):** We have a negligible number of outliers. Weight: Replace a weigth of 0 by NA and drop records above 10 kg, except when `product_name` contains the word _farine_ (_flour_). Volume: Replace 0 by NA and drop records above 12 l except when `product_name` contains the word _tonneau_ (_barrel_).

In [61]:
df.loc[df['quantity_number'] == 0, ['quantity_number', 'quantity_unit']] = np.nan

weight_cond = ((df['quantity_unit'] == 'g')
          & (df['quantity_number'] > 10_000)
          & ~df['product_name'].str.contains('farine', case=False, na=False, regex=False))

volume_cond = ((df['quantity_unit'] == 'l')
          & (df['quantity_number'] > 12)
          & ~df['product_name'].str.contains('tonneau', case=False, na=False, regex=False))

df = df.drop(df[weight_cond | volume_cond].index, axis=0)

In [77]:
nrows, ncols = df.shape
print(f'the dataset contains {nrows} data points with {ncols} features')

the dataset contains 355971 data points with 142 features


### A.1. Column group: Tags<a name="task-a1-tags"></a>

Hypothesis for tags:
* `<name>` is a comma-separated list of short descriptions in one of the languages of the country where the product was bought. These keywords may include white spaces and seem to start with an upper case letter.
* `<name>_tags_en` is the translation of `<name>` to English.
* `<name>_tags` is a comma-separated list of tags, where tags are derived from the keywords by lower-casing all words and joining them with dashes. Each tag is optionally prefixed with a language 2-letter code followed by a colon.

**Comment:** The number of tags does not seem to agree. E.g. for categories.

In [86]:
# Split 'categories' into a list of short descriptions:
s1 = df['categories'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of short descriptions:
ns1 = s1.map(lambda keywords: len(keywords), na_action='ignore')

# Split 'categories_en' into a list of short descriptions:
s2 = df['categories_en'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of short descriptions:
ns2 = s2.map(lambda keywords: len(keywords), na_action='ignore')

# Split 'categories_tags' into a list of tags:
s3 = df['categories_tags'].map(lambda text: text.split(','), na_action='ignore')
# Count the number of tags:
ns3 = s3.map(lambda tags: len(tags), na_action='ignore')

df_tmp = pd.DataFrame(data={
    'categories': s1, 'categories_len': ns1,
    'categories_en': s2, 'categories_en_len': ns2,
    'categories_tags': s3, 'categories_tags_len': ns3
})

display(
    (df_tmp['categories_len'] != df_tmp['categories_en_len']).sum(),
    (df_tmp['categories_len'] != df_tmp['categories_tags_len']).sum(),
    (df_tmp['categories_en_len'] != df_tmp['categories_tags_len']).sum()
)

display_allcols_notrunc(df_tmp[(
    df_tmp['categories'].notna() &
    ~((df_tmp['categories_len'] == df_tmp['categories_en_len']) & (df_tmp['categories_len'] == df_tmp['categories_tags_len'])))
])

305529

305529

252712

Unnamed: 0,categories,categories_len,categories_en,categories_en_len,categories_tags,categories_tags_len
176,[Légumes-feuilles],1.0,"[Plant-based foods and beverages, Plant-based foods, Fresh foods, Fruits and vegetables based foods, Fresh plant-based foods, Vegetables based foods, Fresh vegetables, Leaf vegetables]",8.0,"[en:plant-based-foods-and-beverages, en:plant-based-foods, en:fresh-foods, en:fruits-and-vegetables-based-foods, en:fresh-plant-based-foods, en:vegetables-based-foods, en:fresh-vegetables, en:leaf-vegetables]",8.0
184,"[Aliments et boissons à base de végétaux, Aliments d'origine végétale, Légumineuses et dérivés, Céréales et pommes de terre, Légumineuses, Graines, Graines de légumineuses, Légumes secs, Lentilles vertes]",9.0,"[Plant-based foods and beverages, Plant-based foods, Cereals and potatoes, Legumes and their products, Legumes, Seeds, Legume seeds, Pulses, Lentils, Green lentils]",10.0,"[en:plant-based-foods-and-beverages, en:plant-based-foods, en:cereals-and-potatoes, en:legumes-and-their-products, en:legumes, en:seeds, en:legume-seeds, en:pulses, en:lentils, en:green-lentils]",10.0
187,[Quiches lorraines],1.0,"[Meals, Pizzas pies and quiches, Quiches, fr:Quiches lorraines]",4.0,"[en:meals, en:pizzas-pies-and-quiches, en:quiches, fr:quiches-lorraines]",4.0
190,[en:beverages],1.0,"[Beverages, Non-sugared beverages]",2.0,"[en:beverages, en:non-sugared-beverages]",2.0
191,"[Aliments et boissons à base de végétaux, Boissons, Aliments d'origine végétale, Boissons chaudes, Infusions, Thés, Thés noirs, Boissons non sucrées, Thés noirs aromatisés, Thés aromatisés]",10.0,"[Plant-based foods and beverages, Beverages, Plant-based foods, Hot beverages, Plant-based beverages, Herbal teas, Teas, Black teas, Flavored teas, Flavored black teas, Non-sugared beverages]",11.0,"[en:plant-based-foods-and-beverages, en:beverages, en:plant-based-foods, en:hot-beverages, en:plant-based-beverages, en:herbal-teas, en:teas, en:black-teas, en:flavored-teas, en:flavored-black-teas, en:non-sugared-beverages]",11.0
223,[Pralinen],1.0,"[Plant-based foods and beverages, Plant-based foods, Sugary snacks, Confectioneries, Nuts and their products, Nut confectioneries, de:Pralinen]",7.0,"[en:plant-based-foods-and-beverages, en:plant-based-foods, en:sugary-snacks, en:confectioneries, en:nuts-and-their-products, en:nut-confectioneries, de:pralinen]",7.0
231,[Sodas au cola],1.0,"[Beverages, Carbonated drinks, Sodas, Colas, Sugared beverages]",5.0,"[en:beverages, en:carbonated-drinks, en:sodas, en:colas, en:sugared-beverages]",5.0
232,[Sauces],1.0,"[Groceries, Sauces]",2.0,"[en:groceries, en:sauces]",2.0
238,[en:beverages],1.0,"[Beverages, Sugared beverages]",2.0,"[en:beverages, en:sugared-beverages]",2.0
244,"[Plant-based foods and beverages, Plant-based foods, Fruits and vegetables based foods, Vegetables based foods, Fresh foods, Fresh vegetables, Cauliflower]",7.0,"[Plant-based foods and beverages, Plant-based foods, Fresh foods, Fruits and vegetables based foods, Fresh plant-based foods, Vegetables based foods, Fresh vegetables, Leaf vegetables, Cauliflowers]",9.0,"[en:plant-based-foods-and-beverages, en:plant-based-foods, en:fresh-foods, en:fruits-and-vegetables-based-foods, en:fresh-plant-based-foods, en:vegetables-based-foods, en:fresh-vegetables, en:leaf-vegetables, en:cauliflowers]",9.0


#### Columns: `packaging`, `packaging_tags`

In [64]:
profile(df[['packaging', 'packaging_tags']])

Unnamed: 0,packaging,packaging_tags
Types,"{'float': 266026, 'str': 89945}","{'float': 266026, 'str': 89945}"
NA Count,266026,266026
NA %,74.7325,74.7325


In [66]:
profile(df[['brands', 'brands_tags']])

Unnamed: 0,brands,brands_tags
Types,"{'str': 326957, 'float': 29014}","{'str': 326937, 'float': 29034}"
NA Count,29014,29034
NA %,8.15066,8.15628


In [67]:
profile(df[['categories', 'categories_tags', 'categories_en']])

Unnamed: 0,categories,categories_tags,categories_en
Types,"{'float': 252712, 'str': 103259}","{'float': 252712, 'str': 103259}","{'float': 252712, 'str': 103259}"
NA Count,252712,252712,252712
NA %,70.9923,70.9923,70.9923


In [68]:
profile(df[['origins', 'origins_tags']])

Unnamed: 0,origins,origins_tags
Types,"{'float': 330955, 'str': 25016}","{'float': 330993, 'str': 24978}"
NA Count,330955,330993
NA %,92.9725,92.9831


In [69]:
profile(df[['manufacturing_places', 'manufacturing_places_tags']])

Unnamed: 0,manufacturing_places,manufacturing_places_tags
Types,"{'float': 313994, 'str': 41977}","{'float': 314001, 'str': 41970}"
NA Count,313994,314001
NA %,88.2077,88.2097


In [70]:
profile(df[['labels', 'labels_tags', 'labels_en']])

Unnamed: 0,labels,labels_tags,labels_en
Types,"{'float': 296880, 'str': 59091}","{'float': 296800, 'str': 59171}","{'float': 296800, 'str': 59171}"
NA Count,296880,296800,296800
NA %,83.4001,83.3776,83.3776


In [71]:
profile(df[['emb_codes', 'emb_codes_tags']])

Unnamed: 0,emb_codes,emb_codes_tags
Types,"{'float': 323486, 'str': 32485}","{'float': 323491, 'str': 32480}"
NA Count,323486,323491
NA %,90.8743,90.8757
