# Tabular Data Analysis 2: Basic operations in pandas

The previous notebook gave an overview of how the pandas library enables to load, manipulate and visualize a dataset.

This notebook introduces the basic operations for data manipulation in pandas, before we focus on data visualization in the next notebook.

## Preliminary: Access and load the data

We need to execute the same code as in the previous notebook to load the pandas library and the Open Food Facts dataset.

In [1]:
# (just execute this cell)

# enable Colab to access files (here shortcuts) on your Drive
# from google.colab import drive

# drive.mount("/content/drive")

In [2]:
# (just execute this cell)

# import pandas
import csv
import pandas as pd

# we need this data type for ordered categoricals
from pandas.api.types import CategoricalDtype

# lift some limitations in column width, so more cell values are displayed in full
pd.set_option("display.max_colwidth", 110)

# dataset and data type of the columns
FOLDER = "../data/processed"  # "drive/MyDrive"
OFF_FILE = f"{FOLDER}/off_products_subset.csv"
DTYPE_FILE = f"{FOLDER}/dtype.txt"


# custom function to load the Open Food Facts subset
def load_off():
    """Load the filtered subset of Open Food Facts.

    Returns
    -------
    df : pd.DataFrame
      (A filtered subset of the) Open Food Facts tabular dataset.
    """
    # load the data types for the columns
    with open(DTYPE_FILE) as f:
        dtype = eval(f.read())

    # load the dataset
    df = pd.read_csv(OFF_FILE, sep="\t", dtype=dtype, quoting=csv.QUOTE_NONE)
    # convert columns with datetimes
    for col_name in (
        "created_datetime",
        "last_modified_datetime",
        "last_updated_datetime",
        "last_image_datetime",
    ):
        # ISO 8601 dates
        df[col_name] = pd.to_datetime(df[col_name])
    #
    return df


# load the dataset (takes around 60 seconds)
df = load_off()

## Selecting subsets

One of the fundamental operations on DataFrames is to be able to filter the dataset on a certain condition, to keep only certain rows or columns.

The basic operators for selection are:
* square brackets `[]`,
* `loc`,
* `iloc`.

You can select rows or columns by their position or label, or with a conditional expression on values, see the [pandas intro tutorial 03](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html).

Filter rows in `df` to keep only products with Nutri-Score 'a', and store the result in a variable called `df_nutri_a`.


In [3]:
# (just execute this cell)
df_nutri_a = df[df["nutriscore_grade"] == "a"]
df_nutri_a

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
4,0000182006180,http://world-en.openfoodfacts.org/product/0000182006180/knusper-musli-mango-gut-bio,tobiasseidel,2024-01-19 06:22:10+00:00,2024-01-19 06:32:48+00:00,tobiasseidel,2024-02-14 06:14:40+00:00,Knusper-Müsli Mango,,500g,...,,,,,,,,,10.500000,-3
11,0000335685101,http://world-en.openfoodfacts.org/product/0000335685101/tender-broad-beans-sainsbury-s,grimpeur,2024-01-02 15:42:55+00:00,2024-01-02 16:06:34+00:00,roboto-app,2024-02-14 06:02:11+00:00,tender broad beans,,300g,...,,,,,,,,,0.000000,-8
12,0000350034007,http://world-en.openfoodfacts.org/product/0000350034007/italian-tomato-puree-sainsbury-s,smoothie-app,2023-11-03 09:44:34+00:00,2024-03-10 09:02:28+00:00,mandyjacob123,2024-03-10 09:02:28+00:00,Italian tomato puree,,200g,...,,,,,,,,,75.000000,-3
13,0000440001018,http://world-en.openfoodfacts.org/product/0000440001018/leicht-cross-cereola,smoothie-app,2024-01-22 19:28:33+00:00,2024-06-05 23:44:11+00:00,moon-rabbit,2024-06-05 23:44:11+00:00,LEICHT&CROSS,,125g,...,,,,,,,,,0.000000,-1
14,0000448972280,http://world-en.openfoodfacts.org/product/0000448972280/mini-naan-dippers-asia-green-garden,flopin,2023-08-30 17:14:23+00:00,2023-09-01 19:45:29+00:00,teolemon,2024-02-14 04:42:20+00:00,Mini-Naan-Dippers,,180g,...,,,,,,,,,23.121995,-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778687,9950014911001,http://world-en.openfoodfacts.org/product/9950014911001/oignons-jaunes-40-60-ferme-de-l-artois,kiliweb,2018-02-04 13:32:50+00:00,2023-04-28 13:35:02+00:00,roboto-app,2024-02-10 21:06:03+00:00,Oignons jaunes 40/60,,2 kg,...,,,,,,,,,0.000000,-6
778700,99644444,http://world-en.openfoodfacts.org/product/99644444/betteraves-rouges-freshona,kiliweb,2021-10-17 16:22:33+00:00,2023-12-31 05:16:53+00:00,quentinbrd,2024-02-12 13:51:28+00:00,Betteraves rouges,,500 g,...,,,,,,,,,,-5
778714,9991111111154,http://world-en.openfoodfacts.org/product/9991111111154/compote-a-boire-pomme-poire-la-ferme-de-coutance,kiliweb,2018-07-13 09:23:55+00:00,2022-02-11 06:29:47+00:00,packbot,2024-02-10 21:06:05+00:00,Compote à Boire Pomme Poire,,,...,,,,,,,,,97.000000,-3
778720,9999941860884,http://world-en.openfoodfacts.org/product/9999941860884/tofu-nature-taifun,kiliweb,2023-01-29 09:29:23+00:00,2023-03-28 11:24:20+00:00,itsjustruby,2024-02-14 01:54:10+00:00,Tofu nature,,200g,...,,,,,,,,,55.000000,-3


You should have 111,371 entries.

Now, filter rows in `df` to keep products whose quantity of sugars per 100g is higher than 20g, and store the result in a variable called `df_sugar_gt20`.

In [4]:
df_sugar_gt20 = df[df["sugars_100g"] > 20]
df_sugar_gt20

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2023-04-28 23:59:01+00:00,roboto-app,2024-02-09 14:48:49+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,...,,,,,,,,,,23
3,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,org-database-usda,2024-02-09 14:48:51+00:00,"Milkyway, magic stars chocolates",,,...,,,,,,,,,0.000000,
7,0000241013128,http://world-en.openfoodfacts.org/product/0000241013128/eclairs-intermarche,kiliweb,2018-07-07 11:29:41+00:00,2023-08-22 04:43:41+00:00,quentinbrd,2024-02-09 14:48:52+00:00,Eclairs,,180 g,...,,,,,,,,,,11
17,0000477414034,http://world-en.openfoodfacts.org/product/0000477414034/figs-marks-spencer,foodvisor,2022-07-17 21:12:37+00:00,2023-08-25 13:32:06+00:00,chevalstar,2024-02-13 22:56:17+00:00,Figs,,,...,,,,,,,,,100.000000,
38,0000790310013,http://world-en.openfoodfacts.org/product/0000790310013/sour-fruit-gummies-candy-crush,malikele,2014-01-02 17:03:07+00:00,2022-02-11 10:33:12+00:00,packbot,2024-02-09 14:49:02+00:00,Sour Fruit Gummies,,3.5 oz,...,,,,,,,,,11.363636,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778690,99515874,http://world-en.openfoodfacts.org/product/99515874/mini-stollen-favorina,kiliweb,2017-12-17 13:58:05+00:00,2023-12-30 14:32:16+00:00,moon-rabbit,2024-02-10 21:06:04+00:00,Mini stollen,,,...,,,,,,,,,56.000000,21
778698,99604028,http://world-en.openfoodfacts.org/product/99604028/maple-syrup-trader-joe-s,kiliweb,2022-06-30 14:34:53+00:00,2023-03-30 07:24:34+00:00,wolfgang8741,2024-02-13 22:39:53+00:00,Maple Syrup,,,...,,,,,,,,,,14
778705,99760069,http://world-en.openfoodfacts.org/product/99760069/bio-ahorn-sirup-maribel,kiliweb,2022-03-28 13:49:56+00:00,2024-01-09 08:54:47+00:00,sebleouf,2024-02-13 20:56:24+00:00,Bio ahorn sirup,,,...,,,,,,,,,,14
778719,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,moon-rabbit,2024-02-10 21:06:06+00:00,Chocolat de Couverture Noir,,100 g,...,,,,,,,,,,22


You should obtain 156,224 entries.

Filter the dataset `df` to keep only the columns corresponding to the :
* barcode,
* url,
* date of creation,
* product name,
* brands,
* categories,
* ingredients text,
* main category,
* Nutri-Score grade,
* Nutri-Score score,
* Nova group.

And store the result in a variable named `df_sel_cols`.

In [5]:
df_sel_cols = df[
    [
        "code",
        "url",
        "created_datetime",
        "product_name",
        "brands",
        "categories_en",
        "ingredients_text",
        "main_category_en",
        "nutriscore_grade",
        "nutriscore_score",
        "nova_group",
    ]
]
df_sel_cols

Unnamed: 0,code,url,created_datetime,product_name,brands,categories_en,ingredients_text,main_category_en,nutriscore_grade,nutriscore_score,nova_group
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,2018-02-22 10:56:57+00:00,Véritable pâte à tartiner noisettes chocolat noir,Bovetti,"Breakfasts,Spreads,Sweet spreads,fr:Pâtes à tartiner,Hazelnut spreads,Chocolate spreads,Cocoa and hazelnut...",,Cocoa and hazelnuts spreads,e,23,
1,0000131327786,http://world-en.openfoodfacts.org/product/0000131327786/lime-cordial-sainsbury-s,2024-06-01 20:48:44+00:00,Lime Cordial,Sainsbury's,Lime-cordial,"Water, Lime Juice from Concentrate (30%), Acid: Citric Acid; Preservatives: Potassium Sorbate, Sodium Meta...",Lime-cordial,,,4
2,0000155011159,http://world-en.openfoodfacts.org/product/0000155011159/mini-chaussons-a-la-compote-de-pomme-intermarche,2021-12-16 21:00:16+00:00,Mini chaussons à la compote de pomme,Intermarché,"Snacks,Desserts,Sweet snacks,Biscuits and cakes,Sweet pies,Pastries,Pies,Apple pies","pâte 66.7%. farine de BLE beurre (lait) 18,6%. eau sel levure désactivée. Garniture compote de pommes 30...",Apple pies,d,15,4
3,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,2017-03-09 16:01:56+00:00,"Milkyway, magic stars chocolates",Milkyway,"Snacks,Sweet snacks,Cocoa and its products,Confectioneries,Chocolate candies","Sugar, cocoa butter, skimmed milk powder, cocoa mass, whey powder (from milk), lactose, milk fat, emulsifi...",Chocolate candies,,,4
4,0000182006180,http://world-en.openfoodfacts.org/product/0000182006180/knusper-musli-mango-gut-bio,2024-01-19 06:22:10+00:00,Knusper-Müsli Mango,Gut bio,"Plant-based foods and beverages,Plant-based foods,Breakfasts,Cereals and potatoes,Cereals and their produc...","Hafervollkornflocken, Rohrzucker, extrudiertes Getreideerzeugnis (Reismehl, Weizenmehl, Rohrzucker, Gerste...",Mueslis with fruits,a,-3,3
...,...,...,...,...,...,...,...,...,...,...,...
778720,9999941860884,http://world-en.openfoodfacts.org/product/9999941860884/tofu-nature-taifun,2023-01-29 09:29:23+00:00,Tofu nature,Taifun,"Plant-based foods and beverages,Plant-based foods,Legumes and their products,Meat alternatives,Meat analog...","_Sojabohnen_* 55%, Wasser, Gerinnungsmittel: Magnesiumchlorid, Calciumsulfat.",Plain tofu,a,-3,3
778721,9999991042704,http://world-en.openfoodfacts.org/product/9999991042704/yaourt-vanille-patapain,2018-05-09 10:46:24+00:00,Yaourt vanille,Patapain,"Dairies,Fermented foods,Fermented milk products,Desserts,Dairy desserts,Fermented dairy desserts,Yogurts,V...","Lait entier 77%, crème, sucre 7,5%, ferments lactiques, infusion de gousses de vanille - fève tonka poudre...",Vanilla yogurt,c,7,3
778722,9999991149090,http://world-en.openfoodfacts.org/product/9999991149090/riz-parfume-king-elephant,2018-02-20 17:07:29+00:00,Riz parfumé,King Elephant,"Plant-based foods and beverages,Plant-based foods,Cereals and potatoes,Seeds,Cereals and their products,Ce...",,Aromatic rices,b,0,
778723,9999994666013,http://world-en.openfoodfacts.org/product/9999994666013/skimmed-milk-tesco,2024-06-10 07:11:28+00:00,Skimmed Milk,Tesco,"Beverages and beverages preparations,Beverages,Dairies,Dairy drinks",milk,Dairy drinks,a,-1,1


You should get a DataFrame of 778,725 rows (same as in the loaded dataset) and 11 columns (the ones we selected).

## Making a selection into a proper DataFrame

You can manipulate each of these selections as a DataFrame, but behind the scenes, they are *views* of the original DataFrame `df`.

The *view* mechanism avoids unnecessary copies of the dataset, but it is problematic when we really want to extract a subset and perform some operations only on this subset.

For instance, let us select all products in `df` with sugars and fat per 100g greater than 0, and add a column with the sugars to fat ratio.

First, we need to define two filtering conditions and apply them jointly using the [boolean "and" `&`](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing). Store the result in a variable named `df_sugarsfat`.

In [6]:
df_sugarsfat = df[(df["sugars_100g"] > 0) & (df["fat_100g"] > 0)]
df_sugarsfat

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2023-04-28 23:59:01+00:00,roboto-app,2024-02-09 14:48:49+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,...,,,,,,,,,,23
2,0000155011159,http://world-en.openfoodfacts.org/product/0000155011159/mini-chaussons-a-la-compote-de-pomme-intermarche,kiliweb,2021-12-16 21:00:16+00:00,2021-12-17 08:10:31+00:00,ecoscore-impact-estimator,2024-02-12 13:34:36+00:00,Mini chaussons à la compote de pomme,,250 g,...,,,,,,,,,7.350000,15
3,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,org-database-usda,2024-02-09 14:48:51+00:00,"Milkyway, magic stars chocolates",,,...,,,,,,,,,0.000000,
4,0000182006180,http://world-en.openfoodfacts.org/product/0000182006180/knusper-musli-mango-gut-bio,tobiasseidel,2024-01-19 06:22:10+00:00,2024-01-19 06:32:48+00:00,tobiasseidel,2024-02-14 06:14:40+00:00,Knusper-Müsli Mango,,500g,...,,,,,,,,,10.500000,-3
6,0000209773750,http://world-en.openfoodfacts.org/product/0000209773750/tortitas-de-trigo-roti-wraps-lidl,kiliweb,2021-06-12 08:56:14+00:00,2023-12-04 19:29:11+00:00,mariacastiel,2024-02-13 15:27:29+00:00,Tortitas de trigo- Roti wraps,,,...,,,,,,,,,,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778719,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,moon-rabbit,2024-02-10 21:06:06+00:00,Chocolat de Couverture Noir,,100 g,...,,,,,,,,,,22
778720,9999941860884,http://world-en.openfoodfacts.org/product/9999941860884/tofu-nature-taifun,kiliweb,2023-01-29 09:29:23+00:00,2023-03-28 11:24:20+00:00,itsjustruby,2024-02-14 01:54:10+00:00,Tofu nature,,200g,...,,,,,,,,,55.000000,-3
778721,9999991042704,http://world-en.openfoodfacts.org/product/9999991042704/yaourt-vanille-patapain,kiliweb,2018-05-09 10:46:24+00:00,2024-06-16 21:10:13+00:00,geodata,2024-06-16 21:10:13+00:00,Yaourt vanille,,120 g,...,,,,,,,,,0.000000,7
778723,9999994666013,http://world-en.openfoodfacts.org/product/9999994666013/skimmed-milk-tesco,jrg2024,2024-06-10 07:11:28+00:00,2024-06-10 07:24:04+00:00,roboto-app,2024-06-10 07:24:04+00:00,Skimmed Milk,,,...,,,,,,,,,0.000000,-1


The selected subset contains 560232 rows (and 76 columns, same as `df`).

Now, let us assign to `df_sugarsfat` a new column named `"sugarsfat_ratio"` with the sugars to fat ratio.

This is done with the [assign](https://pandas.pydata.org/pandas-docs/version/2.0/reference/api/pandas.DataFrame.assign.html#pandas-dataframe-assign) function that expects *keyword arguments* (`key=value`), where the key is the name of the assigned column, and the value a Series (or an expression that evaluates to a Series):

In [7]:
# (execute this cell)
df_sugarsfat = df_sugarsfat.assign(
    sugarsfat_ratio=df_sugarsfat["sugars_100g"] / df_sugarsfat["fat_100g"]
)
df_sugarsfat

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g,sugarsfat_ratio
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2023-04-28 23:59:01+00:00,roboto-app,2024-02-09 14:48:49+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,...,,,,,,,,,23,0.666667
2,0000155011159,http://world-en.openfoodfacts.org/product/0000155011159/mini-chaussons-a-la-compote-de-pomme-intermarche,kiliweb,2021-12-16 21:00:16+00:00,2021-12-17 08:10:31+00:00,ecoscore-impact-estimator,2024-02-12 13:34:36+00:00,Mini chaussons à la compote de pomme,,250 g,...,,,,,,,,7.350000,15,1.153359
3,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,org-database-usda,2024-02-09 14:48:51+00:00,"Milkyway, magic stars chocolates",,,...,,,,,,,,0.000000,,1.544669
4,0000182006180,http://world-en.openfoodfacts.org/product/0000182006180/knusper-musli-mango-gut-bio,tobiasseidel,2024-01-19 06:22:10+00:00,2024-01-19 06:32:48+00:00,tobiasseidel,2024-02-14 06:14:40+00:00,Knusper-Müsli Mango,,500g,...,,,,,,,,10.500000,-3,1.454545
6,0000209773750,http://world-en.openfoodfacts.org/product/0000209773750/tortitas-de-trigo-roti-wraps-lidl,kiliweb,2021-06-12 08:56:14+00:00,2023-12-04 19:29:11+00:00,mariacastiel,2024-02-13 15:27:29+00:00,Tortitas de trigo- Roti wraps,,,...,,,,,,,,,5,0.153846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778719,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,moon-rabbit,2024-02-10 21:06:06+00:00,Chocolat de Couverture Noir,,100 g,...,,,,,,,,,22,0.573561
778720,9999941860884,http://world-en.openfoodfacts.org/product/9999941860884/tofu-nature-taifun,kiliweb,2023-01-29 09:29:23+00:00,2023-03-28 11:24:20+00:00,itsjustruby,2024-02-14 01:54:10+00:00,Tofu nature,,200g,...,,,,,,,,55.000000,-3,0.074627
778721,9999991042704,http://world-en.openfoodfacts.org/product/9999991042704/yaourt-vanille-patapain,kiliweb,2018-05-09 10:46:24+00:00,2024-06-16 21:10:13+00:00,geodata,2024-06-16 21:10:13+00:00,Yaourt vanille,,120 g,...,,,,,,,,0.000000,7,1.447368
778723,9999994666013,http://world-en.openfoodfacts.org/product/9999994666013/skimmed-milk-tesco,jrg2024,2024-06-10 07:11:28+00:00,2024-06-10 07:24:04+00:00,roboto-app,2024-06-10 07:24:04+00:00,Skimmed Milk,,,...,,,,,,,,0.000000,-1,50.000000


`df_sugarsfat` now contains one more column (82 in total).

Display the new column to get a first impression of its content.

In [8]:
df_sugarsfat["sugarsfat_ratio"]

0          0.666667
2          1.153359
3          1.544669
4          1.454545
6          0.153846
            ...    
778719     0.573561
778720     0.074627
778721     1.447368
778723    50.000000
778724     3.157895
Name: sugarsfat_ratio, Length: 560323, dtype: float64

Several columns can be assigned simultaneously (ie. within a single call to `assign()`), using as many *keyword arguments*.

Assign, in one call, two new columns:
- `satfat_ratio`: the ratio between saturated fat and fat,
- `satsugars_ratio`: the ratio between saturated fat and sugars.

In [9]:
df_sugarsfat = df_sugarsfat.assign(
    satfat_ratio=df_sugarsfat["saturated-fat_100g"] / df_sugarsfat["fat_100g"],
    satsugars_ratio=df_sugarsfat["saturated-fat_100g"] / df_sugarsfat["sugars_100g"],
)

Display the new `satfat_ratio` column:

In [10]:
df_sugarsfat["satfat_ratio"]

0         0.208333
2         0.526616
3              NaN
4         0.127273
6         0.153846
            ...   
778719    0.601279
778720    0.179104
778721    0.671053
778723    1.000000
778724    0.631579
Name: satfat_ratio, Length: 560323, dtype: float64

Display the new `satsugars_ratio` column:

In [11]:
df_sugarsfat["satsugars_ratio"]

0         0.312500
2         0.456593
3              NaN
4         0.087500
6         1.000000
            ...   
778719    1.048327
778720    2.400000
778721    0.463636
778723    0.020000
778724    0.200000
Name: satsugars_ratio, Length: 560323, dtype: float64

## Renaming columns

Column names are not always ideal, either because they are not transparent (it is hard for you or an external user to understand what they stand for) or because they would look bad if they were used directly to label the axes of a data visualization.

pandas provides means to rename columns, see the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html).

Let us rename each of the columns whose name ends with `_en`.

First, we need to list such columns.

In [12]:
# (just execute this cell)
# list the column names that end with _en
cols_en = [x for x in df.columns if x.endswith("_en")]
cols_en

['packaging_en',
 'categories_en',
 'origins_en',
 'labels_en',
 'countries_en',
 'traces_en',
 'additives_en',
 'food_groups_en',
 'states_en',
 'main_category_en']

Now we can `rename` each of the columns ending with `_en`, so as to drop this suffix.
For instance, `main_category_en` should be renamed `main_category`.
Store the result in a variable named `df_ren_en`.

In [13]:
df_ren_en = df.rename(
    columns={
        "packaging_en": "packaging",
        "categories_en": "categories",
        "origins_en": "origins",
        "labels_en": "labels",
        "countries_en": "countries",
        "traces_en": "traces",
        "additives_en": "additives",
        "food_groups_en": "food_groups",
        "states_en": "states",
        "main_category_en": "main_category",
    }
)

To see if it worked, let us display the column names in `df_ren_en` and check that our `_en` columns, such as `main_category_en`, have been renamed as expected.

In [14]:
df_ren_en.columns

Index(['code', 'url', 'creator', 'created_datetime', 'last_modified_datetime',
       'last_modified_by', 'last_updated_datetime', 'product_name',
       'generic_name', 'quantity', 'packaging', 'packaging_text', 'brands',
       'categories', 'origins', 'manufacturing_places', 'labels', 'emb_codes',
       'purchase_places', 'stores', 'countries', 'ingredients_text',
       'ingredients_tags', 'ingredients_analysis_tags', 'allergens', 'traces',
       'serving_size', 'serving_quantity', 'no_nutrition_data', 'additives_n',
       'additives', 'nutriscore_score', 'nutriscore_grade', 'nova_group',
       'pnns_groups_1', 'pnns_groups_2', 'food_groups', 'states',
       'brand_owner', 'ecoscore_score', 'ecoscore_grade',
       'nutrient_levels_tags', 'product_quantity', 'data_quality_errors_tags',
       'unique_scans_n', 'popularity_tags', 'completeness',
       'last_image_datetime', 'main_category', 'image_small_url',
       'energy-kj_100g', 'energy-kcal_100g', 'energy_100g', 'fat_100

### (Advanced) Substituting in a more progammatic way

If you already learned and practiced Python before this course, there is a more concise and generic way to rename all columns whose name ends with "_en" by dropping the ending, without having to enumerate them:

In [15]:
# alternative, shorter but more advanced
df_ren_en_bis = df.rename(columns={x: x[:-3] for x in df.columns if x.endswith("_en")})
df_ren_en_bis.columns

Index(['code', 'url', 'creator', 'created_datetime', 'last_modified_datetime',
       'last_modified_by', 'last_updated_datetime', 'product_name',
       'generic_name', 'quantity', 'packaging', 'packaging_text', 'brands',
       'categories', 'origins', 'manufacturing_places', 'labels', 'emb_codes',
       'purchase_places', 'stores', 'countries', 'ingredients_text',
       'ingredients_tags', 'ingredients_analysis_tags', 'allergens', 'traces',
       'serving_size', 'serving_quantity', 'no_nutrition_data', 'additives_n',
       'additives', 'nutriscore_score', 'nutriscore_grade', 'nova_group',
       'pnns_groups_1', 'pnns_groups_2', 'food_groups', 'states',
       'brand_owner', 'ecoscore_score', 'ecoscore_grade',
       'nutrient_levels_tags', 'product_quantity', 'data_quality_errors_tags',
       'unique_scans_n', 'popularity_tags', 'completeness',
       'last_image_datetime', 'main_category', 'image_small_url',
       'energy-kj_100g', 'energy-kcal_100g', 'energy_100g', 'fat_100

This uses the advanced syntax of *dict comprehensions*, similar to [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
Comprehensions are pervasive in advanced Python code.

## Summary statistics

You can compute various summary statistics that depend on the type of variable in each column, see the [pandas intro tutorial 06](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html).

Compute summary statistics for several columns from different types, and combinations of columns that could provide interesting insights.

For instance, compute the [means](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html#pandas-dataframe-mean) of the nutritional values in `df` for :
* fat,
* saturated fat,
* sugars,
* salt.

In [16]:
df[["fat_100g", "saturated-fat_100g", "sugars_100g", "salt_100g"]].mean()

fat_100g              2229.800848
saturated-fat_100g     850.173911
sugars_100g             12.508065
salt_100g                1.490976
dtype: float64

The means for "fat_100g" and "saturated-fat_100g" are outside of the expected range: How could there be more than 100g of fats per 100g of product?

Select all products that contain more than 100g of fat per 100g of product, and display only their barcode, URL and "fat_100g".

In [17]:
df[df["fat_100g"] > 100][["code", "url", "fat_100g"]]

Unnamed: 0,code,url,fat_100g
1308,0011110136619,http://world-en.openfoodfacts.org/product/0011110136619/egg-bites-simple-truth,1600.000000
6495,0021000604913,http://world-en.openfoodfacts.org/product/0021000604913/kraft-singles-72-slices,1605.263200
9667,0027400265051,http://world-en.openfoodfacts.org/product/0027400265051/country-crock-churn-style-40-vegetable-oil-spread-...,306.428600
10461,0028400688918,http://world-en.openfoodfacts.org/product/0028400688918/ruffles-flamin-hot-cheddar-sour-cream,128.000000
16372,0040600436755,http://world-en.openfoodfacts.org/product/0040600436755/tiramisu-sainsbury,106.000000
...,...,...,...
760597,8906035030055,http://world-en.openfoodfacts.org/product/8906035030055/refined-sunflower-oil-freedom,1000.000000
760754,8906112661332,http://world-en.openfoodfacts.org/product/8906112661332/chia-seeds-true-elements,126.000000
769486,9300633627942,http://world-en.openfoodfacts.org/product/9300633627942/white-rice-cups-woolworths,150.000000
775479,9354628000333,http://world-en.openfoodfacts.org/product/9354628000333/fatigue-eliminator-happy-mammoth,233.999996


A product in this list is <https://world.openfoodfacts.org/product/0040600436755/tiramisu-sainsbury>.
It has `fat_100g`=`106.0`.

On the product page, look for the picture that contains the nutritional values.

**Question.** What is the most likely explanation for this unexpected value?

Another product is <http://world-en.openfoodfacts.org/product/0011110136619/egg-bites-simple-truth>.
It has `fat_100g`=`1600.0`.

On the product page, look for the picture that contains the nutritional values.

**Question.** What is the most likely explanation for this unexpected value?

Extreme values can be input by mistake, or come from faulty sensors or computations.
They constitute outliers that heavily influence summary statistics, which in turn means that summary statistics are informative indicators of the presence of outliers.

Let us drop all products for which `fat_100g`, `saturated-fat_100g`, `sugars_100g` or `salt_100g` are higher than `100`: 

In [18]:
# (just execute this cell)
df = df.drop(
    index=df[
        (df["fat_100g"] > 100)
        | (df["saturated-fat_100g"] > 100)
        | (df["sugars_100g"] > 100)
        | (df["salt_100g"] > 100)
    ].index
)

Now that we dropped from `df` all products that had an obvious outlier in any of these 4 columns, let us re-compute their summary statistics.

In [19]:
# (just execute this cell)
df[["fat_100g", "saturated-fat_100g", "sugars_100g", "salt_100g"]].mean()

fat_100g              13.851002
saturated-fat_100g     5.113861
sugars_100g           12.486285
salt_100g              1.167740
dtype: float64

Now the mean values seem much more likely:
- `fat_100g`: approx. 14g,
- `saturated-fat_100g`: approx. 5g,
- `sugars_100g`: approx. 12 g,
- `salt_100g`: approx. 1.17 g.

All datasets, even those from big companies or public institutions, can contain erroneous data.

Working on a dataset is an iterative process, in which it is recommended to:
- Devise assumptions about the nature and range of values in columns, and implement these assumptions as assertions or filters,
- Compute summary statistics,
- Generate various plots (data visualizations) that support a visual detection of outliers or unexpected trends,
- Implement reproducible procedures to remove outliers,
- (Repeat)

## Computing on columns

You can manipulate columns in various ways, including with operations that apply element-wise as we saw for NumPy arrays in the first notebook.

Substract the mean value of the column "sugars_100g" from each value in that column.

>**HINT** You only need `mean()` and the substraction operator (`-`).

In [20]:
df["sugars_100g"] - df["sugars_100g"].mean()

0         19.513715
1               NaN
2          5.713715
3         41.113715
4          3.513715
            ...    
778720   -11.986285
778721    -1.486285
778722   -12.486285
778723    -7.486285
778724    11.513715
Name: sugars_100g, Length: 778304, dtype: float64

## Sorting data

The entries are sorted by barcode.
We might find it easier to understand the dataset if we sort entries by another criterion.

Sort entries by brand, following the [pandas intro tutorial 07](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html), and store the result in a variable named `df_sort_brands`.

In [21]:
df_sort_brands = df.sort_values(by="brands")
df_sort_brands

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
402390,4104420054189,http://world-en.openfoodfacts.org/product/4104420054189/feuilles-gaufrettes-epeautre-alnatura,openfoodfacts-contributors,2017-07-08 16:46:48+00:00,2024-02-05 14:23:52+00:00,el-ka-91,2024-02-10 13:48:13+00:00,Feuilles gaufrettes épeautre,,125 g,...,,,,,,,,,,16
3435,0013562302239,http://world-en.openfoodfacts.org/product/0013562302239/annie-s-whole-wheat-bunnies-baked-snack-crackers-m...,bori,2015-07-02 01:42:07+00:00,2023-01-31 21:10:56+00:00,wolfgang8741,2024-02-09 15:34:53+00:00,"Annie's Whole Wheat Bunnies Baked Snack Crackers, Made with Organic Wheat",,7 servings,...,0.0,,0.0000,,,,0.000,0.00120,0.000000,10
83866,0856463002002,http://world-en.openfoodfacts.org/product/0856463002002/core-meal-hearty-oatmeal-to-go-almond-raisin,bori,2015-07-02 02:56:19+00:00,2024-02-21 16:05:12+00:00,5m4u9,2024-02-21 16:05:12+00:00,"Core, meal, hearty oatmeal to go, almond raisin",,1 serving,...,0.0,,0.0000,,,,0.118,0.00212,32.142857,-2
22873,0043182000703,http://world-en.openfoodfacts.org/product/0043182000703/organic-mashed-potatoes-edward-and-sons,bori,2015-07-07 02:49:56+00:00,2024-02-18 15:05:06+00:00,5m4u9,2024-02-18 15:05:06+00:00,Organic Mashed Potatoes,,4,...,0.0,,0.0720,,,,0.000,0.00000,0.000000,6
52293,0099482443436,http://world-en.openfoodfacts.org/product/0099482443436/engine-2-plant-strong-rip-s-big-bowl-triple-berry-...,bori,2015-07-02 03:07:41+00:00,2023-01-23 16:47:55+00:00,wolfgang8741,2024-02-09 20:25:12+00:00,"Engine 2, plant-strong, rip's big bowl triple berry walnut",,7,...,0.0,,0.0109,,,,0.036,0.00327,5.468750,-5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
753934,8805713304023,http://world-en.openfoodfacts.org/product/8805713304023/%ED%95%9C%EC%82%B4%EB%A6%BC-%ED%98%B8%EB%B0%95%EC%...,openfoodfacts-contributors,2020-05-21 02:10:46+00:00,2024-06-04 22:51:07+00:00,5m4u9,2024-06-04 22:51:07+00:00,한살림 호박쌀엿,,100g,...,,,,,,,,,,10
432110,4570018723070,http://world-en.openfoodfacts.org/product/4570018723070/%E5%9B%BD%E7%94%A3%E5%A4%A7%E8%B1%86100%EF%BC%85%E...,openfoodfacts-contributors,2021-05-08 10:40:58+00:00,2023-04-20 01:19:27+00:00,naruyoko,2024-02-13 14:44:04+00:00,国産大豆100％使用おからクッキー いちごプレーン,,8枚（48 g）,...,,,,,,,,,66.666667,
753319,87330991,http://world-en.openfoodfacts.org/product/87330991/sirop-de-framboise-%F0%9D%90%91%F0%9D%90%9A%F0%9D%90%9A...,openfoodfacts-contributors,2021-12-16 11:06:30+00:00,2023-07-26 15:43:51+00:00,kiliweb,2024-02-13 18:52:15+00:00,sirop de framboise,,0.75 𝐥𝐢𝐭𝐞𝐫,...,,,,,,,,,,-5
712474,8437014583076,http://world-en.openfoodfacts.org/product/8437014583076/tarrito-de-calabaza-y-calabacin-ecologico-sin-%F0%...,kiliweb,2018-03-01 21:32:54+00:00,2023-08-29 06:47:30+00:00,roboto-app,2024-02-10 19:48:29+00:00,Tarrito de calabaza y calabacín ecológico sin,,,...,,,,,,,,,60.967500,-1


Let us look at the brands for the first entries, sorted by brands.

In [22]:
df_sort_brands["brands"].head(20)

402390                           Alnatura
3435                              Annie's
83866              Core Meal, Core Method
22873      Edward and Sons, Edward & Sons
52293                            Engine 2
127598                     Les Crudettes 
69766                         Lotus Foods
490204                       St Feuillien
126059              le verger des fruits 
367719                              !NARA
61955                                  #5
409887                       #männerglück
411758                             #sinob
411759                             #sinob
411773                             #sinob
411771                             #sinob
411761                             #sinob
411770                             #sinob
411767                             #sinob
411766                             #sinob
Name: brands, dtype: string

Oddly, only the first few lines have brand names that start with a letter, then brand names start with a special character (`!` or `#`).
This is unexpected, because special characters should appear first.

What happened here ? Let us have a better look at the *values* in the `brands` column of our sorted dataframe `df_sort_brands`.

In [23]:
# (just execute this cell)
df_sort_brands["brands"].head(20).values

<StringArray>
[                      ' Alnatura',                        " Annie's",
         ' Core Meal, Core Method', ' Edward and Sons, Edward & Sons',
                       ' Engine 2',                 ' Les Crudettes ',
                    ' Lotus Foods',                   ' St Feuillien',
          ' le verger des fruits ',                           '!NARA',
                              '#5',                    '#männerglück',
                          '#sinob',                          '#sinob',
                          '#sinob',                          '#sinob',
                          '#sinob',                          '#sinob',
                          '#sinob',                          '#sinob']
Length: 20, dtype: string

In the first entries, the `brands` value starts with a whitespace.
This explains why they were sorted before the entries whose `brands` start with a special character.

Brand names rarely (if ever) start with a whitespace, hence we can assume that whoever added these products made a typing error.

⚠ Datasets contain all sorts of errors and oddities. Datasets released by public agencies or big actors are usually cleaner than crowdsourced datasets, but you should always be cautious.

To confirm our hypothesis and check whether the entries are properly sorted, we can use `iloc` to retrieve entries at arbitary positions in the DataFrame.

For instance, let us check the entries ranked 5851 to 5869 (or 5870 excluded).

In [24]:
# (just execute this cell)
df_sort_brands["brands"].iloc[5851:5870]

652530                          Acqua Dolomia
652527                          Acqua Dolomia
637699           Acqua Minerale San Benedetto
583895    Acqua Minerale San Benedetto S.p.A.
583849                            Acqua Panna
583851                            Acqua Panna
583848                            Acqua Panna
20935                             Acqua Panna
595292    Acqua Panna, San Pellegrino, Nestlé
595293    Acqua Panna, San Pellegrino, Nestlé
595280    Acqua Panna, San Pellegrino, Nestlé
595291    Acqua Panna, San Pellegrino, Nestlé
579167                            Acquafarina
579188                            Acquafarina
579162                            Acquafarina
579155                  Acquafarina, Alimenta
579169                  Acquafarina, Alimenta
579159                  Acquafarina, Alimenta
579157                  Acquafarina, Alimenta
Name: brands, dtype: string

The sorted brands are `Acqua Dolomia`, `Acqua Minerale San Benedetto`, `Acqua Minerale San Benedetto S.p.A.` then `Acqua Panna`, and so on, which corresponds to an ordering and a density what we would expect in a dataset of food products of this size.

Sort entries by the Nutri-Score grade, and store the result in a variable named `df_sort_nsgrade`.

In [25]:
df_sort_nsgrade = df.sort_values(by="nutriscore_grade")
df_sort_nsgrade

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
536266,7500463823640,http://world-en.openfoodfacts.org/product/7500463823640/pals-6-pack-cafe-pals-snacks-barra-sabor-cafe,org-pals-snacks,2022-03-25 20:42:30+00:00,2022-03-25 20:42:32+00:00,org-pals-snacks,2024-02-13 20:53:29+00:00,Pals 6 pack cafe,Snack en barra,50 g,...,,,,,,,,,85.0,-10
522672,6408432203398,http://world-en.openfoodfacts.org/product/6408432203398/profeel-protein-mousse-valio,kiliweb,2022-04-08 16:53:45+00:00,2023-08-22 16:22:21+00:00,moon-rabbit,2024-02-13 21:10:44+00:00,Profeel protein mousse,,150g,...,,,,,,,,,0.0,-2
522673,6408432203657,http://world-en.openfoodfacts.org/product/6408432203657/mousse-chocolate-valio,kiliweb,2022-05-18 15:32:48+00:00,2023-05-10 18:54:24+00:00,moon-rabbit,2024-02-13 21:55:30+00:00,Mousse chocolate,,,...,,,,,,,,,,-1
691528,8424680100300,http://world-en.openfoodfacts.org/product/8424680100300/arroz-antonio-soros-gourmet,kiliweb,2021-03-13 13:07:05+00:00,2023-02-21 16:25:46+00:00,diego30,2024-02-13 13:09:38+00:00,Arroz,,500gr,...,,,,,,,,,0.0,-2
154797,3226980002002,http://world-en.openfoodfacts.org/product/3226980002002/pain-azyme-froment-paul-heumann-paul-hemann,serein,2012-06-18 11:25:44+00:00,2022-02-11 05:24:34+00:00,packbot,2024-02-10 03:09:26+00:00,Pain Azyme Froment Paul Heumann,Pain Azyme extra fin à la farine de froment,200g,...,,,,,,,,,0.0,-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778708,99863333,http://world-en.openfoodfacts.org/product/99863333/volvic-juicy-fraise-danone,kiliweb,2021-10-16 20:32:22+00:00,2023-02-15 20:13:13+00:00,itsjustruby,2024-02-13 17:47:21+00:00,Volvic Juicy Fraise,,,...,,,,,,,,,,
778710,99885434,http://world-en.openfoodfacts.org/product/99885434/raviolis-pekinois-surgeles-asia-food,openfoodfacts-contributors,2019-07-08 10:52:02+00:00,2023-06-06 12:57:53+00:00,chevalstar,2024-02-11 04:19:57+00:00,raviolis pékinois surgelés,,3800 g,...,,,,,,,,,0.0,
778711,9990000003556,http://world-en.openfoodfacts.org/product/9990000003556/fuze-tea-pesca-e-rosa-fuzetea,kiliweb,2022-09-14 12:00:46+00:00,2023-12-05 21:59:12+00:00,alex-off,2024-02-13 23:52:54+00:00,fuze tea pesca e rosa,,,...,,,,,,,,,,
778715,99911522,http://world-en.openfoodfacts.org/product/99911522/veritables-merguez-casino,alm1412,2021-07-04 10:30:44+00:00,2023-02-12 18:35:42+00:00,yogoff,2024-02-12 14:21:07+00:00,Veritables merguez,,6,...,,,,,,,,,0.0,


Let us check the first 20 entries.

In [26]:
df_sort_nsgrade["nutriscore_grade"].head(20)

536266    a
522672    a
522673    a
691528    a
154797    a
154798    a
154799    a
337346    a
154800    a
154801    a
337344    a
337343    a
92990     a
420374    a
337342    a
582334    a
92957     a
760425    a
691534    a
154804    a
Name: nutriscore_grade, dtype: category
Categories (5, object): ['a' < 'b' < 'c' < 'd' < 'e']

The entries with nutriscore grade 'a' are ranked first, as expected.

Sort entries by the Nutri-Score grade and Nova group (together), and store the result in a variable named `df_sort_nsgrade_novagroup`.

In [27]:
df_sort_nsgrade_novagroup = df.sort_values(by=["nutriscore_grade", "nova_group"])
df_sort_nsgrade_novagroup

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
12,0000350034007,http://world-en.openfoodfacts.org/product/0000350034007/italian-tomato-puree-sainsbury-s,smoothie-app,2023-11-03 09:44:34+00:00,2024-03-10 09:02:28+00:00,mandyjacob123,2024-03-10 09:02:28+00:00,Italian tomato puree,,200g,...,,,,,,,,,75.0,-3
30,0000651511016,http://world-en.openfoodfacts.org/product/0000651511016/celery-ocean-mist,usda-ndb-import,2017-03-09 20:11:19+00:00,2023-01-28 03:05:16+00:00,kiliweb,2024-02-09 15:05:00+00:00,Celery,,,...,0.000055,,0.0082,,,,0.036,0.00033,100.0,-5
36,0000790001201,http://world-en.openfoodfacts.org/product/0000790001201/sonnenblumenhack-dm-bio,caline78,2024-04-23 16:53:12+00:00,2024-05-19 16:05:44+00:00,vectorofchange,2024-05-19 16:05:44+00:00,Sonnenblumenhack,,75 g,...,,,,,,,,,0.0,-5
68,0001494964588,http://world-en.openfoodfacts.org/product/0001494964588/apfelmark-gutbio,smoothie-app,2023-02-25 15:00:46+00:00,2023-06-25 12:51:01+00:00,worldtest,2024-02-14 02:12:40+00:00,Apfelmark,,355 g,...,,,,,,,,,99.9,-3
73,0002000000288,http://world-en.openfoodfacts.org/product/0002000000288/melange-rando-les-accents-du-soleil,kiliweb,2020-05-06 06:46:15+00:00,2022-02-11 00:56:21+00:00,packbot,2024-02-12 16:58:05+00:00,Melange Rando,Mélange Fruits Secs,125 g,...,,,,,,,,,100.0,-5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778706,9977512576693,http://world-en.openfoodfacts.org/product/9977512576693/sables-natures-gerble,kiliweb,2022-02-05 15:49:28+00:00,2022-02-05 16:08:45+00:00,quentinbrd,2024-02-12 13:22:14+00:00,Sablés natures,,,...,,,,,,,,,,
778707,9984410997000,http://world-en.openfoodfacts.org/product/9984410997000/tomato-juice-waitrose,foodvisor,2022-12-05 17:12:45+00:00,2023-08-23 16:44:12+00:00,naruyoko,2024-02-14 01:05:50+00:00,Tomato juice,,,...,,,,,,,,,,
778708,99863333,http://world-en.openfoodfacts.org/product/99863333/volvic-juicy-fraise-danone,kiliweb,2021-10-16 20:32:22+00:00,2023-02-15 20:13:13+00:00,itsjustruby,2024-02-13 17:47:21+00:00,Volvic Juicy Fraise,,,...,,,,,,,,,,
778711,9990000003556,http://world-en.openfoodfacts.org/product/9990000003556/fuze-tea-pesca-e-rosa-fuzetea,kiliweb,2022-09-14 12:00:46+00:00,2023-12-05 21:59:12+00:00,alex-off,2024-02-13 23:52:54+00:00,fuze tea pesca e rosa,,,...,,,,,,,,,,


Let us check the first 20 entries.

In [28]:
df_sort_nsgrade_novagroup[["nutriscore_grade", "nova_group"]].head(20)

Unnamed: 0,nutriscore_grade,nova_group
12,a,1
30,a,1
36,a,1
68,a,1
73,a,1
88,a,1
106,a,1
204,a,1
461,a,1
565,a,1


Products with nutriscore_grade 'a' and nova_group '1' appear first.

## Working with dates

pandas has a specific data type for dates. You can explicitly ask pandas to use this type for specific columns, either during `read_csv` or after (as I did in `load_off`), see the [pandas intro tutorial 09](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html).

This specific data type makes it easy to filter entries by the month of their creation, to know what day of the week an entry was created, or to sort entries by their date of creation.

Sort entries by their date of creation, and store the result in a variable named `df_sort_created`.

In [29]:
df_sort_created = df.sort_values(by="created_datetime")
df_sort_created

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
310855,3760029248001,http://world-en.openfoodfacts.org/product/3760029248001/caramels-tendres-au-beurre-sale-au-sel-de-guerande...,stephane,2012-01-31 14:43:58+00:00,2023-08-22 06:01:47+00:00,naruyoko,2024-02-10 11:37:10+00:00,Caramels tendres au beurre salé au sel de Guérande,Caramels au beurre salé et à la fleur de sel de Guérande,100 g,...,,,,,,,,,0.000000,28
120954,3029330062806,http://world-en.openfoodfacts.org/product/3029330062806/jacquet-les-bouchees-creatives-a-garnir,stephane,2012-02-09 10:34:56+00:00,2022-02-11 07:44:48+00:00,packbot,2024-02-10 01:22:12+00:00,Jacquet Les bouchées créatives à garnir,Supports en pâte cuite prêts à garnir,54 g,...,,,,,,,,,0.000000,10
180478,3257980112590,http://world-en.openfoodfacts.org/product/3257980112590/boudoirs-cora,marianne,2012-02-11 14:51:07+00:00,2023-02-26 13:16:37+00:00,stephane,2024-02-10 04:30:36+00:00,Boudoirs,30 Boudoirs aux œufs frais,175 g,...,,,,,,,,,0.000000,14
117373,3017760038409,http://world-en.openfoodfacts.org/product/3017760038409/lulu-la-barquette-fraise-lu,marianne,2012-02-11 15:07:23+00:00,2024-06-11 12:45:40+00:00,fgouget,2024-06-11 12:45:40+00:00,Lulu La Barquette Fraise,Génoise garnie à la purée de fraise,120 g,...,,,,,,,,,27.800000,13
139195,3160181210524,http://world-en.openfoodfacts.org/product/3160181210524/cookies-tout-chocolat-biocoop,stephane,2012-02-11 18:51:58+00:00,2022-02-11 03:48:47+00:00,packbot,2024-02-10 02:14:09+00:00,Cookies tout chocolat Biocoop,Cookies au chocolat,200 g,...,,,,,,,,,0.000000,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214361,3292590864347,http://world-en.openfoodfacts.org/product/3292590864347/steak-hache-facon-bouchere-15-thiriet,foodvisor,2024-07-18 22:14:41+00:00,2024-07-19 02:41:03+00:00,quentinbrd,2024-07-19 02:41:03+00:00,Steak haché façon bouchère 15%,,,...,,,,,,,,,,2
62639,0604485070741,http://world-en.openfoodfacts.org/product/0604485070741/smoky-roasted-garlic-onion-seasoning-crate-branch,benjipoo,2024-07-18 22:52:40+00:00,2024-07-18 23:03:25+00:00,benjipoo,2024-07-18 23:03:25+00:00,Smoky Roasted Garlic & Onion Seasoning,,10OZ,...,,,,,,,,,55.555556,
572227,7771260011039,http://world-en.openfoodfacts.org/product/7771260011039/sante-sport-sabor-mango-zero-azucar,5m4u9,2024-07-19 01:48:45+00:00,2024-07-19 02:30:44+00:00,5m4u9,2024-07-19 02:30:44+00:00,Santé Sport sabor Mango Zero Azúcar,"bebida hidratate sin azúcar con 5 iones (sulfato de magnesio, cloruro de sodio, cloruro de calcio y clorur...",1 l,...,,,,,,0.015,0.0075,,0.000000,
575155,7798294150220,http://world-en.openfoodfacts.org/product/7798294150220/galletitas-de-naranja-banadas-con-chocolate-angola,cielo,2024-07-19 02:05:18+00:00,2024-07-19 02:19:16+00:00,roboto-app,2024-07-19 02:19:16+00:00,Galletitas de naranja bañadas con chocolate,,130g,...,,,,,,,,,0.000000,4


Let us check that the first and last entries are as expected.

Display the first entries.

In [30]:
df_sort_created["created_datetime"].head(10)

310855   2012-01-31 14:43:58+00:00
120954   2012-02-09 10:34:56+00:00
180478   2012-02-11 14:51:07+00:00
117373   2012-02-11 15:07:23+00:00
139195   2012-02-11 18:51:58+00:00
595275   2012-02-11 20:46:21+00:00
116079   2012-02-11 21:11:15+00:00
125771   2012-02-12 08:32:47+00:00
548943   2012-02-12 08:51:55+00:00
294527   2012-02-12 18:01:45+00:00
Name: created_datetime, dtype: datetime64[ns, UTC]

The oldest entries in our dataset date from 2012.

Display the last entries.

In [31]:
df_sort_created["created_datetime"].tail(20)

756648   2024-07-18 16:30:44+00:00
454885   2024-07-18 16:31:03+00:00
371233   2024-07-18 16:37:51+00:00
100035   2024-07-18 18:11:07+00:00
94707    2024-07-18 18:48:44+00:00
1305     2024-07-18 19:41:21+00:00
661920   2024-07-18 19:44:56+00:00
94708    2024-07-18 19:51:02+00:00
737884   2024-07-18 20:04:42+00:00
453714   2024-07-18 20:57:56+00:00
403631   2024-07-18 21:00:45+00:00
133242   2024-07-18 21:36:51+00:00
112979   2024-07-18 21:36:58+00:00
445006   2024-07-18 22:00:54+00:00
468234   2024-07-18 22:03:22+00:00
214361   2024-07-18 22:14:41+00:00
62639    2024-07-18 22:52:40+00:00
572227   2024-07-19 01:48:45+00:00
575155   2024-07-19 02:05:18+00:00
339491   2024-07-19 05:41:18+00:00
Name: created_datetime, dtype: datetime64[ns, UTC]

The latest entries in our dataset date from 2024-07-19 (when I downloaded the entire dataset).

## Working with textual data

pandas provides a number of functions to process text strings, see the [pandas intro tutorial 10](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html).

Use these functions to select all entries whose list of brands contains "Casino" (this operation is case-sensitive, so mind the initial capital letter!), and store the result in a variable named `df_casino`.

In [32]:
df_casino = df[df["brands"].str.contains("Casino")]
df_casino

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
54073,0200298019689,http://world-en.openfoodfacts.org/product/0200298019689/saucisses-de-toulouse-casino,kiliweb,2018-01-21 17:34:37+00:00,2019-01-11 10:43:03+00:00,teolemon,2024-02-09 20:32:34+00:00,Saucisses de toulouse,,,...,,,,,,,,,0.000000,18
54099,0200448052542,http://world-en.openfoodfacts.org/product/0200448052542/poulet-jaune-fermier-du-gers-casino,moon-rabbit,2017-09-30 09:42:14+00:00,2021-04-03 20:16:26+00:00,youplaboum,2024-02-09 20:32:39+00:00,Poulet jaune fermier du Gers,,"1,606 kg",...,,,,,,,,,,
54102,0200451050641,http://world-en.openfoodfacts.org/product/0200451050641/2-cuisses-de-poulet-blanc-casino,kiliweb,2023-04-28 14:35:47+00:00,2023-08-14 07:21:28+00:00,quentinbrd,2024-02-12 11:42:07+00:00,2 Cuisses de Poulet blanc,,,...,,,,,,,,,,
54421,0202152035750,http://world-en.openfoodfacts.org/product/0202152035750/saucisse-de-toulouse-geant-casino,kiliweb,2018-01-20 19:41:37+00:00,2018-12-27 19:57:11+00:00,teolemon,2024-02-09 20:34:06+00:00,Saucisse de toulouse,,,...,,,,,,,,,0.000000,18
54558,0202557031753,http://world-en.openfoodfacts.org/product/0202557031753/aiguillettes-de-poulet-marinees-casino,kiliweb,2019-05-27 13:09:22+00:00,2023-01-16 14:25:10+00:00,gabrielb31,2024-02-11 03:18:01+00:00,Aiguillettes de poulet marinées,,,...,,,,,,,,,,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
758331,88596433,http://world-en.openfoodfacts.org/product/88596433/sirop-d-orgeat-casino,kiliweb,2022-12-15 19:45:28+00:00,2022-12-16 04:34:16+00:00,quentinbrd,2024-02-12 12:13:24+00:00,Sirop d'orgeat,,,...,,,,,,,,,,
758362,88830100,http://world-en.openfoodfacts.org/product/88830100/assortiment-de-petits-cakes-casino,openfoodfacts-contributors,2020-01-02 20:05:22+00:00,2023-05-16 11:29:19+00:00,fix-serving-size-bot,2024-02-11 08:25:37+00:00,Assortiment de petits cakes,,450 g,...,,,,,,,,,21.833333,
768225,9272077113582,http://world-en.openfoodfacts.org/product/9272077113582/huile-olives-casino,nutrinet-sante,2022-05-19 03:47:32+00:00,2022-05-19 11:21:16+00:00,segundo,2024-02-12 12:53:35+00:00,Huile olives,,,...,,,,,,,,,,
778395,96963333,http://world-en.openfoodfacts.org/product/96963333/farine-de-ble-t65-casino,kiliweb,2018-04-02 09:06:01+00:00,2023-12-08 14:31:08+00:00,chevalstar,2024-02-10 21:05:54+00:00,Farine de blé T65,,,...,,,,,,,,,,


You should get 5495 products whose brands contains "Casino".

## Wrapping it all together

Select all the products that are in the category for spreads and store this subset in a variable `df_spreads`.

> **HINT** If you can't find the right pattern to look for, take a peak at the spelling of the categories: Print the content of the column and browse through the values until you find a suitable value.

In [33]:
df_spreads = df[df["categories_en"].str.contains("Spreads")].copy()
df_spreads

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,last_modified_by,last_updated_datetime,product_name,generic_name,quantity,...,vitamin-a_100g,vitamin-d_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,potassium_100g,calcium_100g,iron_100g,fruits-vegetables-nuts-estimate-from-ingredients_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2023-04-28 23:59:01+00:00,roboto-app,2024-02-09 14:48:49+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,...,,,,,,,,,,23
18,0000500000050,http://world-en.openfoodfacts.org/product/0000500000050/smuckers-natural-peanut-butter,wizno,2024-07-05 16:01:27+00:00,2024-07-05 16:07:51+00:00,wizno,2024-07-05 16:07:51+00:00,Smuckers Natural Peanut Butter,,16 oz / 454 grams,...,,,,,,,,,75.0,
21,0000547404828,http://world-en.openfoodfacts.org/product/0000547404828/organic-hummus-trader-joe-s,foodvisor,2023-09-26 07:18:03+00:00,2024-02-14 04:01:17+00:00,rj2,2024-02-14 05:04:55+00:00,Organic Hummus,,8 oz,...,,,,,,,,,0.0,
43,0000800135001,http://world-en.openfoodfacts.org/product/0000800135001/nectar-pour-ngalax-torodo,ninehadi,2023-05-11 19:57:59+00:00,2023-05-12 11:10:21+00:00,ninehadi,2024-02-14 03:03:34+00:00,nectar pour ngalax,,1 litre,...,,,,,,,,,0.0,
170,0006491074766,http://world-en.openfoodfacts.org/product/0006491074766/peanut-butter-natural-crunchy-we-natural,smoothie-app,2022-08-25 09:11:52+00:00,2023-01-28 09:22:38+00:00,roboto-app,2024-02-13 23:34:01+00:00,Peanut butter natural crunchy,,475 g,...,,,,,,,,,0.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778669,9923352004469,http://world-en.openfoodfacts.org/product/9923352004469/brunch-knoblauch-krauter,smoothie-app,2023-06-04 11:30:54+00:00,2023-10-09 07:44:39+00:00,geodata,2024-02-14 03:30:46+00:00,Brunch Knoblauch-Kräuter,,185g,...,,,,,,,,,,15
778675,9935010000003,http://world-en.openfoodfacts.org/product/9935010000003/rillette-d-oie-sans-marque,sebleouf,2015-10-31 12:07:09+00:00,2022-02-11 08:08:25+00:00,packbot,2024-02-10 21:06:03+00:00,Rillette d'oie,,180 g,...,,,,,,,,,0.0,
778677,99365516,http://world-en.openfoodfacts.org/product/99365516/organic-creamy-peanut-butter-salted-trader-joe-s,kiliweb,2022-08-22 18:21:21+00:00,2023-01-03 15:09:53+00:00,wolfgang8741,2024-02-13 23:31:05+00:00,Organic creamy peanut butter salted,,16 oz,...,,,,,,,,,,-3
778684,99440077,http://world-en.openfoodfacts.org/product/99440077/confiture-fraises-bio-les-comtes-de-provence,kiliweb,2021-05-31 06:39:35+00:00,2022-02-03 08:42:04+00:00,charlesnepote,2024-02-12 14:30:41+00:00,Confiture fraises bio,,,...,,,,,,,,,,


You should find 44120 spreads.

For these spreads, compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

In [34]:
df_spreads[["fat_100g", "saturated-fat_100g", "sugars_100g", "salt_100g"]].mean()

fat_100g              22.496616
saturated-fat_100g     7.984325
sugars_100g           26.421148
salt_100g              0.655637
dtype: float64

You should find mean values of approximately (rounded to the closest decimal) :

* fat = 22.5 g,
* saturated-fat = 8.0 g,
* sugars = 26.4g,
* salt = 0.7g.


For each of these 4 nutritional values, compute the percentage of difference between each product and the average of its category, and store the computed values as new columns to `df_spreads`, named `diff_fat`, `diff_saturatedfat`, `diff_sugars`, `diff_salt`.

Remember that you can find help in the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) and [pandas tutorial 06](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html#min-tut-06-stats).

In [35]:
# fmt: off
df_spreads = df_spreads.assign(
    diff_fat=100 * (df_spreads["fat_100g"] - df_spreads["fat_100g"].mean()) / df_spreads["fat_100g"].mean(),
    diff_sat=100 * (df_spreads["saturated-fat_100g"] - df_spreads["saturated-fat_100g"].mean()) / df_spreads["saturated-fat_100g"].mean(),
    diff_sugars=100 * (df_spreads["sugars_100g"] - df_spreads["sugars_100g"].mean()) / df_spreads["sugars_100g"].mean(),
    diff_salt=100 * (df_spreads["salt_100g"] - df_spreads["salt_100g"].mean()) / df_spreads["salt_100g"].mean(),
)

In [36]:
df_spreads[["diff_fat", "diff_sat", "diff_sugars", "diff_salt"]]

Unnamed: 0,diff_fat,diff_sat,diff_sugars,diff_salt
0,113.365421,25.245402,21.115100,-98.474766
18,,,,
21,-77.774435,-87.475460,,
43,-37.279456,,-70.099710,
170,127.145272,25.245402,-85.617582,-84.747658
...,...,...,...,...
778669,-2.207515,125.441724,-87.888490,83.028104
778675,,,,
778677,-49.487353,-76.279280,-97.132692,-81.223443
778684,-100.000000,-100.000000,58.963568,


Note that these values differ from what the Open Food Facts website displays when you look at the nutritional values of a product from this category, eg. [Coconut Spread - premium Srikaya - Hey Boo - 227 g](https://world.openfoodfacts.org/product/0608938316165/coconut-spread-premium-srikaya-hey-boo).

In [37]:
# (uncomment this line and check the output)
df_spreads[df_spreads['code'] == '0608938316165']['diff_fat']

62911   -6.652628
Name: diff_fat, dtype: float64

This product contains less fat (-6.7 %) than the average spreads in our subset `df_spreads`, but more fat (+ 9 %) than the average spreads in the entire Open Food Facts dataset (as displayed on the OFF website).


This is because the Open Food Facts website uses its entire dataset, whereas we are working on a filtered subset of "reasonably complete" product entries prepared beforehand to keep only products with :

* a non-ambiguous barcode in the EAN-8 or EAN-13 formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

It seems that, in this "resonably complete" subset, spreads contain more fat on average than in the whole Open Food Facts dataset.

Is the entire Open Food Facts dataset closer to the reality of what is on the shelves of supermarkets ?
Is our subset more faithful globally ? Is it more faithful to the consumer market in certain countries, eg. France and Spain ?

These questions raise the more general problem of [Selection bias](https://en.wikipedia.org/wiki/Selection_bias) that lies behind every data analysis and use of dataset for eg. artificial intelligence systems.

## Bonus exercise : Traffic light labelling

The [traffic light labelling system](https://www.nutrition.org.uk/healthyliving/helpingyoueatwell/324-labels.html?start=3) is used on the [Open Food Facts website (French)](https://fr.openfoodfacts.org/reperes-nutritionnels) to display colorful, easier to grasp information on 4 nutritional values with a color code :

* fat,
* saturated fat,
* sugars,
* salt.

The OpenFoodFacts dataset does not contain these indicators, but you can recompute them from the [reference table](https://www.nutrition.org.uk/media/er5n0c3s/capture.png).

Add 4 columns to the dataset, one for each of the 4 relevant nutritional values, that will contain the  (low, medium, high) or color (green, yellow, red) of the traffic light.

> **HINT** We can simplify the exercise and express all conditions on the values per 100g (ignoring the rightmost column of the reference table where thresholds are expressed per portion).

We can use [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc), see "Setting values".

In [38]:
# (just execute this cell)
# for fat_100g
df["tl_fat"] = "unknown"
df.loc[df["fat_100g"] <= 3, "tl_fat"] = "green"
df.loc[(df["fat_100g"] > 3) & (df["fat_100g"] <= 17.5), "tl_fat"] = "amber"
df.loc[(df["fat_100g"] > 17.5), "tl_fat"] = "red"

Let us check that the traffic lights for fat are as wanted.

In [39]:
# (just execute this cell)
df[["fat_100g", "tl_fat"]].head(10)

Unnamed: 0,fat_100g,tl_fat
0,48.0,red
1,,unknown
2,15.78,amber
3,34.7,red
4,11.0,amber
5,1.0,green
6,7.8,amber
7,9.5,amber
8,1.0,green
9,3.3,amber


Now we can define the traffic lights for the 3 remaining nutritional values, in columns `"tl_sat"`, `"tl_sugars"`, `"tl_salt"`.

In [40]:
# for saturated-fat_100g
df["tl_sat"] = "unknown"
df.loc[df["saturated-fat_100g"] <= 1.5, "tl_sat"] = "green"
df.loc[(df["saturated-fat_100g"] > 1.5) & (df["saturated-fat_100g"] <= 5), "tl_sat"] = "amber"
df.loc[(df["saturated-fat_100g"] > 5), "tl_sat"] = "red"
# for sugar_100g
df["tl_sugars"] = "unknown"
df.loc[df["sugars_100g"] <= 5, "tl_sugars"] = "green"
df.loc[(df["sugars_100g"] > 5) & (df["sugars_100g"] <= 22.5), "tl_sugars"] = "amber"
df.loc[(df["sugars_100g"] > 22.5), "tl_sugars"] = "red"
# for salt_100g
df["tl_salt"] = "unknown"
df.loc[df["salt_100g"] <= 0.3, "tl_salt"] = "green"
df.loc[(df["salt_100g"] > 0.3) & (df["salt_100g"] <= 1.5), "tl_salt"] = "amber"
df.loc[(df["salt_100g"] > 1.5), "tl_salt"] = "red"

We can display the traffic lights for the first 10 products, and compare with what the Open Food Facts website displays.

>**HINT** We can retrieve URLs from the column `url`.

In [41]:
df[["url", "tl_fat", "tl_sat", "tl_sugars", "tl_salt"]].head(10)

Unnamed: 0,url,tl_fat,tl_sat,tl_sugars,tl_salt
0,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,red,red,red,green
1,http://world-en.openfoodfacts.org/product/0000131327786/lime-cordial-sainsbury-s,unknown,unknown,unknown,unknown
2,http://world-en.openfoodfacts.org/product/0000155011159/mini-chaussons-a-la-compote-de-pomme-intermarche,amber,red,amber,green
3,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,red,unknown,red,unknown
4,http://world-en.openfoodfacts.org/product/0000182006180/knusper-musli-mango-gut-bio,amber,green,amber,green
5,http://world-en.openfoodfacts.org/product/0000204286484/mehrkomponeneten-protein-90-c6-haselnuss-allfitnes...,green,unknown,unknown,unknown
6,http://world-en.openfoodfacts.org/product/0000209773750/tortitas-de-trigo-roti-wraps-lidl,amber,green,green,amber
7,http://world-en.openfoodfacts.org/product/0000241013128/eclairs-intermarche,amber,amber,amber,amber
8,http://world-en.openfoodfacts.org/product/0000250632969/mehrkomponeneten-protein-90-c6-banane-allfitnessfa...,green,unknown,unknown,unknown
9,http://world-en.openfoodfacts.org/product/0000290153097/risto-piatti-fusilli-alla-sorrentina-senza-glutine...,amber,amber,green,amber


Open the webpages for a few products.

**Question.** Do your results match what is displayed on the page? If there are differences, are they systematic?

## To go further

### Python for data science

* [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)