<a href="https://colab.research.google.com/github/datactivist/scpo-data-science-bootcamp/blob/main/notebooks/2_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular data analysis 1 : Loading Open Food Facts data with pandas

In this series of notebooks, we are going to explore the data contained in the OpenFoodFacts database.

## OpenFoodFacts

OpenFoodFacts is an open, crowdsourced database on food products from around the world.

It is produced and managed as a digital commons.

Everyone can contribute data on packaged food products: pictures, ingredients, nutritional values etc.

This database has served as the foundation for many mobile phone apps, especially scanning apps to help customers while grocery shopping.

### Notions

* It is [*open*](https://en.wikipedia.org/wiki/Open_data): Anyone can freely use it, access it, modify it.
* It is [*crowdsourced*](https://en.wikipedia.org/wiki/Crowdsourcing) : Anyone can add new food products to the database, complete or modify existing data.
* It is a [knowledge commons](https://en.wikipedia.org/wiki/Knowledge_commons), a type of [digital commons](https://en.wikipedia.org/wiki/Digital_commons_(economics)).

### Browsing through the dataset

The OpenFoodFacts database is [available online](https://world.openfoodfacts.org/).

Take a few minutes to explore the database through its online interface.

* How is each product described ?
* What types of information are provided ?

### Understanding the dataset

To really understand a dataset, you need to read its documentation so that you are able to answer a set of common, basic questions that will help guide your analysis, such as :

* Who created this dataset and for what purpose ?
* How was the dataset created ?
* What do the instances that comprise the dataset represent (eg. people, companies, events, photos...) ?
* What data does each instance consist of ? Are they "raw" data or (computed) features ?
* Are the instances related in some way ? If so, are there specific fields that enable cross-reference ?

The documentation for a dataset is always written with some purpose, for an intended type of reader, in a certain context, hence it is very likely that you will not find all the answers in the documentation.

Here, you can gather partial information on OFF from :

* the [presentation of the project](https://world.openfoodfacts.org/discover)
* various pages of the [wiki](https://wiki.openfoodfacts.org/Main_Page), mostly :
  * [Data fields](https://wiki.openfoodfacts.org/Data_fields)
  * [Ingredients](https://wiki.openfoodfacts.org/Ingredients)
  * [Quality](https://wiki.openfoodfacts.org/Quality)

#### To go further

* [Datasheets for datasets](https://arxiv.org/pdf/1803.09010.pdf) are a standardized documentation process and format proposed by AI researchers to facilitate the proper (re-)use of datasets and avoid common pitfalls in designing AI components (and ensuing scandals when they exhibit problematic biases in deployment)

Equipped with this new knowledge about the OpenFoodFacts database, you can start the exploratory analysis of the data to gather the missing information to complete your answers, and ask questions of your own.


### OpenFoodFacts as a tabular dataset

The entire set of facts about all the products in the OpenFoodFacts database can be represented as a *tabular dataset*, that is a table of data where :

* each row is a product,
* each column is a field (eg. "brand", "barcode", "energy for 100g"...),
* each cell contains the value of a field for a product.


The simplest and most common format used for tabular datasets is the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values).
CSV files can be opened in a spreadsheet software such as Microsoft Excel, Apple Numbers or LibreOffice Calc, or just any plain text editor.

The OpenFoodFacts database is [available for download in various formats](https://world.openfoodfacts.org/data), including the CSV format.
Because the whole dataset is too big (the CSV export, uncompressed, weighs more than 4 GB as of 2021-08-16), we will work on a filtered subset of the dataset where we only keep products with :

* a non-ambiguous barcode in the [EAN-8](https://en.wikipedia.org/wiki/EAN-8) or [EAN-13](https://en.wikipedia.org/wiki/International_Article_Number) formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

## The pandas library for tabular data analysis

### Gaining functionalities with libraries

The Python standard library includes a module named [csv](https://docs.python.org/3/library/csv.html#module-csv) that provides very basic support to read and write CSV files.
This module enables you to read and write values, but nothing more.

It gives you no way to :

* rename columns ;
* filter columns, eg. keep only the columns for nutritional values ;
* filter rows, eg. select all products that are categorized as "Sweet spreads" ;
* compute summary statistics on columns across rows, eg. compute the min, max, mean and median of fiber content per 100g ;
* compare columns, eg. test whether they contain the same values ;
* etc.

As we saw in the 1st notebook, this can be remedied by using an additional [software library](https://en.wikipedia.org/wiki/Library_(computing)), which is, roughly speaking, a collection of code that provides functionalities to perform operations on a given task or domain .

The most widely used library in Python to work on tabular datasets is [pandas](https://pandas.pydata.org/).

We need to import pandas and, for technical reasons, a specific pandas data type to load [categorical variables with ordered values](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#controlling-behavior).

In [2]:
import pandas as pd
# we need this data type for ordered categoricals
from pandas.api.types import CategoricalDtype
# lift some limitations in column width, so more cell values are displayed in full
pd.set_option('display.max_colwidth', 110)

The OpenFoodFacts CSV file we will load has an accompanying text file that specifies the specific data type that pandas should use for the columns. Otherwise, pandas would do its best to guess data types but its guesses are (rightfully) conservative so the result is quite rough around the edges.

The two files (csv and txt) are on the Google Drive of my Sciences Po account :

* [CSV file](https://drive.google.com/file/d/14Pyz3Wb-FGs_9H-e7K-4Ug2X31N81Amv/view?usp=sharing)
* [dtype txt file](https://drive.google.com/file/d/1EUBD1btT8k4PS073WLUqGm_UucUl4n3P/view?usp=sharing)

0. Check that you have saved this notebook on your Google Drive (otherwise a "Save on your Drive" button appears in the Colab notebook menu bar) ;
1. Download the CSV and txt files ;
2. Add them to the Google Drive of the account you used to open Colab : It should be your Sciences-Po account, or your personal account.
  * To check what account you are using, click on the circle with your initial, at the top right of the Colab menu bar (as in Gmail and other Google products).
  * Then go to <https://drive.google.com/>, check you are logged in with the same account, and drop your files where you want.
3. To the very left of this Colab notebook, there is a small "folder" icon. Click on it, a menu bar will appear with three icons below "Files". Click on the icon on the right (dark folder icon, with the Drive icon). It will enable you to access files on your Drive from your Colab notebook.
4. In the next cell, change the path to the files to match the path on your Drive.

It should work, but let me know if you encounter any issue.

In [3]:
# this code probably appeared with the above procedure
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# dataset and data type of the columns
# - google drive
# OFF_FILE = 'drive/MyDrive/data-science-bootcamp/off_products_subset.csv'
# DTYPE_FILE = 'drive/MyDrive/data-science-bootcamp/dtype.txt'
# - local
OFF_FILE = '../data/processed/off_products_subset.csv'
DTYPE_FILE = '../data/processed/dtype.txt'

We will use a custom utility function, `load_off`, to load the OpenFoodFacts dataset and convert a column.

You do not need to understand or even look at its code because this requires a few Python functions and technical notions (file input and output, evaluation) that we could not cover in the first notebook and are beyond the objectives of this bootcamp.
However, feel free to ask Mathieu questions if you are curious !

In [5]:
def load_off():
  """Load the filtered subset of OpenFoodFacts.
  
  Returns
  -------
  df : pd.DataFrame
    (A filtered subset of the) OpenFoodFacts tabular dataset.
  """
  # load the data types for the columns
  with open(DTYPE_FILE) as f:
    dtype = eval(f.read())

  # load the dataset
  df = pd.read_csv(OFF_FILE, sep='\t', dtype=dtype)
  # convert columns with datetimes
  for col_name in ('created_datetime', 'last_modified_datetime'):
    # ISO 8601 dates
    df[col_name] = pd.to_datetime(df[col_name])
  #
  return df

We load the dataset using the function above.

In [6]:
df = load_off()

If all went fine, you do not see anything.
What have we read, really ?
You remember that typing the name of a variable, as the only (or last) line of a notebook cell, prints its value. 

In [7]:
# type the name of the variable containing the dataset
df

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004,,,,,,,,,23.0
1,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,"Milkyway, magic stars chocolates",,,,Milkyway,...,,,,,,,,,,
2,0000204286484,http://world-en.openfoodfacts.org/product/0000204286484/mehrkomponeneten-protein-90-c6-haselnuss-allfitnes...,allfitnessfactory-de,2016-12-30 12:12:46+00:00,2017-03-24 16:39:27+00:00,Mehrkomponeneten Protein 90 C6 Haselnuß,Mehrkomponeneten Protein in Haselnuß Geschmack,"2,5 kg",bucket,allfitnessfactory.de,...,,,,,,,,,,
3,0000250632969,http://world-en.openfoodfacts.org/product/0000250632969/mehrkomponeneten-protein-90-c6-banane-allfitnessfa...,allfitnessfactory-de,2017-01-13 07:30:12+00:00,2017-03-24 16:42:57+00:00,Mehrkomponeneten Protein 90 C6 Banane,Mehrkomponeneten Protein in Bananen Geschmack,"2,5 kg",bucket,allfitnessfactory.de,...,,,,,,,,,,
4,0000460938714,http://world-en.openfoodfacts.org/product/0000460938714/100-soja-protein-haselnuss-allfitnessfactory-de,allfitnessfactory-de,2016-12-30 11:39:50+00:00,2017-03-24 16:47:58+00:00,100% Soja Protein Haselnuss,100% Soja Protein Haselnuss Geschmack,2 kg,bucket,allfitnessfactory.de,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416547,9999091865142,http://world-en.openfoodfacts.org/product/9999091865142/paprikas-kukorica-csemege-spar,hunsly,2018-10-21 15:10:04+00:00,2019-11-18 22:25:49+00:00,Paprikás Kukorica csemege,extrudált kukorica,100 g,"műanyag,zacskó",Spar,...,0.384,,,,,,,,,11.0
416548,99994440,http://world-en.openfoodfacts.org/product/99994440/veganes-muhlenhack-rugenwalder-muhle,kiliweb,2021-02-20 16:41:45+00:00,2021-02-21 09:22:46+00:00,Veganes Mühlenhack,,180 g,,Rügenwalder Mühle,...,0.600,,,,,,,,,-3.0
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,Chocolat de Couverture Noir,,100 g,,Barry,...,0.012,,,,,,,,,22.0
416550,9999991149090,http://world-en.openfoodfacts.org/product/9999991149090/riz-parfume-king-elephant,kiliweb,2018-02-20 17:07:29+00:00,2018-12-20 20:51:04+00:00,Riz parfumé,,,,King Elephant,...,0.000,,,,,,,,,0.0


The dataset is loaded in a pandas DataFrame, a type of object described in the [pandas intro tutorial 01](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html).

By default pandas displays the column headers, the first and last five rows with their row index, the total number of rows and columns.

How many rows and columns does the table contain in total ?

You can display the first `n` entries of a Dataframe with the DataFrame method [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), and the last `n` entries with [tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail).

Remember that methods are attached to an object, and are called with the dot notation.

In [8]:
# display the first entry
df.head(1)

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004,,,,,,,,,23.0


Display the first 7 entries.

In [9]:
df.head(7)

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004,,,,,,,,,23.0
1,159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,"Milkyway, magic stars chocolates",,,,Milkyway,...,,,,,,,,,,
2,204286484,http://world-en.openfoodfacts.org/product/0000204286484/mehrkomponeneten-protein-90-c6-haselnuss-allfitnes...,allfitnessfactory-de,2016-12-30 12:12:46+00:00,2017-03-24 16:39:27+00:00,Mehrkomponeneten Protein 90 C6 Haselnuß,Mehrkomponeneten Protein in Haselnuß Geschmack,"2,5 kg",bucket,allfitnessfactory.de,...,,,,,,,,,,
3,250632969,http://world-en.openfoodfacts.org/product/0000250632969/mehrkomponeneten-protein-90-c6-banane-allfitnessfa...,allfitnessfactory-de,2017-01-13 07:30:12+00:00,2017-03-24 16:42:57+00:00,Mehrkomponeneten Protein 90 C6 Banane,Mehrkomponeneten Protein in Bananen Geschmack,"2,5 kg",bucket,allfitnessfactory.de,...,,,,,,,,,,
4,460938714,http://world-en.openfoodfacts.org/product/0000460938714/100-soja-protein-haselnuss-allfitnessfactory-de,allfitnessfactory-de,2016-12-30 11:39:50+00:00,2017-03-24 16:47:58+00:00,100% Soja Protein Haselnuss,100% Soja Protein Haselnuss Geschmack,2 kg,bucket,allfitnessfactory.de,...,,,,,,,,,,
5,470322800,http://world-en.openfoodfacts.org/product/0000470322800/whey-protein-aus-molke-vanilla-allfitnessfactory-de,allfitnessfactory-de,2017-01-13 10:22:12+00:00,2017-03-24 16:47:32+00:00,Whey Protein aus Molke Vanilla,Whey Protein aus Molke Vanille Geschmack,2000g,can,allfitnessfactory.de,...,0.484632,,,,,,,,,
6,501050603,http://world-en.openfoodfacts.org/product/0000501050603/whey-protein-aus-molke-1000-gramm-vanilla-allfitne...,allfitnessfactory-de,2017-01-13 10:12:31+00:00,2017-03-24 16:44:38+00:00,Whey Protein aus Molke 1000 Gramm Vanilla,Whey Protein aus Molke 1000 Gramm Vanille Geschmack,1000g,bag,allfitnessfactory.de,...,0.484632,,,,,,,,,


Display the last 3 entries.

In [10]:
df.tail(3)

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,Chocolat de Couverture Noir,,100 g,,Barry,...,0.012,,,,,,,,,22.0
416550,9999991149090,http://world-en.openfoodfacts.org/product/9999991149090/riz-parfume-king-elephant,kiliweb,2018-02-20 17:07:29+00:00,2018-12-20 20:51:04+00:00,Riz parfumé,,,,King Elephant,...,0.0,,,,,,,,,0.0
416551,9999999175305,http://world-en.openfoodfacts.org/product/9999999175305/erdbeerkuchen-1019g-tiefgefroren-coppenrath-wiese,sil,2019-12-22 08:13:01+00:00,2020-08-04 09:24:05+00:00,Erdbeerkuchen 1019g tiefgefroren,,"1,019 kg","Kunststoff,Styropor",Coppenrath & Wiese,...,0.112,,,,,,,,,12.0


**Hint** Some URLs are longer than the maximal displayed text length for a cell (by default 80 characters, previously raised here to 110). This will make it harder for you to consult the product page on the OFF website. 
You can use the `values` attribute to get the complete array of values for a (subset of a) DataFrame, or of a column (Series).

In [11]:
# display the arrays of values of all fields for the first 2 products
# NB : each entry has 2 URLs : one for the product page, one for its (small-sized) image
df.head(2).values

array([['0000101209159',
        'http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-bovetti',
        'kiliweb', Timestamp('2018-02-22 10:56:57+0000', tz='UTC'),
        Timestamp('2020-01-18 19:26:31+0000', tz='UTC'),
        'Véritable pâte à tartiner noisettes chocolat noir', <NA>,
        '350 g', <NA>, 'Bovetti',
        'Spreads,Breakfasts,Sweet spreads,fr:Pâtes à tartiner,Hazelnut spreads,Chocolate spreads,Cocoa and hazelnuts spreads',
        <NA>, <NA>, 'No gluten,No palm oil', <NA>, <NA>, <NA>, <NA>,
        'France', <NA>, <NA>, <NA>, <NA>, nan, <NA>, <NA>, <NA>, <NA>,
        <NA>, <NA>, <NA>, 23, 'e', nan, 'Sugary snacks', 'Sweets',
        'To be completed,Nutrition facts completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Categories completed,Brands completed,Packaging to be completed,Quantity completed,Product name completed,Photos val

### About the data table

We can display a summary of the DataFrame with `info`, including for each column its index, name, number of non-null values, and data type (`dtype`).
For more information, you can read the [pandas intro tutorial 02](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416552 entries, 0 to 416551
Data columns (total 66 columns):
 #   Column                                      Non-Null Count   Dtype              
---  ------                                      --------------   -----              
 0   code                                        416552 non-null  string             
 1   url                                         416552 non-null  string             
 2   creator                                     416551 non-null  category           
 3   created_datetime                            416552 non-null  datetime64[ns, UTC]
 4   last_modified_datetime                      416552 non-null  datetime64[ns, UTC]
 5   product_name                                416552 non-null  string             
 6   generic_name                                99355 non-null   string             
 7   quantity                                    286295 non-null  string             
 8   packaging               

`info` also displays the memory usage of the DataFrame.

### Selecting subsets

One of the fundamental operations on DataFrames is to be able to filter the dataset on a certain condition, to keep only certain rows or columns.

The basic operators for selection are square brackets `[]`, `loc` and `iloc`, and you can select rows or columns by their position or label, or with a conditional expression on values, see the [pandas intro tutorial 03](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html).

Filter rows in `df` to keep only products with Nutri-Score 'a'.


In [13]:
df_nutri_a = df[df["nutriscore_grade"] == "a"]
df_nutri_a

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
17,0000870000001,http://world-en.openfoodfacts.org/product/0000870000001/jugo-de-hierba-de-trigo-en-polvo-saludviva,someonefromtheuniverse,2020-02-18 20:55:02+00:00,2020-05-19 15:27:51+00:00,Jugo de Hierba de Trigo en Polvo,,100g,,SaludViva,...,0.1120,,,,,,,,,-1.0
27,0001130000007,http://world-en.openfoodfacts.org/product/0001130000007/rosa-de-mosqueta-en-polvo-saludviva,kiliweb,2020-02-24 19:44:40+00:00,2021-01-01 19:46:34+00:00,Rosa de Mosqueta en polvo,,,,saludviva,...,0.0016,,,,,,,,,-2.0
32,0002000000288,http://world-en.openfoodfacts.org/product/0002000000288/melange-rando-les-accents-du-soleil,kiliweb,2020-05-06 06:46:15+00:00,2020-05-06 06:57:47+00:00,Melange Rando,Mélange Fruits Secs,125 g,Sachet,Les Accents du Soleil,...,0.0170,,,,,,,,,-5.0
34,0002000000714,http://world-en.openfoodfacts.org/product/0002000000714/yaourt-nature-brebis-la-bergerie,kiliweb,2018-02-12 12:49:20+00:00,2019-03-14 10:40:02+00:00,Yaourt nature brebis,,,,La Bergerie,...,0.0520,,,,,,,,,-3.0
94,0009300000383,http://world-en.openfoodfacts.org/product/0009300000383/hamburger-dill-chips-pickles-mt-olive,usda-ndb-import,2017-03-09 11:58:01+00:00,2021-03-06 09:32:42+00:00,Hamburger dill chips pickles,,453.6 g,Frasco de Cristal,Mt. Olive,...,0.2900,,,,,,,,,-2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416518,99003333,http://world-en.openfoodfacts.org/product/99003333/betteraves-rouges-saint-eloi,kiliweb,2018-08-30 18:00:12+00:00,2020-10-08 11:45:07+00:00,Betteraves rouges,,,,Saint Eloi,...,0.0560,,,,,,,,,-1.0
416535,99446666,http://world-en.openfoodfacts.org/product/99446666/cafe-cappuccino-dolce-gusto,kiliweb,2019-01-26 17:56:40+00:00,2020-10-20 15:37:09+00:00,Café cappuccino,,,,Dolce Gusto,...,0.0280,,,,,,,,,-1.0
416536,9950014911001,http://world-en.openfoodfacts.org/product/9950014911001/oignons-jaunes-40-60-ferme-de-l-artois,kiliweb,2018-02-04 13:32:50+00:00,2019-01-08 15:31:35+00:00,Oignons jaunes 40/60,,,,Ferme De L'artois,...,0.0800,,,,,,,,,-11.0
416544,9991111111154,http://world-en.openfoodfacts.org/product/9991111111154/compote-a-boire-pomme-poire-la-ferme-de-coutance,kiliweb,2018-07-13 09:23:55+00:00,2018-07-13 09:53:19+00:00,Compote à Boire Pomme Poire,,,"carton,plastique",La Ferme de Coutance,...,0.0000,,,,,,,,,-3.0


You should have 56260 entries.

Now filter rows in `df` to keep products whose quantity of sugars per 100g is higher than 20g.

In [14]:
df_sugar_gt20 = df[df["sugars_100g"] > 20]
df_sugar_gt20

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.00400,,,,,,,,,23.0
1,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,"Milkyway, magic stars chocolates",,,,Milkyway,...,,,,,,,,,,
12,0000790310013,http://world-en.openfoodfacts.org/product/0000790310013/sour-fruit-gummies-candy-crush,malikele,2014-01-02 17:03:07+00:00,2019-10-04 20:51:08+00:00,Sour Fruit Gummies,,3.5 oz,Plastik,Candy Crush,...,0.05080,,,,,,,,,14.0
13,0000790310020,http://world-en.openfoodfacts.org/product/0000790310020/jelly-fish-candy-crush,malikele,2014-01-02 11:21:51+00:00,2019-10-04 20:57:26+00:00,Jelly Fish,,3 oz,Plastik,Candy Crush,...,0.03048,,,,,,,,,6.0
15,0000800000002,http://world-en.openfoodfacts.org/product/0000800000002/epices-a-pain-d-epices-fortwenger,kiliweb,2018-01-03 12:54:46+00:00,2019-04-03 07:28:31+00:00,Épices à pain d'épices,,,,Fortwenger,...,0.04000,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416520,99018788,http://world-en.openfoodfacts.org/product/99018788/pain-d-epices-lea-nature-jardin-bio,openfoodfacts-contributors,2021-01-18 20:52:04+00:00,2021-05-23 09:05:43+00:00,Pain d'épices,,300 g,plastique,Léa Nature Jardin Bio,...,0.15600,,,,,,,,,
416524,9906705312602,http://world-en.openfoodfacts.org/product/9906705312602/digestive-go-manzana-y-avena-fontaneda,kiliweb,2019-11-15 09:42:59+00:00,2019-11-15 09:46:35+00:00,Digestive Go! - Manzana y avena,,,,Fontaneda,...,0.47600,,,,,,,,,20.0
416534,99440084,http://world-en.openfoodfacts.org/product/99440084/confiture-de-cerises-les-comtes-de-provence,kiliweb,2018-12-14 19:03:37+00:00,2021-05-24 07:22:54+00:00,Confiture de cerises,,,,Les Comtes de Provence,...,0.03200,,,,,,,,,10.0
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,Chocolat de Couverture Noir,,100 g,,Barry,...,0.01200,,,,,,,,,22.0


You should obtain 83891 entries.

Filter the dataset `df` to keep only the columns corresponding to the :
* barcode,
* url,
* date of creation,
* product name,
* brands,
* categories,
* ingredients text,
* main category,
* Nutri-Score grade,
* Nutri-Score score,
* Nova group.

In [15]:
df_sel_cols = df[["code", "url", "created_datetime", "product_name", "brands", "categories_en", "ingredients_text", "main_category_en", "nutriscore_grade", "nutriscore_score", "nova_group"]]
df_sel_cols

Unnamed: 0,code,url,created_datetime,product_name,brands,categories_en,ingredients_text,main_category_en,nutriscore_grade,nutriscore_score,nova_group
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,2018-02-22 10:56:57+00:00,Véritable pâte à tartiner noisettes chocolat noir,Bovetti,"Spreads,Breakfasts,Sweet spreads,fr:Pâtes à tartiner,Hazelnut spreads,Chocolate spreads,Cocoa and hazelnut...",,Cocoa and hazelnuts spreads,e,23,
1,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,2017-03-09 16:01:56+00:00,"Milkyway, magic stars chocolates",Milkyway,"Snacks,Sweet snacks,Cocoa and its products,Confectioneries,Chocolate candies","Sugar, cocoa butter, skimmed milk powder, cocoa mass, whey powder (from milk), lactose, milk fat, emulsifi...",Chocolate candies,,,4
2,0000204286484,http://world-en.openfoodfacts.org/product/0000204286484/mehrkomponeneten-protein-90-c6-haselnuss-allfitnes...,2016-12-30 12:12:46+00:00,Mehrkomponeneten Protein 90 C6 Haselnuß,allfitnessfactory.de,"Dietary supplements,Bodybuilding supplements,Protein powders","Proteinmischung (_Sojaprotein_, _Weizenprotein_, _Molkenprotein_, _Wheyprotein_), _Milchprotein_, _Hühnere...",Protein powders,,,4
3,0000250632969,http://world-en.openfoodfacts.org/product/0000250632969/mehrkomponeneten-protein-90-c6-banane-allfitnessfa...,2017-01-13 07:30:12+00:00,Mehrkomponeneten Protein 90 C6 Banane,allfitnessfactory.de,"Dietary supplements,Bodybuilding supplements,Protein powders","Proteinmischung (_Sojaprotein_, _Weizenprotein_, _Molkenprotein_, _Wheyprotein_), _Milchprotein_, _Hühnere...",Protein powders,,,4
4,0000460938714,http://world-en.openfoodfacts.org/product/0000460938714/100-soja-protein-haselnuss-allfitnessfactory-de,2016-12-30 11:39:50+00:00,100% Soja Protein Haselnuss,allfitnessfactory.de,"Dietary supplements,Bodybuilding supplements,Protein powders","100% Soja-Protein-Isolat (_Soja_), Aroma, Süßstoff Natrium-Saccharin.",Protein powders,,,4
...,...,...,...,...,...,...,...,...,...,...,...
416547,9999091865142,http://world-en.openfoodfacts.org/product/9999091865142/paprikas-kukorica-csemege-spar,2018-10-21 15:10:04+00:00,Paprikás Kukorica csemege,Spar,hu:extrudált-kukorica,"kukoricadara (79%), finomított napraforgó-étolaj, őrölt fűszerpaprika (1.2%), étkezési só, színezék (papri...",hu:extrudált-kukorica,d,11,4
416548,99994440,http://world-en.openfoodfacts.org/product/99994440/veganes-muhlenhack-rugenwalder-muhle,2021-02-20 16:41:45+00:00,Veganes Mühlenhack,Rügenwalder Mühle,"Plant-based foods and beverages,Plant-based foods,Meat analogues","Trinkwasser, 26% Sojaproteinkonzentrat. Branntweinessig, Rapsöl, Kochsalz, natūrliches Aroma, Gewürze, Kar...",Meat analogues,a,-3,3
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,2018-03-21 20:59:04+00:00,Chocolat de Couverture Noir,Barry,"Snacks,Sweet snacks,Cocoa and its products,Chocolates,Dark chocolates",,Dark chocolates,e,22,
416550,9999991149090,http://world-en.openfoodfacts.org/product/9999991149090/riz-parfume-king-elephant,2018-02-20 17:07:29+00:00,Riz parfumé,King Elephant,"Plant-based foods and beverages,Plant-based foods,Cereals and potatoes,Seeds,Cereals and their products,Ce...",,Aromatic rices,b,0,


### Making a selection into a proper DataFrame

You can manipulate each of these selections as a DataFrame, but behind the scenes, they are *views* of the original DataFrame `df`.
The *view* mechanism avoids unnecessary copies of the dataset, but it is problematic when we really want to extract a subset and perform some operations only on this subset.

For instance, let us select all products in `df` with sugars and fat per 100g greater than 0, and add a column with the sugars to fat ratio.

First, we need to define two filtering conditions and apply them jointly using the [boolean "and" `&`](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing).

In [16]:
df_sugarsfat = df[(df["sugars_100g"] > 0) & (df["fat_100g"] > 0)]
df_sugarsfat

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004000,,,,,,,,,23.0
1,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,"Milkyway, magic stars chocolates",,,,Milkyway,...,,,,,,,,,,
5,0000470322800,http://world-en.openfoodfacts.org/product/0000470322800/whey-protein-aus-molke-vanilla-allfitnessfactory-de,allfitnessfactory-de,2017-01-13 10:22:12+00:00,2017-03-24 16:47:32+00:00,Whey Protein aus Molke Vanilla,Whey Protein aus Molke Vanille Geschmack,2000g,can,allfitnessfactory.de,...,0.484632,,,,,,,,,
6,0000501050603,http://world-en.openfoodfacts.org/product/0000501050603/whey-protein-aus-molke-1000-gramm-vanilla-allfitne...,allfitnessfactory-de,2017-01-13 10:12:31+00:00,2017-03-24 16:44:38+00:00,Whey Protein aus Molke 1000 Gramm Vanilla,Whey Protein aus Molke 1000 Gramm Vanille Geschmack,1000g,bag,allfitnessfactory.de,...,0.484632,,,,,,,,,
7,0000526938306,http://world-en.openfoodfacts.org/product/0000526938306/whey-protein-aus-molke-500-gramm-vanilla-allfitnes...,allfitnessfactory-de,2017-01-13 10:03:47+00:00,2017-03-24 16:43:14+00:00,Whey Protein aus Molke 500 Gramm Vanilla,Whey Protein aus Molke 500 Gramm Vanille Geschmack,500g,bag,allfitnessfactory.de,...,0.484632,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416546,9996980313319,http://world-en.openfoodfacts.org/product/9996980313319/cup-noodles-nissin,openfoodfacts-contributors,2018-09-15 14:55:18+00:00,2019-11-03 09:45:23+00:00,cup noodles,,65g,plastique,nissin,...,0.400000,0.0,,,,,,,,
416547,9999091865142,http://world-en.openfoodfacts.org/product/9999091865142/paprikas-kukorica-csemege-spar,hunsly,2018-10-21 15:10:04+00:00,2019-11-18 22:25:49+00:00,Paprikás Kukorica csemege,extrudált kukorica,100 g,"műanyag,zacskó",Spar,...,0.384000,,,,,,,,,11.0
416548,99994440,http://world-en.openfoodfacts.org/product/99994440/veganes-muhlenhack-rugenwalder-muhle,kiliweb,2021-02-20 16:41:45+00:00,2021-02-21 09:22:46+00:00,Veganes Mühlenhack,,180 g,,Rügenwalder Mühle,...,0.600000,,,,,,,,,-3.0
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,Chocolat de Couverture Noir,,100 g,,Barry,...,0.012000,,,,,,,,,22.0


Now, let us try to add a new column with the sugars to fat ratio.

In [17]:
df_sugarsfat["sugarsfat_ratio"] = df_sugarsfat["sugars_100g"] / df_sugarsfat["fat_100g"]
df_sugarsfat["sugarsfat_ratio"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0         0.666667
1         1.544669
5         1.304348
6         1.304348
7         1.304348
            ...   
416546    0.131579
416547    0.062500
416548    0.156250
416549    0.573561
416551    3.157895
Name: sugarsfat_ratio, Length: 300142, dtype: float64

The output seems fine, but it is preceded by a `SettingWithCopyWarning` that tells us we are working on a *view* when we should be working on an independent copy of the subset of the dataframe.

To avoid this warning, we need to turn our selection into an independent dataframe, with the function `copy()`:

In [18]:
df_sugarsfat = df[(df["sugars_100g"] > 0) & (df["fat_100g"] > 0)].copy()
df_sugarsfat

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004000,,,,,,,,,23.0
1,0000159487776,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,usda-ndb-import,2017-03-09 16:01:56+00:00,2020-04-22 20:31:56+00:00,"Milkyway, magic stars chocolates",,,,Milkyway,...,,,,,,,,,,
5,0000470322800,http://world-en.openfoodfacts.org/product/0000470322800/whey-protein-aus-molke-vanilla-allfitnessfactory-de,allfitnessfactory-de,2017-01-13 10:22:12+00:00,2017-03-24 16:47:32+00:00,Whey Protein aus Molke Vanilla,Whey Protein aus Molke Vanille Geschmack,2000g,can,allfitnessfactory.de,...,0.484632,,,,,,,,,
6,0000501050603,http://world-en.openfoodfacts.org/product/0000501050603/whey-protein-aus-molke-1000-gramm-vanilla-allfitne...,allfitnessfactory-de,2017-01-13 10:12:31+00:00,2017-03-24 16:44:38+00:00,Whey Protein aus Molke 1000 Gramm Vanilla,Whey Protein aus Molke 1000 Gramm Vanille Geschmack,1000g,bag,allfitnessfactory.de,...,0.484632,,,,,,,,,
7,0000526938306,http://world-en.openfoodfacts.org/product/0000526938306/whey-protein-aus-molke-500-gramm-vanilla-allfitnes...,allfitnessfactory-de,2017-01-13 10:03:47+00:00,2017-03-24 16:43:14+00:00,Whey Protein aus Molke 500 Gramm Vanilla,Whey Protein aus Molke 500 Gramm Vanille Geschmack,500g,bag,allfitnessfactory.de,...,0.484632,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416546,9996980313319,http://world-en.openfoodfacts.org/product/9996980313319/cup-noodles-nissin,openfoodfacts-contributors,2018-09-15 14:55:18+00:00,2019-11-03 09:45:23+00:00,cup noodles,,65g,plastique,nissin,...,0.400000,0.0,,,,,,,,
416547,9999091865142,http://world-en.openfoodfacts.org/product/9999091865142/paprikas-kukorica-csemege-spar,hunsly,2018-10-21 15:10:04+00:00,2019-11-18 22:25:49+00:00,Paprikás Kukorica csemege,extrudált kukorica,100 g,"műanyag,zacskó",Spar,...,0.384000,,,,,,,,,11.0
416548,99994440,http://world-en.openfoodfacts.org/product/99994440/veganes-muhlenhack-rugenwalder-muhle,kiliweb,2021-02-20 16:41:45+00:00,2021-02-21 09:22:46+00:00,Veganes Mühlenhack,,180 g,,Rügenwalder Mühle,...,0.600000,,,,,,,,,-3.0
416549,9999900002553,http://world-en.openfoodfacts.org/product/9999900002553/chocolat-de-couverture-noir-barry,kiliweb,2018-03-21 20:59:04+00:00,2018-09-16 20:23:38+00:00,Chocolat de Couverture Noir,,100 g,,Barry,...,0.012000,,,,,,,,,22.0


The selection is the same as before. 

In [19]:
df_sugarsfat["sugarsfat_ratio"] = df_sugarsfat["sugars_100g"] / df_sugarsfat["fat_100g"]
df_sugarsfat["sugarsfat_ratio"]

0         0.666667
1         1.544669
5         1.304348
6         1.304348
7         1.304348
            ...   
416546    0.131579
416547    0.062500
416548    0.156250
416549    0.573561
416551    3.157895
Name: sugarsfat_ratio, Length: 300142, dtype: float64

The ratios are the same, but we got rid of the big warning, so we must be doing things the *right* way.

We will not go further and we certainly do not expect you to master the [difference between a view and a copy](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy), but at least now you know that if you encounter the big scary warning, you probably need to `copy()` your selection of rows.

### Renaming columns

Column names are not always ideal, either because they are not transparent (it is hard for you or an external user to understand what they stand for) or because they would look bad if they were used directly to label the axes of a datavisualization.

pandas provides means to rename columns, see the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html).

Rename all the columns ending with `_en` (code for English) to drop this suffix, eg. `main_category_en` to `main_category`.

In [20]:
# select the column names that end with _en
cols_en = [x for x in df.columns if x.endswith("_en")]
cols_en

['categories_en',
 'origins_en',
 'labels_en',
 'countries_en',
 'traces_en',
 'additives_en',
 'states_en',
 'main_category_en']

In [21]:
df_ren_en = df.rename(
    columns={
        "categories_en": "categories",
        "origins_en": "origins",
        "labels_en": "labels",
        "countries_en": "countries",
        "traces_en": "traces",
        "additives_en": "additives",
        "states_en": "states",
        "main_category_en": "main_category"
    }
)
df_ren_en.columns

Index(['code', 'url', 'creator', 'created_datetime', 'last_modified_datetime',
       'product_name', 'generic_name', 'quantity', 'packaging', 'brands',
       'categories', 'origins', 'manufacturing_places', 'labels', 'emb_codes',
       'emb_codes_tags', 'purchase_places', 'stores', 'countries',
       'ingredients_text', 'allergens', 'traces', 'serving_size',
       'serving_quantity', 'additives_n', 'additives_tags', 'additives',
       'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil_tags',
       'ingredients_that_may_be_from_palm_oil_n',
       'ingredients_that_may_be_from_palm_oil_tags', 'nutriscore_score',
       'nutriscore_grade', 'nova_group', 'pnns_groups_1', 'pnns_groups_2',
       'states', 'brand_owner', 'ecoscore_score_fr', 'ecoscore_grade_fr',
       'main_category', 'image_small_url', 'energy-kj_100g',
       'energy-kcal_100g', 'energy_100g', 'fat_100g', 'saturated-fat_100g',
       'monounsaturated-fat_100g', 'polyunsaturated-fat_100g',
       'trans-fat_

In [22]:
# alternative, shorter but more advanced
df_ren_en_bis = df.rename(
    columns={x: x[:-3] for x in df.columns if x.endswith("_en")}
)
df_ren_en_bis.columns

Index(['code', 'url', 'creator', 'created_datetime', 'last_modified_datetime',
       'product_name', 'generic_name', 'quantity', 'packaging', 'brands',
       'categories', 'origins', 'manufacturing_places', 'labels', 'emb_codes',
       'emb_codes_tags', 'purchase_places', 'stores', 'countries',
       'ingredients_text', 'allergens', 'traces', 'serving_size',
       'serving_quantity', 'additives_n', 'additives_tags', 'additives',
       'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil_tags',
       'ingredients_that_may_be_from_palm_oil_n',
       'ingredients_that_may_be_from_palm_oil_tags', 'nutriscore_score',
       'nutriscore_grade', 'nova_group', 'pnns_groups_1', 'pnns_groups_2',
       'states', 'brand_owner', 'ecoscore_score_fr', 'ecoscore_grade_fr',
       'main_category', 'image_small_url', 'energy-kj_100g',
       'energy-kcal_100g', 'energy_100g', 'fat_100g', 'saturated-fat_100g',
       'monounsaturated-fat_100g', 'polyunsaturated-fat_100g',
       'trans-fat_

### Summary statistics

You can compute various summary statistics that depend on the type of variable in each column, see the [pandas intro tutorial 06](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html).

Compute summary statistics for several columns from different types, and combinations of columns that could provide interesting insights.

For instance, compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

In [23]:
# compute the means
df[["fat_100g", "saturated-fat_100g", "sugars_100g", "salt_100g"]].mean()

fat_100g              14.184523
saturated-fat_100g     5.331667
sugars_100g           12.815801
salt_100g              1.500613
dtype: float64

### Computing on columns

You can manipulate columns in various ways, including with operations that apply element-wise as we saw for NumPy arrays in the first notebook.

You can for instance subtract the mean value of a column to each value.

In [24]:
# subtract to each value for 'fat' the mean value of fat in the dataset
df['fat_100g'] - df['fat_100g'].mean()

0         33.815477
1         20.515477
2        -13.184523
3        -13.184523
4        -13.684523
            ...    
416547     9.815477
416548   -10.984523
416549    32.715477
416550   -13.684523
416551    -6.584523
Name: fat_100g, Length: 416552, dtype: float64

### Sorting data

The entries are sorted by barcode.
We might find it easier to understand the dataset if we sort entries by another criterion.

Sort entries by brand, following the [pandas intro tutorial 07](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html).

In [25]:
df_sort_brands = df.sort_values(by="brands")
df_sort_brands

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
902,0013562302239,http://world-en.openfoodfacts.org/product/0013562302239/annie-s-whole-wheat-bunnies-baked-snack-crackers-m...,bori,2015-07-02 01:42:07+00:00,2021-02-02 20:00:45+00:00,"Annie's Whole Wheat Bunnies Baked Snack Crackers, Made with Organic Wheat",,7 servings,,Annie's,...,0.83300,,0.0,0.0000,,,0.000,0.00120,,10.0
26587,0856463002002,http://world-en.openfoodfacts.org/product/0856463002002/core-meal-hearty-oatmeal-to-go-almond-raisin,bori,2015-07-02 02:56:19+00:00,2020-04-22 19:33:53+00:00,"Core, meal, hearty oatmeal to go, almond raisin",,1 serving,,"Core Meal, Core Method",...,0.02400,,0.0,0.0000,,,0.118,0.00212,,-2.0
6656,0043182000703,http://world-en.openfoodfacts.org/product/0043182000703/organic-mashed-potatoes-edward-and-sons,bori,2015-07-07 02:49:56+00:00,2020-04-22 16:59:09+00:00,Organic Mashed Potatoes,,4,,"Edward and Sons, Edward & Sons",...,0.72000,,0.0,0.0720,,,0.000,0.00000,,6.0
15481,0099482443436,http://world-en.openfoodfacts.org/product/0099482443436/engine-2-plant-strong-rip-s-big-bowl-triple-berry-...,bori,2015-07-02 03:07:41+00:00,2020-04-22 16:42:19+00:00,"Engine 2, plant-strong, rip's big bowl triple berry walnut",,7,,Engine 2,...,0.09100,,0.0,0.0109,,,0.036,0.00327,,-5.0
22953,0708953602011,http://world-en.openfoodfacts.org/product/0708953602011/organic-forbidden-rice-ramen-lotus-foods,bori,2015-07-07 02:52:46+00:00,2020-04-22 16:22:40+00:00,Organic Forbidden Rice Ramen,,8,,Lotus Foods,...,0.00000,,0.0,0.0000,,0.4,0.000,0.00103,,-4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406604,8803560000143,http://world-en.openfoodfacts.org/product/8803560000143/%EC%8C%80%EB%96%A1-%EC%86%A1%ED%95%99,woshilapin,2015-04-26 11:07:11+00:00,2018-05-29 08:30:13+00:00,쌀떡,,500 g,"Sachet,Plastique",송학,...,0.07112,,,,,,,,,-4.0
406573,8801117536916,http://world-en.openfoodfacts.org/product/8801117536916/%EC%98%A4%EB%9C%A8-%ED%94%84%EB%A1%9C%EB%A7%88%EC%...,chemy,2016-07-04 01:55:08+00:00,2017-11-17 06:00:01+00:00,오뜨 프로마즈,,240g,box,오리온,...,,,,,,,,,,
406609,8805713304023,http://world-en.openfoodfacts.org/product/8805713304023/%ED%95%9C%EC%82%B4%EB%A6%BC-%ED%98%B8%EB%B0%95%EC%...,openfoodfacts-contributors,2020-05-21 02:10:46+00:00,2020-05-21 02:15:13+00:00,한살림 호박쌀엿,,100g,plastic,한살림,...,2.33680,,,,,,,,,10.0
13996,0079200009373,http://world-en.openfoodfacts.org/product/0079200009373/fun-dip-%F0%9F%A4%A9%F0%9F%A4%A9%F0%9F%A4%A9,halal-app-chakib,2021-03-29 01:33:04+00:00,2021-04-25 20:18:29+00:00,Fun dip,,,ksiskek,🤩🤩🤩,...,0.00000,,,,,,,,,14.0


Let us look at the first entries, when sorted by brands.

In [26]:
df_sort_brands["brands"].head(20)

902                               Annie's
26587              Core Meal, Core Method
6656       Edward and Sons, Edward & Sons
15481                            Engine 2
22953                         Lotus Foods
294010                               ifri
51329               le verger des fruits 
227880                              !NARA
242806                       #männerglück
243671                             #sinob
243674                             #sinob
243675                             #sinob
243672                             #sinob
243679                             #sinob
243676                             #sinob
243680                             #sinob
243681          #sinob core,BlackLine 2.0
31415                        &all,アンド・オール
254153                &quot;Fruttis&quot;
210407      &quot;LB Bulgaricum&quot; PLC
Name: brands, dtype: string

Oddly, only the first few lines have brand names that start with a letter, then brand names start with a special character (`!` or `#`).
This is unexpected, because special characters should appear first.

What happened here ? Let us have a better look at the *values* in `brands`.

In [27]:
df_sort_brands["brands"].head(20).values

<StringArray>
[                       " Annie's",         ' Core Meal, Core Method',
 ' Edward and Sons, Edward & Sons',                       ' Engine 2',
                    ' Lotus Foods',                           ' ifri',
          ' le verger des fruits ',                           '!NARA',
                    '#männerglück',                          '#sinob',
                          '#sinob',                          '#sinob',
                          '#sinob',                          '#sinob',
                          '#sinob',                          '#sinob',
       '#sinob core,BlackLine 2.0',                    '&all,アンド・オール',
             '&quot;Fruttis&quot;',   '&quot;LB Bulgaricum&quot; PLC']
Length: 20, dtype: string

In the first entries, the `brands` value starts with a whitespace.
This explains why they were sorted before the entries whose `brands` start with a special character.

Brand names rarely (if ever) start with a whitespace, hence we can assume that whoever added these products made a typing error.

⚠ Datasets contain all sorts of errors and oddities. Datasets released by public agencies or big actors are usually cleaner than crowdsourced datasets, but you should always be cautious.

To confirm our hypothesis and check whether the entries are properly sorted, we can use `iloc` to retrieve entries at arbitary positions in the DataFrame.

For instance, let us check the entries ranked 4881 to 4899 (or 4900 excluded).

In [28]:
df_sort_brands["brands"].iloc[4881:4900]

373175                   Alba
373169                   Alba
373170                   Alba
373171                   Alba
373172                   Alba
373163                   Alba
373173                   Alba
373166                   Alba
373158                   Alba
373162                   Alba
373155                   Alba
373161                   Alba
373160                   Alba
373157                   Alba
336666    Alba torri e sapori
6933                 Albacore
172373               Albacore
291490               Albalact
291497               Albalact
Name: brands, dtype: string

The sorted brands are `Alba`, `Alba torri e sapori`, `Albacore` then `Albalact`, which is what we were expecting.

Sort entries by the Nutri-Score grade.

In [29]:
df_sort_nsgrade = df.sort_values(by="nutriscore_grade")
df_sort_nsgrade

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
160032,3560070446292,http://world-en.openfoodfacts.org/product/3560070446292/mais-doux-carrefour,carrefour,2019-01-27 18:33:45+00:00,2021-03-28 20:08:55+00:00,Mais doux,Mais doux en grains sous vide,600 g,,Carrefour,...,0.21200,,,,,,,,,-1.0
119262,3289131100095,http://world-en.openfoodfacts.org/product/3289131100095/ananas-en-morceaux-la-pulpe,kiliweb,2018-02-13 09:13:43+00:00,2021-03-14 08:33:39+00:00,Ananas en morceaux,,,,La Pulpe,...,0.00000,,,,,,,,,-3.0
49765,3038359001512,http://world-en.openfoodfacts.org/product/3038359001512/penne-rigate-offre-economique-panzani,date-limite-app,2015-02-28 17:49:42+00:00,2021-02-23 00:17:29+00:00,Penne Rigate (offre économique),Penne Rigate (offre économique),500 g,"sachet,plastique",Panzani,...,0.00520,,,,,,,,,-4.0
49766,3038359001567,http://world-en.openfoodfacts.org/product/3038359001567/le-risotto-a-poeler-champignons-lustucru,openfoodfacts-contributors,2015-03-03 17:11:18+00:00,2020-09-06 06:19:52+00:00,Le Risotto à Poêler Champignons,,350 g,Plastique,Lustucru,...,0.03200,,,,,,,,,-1.0
119264,3289131217120,http://world-en.openfoodfacts.org/product/3289131217120/segments-de-pamplemousse-au-sirop-leger-la-pulpe,kiliweb,2018-10-21 12:11:10+00:00,2019-01-09 11:03:28+00:00,Segments de pamplemousse au sirop léger,,,,La Pulpe,...,0.00400,,,,,,,,,-2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416537,9955100109478,http://world-en.openfoodfacts.org/product/9955100109478/cube-mendiants-maison-gaucher,serayet,2018-12-07 20:43:21+00:00,2018-12-07 20:55:16+00:00,cube mendiants,,200 g,cube plastique,maison Gaucher,...,,,,,,,,,,
416539,9961193410131,http://world-en.openfoodfacts.org/product/9961193410131/minced-pork-avatar-meat-avatar,openfoodfacts-contributors,2021-07-27 16:42:36+00:00,2021-07-27 17:14:13+00:00,Minced Pork Avatar,,240g,"Plastic,paper",Meat Avatar,...,,,,,,,,,,
416541,99885434,http://world-en.openfoodfacts.org/product/99885434/raviolis-pekinois-surgeles-asia-food,openfoodfacts-contributors,2019-07-08 10:52:02+00:00,2019-07-18 07:28:24+00:00,raviolis pekinois surgelés,,3800 g,,asia food,...,0.41656,,,,,,,,,
416545,99911522,http://world-en.openfoodfacts.org/product/99911522/chipolatas-casino,alm1412,2021-07-04 10:30:44+00:00,2021-07-04 10:41:08+00:00,Chipolatas,,6,barquette,Casino,...,,,,,,,,,,


Let us check the first 20 entries.

In [30]:
df_sort_nsgrade["nutriscore_grade"].head(20)

160032    a
119262    a
49765     a
49766     a
119264    a
119265    a
386381    a
325358    a
49772     a
49773     a
49774     a
49776     a
386380    a
49779     a
49780     a
49781     a
49782     a
49783     a
49784     a
49785     a
Name: nutriscore_grade, dtype: category
Categories (5, object): ['a' < 'b' < 'c' < 'd' < 'e']

The entries with nutriscore grade 'a' are ranked first, as expected.

Sort entries by the Nutri-Score grade and Nova group (together).

In [31]:
df_sort_nsgrade_novagroup = df.sort_values(by=["nutriscore_grade", "nova_group"])
df_sort_nsgrade_novagroup

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
32,0002000000288,http://world-en.openfoodfacts.org/product/0002000000288/melange-rando-les-accents-du-soleil,kiliweb,2020-05-06 06:46:15+00:00,2020-05-06 06:57:47+00:00,Melange Rando,Mélange Fruits Secs,125 g,Sachet,Les Accents du Soleil,...,0.01700,,,,,,,,,-5.0
185,0010200231005,http://world-en.openfoodfacts.org/product/0010200231005/old-fashioned-stone-ground-yellow-corn-meal-wilkin...,usda-ndb-import,2017-03-09 12:09:40+00:00,2020-07-13 03:01:35+00:00,Old Fashioned Stone Ground Yellow Corn Meal,,32 oz,,Wilkins Rogers Mills,...,0.00000,,,,0.000333,0.233,,0.00300,,-3.0
186,0010248765135,http://world-en.openfoodfacts.org/product/0010248765135/yemina-semilla-de-melon,openfoodfactsmx4,2018-11-09 19:59:29+00:00,2018-11-09 20:05:37+00:00,Yemina Semilla de melón,Pasta de sémola de trigo duro adicionada con vitaminas y hierro,200 g,Bolsa de plastico,Yemina,...,0.00508,,,,62.500000,,,0.00450,,-6.0
190,0010300343257,http://world-en.openfoodfacts.org/product/0010300343257/natural-almonds-emerald,usda-ndb-import,2017-03-09 12:44:57+00:00,2020-04-22 18:05:35+00:00,Natural Almonds,,,,Emerald,...,0.00000,,0.0,0.0,,0.743,0.286,0.00343,,-5.0
253,0011110043573,http://world-en.openfoodfacts.org/product/0011110043573/raw-almonds-simple-truth,kiliweb,2020-03-02 14:17:15+00:00,2020-04-23 21:07:06+00:00,Raw almonds,,8 oz,,Simple Truth,...,0.00000,,,,,0.733,0.267,0.00333,,-5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416520,99018788,http://world-en.openfoodfacts.org/product/99018788/pain-d-epices-lea-nature-jardin-bio,openfoodfacts-contributors,2021-01-18 20:52:04+00:00,2021-05-23 09:05:43+00:00,Pain d'épices,,300 g,plastique,Léa Nature Jardin Bio,...,0.15600,,,,,,,,,
416522,99020118,http://world-en.openfoodfacts.org/product/99020118/pelardon-delices-des-cevennes,openfoodfacts-contributors,2018-10-01 10:52:20+00:00,2019-03-09 16:55:13+00:00,pelardon,,6,4,delices des cevennes,...,0.40000,,,,,,,,,
416523,9906410000009,http://world-en.openfoodfacts.org/product/9906410000009/roussette-du-bugey-2011,agamitsudo,2013-07-10 18:20:08+00:00,2016-01-03 20:00:15+00:00,Roussette du Bugey (2011),Vins blanc du Bugey,750 ml,Bouteille en verre,Roussette du Bugey,...,,12.0,,,,,,,,
416528,99272104,http://world-en.openfoodfacts.org/product/99272104/casa-mayor-frutos-secos-damel,openfoodfacts-contributors,2021-05-03 05:28:52+00:00,2021-05-03 05:31:46+00:00,Casa Mayor Frutos Secos,,250 g,Sachet,Damel,...,,,,,,,,,,


Let us check the first 20 entries.

In [32]:
df_sort_nsgrade_novagroup[["nutriscore_grade", "nova_group"]].head(20)

Unnamed: 0,nutriscore_grade,nova_group
32,a,1
185,a,1
186,a,1
190,a,1
253,a,1
270,a,1
274,a,1
323,a,1
341,a,1
349,a,1


Products with nutriscore_grade 'a' and nova_group '1' appear first.

### Working with dates

pandas has a specific data type for dates. You can explicitly ask pandas to use this type for specific columns, either during `read_csv` or after (as I did in `load_off`), see the [pandas intro tutorial 09](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html).

This specific data type makes it easy to filter entries by the month of their creation, to know what day of the week an entry was created, or to sort entries by their date of creation.

Sort entries by their date of creation.

In [33]:
df_sort_created = df.sort_values(by="created_datetime")
df_sort_created

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
193468,3760029248001,http://world-en.openfoodfacts.org/product/3760029248001/caramels-tendres-au-beurre-sale-au-sel-de-guerande...,stephane,2012-01-31 14:43:58+00:00,2018-08-30 21:14:54+00:00,Caramels tendres au beurre salé au sel de Guérande,Caramels au beurre salé et à la fleur de sel de Guérande,100 g,"Boite,Carton",Carabreizh,...,0.68000,,,,,,,,,28.0
47317,3029330062806,http://world-en.openfoodfacts.org/product/3029330062806/jacquet-les-bouchees-creatives-a-garnir,stephane,2012-02-09 10:34:56+00:00,2016-12-23 16:38:19+00:00,Jacquet Les bouchées créatives à garnir,Supports en pâte cuite prêts à garnir,54 g,Boite carton,Jacquet,...,0.70104,,,,,,,,,10.0
94007,3257980112590,http://world-en.openfoodfacts.org/product/3257980112590/boudoirs-aux-oeufs-frais-cora,marianne,2012-02-11 14:51:07+00:00,2021-02-22 22:39:03+00:00,Boudoirs aux œufs frais,30 Boudoirs aux œufs frais,175 g,"Boîte,Carton",Cora,...,0.03600,,,,,,,,,14.0
44679,3017760038409,http://world-en.openfoodfacts.org/product/3017760038409/lulu-la-barquette-fraise-lu,marianne,2012-02-11 15:07:23+00:00,2021-06-23 08:00:58+00:00,Lulu La Barquette Fraise,Génoise garnie à la purée de fraise,120 g,"Paquet, Carton, produkt","LU, Mondelez",...,0.03600,,,,,,,,,13.0
61192,3160181210524,http://world-en.openfoodfacts.org/product/3160181210524/cookies-tout-chocolat-biocoop,stephane,2012-02-11 18:51:58+00:00,2019-08-23 19:57:37+00:00,Cookies tout chocolat Biocoop,Cookies au chocolat,200 g,"Boîte,Carton",Biocoop,...,0.15240,,,,,,,,,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303660,7501088102776,http://world-en.openfoodfacts.org/product/7501088102776/zataar-terana,ana-v-mancera,2021-08-15 23:53:03+00:00,2021-08-15 23:58:07+00:00,Zataar,,65 g,"glass jar,metal lid,plastic wrap",Terana,...,0.64100,,,,,,,,,
301860,7501003302069,http://world-en.openfoodfacts.org/product/7501003302069/hojas-de-laurel-mccormick,ana-v-mancera,2021-08-15 23:59:07+00:00,2021-08-16 00:03:26+00:00,Hojas de Laurel,,10 g,"plastic jar,plastic lid",McCormick,...,,,,,,,,,,
301857,7501003301895,http://world-en.openfoodfacts.org/product/7501003301895/ajo-en-polvo-mccormick,ana-v-mancera,2021-08-16 00:10:25+00:00,2021-08-16 00:11:20+00:00,Ajo en polvo,,77 g,"plastic jar,plastic lid",McCormick,...,,,,,,,,,,
301876,7501003311429,http://world-en.openfoodfacts.org/product/7501003311429/oregano-mccormick,ana-v-mancera,2021-08-16 00:11:58+00:00,2021-08-16 00:14:52+00:00,Orégano,,23 g,"plastic jar,plastic lid",McCormick,...,,,,,,,,,,


Let us check that the first and last entries are as expected.

In [34]:
df_sort_created["created_datetime"].head(10)

193468   2012-01-31 14:43:58+00:00
47317    2012-02-09 10:34:56+00:00
94007    2012-02-11 14:51:07+00:00
44679    2012-02-11 15:07:23+00:00
61192    2012-02-11 18:51:58+00:00
330177   2012-02-11 20:46:21+00:00
43673    2012-02-11 21:11:15+00:00
51065    2012-02-12 08:32:47+00:00
309599   2012-02-12 08:51:55+00:00
182209   2012-02-12 18:01:45+00:00
Name: created_datetime, dtype: datetime64[ns, UTC]

The oldest entries in our dataset date from 2012.

In [35]:
df_sort_created["created_datetime"].tail(20)

25990    2021-08-15 18:16:32+00:00
34973    2021-08-15 18:21:32+00:00
304802   2021-08-15 18:22:09+00:00
252695   2021-08-15 19:21:42+00:00
165203   2021-08-15 19:33:05+00:00
301566   2021-08-15 20:28:57+00:00
348023   2021-08-15 20:36:34+00:00
304758   2021-08-15 22:51:58+00:00
304860   2021-08-15 23:03:04+00:00
304185   2021-08-15 23:11:29+00:00
303659   2021-08-15 23:18:41+00:00
301859   2021-08-15 23:26:01+00:00
302076   2021-08-15 23:35:35+00:00
303657   2021-08-15 23:43:38+00:00
303658   2021-08-15 23:45:45+00:00
303660   2021-08-15 23:53:03+00:00
301860   2021-08-15 23:59:07+00:00
301857   2021-08-16 00:10:25+00:00
301876   2021-08-16 00:11:58+00:00
302339   2021-08-16 00:20:59+00:00
Name: created_datetime, dtype: datetime64[ns, UTC]

The newest entries in our dataset date from 2021-08-15 (when I downloaded the entire dataset).

### Working with textual data

pandas provides a number of functions to process text strings, see the [pandas intro tutorial 10](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html).

Use these functions to select all entries whose list of brands contains "Casino" (this operation is case-sensitive, so mind the initial capital letter!).

In [36]:
df_casino = df[df["brands"].str.contains("Casino")]
df_casino

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
16018,0200298019689,http://world-en.openfoodfacts.org/product/0200298019689/saucisses-de-toulouse-casino,kiliweb,2018-01-21 17:34:37+00:00,2019-01-11 10:43:03+00:00,Saucisses de toulouse,,,,Casino,...,0.600,,,,,,,,,18.0
16036,0200448052542,http://world-en.openfoodfacts.org/product/0200448052542/poulet-jaune-fermier-du-gers-casino,moon-rabbit,2017-09-30 09:42:14+00:00,2021-04-03 20:16:26+00:00,Poulet jaune fermier du Gers,,"1,606 kg",,Casino,...,,,,,,,,,,
16220,0202152035750,http://world-en.openfoodfacts.org/product/0202152035750/saucisse-de-toulouse-geant-casino,kiliweb,2018-01-20 19:41:37+00:00,2018-12-27 19:57:11+00:00,Saucisse de toulouse,,,,Geant Casino,...,0.600,,,,,,,,,18.0
16416,0203339029524,http://world-en.openfoodfacts.org/product/0203339029524/filets-de-poulet-casino,kiliweb,2019-12-11 10:00:28+00:00,2020-10-28 15:13:41+00:00,Filets de poulet,,,,Casino,...,0.000,,,,,,,,,
16417,0203339040086,http://world-en.openfoodfacts.org/product/0203339040086/filet-de-poulet-casino,kiliweb,2020-01-01 20:03:40+00:00,2020-01-18 11:06:56+00:00,Filet de poulet,,,,Casino,...,0.000,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380192,8436018994697,http://world-en.openfoodfacts.org/product/8436018994697/chorizo-fort-l-espagne-des-saveurs,kiliweb,2018-01-13 18:30:42+00:00,2020-04-29 16:22:00+00:00,Chorizo Fort,,170 g,,"L’espagne Des Saveurs,Casino",...,1.260,,,,,,,,,24.0
396266,85975934,http://world-en.openfoodfacts.org/product/85975934/truffe-fantaisie-noir-casino,kiliweb,2018-03-27 11:58:02+00:00,2019-01-13 19:38:41+00:00,Truffe fantaisie noir,,150 g,,Casino,...,0.120,,,,,,,,,27.0
409394,88830100,http://world-en.openfoodfacts.org/product/88830100/assortiment-de-petits-cakes-casino,openfoodfacts-contributors,2020-01-02 20:05:22+00:00,2020-10-11 14:40:48+00:00,Assortiment de petits cakes,,450 g,Busta di plastica,Casino,...,0.277,,,,,,,,,27.0
416434,96963333,http://world-en.openfoodfacts.org/product/96963333/farine-de-ble-t65-casino,kiliweb,2018-04-02 09:06:01+00:00,2021-05-01 11:39:21+00:00,Farine de blé T65,,,,Casino,...,,,,,,,,,,


You should get 4434 products whose brands contains "Casino".

### Wrapping it all together

Select all the products that are in the category for spreads and store this subset in a variable `df_spreads`.

> **HINT** If you can't find the right pattern to look for, take a peak at the spelling of the categories: Print the content of the column and browse through the values until you find a suitable value.

In [37]:
df_spreads = df[df["categories_en"].str.contains("Spreads")].copy()
df_spreads

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,...,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,0000101209159,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,...,0.004,,,,,,,,,23.0
57,0007700026200,http://world-en.openfoodfacts.org/product/0007700026200/miel-les-ruchers-du-born,kiliweb,2019-12-08 10:17:00+00:00,2019-12-08 12:50:23+00:00,Miel,,,,Les ruchers du Born,...,0.000,,,,,,,,,
172,0009800800254,http://world-en.openfoodfacts.org/product/0009800800254/hazelnut-spread-with-cocoa-ferrero,usda-ndb-import,2017-03-09 14:51:12+00:00,2020-11-08 07:51:39+00:00,Hazelnut spread with cocoa,,,,"Ferrero, Ferrero U.S.A. Incorporated",...,0.041,,0.0,0.0,,,0.108,0.00195,,18.0
173,0009800801107,http://world-en.openfoodfacts.org/product/0009800801107/ferrero-nutella-hazelnut-spread-with-cocoa-mini-cups,bdwyer,2015-07-26 00:50:59+00:00,2020-04-22 18:33:03+00:00,"Ferrero, nutella, hazelnut spread with cocoa mini cups",Hazelnut Spread with Skim Milk & Cocoa,10 MINI CUPS - 5.2 OZ (150 g),Box,"Nutella,Ferrero",...,0.050,,0.0,0.0,,,0.133,0.00240,,17.0
174,0009800892204,http://world-en.openfoodfacts.org/product/0009800892204/ferrero-nutella-hazelnut-spread-with-cocoa,openfoodfacts-contributors,2013-07-02 14:14:24+00:00,2021-07-18 17:18:36+00:00,"Ferrero, nutella, hazelnut spread with cocoa",,1 kg,Pot,"Ferrero,Nutella",...,0.041,,0.0,0.0,,,0.108,0.00195,,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416493,97929352,http://world-en.openfoodfacts.org/product/97929352/nusco-dark-chocolate-spread-brinkers,kiliweb,2019-08-03 08:28:20+00:00,2020-05-31 10:31:02+00:00,Nusco Dark chocolate spread,,,,Brinkers,...,0.008,,,,,,,,,20.0
416515,9900010011557,http://world-en.openfoodfacts.org/product/9900010011557/miel-de-camargue-l-boulaire,jeanbono,2013-08-06 08:49:57+00:00,2017-09-06 11:54:00+00:00,Miel de Camargue,Miel,250 g,"Bocal,Verre",L. Boulaire,...,,,,,,,,,,
416519,9900634104376,http://world-en.openfoodfacts.org/product/9900634104376/ceba-caramel-litzada-ametller-origen,kiliweb,2020-08-25 10:53:57+00:00,2020-10-25 12:42:53+00:00,Ceba Caramel•litzada,,,,ametller origen,...,0.216,,,,,,,,,14.0
416531,9935010000003,http://world-en.openfoodfacts.org/product/9935010000003/rillette-d-oie-sans-marque,sebleouf,2015-10-31 12:07:09+00:00,2015-11-01 11:20:39+00:00,Rillette d'oie,,180 g,"Pot,Verre","Sans marque,D.Lambert",...,,,,,,,,,,


You should find 25565 spreads.

For these spreads, compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

In [38]:
df_spreads[["fat_100g", "saturated-fat_100g", "sugars_100g", "salt_100g"]].mean()

fat_100g              20.955330
saturated-fat_100g     7.974919
sugars_100g           29.003056
salt_100g              0.765626
dtype: float64

You should find the mean values :

* fat = 20.96 g,
* saturated-fat = 7.97 g,
* sugars = 29g,
* salt = 0.77g.


For each of these 4 nutritional values, compute the percentage of difference between each product and the average of its category, and store the computed values as new columns to `df_spreads` (eg. `diff-fat_100g`, `diff-sugars_100g` etc).

Remember that you can find help in the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) and [pandas tutorial 06](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html#min-tut-06-stats).

In [39]:
df_spreads["diff-fat_100g"] = 100 * (df_spreads["fat_100g"] - df_spreads["fat_100g"].mean()) / df_spreads["fat_100g"].mean()
df_spreads["diff-fat_100g"]

0         129.058667
57               NaN
172        54.757762
173        43.161667
174        54.757762
             ...    
416493    105.198389
416515           NaN
416519    -97.613972
416531           NaN
416534   -100.000000
Name: diff-fat_100g, Length: 25565, dtype: float64

In [40]:
df_spreads["diff-saturated-fat_100g"] = 100 * (df_spreads["saturated-fat_100g"] - df_spreads["saturated-fat_100g"].mean()) / df_spreads["saturated-fat_100g"].mean()
df_spreads["diff-sugars_100g"] = 100 * (df_spreads["sugars_100g"] - df_spreads["sugars_100g"].mean()) / df_spreads["sugars_100g"].mean()
df_spreads["diff-salt_100g"] = 100 * (df_spreads["salt_100g"] - df_spreads["salt_100g"].mean()) / df_spreads["salt_100g"].mean()

Note that these values differ from what the OpenFoodFacts website displays when you look at the nutritional values of a product from this category, eg. [Coconut Spread - premium Srikaya - Hey Boo - 227 g](https://world.openfoodfacts.org/product/0608938316165/coconut-spread-premium-srikaya-hey-boo).

In [41]:
df_spreads[df_spreads['code'] == '0608938316165']['diff-fat_100g']

20958    0.213167
Name: diff-fat_100g, dtype: float64

This product contains barely 0.2% more fat than the average spreads in our subset, but 17% more than the average spreads in the entire OpenFoodFacts dataset (as displayed on the OFF website).


This is because the OpenFoodFacts website uses its entire dataset, whereas we are working on a filtered subset of "reasonably complete" product entries prepared beforehand to keep only products with :

* a non-ambiguous barcode in the EAN-8 or EAN-13 formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

It seems that, in this "resonably complete" subset, spreads contain more fat on average than in the whole OpenFoodFacts dataset.

Is the entire OpenFoodFacts dataset closer to the reality of what is on the shelves of supermarkets ?
Is our subset more faithful globally ? Is it more faithful to the consumer market in certain countries, eg. France and Spain ?

These questions raise the more general problem of [Selection bias](https://en.wikipedia.org/wiki/Selection_bias) that lies behind every data analysis and use of dataset for eg. artificial intelligence systems.

## Bonus exercise : Traffic light labelling

The [traffic light labelling system](https://www.nutrition.org.uk/healthyliving/helpingyoueatwell/324-labels.html?start=3) is used on the [OpenFoodFacts website (French)](https://fr.openfoodfacts.org/reperes-nutritionnels) to display colorful, easier to grasp information on 4 nutritional values with a color code :

* fat,
* saturated fat,
* sugars,
* salt.

The OpenFoodFacts dataset does not contain these indicators, but you can recompute them from the [reference table](https://www.nutrition.org.uk/media/er5n0c3s/capture.png).

Add 4 columns to the dataset, one for each of the 4 relevant nutritional values, that will contain the  (low, medium, high) or color (green, yellow, red) of the traffic light. 

> **HINT** You can simplify the exercise and express all conditions on the values per 100g (ignoring the rightmost column of the table where thresholds are expressed per portion).

In [42]:
# for fat_100g
df["tl_fat"] = "unknown"
df.loc[df["fat_100g"] <= 3, "tl_fat"] = "green"
df.loc[(df["fat_100g"] > 3) & (df["fat_100g"] <= 17.5), "tl_fat"] = "amber"
df.loc[(df["fat_100g"] > 17.5), "tl_fat"] = "red"

Let us check that the traffic lights for fat are as wanted.

In [43]:
df[["fat_100g", "tl_fat"]].head(10)

Unnamed: 0,fat_100g,tl_fat
0,48.0,red
1,34.7,red
2,1.0,green
3,1.0,green
4,0.5,green
5,4.6,amber
6,4.6,amber
7,4.6,amber
8,6.3,amber
9,,unknown


Now you can define the traffic lights for the 3 remaining nutritional values.

In [44]:
# for saturated-fat_100g
df["tl_saturated-fat"] = "unknown"
df.loc[df["saturated-fat_100g"] <= 1.5, "tl_saturated-fat"] = "green"
df.loc[(df["saturated-fat_100g"] > 1.5) & (df["saturated-fat_100g"] <= 5), "tl_saturated-fat"] = "amber"
df.loc[(df["saturated-fat_100g"] > 5), "tl_saturated-fat"] = "red"
# for sugar_100g
df["tl_sugars"] = "unknown"
df.loc[df["sugars_100g"] <= 5, "tl_sugars"] = "green"
df.loc[(df["sugars_100g"] > 5) & (df["sugars_100g"] <= 22.5), "tl_sugars"] = "amber"
df.loc[(df["sugars_100g"] > 22.5), "tl_sugars"] = "red"
# for salt_100g
df["tl_salt"] = "unknown"
df.loc[df["salt_100g"] <= 0.3, "tl_salt"] = "green"
df.loc[(df["salt_100g"] > 0.3) & (df["salt_100g"] <= 1.5), "tl_salt"] = "amber"
df.loc[(df["salt_100g"] > 1.5), "tl_salt"] = "red"

We can display the traffic lights for the first 10 products, and compare with what the Open Food Facts website displays (remember: you can retrieve URLs from the column `url`).

In [45]:
df[["url", "tl_fat", "tl_saturated-fat", "tl_sugars", "tl_salt"]].head(10)

Unnamed: 0,url,tl_fat,tl_saturated-fat,tl_sugars,tl_salt
0,http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-...,red,red,red,green
1,http://world-en.openfoodfacts.org/product/0000159487776/milkyway-magic-stars-chocolates,red,unknown,red,unknown
2,http://world-en.openfoodfacts.org/product/0000204286484/mehrkomponeneten-protein-90-c6-haselnuss-allfitnes...,green,unknown,unknown,unknown
3,http://world-en.openfoodfacts.org/product/0000250632969/mehrkomponeneten-protein-90-c6-banane-allfitnessfa...,green,unknown,unknown,unknown
4,http://world-en.openfoodfacts.org/product/0000460938714/100-soja-protein-haselnuss-allfitnessfactory-de,green,unknown,unknown,unknown
5,http://world-en.openfoodfacts.org/product/0000470322800/whey-protein-aus-molke-vanilla-allfitnessfactory-de,amber,unknown,amber,amber
6,http://world-en.openfoodfacts.org/product/0000501050603/whey-protein-aus-molke-1000-gramm-vanilla-allfitne...,amber,unknown,amber,amber
7,http://world-en.openfoodfacts.org/product/0000526938306/whey-protein-aus-molke-500-gramm-vanilla-allfitnes...,amber,unknown,amber,amber
8,http://world-en.openfoodfacts.org/product/0000554004509/pain-de-mie-sans-gluten-genius,amber,green,green,amber
9,http://world-en.openfoodfacts.org/product/0000606009841/beignets-framboises-intermarche,unknown,amber,amber,amber


## To go further

### Python for data science

* [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)