<a href="https://colab.research.google.com/github/datactivist/scpo-data-science-bootcamp/blob/main/notebooks/2_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular data analysis 1 : Loading Open Food Facts data with pandas

In this series of notebooks, we are going to explore the data contained in the OpenFoodFacts database.

## OpenFoodFacts

OpenFoodFacts is an open, crowdsourced database on food products from around the world.

It is produced and managed as a digital commons.

Everyone can contribute data on packaged food products: pictures, ingredients, nutritional values etc.

This database has served as the foundation for many mobile phone apps, especially scanning apps to help customers while grocery shopping.

### Notions

* It is [*open*](https://en.wikipedia.org/wiki/Open_data): Anyone can freely use it, access it, modify it.
* It is [*crowdsourced*](https://en.wikipedia.org/wiki/Crowdsourcing) : Anyone can add new food products to the database, complete or modify existing data.
* It is a [knowledge commons](https://en.wikipedia.org/wiki/Knowledge_commons), a type of [digital commons](https://en.wikipedia.org/wiki/Digital_commons_(economics)).

### Browsing through the dataset

The OpenFoodFacts database is [available online](https://world.openfoodfacts.org/).

Take a few minutes to explore the database through its online interface.

* How is each product described ?
* What types of information are provided ?

### Understanding the dataset

To really understand a dataset, you need to read its documentation so that you are able to answer a set of common, basic questions that will help guide your analysis, such as :

* Who created this dataset and for what purpose ?
* How was the dataset created ?
* What do the instances that comprise the dataset represent (eg. people, companies, events, photos...) ?
* What data does each instance consist of ? Are they "raw" data or (computed) features ?
* Are the instances related in some way ? If so, are there specific fields that enable cross-reference ?

The documentation for a dataset is always written with some purpose, for an intended type of reader, in a certain context, hence it is very likely that you will not find all the answers in the documentation.

Here, you can gather partial information on OFF from :

* the [presentation of the project](https://world.openfoodfacts.org/discover)
* various pages of the [wiki](https://wiki.openfoodfacts.org/Main_Page), mostly :
  * [Data fields](https://wiki.openfoodfacts.org/Data_fields)
  * [Ingredients](https://wiki.openfoodfacts.org/Ingredients)
  * [Quality](https://wiki.openfoodfacts.org/Quality)

#### To go further

* [Datasheets for datasets](https://arxiv.org/pdf/1803.09010.pdf) are a standardized documentation process and format proposed by AI researchers to facilitate the proper (re-)use of datasets and avoid common pitfalls in designing AI components (and ensuing scandals when they exhibit problematic biases in deployment)

Equipped with this new knowledge about the OpenFoodFacts database, you can start the exploratory analysis of the data to gather the missing information to complete your answers, and ask questions of your own.


### OpenFoodFacts as a tabular dataset

The entire set of facts about all the products in the OpenFoodFacts database can be represented as a *tabular dataset*, that is a table of data where :

* each row is a product,
* each column is a field (eg. "brand", "barcode", "energy for 100g"...),
* each cell contains the value of a field for a product.


The simplest and most common format used for tabular datasets is the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values).
CSV files can be opened in a spreadsheet software such as Microsoft Excel, Apple Numbers or LibreOffice Calc, or just any plain text editor.

The OpenFoodFacts database is [available for download in various formats](https://world.openfoodfacts.org/data), including the CSV format.
Because the whole dataset is too big (the CSV export weighs more than 4 GB as of 2021-08-16), we will work on a filtered subset of the dataset where we only keep products with :

* a non-ambiguous barcode in the [EAN-8](https://en.wikipedia.org/wiki/EAN-8) or [EAN-13](https://en.wikipedia.org/wiki/International_Article_Number) formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

## The pandas library for tabular data analysis

### Gaining functionalities with libraries

The Python standard library includes a module named [csv](https://docs.python.org/3/library/csv.html#module-csv) that provides very basic support to read and write CSV files.
This module enables you to read and write values, but nothing more.

It gives you no way to :

* rename columns ;
* filter columns, eg. keep only the columns for nutritional values ;
* filter rows, eg. select all products that are categorized as "Sweet spreads" ;
* compute summary statistics on columns across rows, eg. compute the min, max, mean and median of fiber content per 100g ;
* compare columns, eg. test whether they contain the same values ;
* etc.

As we saw in the 1st notebook, this can be remedied by using an additional [software library](https://en.wikipedia.org/wiki/Library_(computing)), which is, roughly speaking, a collection of code that provides functionalities to perform operations on a given task or domain .

The most widely used library in Python to work on tabular datasets is [pandas](https://pandas.pydata.org/).

We need to import pandas and, for technical reasons, a specific pandas data type to load [categorical variables with ordered values](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#controlling-behavior).

In [2]:
import pandas as pd
# we need this data type for ordered categoricals
from pandas.api.types import CategoricalDtype
# lift some limitations in column width, so more cell values are displayed in full
pd.set_option('display.max_colwidth', 110)

The OpenFoodFacts CSV file we will load has an accompanying text file that specifies the specific data type that pandas should use for the columns. Otherwise, pandas would do its best to guess data types but its guesses are (rightfully) conservative so the result is quite rough around the edges.

The two files (csv and txt) are on the Google Drive of my Sciences Po account :

* [CSV file](https://drive.google.com/file/d/14Pyz3Wb-FGs_9H-e7K-4Ug2X31N81Amv/view?usp=sharing)
* [dtype txt file](https://drive.google.com/file/d/1EUBD1btT8k4PS073WLUqGm_UucUl4n3P/view?usp=sharing)

0. Check that you have saved this notebook on your Google Drive (otherwise a "Save on your Drive" button appears in the Colab notebook menu bar) ;
1. Download the CSV and txt files ;
2. Add them to the Google Drive of the account you used to open Colab : It should be your Sciences-Po account, or your personal account.
  * To check what account you are using, click on the circle with your initial, at the top right of the Colab menu bar (as in Gmail and other Google products).
  * Then go to <https://drive.google.com/>, check you are logged in with the same account, and drop your files where you want.
3. To the very left of this Colab notebook, there is a small "folder" icon. Click on it, a menu bar will appear with three icons below "Files". Click on the icon on the right (dark folder icon, with the Drive icon). It will enable you to access files on your Drive from your Colab notebook.
4. In the next cell, change the path to the files to match the path on your Drive.

It should work, but let me know if you encounter any issue.

In [6]:
# this code probably appeared with the above procedure
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
# dataset
OFF_FILE = 'drive/MyDrive/data-science-bootcamp/off_products_subset.csv'
# data type of the columns, useful for loading
DTYPE_FILE = 'drive/MyDrive/data-science-bootcamp/dtype.txt'

We will use a custom utility function, `load_off`, to load the OpenFoodFacts dataset and convert a column.

You do not need to understand or even look at its code because this requires a few Python functions and technical notions (file input and output, evaluation) that we could not cover in the first notebook and are beyond the objectives of this bootcamp.
However, feel free to ask Mathieu questions if you are curious !

In [8]:
def load_off():
  """Load the filtered subset of OpenFoodFacts.
  
  Returns
  -------
  df : pd.DataFrame
    (A filtered subset of the) OpenFoodFacts tabular dataset.
  """
  # load the data types for the columns
  with open(DTYPE_FILE) as f:
    dtype = eval(f.read())

  # load the dataset
  df = pd.read_csv(OFF_FILE, sep='\t', dtype=dtype)
  # convert columns with datetimes
  for col_name in ('created_datetime', 'last_modified_datetime'):
    # ISO 8601 dates
    df[col_name] = pd.to_datetime(df[col_name])
  #
  return df

We load the dataset using the function above.

In [9]:
df = load_off()

If all went fine, you do not see anything.
What have we read, really ?
You remember that typing the name of a variable, as the only (or last) line of a notebook cell, prints its value. 

In [None]:
# type the name of the variable containing the dataset


The dataset is loaded in a pandas DataFrame, a type of object described in the [pandas intro tutorial 01](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html).

By default pandas displays the column headers, the first and last five rows with their row index, the total number of rows and columns.

How many rows and columns does the table contain in total ?

You can display the first `n` entries of a Dataframe with the DataFrame method [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), and the last `n` entries with [tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail).

Remember that methods are attached to an object, and are called with the dot notation.

In [None]:
# display the first entry
df.head(1)

Unnamed: 0,code,url,creator,created_datetime,last_modified_datetime,product_name,generic_name,quantity,packaging,brands,categories_en,origins_en,manufacturing_places,labels_en,emb_codes,emb_codes_tags,purchase_places,stores,countries_en,ingredients_text,allergens,traces_en,serving_size,serving_quantity,additives_n,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states_en,brand_owner,ecoscore_score_fr,ecoscore_grade_fr,main_category_en,image_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,fat_100g,saturated-fat_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,vitamin-c_100g,vitamin-b1_100g,potassium_100g,calcium_100g,iron_100g,magnesium_100g,nutrition-score-fr_100g
0,101209159,http://world-en.openfoodfacts.org/product/0000...,kiliweb,2018-02-22 10:56:57+00:00,2020-01-18 19:26:31+00:00,Véritable pâte à tartiner noisettes chocolat noir,,350 g,,Bovetti,"Spreads,Breakfasts,Sweet spreads,fr:Pâtes à ta...",,,"No gluten,No palm oil",,,,,France,,,,,,,,,,,,,23,e,,Sugary snacks,Sweets,"To be completed,Nutrition facts completed,Ingr...",,20,d,Cocoa and hazelnuts spreads,https://images.openfoodfacts.org/images/produc...,,617.0,2582.0,48.0,10.0,,,,,36.0,32.0,,8.0,0.01,0.004,,,,,,,,,23.0


Display the first 7 entries.

Display the last 3 entries.

**Hint** Some URLs are longer than the maximal displayed text length for a cell (by default 80 characters, previously raised here to 110). This will make it harder for you to consult the product page on the OFF website. 
You can use the `values` attribute to get the complete array of values for a (subset of a) DataFrame, or of a column (Series).

In [None]:
# display the arrays of values of all fields for the first 2 products
# NB : each entry has 2 URLs : one for the product page, one for its (small-sized) image
df.head(2).values

array([['0000101209159',
        'http://world-en.openfoodfacts.org/product/0000101209159/veritable-pate-a-tartiner-noisettes-chocolat-noir-bovetti',
        'kiliweb', Timestamp('2018-02-22 10:56:57+0000', tz='UTC'),
        Timestamp('2020-01-18 19:26:31+0000', tz='UTC'),
        'Véritable pâte à tartiner noisettes chocolat noir', <NA>,
        '350 g', <NA>, 'Bovetti',
        'Spreads,Breakfasts,Sweet spreads,fr:Pâtes à tartiner,Hazelnut spreads,Chocolate spreads,Cocoa and hazelnuts spreads',
        <NA>, <NA>, 'No gluten,No palm oil', <NA>, <NA>, <NA>, <NA>,
        'France', <NA>, <NA>, <NA>, <NA>, nan, <NA>, <NA>, <NA>, <NA>,
        <NA>, <NA>, <NA>, 23, 'e', nan, 'Sugary snacks', 'Sweets',
        'To be completed,Nutrition facts completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Categories completed,Brands completed,Packaging to be completed,Quantity completed,Product name completed,Photos val

### About the data table

We can display a summary of the DataFrame with `info`, including for each column its index, name, number of non-null values, and data type (`dtype`).
For more information, you can read the [pandas intro tutorial 02](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416552 entries, 0 to 416551
Data columns (total 66 columns):
 #   Column                                      Non-Null Count   Dtype              
---  ------                                      --------------   -----              
 0   code                                        416552 non-null  string             
 1   url                                         416552 non-null  string             
 2   creator                                     416551 non-null  category           
 3   created_datetime                            416552 non-null  datetime64[ns, UTC]
 4   last_modified_datetime                      416552 non-null  datetime64[ns, UTC]
 5   product_name                                416552 non-null  string             
 6   generic_name                                99355 non-null   string             
 7   quantity                                    286295 non-null  string             
 8   packaging               

`info` also displays the memory usage of the DataFrame.

### Selecting subsets

One of the fundamental operations on DataFrames is to be able to filter the dataset on a certain condition, to keep only certain rows or columns.

The basic operators for selection are square brackets `[]`, `loc` and `iloc`, and you can select rows or columns by their position or label, or with a conditional expression on values, see the [pandas intro tutorial 03](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html).

Filter rows to keep only products with Nutri-Score 'a'.


You should have 56260 entries.

Now filter rows to keep products whose quantity of sugars per 100g is higher than 20g.

You should obtain 83891 entries.

Filter the dataset to keep only the columns corresponding to the :
* barcode,
* url,
* date of creation,
* product name,
* brands,
* categories,
* ingredients text,
* main category,
* Nutri-Score grade,
* Nutri-Score score,
* Nova group.

### Renaming columns

Column names are not always ideal, either because they are not transparent (it is hard for you or an external user to understand what they stand for) or because they would look bad if they were used directly to label the axes of a datavisualization.

pandas provides means to rename columns, see the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html).

Rename all the columns ending with `_en` (code for English) to drop this suffix, eg. `main_category_en` to `main_category`.

### Summary statistics

You can compute various summary statistics that depend on the type of variable in each column, see the [pandas intro tutorial 06](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html).

Compute summary statistics for several columns from different types, and combinations of columns that could provide interesting insights.

For instance, compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

In [None]:
# compute the means


### Computing on columns

You can manipulate columns in various ways, including with operations that apply element-wise as we saw for NumPy arrays in the first notebook.

You can for instance subtract the mean value of a column to each value.

In [None]:
# subtract to each value for 'fat' the mean value of fat in the dataset
df['fat_100g'] - df['fat_100g'].mean()

0         33.815477
1         20.515477
2        -13.184523
3        -13.184523
4        -13.684523
            ...    
416547     9.815477
416548   -10.984523
416549    32.715477
416550   -13.684523
416551    -6.584523
Name: fat_100g, Length: 416552, dtype: float64

### Sorting data

The entries are sorted by barcode.
We might find it easier to understand the dataset if we sort entries by another criterion.

Sort entries by brand, following the [pandas intro tutorial 07](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html).

Sort entries by the Nutri-Score grade.

Sort entries by the Nutri-Score grade and Nova group (together).

### Working with dates

pandas has a specific data type for dates. You can explicitly ask pandas to use this type for specific columns, either during `read_csv` or after (as I did in `load_off`), see the [pandas intro tutorial 09](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html).

This specific data type makes it easy to filter entries by the month of their creation, to know what day of the week an entry was created, or to sort entries by their date of creation.

Sort entries by their date of creation.

### Working with textual data

pandas provides a number of functions to process text strings, see the [pandas intro tutorial 10](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html).

Use these functions to select all entries whose list of brands contains "Casino" (this operation is case-sensitive, so mind the initial capital letter!).

You should get 4434 products whose brands contains "Casino".

### Wrapping it all together

Select all the products that are in the category for spreads and store this subset in a variable `df_spreads`.

> **HINT** If you can't find the right pattern to look for, take a peak at the spelling of the categories: Print the content of the column and browse through the values until you find a suitable value.

You should find 25565 spreads.

Compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

You should find the mean values :

* fat = 20.96 g,
* saturated-fat = 7.97 g,
* sugars = 29g,
* salt = 0.77g.


For each of these 4 nutritional values, compute the percentage of difference between each product and the average of its category, and store the computed values as new columns to `df_spreads` (eg. `diff-fat_100g`, `diff-sugars_100g` etc).

Remember that you can find help in the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) and [pandas tutorial 06](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html#min-tut-06-stats).

Note that these values differ from what the OpenFoodFacts website displays when you look at the nutritional values of a product from this category, eg. [Coconut Spread - premium Srikaya - Hey Boo - 227 g](https://world.openfoodfacts.org/product/0608938316165/coconut-spread-premium-srikaya-hey-boo).

In [None]:
df_spreads[df_spreads['code'] == '0608938316165']['diff-fat_100g']

20958    0.213167
Name: diff-fat_100g, dtype: float64

This product contains barely 0.2% more fat than the average spreads in our subset, but 20% more than the average spreads in the entire OpenFoodFacts dataset (as displayed on the OFF website).


This is because the OpenFoodFacts website uses its entire dataset, whereas we are working on a filtered subset of "reasonably complete" product entries prepared beforehand to keep only products with :

* a non-ambiguous barcode in the EAN-8 or EAN-13 formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

It seems that, in this "resonably complete" subset, spreads contain more fat on average than in the whole OpenFoodFacts dataset.

Is the entire OpenFoodFacts dataset closer to the reality of what is on the shelves of supermarkets ?
Is our subset more faithful globally ? Is it more faithful to the consumer market in certain countries, eg. France and Spain ?

These questions raise the more general problem of [Selection bias](https://en.wikipedia.org/wiki/Selection_bias) that lies behind every data analysis and use of dataset for eg. artificial intelligence systems.

## Bonus exercise : Traffic light labelling

The [traffic light labelling system](https://www.nutrition.org.uk/healthyliving/helpingyoueatwell/324-labels.html?start=3) is used on the [OpenFoodFacts website (French)](https://fr.openfoodfacts.org/reperes-nutritionnels) to display colorful, easier to grasp information on 4 nutritional values with a color code :

* fat,
* saturated fat,
* sugars,
* salt.

The OpenFoodFacts dataset does not contain these indicators, but you can recompute them from the [reference table](https://www.nutrition.org.uk/images/cache/7246e7822e0a7588fae60fac0b6c8e7f_w664.png).

Add 4 columns to the dataset, one for each of the 4 relevant nutritional values, that will contain the  (low, medium, high) or color (green, yellow, red) of the traffic light. 

## To go further

### Python for data science

* [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)