<a href="https://colab.research.google.com/github/datactivist/scpo-data-science-bootcamp/blob/main/notebooks/2_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular data analysis 1 : Loading Open Food Facts data with pandas

In this series of notebooks, we are going to explore the data contained in the OpenFoodFacts database.

## OpenFoodFacts

OpenFoodFacts is an open, crowdsourced database on food products from around the world.

It is produced and managed as a digital commons.

Everyone can contribute data on packaged food products: pictures, ingredients, nutritional values etc.

This database has served as the foundation for many mobile phone apps, especially scanning apps to help customers while grocery shopping.

### Notions

* It is [*open*](https://en.wikipedia.org/wiki/Open_data): Anyone can freely use it, access it, modify it.
* It is [*crowdsourced*](https://en.wikipedia.org/wiki/Crowdsourcing) : Anyone can add new food products to the database, complete or modify existing data.
* It is a [knowledge commons](https://en.wikipedia.org/wiki/Knowledge_commons), a type of [digital commons](https://en.wikipedia.org/wiki/Digital_commons_(economics)).

### Browsing through the dataset

The OpenFoodFacts database is [available online](https://world.openfoodfacts.org/).

Take a few minutes to explore the database through its online interface.

* How is each product described ?
* What types of information are provided ?

### Understanding the dataset

To really understand a dataset, you need to read its documentation so that you are able to answer a set of common, basic questions that will help guide your analysis, such as :

* Who created this dataset and for what purpose ?
* How was the dataset created ?
* What do the instances that comprise the dataset represent (eg. people, companies, events, photos...) ?
* What data does each instance consist of ? Are they "raw" data or (computed) features ?
* Are the instances related in some way ? If so, are there specific fields that enable cross-reference ?

The documentation for a dataset is always written with some purpose, for an intended type of reader, in a certain context, hence it is very likely that you will not find all the answers in the documentation.

Here, you can gather partial information on OFF from :

* the [presentation of the project](https://world.openfoodfacts.org/discover)
* various pages of the [wiki](https://wiki.openfoodfacts.org/Main_Page), mostly :
  * [Data fields](https://wiki.openfoodfacts.org/Data_fields)
  * [Ingredients](https://wiki.openfoodfacts.org/Ingredients)
  * [Quality](https://wiki.openfoodfacts.org/Quality)

#### To go further

* [Datasheets for datasets](https://arxiv.org/pdf/1803.09010.pdf) are a standardized documentation process and format proposed by AI researchers to facilitate the proper (re-)use of datasets and avoid common pitfalls in designing AI components (and ensuing scandals when they exhibit problematic biases in deployment)

Equipped with this new knowledge about the OpenFoodFacts database, you can start the exploratory analysis of the data to gather the missing information to complete your answers, and ask questions of your own.


### OpenFoodFacts as a tabular dataset

The entire set of facts about all the products in the OpenFoodFacts database can be represented as a *tabular dataset*, that is a table of data where :

* each row is a product,
* each column is a field (eg. "brand", "barcode", "energy for 100g"...),
* each cell contains the value of a field for a product.


The simplest and most common format used for tabular datasets is the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values).
CSV files can be opened in a spreadsheet software such as Microsoft Excel, Apple Numbers or LibreOffice Calc, or just any plain text editor.

The OpenFoodFacts database is [available for download in various formats](https://world.openfoodfacts.org/data), including the CSV format.
Because the whole dataset is too big (the CSV export, uncompressed, weighs more than 4 GB as of 2021-08-16), we will work on a filtered subset of the dataset where we only keep products with :

* a non-ambiguous barcode in the [EAN-8](https://en.wikipedia.org/wiki/EAN-8) or [EAN-13](https://en.wikipedia.org/wiki/International_Article_Number) formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

### Accessing the data



You need two files (csv and txt) that are on the Google Drive of my Sciences Po account :

* [data file (csv)](https://drive.google.com/file/d/14Pyz3Wb-FGs_9H-e7K-4Ug2X31N81Amv/view?usp=sharing),
* [metadata file (txt)](https://drive.google.com/file/d/1EUBD1btT8k4PS073WLUqGm_UucUl4n3P/view?usp=sharing) (column types, so that pandas does not have to guess them).

For **each of these 2 files**:
1. Open the link
2. Click on the "Add shortcut to Drive" button
<center>
<img src="https://github.com/datactivist/scpo-data-science-bootcamp/raw/main/notebooks/img/drive-1.png" width=400>
</center>
3. In the menu, click on "My Drive"
<center>
<img src="https://github.com/datactivist/scpo-data-science-bootcamp/raw/main/notebooks/img/drive-2.png" width=400>
</center>
4. Click on "Add shortcut here"
<center>
<img src="https://github.com/datactivist/scpo-data-science-bootcamp/raw/main/notebooks/img/drive-3.png" width=400>
</center>

This will add shortcuts, in your Sciences Po (Google) drive, to the files stored on Mathieu's Sciences Po (Google) Drive.

Then you need to authorize Colab to access files (here shortcuts) on your Drive.

Execute this next cell, a pop-up will appear asking you to select your Sciences Po (Google) account, then asking you to authorize access.

In [None]:
# enable Colab to access files (here shortcuts) on your Drive
from google.colab import drive
drive.mount('/content/drive')

The files can now be accessed from the shortcuts on your drive.

## The pandas library for tabular data analysis

### Gaining functionalities with libraries

The Python standard library includes a module named [csv](https://docs.python.org/3/library/csv.html#module-csv) that provides very basic support to read and write CSV files.
This module enables you to read and write values, but nothing more.

It gives you no way to :

* rename columns ;
* filter columns, eg. keep only the columns for nutritional values ;
* filter rows, eg. select all products that are categorized as "Sweet spreads" ;
* compute summary statistics on columns across rows, eg. compute the min, max, mean and median of fiber content per 100g ;
* compare columns, eg. test whether they contain the same values ;
* etc.

As we saw in the 1st notebook, this can be remedied by using an additional [software library](https://en.wikipedia.org/wiki/Library_(computing)), which is, roughly speaking, a collection of code that provides functionalities to perform operations on a given task or domain .

The most widely used library in Python to work on tabular datasets is [pandas](https://pandas.pydata.org/).

We need to import pandas and, for technical reasons, a specific pandas data type to load [categorical variables with ordered values](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#controlling-behavior).

We will use a custom utility function, `load_off`, to load the OpenFoodFacts dataset and convert a column.

> You do not need to understand or even look at the next cell because this requires a few Python functions and technical notions (file input and output, evaluation) that we could not cover in the first notebook and are beyond the objectives of this bootcamp.
However, feel free to ask Mathieu questions if you are curious !

In [None]:
# (just execute this cell)

# import pandas
import pandas as pd
# we need this data type for ordered categoricals
from pandas.api.types import CategoricalDtype
# lift some limitations in column width, so more cell values are displayed in full
pd.set_option('display.max_colwidth', 110)

# dataset and data type of the columns
OFF_FILE = 'drive/MyDrive/off_products_subset.csv'
DTYPE_FILE = 'drive/MyDrive/dtype.txt'

# custom function to load the Open Food Facts subset
def load_off():
  """Load the filtered subset of OpenFoodFacts.
  
  Returns
  -------
  df : pd.DataFrame
    (A filtered subset of the) OpenFoodFacts tabular dataset.
  """
  # load the data types for the columns
  with open(DTYPE_FILE) as f:
    dtype = eval(f.read())

  # load the dataset
  df = pd.read_csv(OFF_FILE, sep='\t', dtype=dtype)
  # convert columns with datetimes
  for col_name in ('created_datetime', 'last_modified_datetime'):
    # ISO 8601 dates
    df[col_name] = pd.to_datetime(df[col_name])
  #
  return df

# load the dataset
df = load_off()

If all went fine, you do not see anything.
What have we read, really ?
You remember that typing the name of a variable, as the only (or last) line of a notebook cell, prints its value. 

Type the name of the variable containing the dataset, to display the value (content) of that variable.

The dataset is loaded in a pandas DataFrame, a type of object described in the [pandas intro tutorial 01](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html).

By default pandas displays the column headers, the first and last five rows with their row index, the total number of rows and columns.

**Question** How many rows and columns does the table contain in total ?

### First glance at the dataset

You can display the first `n` entries of a Dataframe with the DataFrame method [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), and the last `n` entries with [tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail).

> **HINT**: Remember that methods are attached to an object, and are called with the dot notation.

You can call [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) with no parameter.

In [None]:
# (just execute this cell)
df.head()

You can call `head` with a parameter `7` to display the first 7 entries.

You can call [tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) with no parameter.

You can call `tail` with a parameter `3` to display the last 3 entries.

**Hint** Some URLs are longer than the maximal displayed text length for a cell (by default 80 characters, previously raised here to 110). This will make it harder for you to consult the product page on the OFF website. 
You can use the `values` attribute to get the complete array of values for a (subset of a) DataFrame, or of a column (Series).

In [None]:
# (just execute this cell)
# display the arrays of values of all fields for the first 2 products
# NB : each entry has 2 URLs : one for the product page, one for its (small-sized) image
df.head(2).values

### About the data table

We can display a summary of the DataFrame with `info`, including for each column its index, name, number of non-null values, and data type (`dtype`).
For more information, you can read the [pandas intro tutorial 02](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

`info` also displays the memory usage of the DataFrame.

### Selecting subsets

One of the fundamental operations on DataFrames is to be able to filter the dataset on a certain condition, to keep only certain rows or columns.

The basic operators for selection are square brackets `[]`, `loc` and `iloc`, and you can select rows or columns by their position or label, or with a conditional expression on values, see the [pandas intro tutorial 03](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html).

Filter rows in `df` to keep only products with Nutri-Score 'a', and store the result in a variable called `df_nutri_a`.


In [None]:
# (just execute this cell)
df_nutri_a = df[df["nutriscore_grade"] == "a"]
df_nutri_a

You should have 56260 entries.

Now filter rows in `df` to keep products whose quantity of sugars per 100g is higher than 20g, and store the result in a variable called `df_sugar_gt20`.

You should obtain 83891 entries.

Filter the dataset `df` to keep only the columns corresponding to the :
* barcode,
* url,
* date of creation,
* product name,
* brands,
* categories,
* ingredients text,
* main category,
* Nutri-Score grade,
* Nutri-Score score,
* Nova group.

And store the result in a variable named `df_sel_cols`.

### Making a selection into a proper DataFrame

You can manipulate each of these selections as a DataFrame, but behind the scenes, they are *views* of the original DataFrame `df`.
The *view* mechanism avoids unnecessary copies of the dataset, but it is problematic when we really want to extract a subset and perform some operations only on this subset.

For instance, let us select all products in `df` with sugars and fat per 100g greater than 0, and add a column with the sugars to fat ratio.

First, we need to define two filtering conditions and apply them jointly using the [boolean "and" `&`](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing). Store the result in a variable named `df_sugarsfat`.

Now, let us try to add to `df_sugarsfat` a new column named `"sugarsfat_ratio"` with the sugars to fat ratio.

The output seems fine, but it is preceded by a `SettingWithCopyWarning` that tells us we are working on a *view* when we should be working on an independent copy of the subset of the dataframe.

To avoid this warning, we need to turn our selection into an independent dataframe, with the function `copy()`, and store the result in our variable named `df_sugarsfat`:

Then let us add to (our new) `df_sugarsfat` a column named `"sugarsfat_ratio"` with the sugars to fat ratio.

The ratios are the same as before, except we got rid of the big warning, so we must be doing things the *right* way.

We will not go further and we certainly do not expect you to master the [difference between a view and a copy](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy), but at least now you know that if you encounter the big scary warning, you probably need to `copy()` your selection of rows.

### Renaming columns

Column names are not always ideal, either because they are not transparent (it is hard for you or an external user to understand what they stand for) or because they would look bad if they were used directly to label the axes of a datavisualization.

pandas provides means to rename columns, see the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html).

Let us rename each of the columns whose name ends with `_en`.

First, we need to list such columns.

In [None]:
# (just execute this cell)
# list the column names that end with _en
cols_en = [x for x in df.columns if x.endswith("_en")]
cols_en

Now we can `rename` each of the columns ending with `_en`, so as to drop this suffix.

For instance, `main_category_en` should be renamed `main_category`.

We can store the result in a variable named `df_ren_en`.

To see if it worked, let us display the column names in `df_ren_en` and check that our `_en` columns, such as `main_category_en`, have been renamed as expected.

### Summary statistics

You can compute various summary statistics that depend on the type of variable in each column, see the [pandas intro tutorial 06](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html).

Compute summary statistics for several columns from different types, and combinations of columns that could provide interesting insights.

For instance, compute the means of the nutritional values in `df` for :
* fat,
* saturated fat,
* sugars,
* salt.

### Computing on columns

You can manipulate columns in various ways, including with operations that apply element-wise as we saw for NumPy arrays in the first notebook.

You can for instance subtract the mean value of a column to each value in the column.

### Sorting data

The entries are sorted by barcode.
We might find it easier to understand the dataset if we sort entries by another criterion.

Sort entries by brand, following the [pandas intro tutorial 07](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html), and store the result in a variable named `df_sort_brands`.

Let us look at the brands for the first entries, sorted by brands.

Oddly, only the first few lines have brand names that start with a letter, then brand names start with a special character (`!` or `#`).
This is unexpected, because special characters should appear first.

What happened here ? Let us have a better look at the *values* in the `brands` column of our sorted dataframe `df_sort_brands`.

In [None]:
# (just execute this cell)
df_sort_brands["brands"].head(20).values

In the first entries, the `brands` value starts with a whitespace.
This explains why they were sorted before the entries whose `brands` start with a special character.

Brand names rarely (if ever) start with a whitespace, hence we can assume that whoever added these products made a typing error.

⚠ Datasets contain all sorts of errors and oddities. Datasets released by public agencies or big actors are usually cleaner than crowdsourced datasets, but you should always be cautious.

To confirm our hypothesis and check whether the entries are properly sorted, we can use `iloc` to retrieve entries at arbitary positions in the DataFrame.

For instance, let us check the entries ranked 4881 to 4899 (or 4900 excluded).

In [None]:
# (just execute this cell)
df_sort_brands["brands"].iloc[4881:4900]

The sorted brands are `Alba`, `Alba torri e sapori`, `Albacore` then `Albalact`, which is what we were expecting.

Sort entries by the Nutri-Score grade, and store the result in a variable named `df_sort_nsgrade`.

Let us check the first 20 entries.

The entries with nutriscore grade 'a' are ranked first, as expected.

Sort entries by the Nutri-Score grade and Nova group (together), and store the result in a variable named `df_sort_nsgrade_novagroup`.

Let us check the first 20 entries.

Products with nutriscore_grade 'a' and nova_group '1' appear first.

### Working with dates

pandas has a specific data type for dates. You can explicitly ask pandas to use this type for specific columns, either during `read_csv` or after (as I did in `load_off`), see the [pandas intro tutorial 09](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html).

This specific data type makes it easy to filter entries by the month of their creation, to know what day of the week an entry was created, or to sort entries by their date of creation.

Sort entries by their date of creation, and store the result in a variable named `df_sort_created`.

Let us check that the first and last entries are as expected.

Display the first entries.

The oldest entries in our dataset date from 2012.

Display the last entries.

The newest entries in our dataset date from 2021-08-15 (when I downloaded the entire dataset).

### Working with textual data

pandas provides a number of functions to process text strings, see the [pandas intro tutorial 10](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html).

Use these functions to select all entries whose list of brands contains "Casino" (this operation is case-sensitive, so mind the initial capital letter!), and store the result in a variable named `df_casino`.

You should get 4434 products whose brands contains "Casino".

### Wrapping it all together

Select all the products that are in the category for spreads and store this subset in a variable `df_spreads`.

> **HINT** If you can't find the right pattern to look for, take a peak at the spelling of the categories: Print the content of the column and browse through the values until you find a suitable value.

You should find 25565 spreads.

For these spreads, compute the means of the nutritional values for :
* fat,
* saturated fat,
* sugars,
* salt.

You should find the mean values :

* fat = 20.96 g,
* saturated-fat = 7.97 g,
* sugars = 29g,
* salt = 0.77g.


For each of these 4 nutritional values, compute the percentage of difference between each product and the average of its category, and store the computed values as new columns to `df_spreads` (eg. `diff-fat_100g`, `diff-sugars_100g` etc).

Remember that you can find help in the [pandas intro tutorial 05](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) and [pandas tutorial 06](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html#min-tut-06-stats).

Note that these values differ from what the OpenFoodFacts website displays when you look at the nutritional values of a product from this category, eg. [Coconut Spread - premium Srikaya - Hey Boo - 227 g](https://world.openfoodfacts.org/product/0608938316165/coconut-spread-premium-srikaya-hey-boo).

In [None]:
# (uncomment this line and check the output)
# df_spreads[df_spreads['code'] == '0608938316165']['diff-fat_100g']

This product contains barely 0.2% more fat than the average spreads in our subset `df_spreads`, but 17% more than the average spreads in the entire OpenFoodFacts dataset (as displayed on the OFF website).


This is because the OpenFoodFacts website uses its entire dataset, whereas we are working on a filtered subset of "reasonably complete" product entries prepared beforehand to keep only products with :

* a non-ambiguous barcode in the EAN-8 or EAN-13 formats ;
* a product name,
* brands,
* an image URL for the product ;
* a category ;
* basic nutritional values.

It seems that, in this "resonably complete" subset, spreads contain more fat on average than in the whole OpenFoodFacts dataset.

Is the entire OpenFoodFacts dataset closer to the reality of what is on the shelves of supermarkets ?
Is our subset more faithful globally ? Is it more faithful to the consumer market in certain countries, eg. France and Spain ?

These questions raise the more general problem of [Selection bias](https://en.wikipedia.org/wiki/Selection_bias) that lies behind every data analysis and use of dataset for eg. artificial intelligence systems.

## Bonus exercise : Traffic light labelling

The [traffic light labelling system](https://www.nutrition.org.uk/healthyliving/helpingyoueatwell/324-labels.html?start=3) is used on the [OpenFoodFacts website (French)](https://fr.openfoodfacts.org/reperes-nutritionnels) to display colorful, easier to grasp information on 4 nutritional values with a color code :

* fat,
* saturated fat,
* sugars,
* salt.

The OpenFoodFacts dataset does not contain these indicators, but you can recompute them from the [reference table](https://www.nutrition.org.uk/media/er5n0c3s/capture.png).

Add 4 columns to the dataset, one for each of the 4 relevant nutritional values, that will contain the  (low, medium, high) or color (green, yellow, red) of the traffic light. 

> **HINT** You can simplify the exercise and express all conditions on the values per 100g (ignoring the rightmost column of the table where thresholds are expressed per portion).

We can use [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc), see "Setting values".

In [None]:
# (just execute this cell)
# for fat_100g
df["tl_fat"] = "unknown"
df.loc[df["fat_100g"] <= 3, "tl_fat"] = "green"
df.loc[(df["fat_100g"] > 3) & (df["fat_100g"] <= 17.5), "tl_fat"] = "amber"
df.loc[(df["fat_100g"] > 17.5), "tl_fat"] = "red"

Let us check that the traffic lights for fat are as wanted.

In [None]:
# (just execute this cell)
df[["fat_100g", "tl_fat"]].head(10)

Now you can define the traffic lights for the 3 remaining nutritional values, in columns `"tl_saturated-fat"`, `"tl_sugars"`, `"tl_salt"`. 

We can display the traffic lights for the first 10 products, and compare with what the Open Food Facts website displays (remember: you can retrieve URLs from the column `url`).

## To go further

### Python for data science

* [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)