# Tabular data analysis 1 : Loading Open Food Facts data with pandas

In this series of notebooks, we are going to explore the data contained in the OpenFoodFacts database.

## OpenFoodFacts

OpenFoodFacts is an open, crowdsourced database on food products from around the world.

It is produced and managed as a digital commons.

Everyone can contribute data on packaged food products: pictures, ingredients, nutritional values etc.

This database has served as the foundation for many mobile phone apps, especially scanning apps to help customers while grocery shopping.

### Notions

* It is [*open*](https://en.wikipedia.org/wiki/Open_data): Anyone can freely use it, access it, modify it.
* It is [*crowdsourced*](https://en.wikipedia.org/wiki/Crowdsourcing) : Anyone can add new food products to the database, complete or modify existing data.
* It is a [knowledge commons](https://en.wikipedia.org/wiki/Knowledge_commons), a type of [digital commons](https://en.wikipedia.org/wiki/Digital_commons_(economics)).

The OpenFoodFacts database is [available online](https://world.openfoodfacts.org/).

Take a few minutes to explore the database through its online interface.

* How is each product described ?
* What types of information are provided ?
* How complete is the information ?
* Is the information up-to-date ?

### OpenFoodFacts as a tabular dataset

The entire set of facts about all the products in the OpenFoodFacts database can be represented as a *tabular dataset*, that is a table of data where :

* each row is a product,
* each column is a field (eg. "brand", "barcode", "energy for 100g"...),
* each cell contains the value of a field for a product.

The simplest and most common format used for tabular datasets is the [CSV format](https://en.wikipedia.org/wiki/Comma-separated_values).
CSV files can be opened in a spreadsheet software such as Microsoft Excel, Apple Numbers or LibreOffice Calc, or just any plain text editor.

The OpenFoodFacts database is [available for download in various formats](https://world.openfoodfacts.org/data), including the CSV format, but because the whole database is too big (more than 4 GB as of 2021-08-16), we will work on a filtered version of this CSV file (more on that below).

## The pandas library for tabular data analysis

### Gaining functionalities with libraries

The Python standard library (**TODO introduce the notion of standard library in notebook 1**) includes a module named `csv` that provides very basic support to read and write CSV files.
This module enables you to read and write values, but nothing more.

It gives you no way to :

* rename columns ;
* filter columns, eg. keep only the columns for nutritional values ;
* filter rows, eg. select all products that are categorized as "Sweet spreads" ;
* compute summary statistics on columns across rows, eg. compute the min, max, mean and median of fiber content per 100g ;
* compare columns, eg. test whether they contain the same values ;
* etc.

This can be remedied by using an additional [software library](https://en.wikipedia.org/wiki/Library_(computing)), which is, roughly speaking, a collection of code that provides functionalities to perform operations on a given task or domain.
You might have heard, or will hear about, libraries dedicated to machine learning such as [scikit-learn](https://en.wikipedia.org/wiki/Scikit-learn) or Google's [TensorFlow](https://en.wikipedia.org/wiki/TensorFlow) to build neural networks.

The most widely used library in Python to work on tabular datasets is [pandas](https://pandas.pydata.org/).

### Importing a library

In order to use a library, you need to (1) install it and (2) import it.

Because the pandas library is already installed in Google Colaboratory, here you only need to import it by executing the following cell.

In [1]:
import pandas as pd

We imported the pandas library and assigned it to a new namespace, `pd`.

In order to use a pandas function such as [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), you therefore need to prefix it with `pd.`, for example

`df = pd.read_csv("myfile.csv")`.


Namespaces enable to avoid possible name conflicts between functions in the pandas library and the standard python library, eg. `sum` or `mean`.
By default, the namespace of an imported library is named after its root module, here `pandas`.
The `import ... as ...` syntax enables to define a shorter name for a namespace and type the shorter `pd` rather than `pandas` every time.

If you read notebooks that use other libraries, you will notice recurring naming conventions for some libraries :

* `import numpy as np`
* `import seaborn as sns`
* `import matplotlib.pyplot as plt`
* etc.

## To go further

* The [csv module of the Python standard library](https://docs.python.org/3.7/library/csv.html)
* [Python packages](https://docs.python.org/3.7/tutorial/modules.html#packages) and the dot notation