# Analyzing big datasets

In the previous notebook, we carefully examined and probed the OpenFoodFacts dataset to determine how we could load all of its entries to conduct a full scale analysis.
We saw that selecting columns and specifying data types, especially for categorical variables, greatly improved the loading speed and reduced the memory usage of the DataFrame.

We can now start to work on the entries themselves.

## Loading the dataset

In [4]:
import pandas as pd
from pandas.api.types import CategoricalDtype

In [5]:
# change to the (absolute or relative) path to the CSV file on your computer
CSV_FILE = '/home/mathieu/datasets/openfoodfacts/2020-09-26/fr.openfoodfacts.org.products.csv'

In [8]:
# load the dataset with specific instructions :
# - subset of columns of interest
keep_cols = ["code", "url", "product_name", "brands", "categories",
             "countries_tags",
             "additives_tags",
             "nutriscore_score", "nutriscore_grade",
             "nova_group",
             "pnns_groups_1", "pnns_groups_2",
             "states",
             "energy-kcal_100g",
             "fat_100g", "saturated-fat_100g", "trans-fat_100g", "cholesterol_100g",
             "carbohydrates_100g", "sugars_100g",
             "fiber_100g",
             "proteins_100g",
             "salt_100g", "sodium_100g",
             "vitamin-a_100g",
             "vitamin-c_100g",
             "calcium_100g",
             "iron_100g",
             "nutrition-score-fr_100g"
]
# - specification of data type for these columns
dtype = {
    # 'code' should be read as a string
    'code': str,
    # ordered categoricals, with explicit list of values
    'nova_group': CategoricalDtype(categories=['1', '2', '3', '4'], ordered=True),
    'nutriscore_grade': CategoricalDtype(categories=['a', 'b', 'c', 'd', 'e'], ordered=True),
    # unordered categoricals, values will be inferred during reading
    'pnns_groups_1': 'category',
    'pnns_groups_2': 'category',
    # we usually don't need to cast the "_100g" columns to 'float16'
}
df = pd.read_csv(CSV_FILE, sep='\t', header=0, usecols=keep_cols, dtype=dtype)
df

Unnamed: 0,code,url,product_name,brands,categories,countries_tags,additives_tags,nutriscore_score,nutriscore_grade,nova_group,...,sugars_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,vitamin-a_100g,vitamin-c_100g,calcium_100g,iron_100g,nutrition-score-fr_100g
0,0000000000017,http://world-fr.openfoodfacts.org/produit/0000...,Vitória crackers,,,en:france,,,,,...,15.0,,7.8,1.40,0.560,,,,,
1,0000000000031,http://world-fr.openfoodfacts.org/produit/0000...,Cacao,,,en:france,,,,,...,,,,,,,,,,
2,000000000003327986,http://world-fr.openfoodfacts.org/produit/0000...,Filetes de pollo empanado,,,en:spain,,,,,...,,,,,,,,,,
3,0000000000100,http://world-fr.openfoodfacts.org/produit/0000...,moutarde au moût de raisin,courte paille,"Epicerie, Condiments, Sauces, Moutardes",en:france,,18.0,d,,...,22.0,0.0,5.1,4.60,1.840,,,,,18.0
4,00000000001111111111,http://world-fr.openfoodfacts.org/produit/0000...,Sfiudwx,Watt,Xsf,en:france,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1469562,9999999900686,http://world-fr.openfoodfacts.org/produit/9999...,Marrons glacés,,"Aliments et boissons à base de végétaux, Alime...",en:belgium,,,,,...,,,,,,,,,,
1469563,9999999901,http://world-fr.openfoodfacts.org/produit/9999...,Scs,,,en:united-kingdom,,,,,...,1.0,,1.0,1.00,0.400,,,,,
1469564,9999999910128,http://world-fr.openfoodfacts.org/produit/9999...,Sandwich club Rillette poisson combava,,,en:reunion,,,,,...,,,,,,,,,,
1469565,9999999990397,http://world-fr.openfoodfacts.org/produit/9999...,Fati,,,en:belgium,,,,,...,0.6,,1.6,0.64,0.256,,,,,


Check the memory usage.

In [9]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1469567 entries, 0 to 1469566
Data columns (total 29 columns):
 #   Column                   Non-Null Count    Dtype   
---  ------                   --------------    -----   
 0   code                     1469567 non-null  object  
 1   url                      1469567 non-null  object  
 2   product_name             1402129 non-null  object  
 3   brands                   815139 non-null   object  
 4   categories               750396 non-null   object  
 5   countries_tags           1464327 non-null  object  
 6   additives_tags           375821 non-null   object  
 7   nutriscore_score         588623 non-null   float64 
 8   nutriscore_grade         588623 non-null   category
 9   nova_group               562355 non-null   category
 10  pnns_groups_1            1454256 non-null  category
 11  pnns_groups_2            1468293 non-null  category
 12  states                   1469567 non-null  object  
 13  energy-kcal_100g         11

## Filtering out incomplete entries
All (or almost all?) large databases suffer from quality issues : inconsistencies, missing values, errors...
Crowdsourced databases, being filled by multiple providers, are particularly prone to these issues.
It is good practice to filter out the entries that we suspect to be incomplete or of low quality, so that they do not act as outliers that compromise or draw the analysis to erroneous conclusions.

OpenFoodFacts has an inventory of "states" that describe the level of completion and quality control of an entry.
We will only keep the products whose entries are marked "complete".

Filter the rows to keep only those whose field `states` contains `en:complete`.

In [12]:
# TODO


## Analyzing the data through data visualizations

Your task is to produce an analysis of this dataset, using data visualizations.
You need to come up with visualizations that enable you to gain insight, show evidence, confirm intuitions... on variables in isolation or in pairs, on pairs or groups or families of products, etc. 

You have seen the basics of matplotlib in the first week of this course.
Matplotlib is very flexible and powerful but requires you to write many instructions to create advanced, but relatively standard, visualizations. 
My recommendation is that you use [seaborn](https://seaborn.pydata.org/introduction.html), a library built on top of matplotlib that facilitates the creation of data visualisations from pandas DataFrames.

If you have not already done so, you can install it in the terminal :
`conda install seaborn`.

You can find inspiration by looking at [seaborn's example gallery](https://seaborn.pydata.org/examples/index.html).

In [13]:
# TODO
