# Préparez des données pour un organisme de santé publique

## 0 - Rappel du sujet

L'agence Santé publique France souhaite améliorer sa base de données Open Food Facts, et fait appel aux services de votre entreprise. Cette base de données open source est mise à la disposition de particuliers, et d’organisations afin de leur permettre de connaître la qualité nutritionnelle de produits. 

Aujourd’hui, pour ajouter un produit à la base de données d'Open Food Facts, il est nécessaire de remplir de nombreux champs textuels et numériques, ce qui peut conduire à des erreurs de saisie et à des valeurs manquantes dans la base. 

L’agence Santé publique France confie à votre entreprise la création d’un système de suggestion ou d’auto-complétion pour aider les usagers à remplir plus efficacement la base de données. Dans un premier temps, vous devez vous concentrer sur la prise en main des données, et d’abord les nettoyer et les explorer.

Afin de simplifier ton approche, je te propose de commencer par établir la faisabilité de suggérer les valeurs manquantes pour une variable dont plus de 50% des valeurs sont manquantes.

Voici les différentes étapes pour nettoyer et explorer les données : 

1) Traiter le jeu de données

Repérer des variables pertinentes pour les traitements à venir, et nécessaires pour suggérer des valeurs manquantes,.
Nettoyer les données en :
mettant en évidence les éventuelles valeurs manquantes parmi les variables pertinentes sélectionnées, avec au moins 3 méthodes de traitement adaptées aux variables concernées,
identifiant et en traitant les éventuelles valeurs aberrantes de chaque variable.
Automatiser ces traitements pour éviter de répéter ces opérations
Attention, le client souhaite que le programme fonctionne si la base de données est légèrement modifiée (ajout d’entrées, par xmple) !

 

2) Tout au long de l’analyse, produire des visualisations afin de mieux comprendre les données. Effectuer une analyse univariée pour chaque variable intéressante, afin de synthétiser son comportement.

## 1 - Préparer le dataframe

In [4]:
import numpy as np

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter
import scipy
from scipy import stats
import scipy.stats as st

import statsmodels
import statsmodels.api as sm
#!pip install missingno
import missingno as msno

print("numpy version", np.__version__)
print("pandas version", pd.__version__)
print("matplotlib version", matplotlib.__version__)
print("seaborn version", sns.__version__)
print("scipy version", scipy.__version__)
print("statsmodels version", statsmodels.__version__)
print("missingno version", msno.__version__)

pd.options.display.max_rows = 200
pd.options.display.max_columns = 100

numpy version 1.26.4
pandas version 2.1.4
matplotlib version 3.8.0
seaborn version 0.13.2
scipy version 1.11.4
statsmodels version 0.14.0
missingno version 0.5.2


In [5]:
# Charger le fichier de données
data = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_3/openfoodfacts/fr.openfoodfacts.org.products.csv", sep='\t', low_memory=False)
data.head()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_fr,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_fr,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_fr,ingredients_text,allergens,allergens_fr,traces,traces_tags,traces_fr,serving_size,no_nutriments,additives_n,additives,additives_tags,additives_fr,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,...,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-fr.openfoodfacts.org/produit/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-fr.openfoodfacts.org/produit/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,États-Unis,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,28 g (1 ONZ),,0.0,[ bananas -> en:bananas ] [ vegetable-oil -...,,,0.0,,,0.0,...,3.57,,,,0.0,0.0,,0.0,,,,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-fr.openfoodfacts.org/produit/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,États-Unis,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,28 g (0.25 cup),,0.0,[ peanuts -> en:peanuts ] [ wheat-flour -> ...,,,0.0,,,0.0,...,17.86,,,,0.635,0.25,,0.0,,,,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-fr.openfoodfacts.org/produit/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,États-Unis,"Organic hazelnuts, organic cashews, organic wa...",,,,,,28 g (0.25 cup),,0.0,[ organic-hazelnuts -> en:organic-hazelnuts ...,,,0.0,,,0.0,...,17.86,,,,1.22428,0.482,,,,,,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-fr.openfoodfacts.org/produit/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,États-Unis,Organic polenta,,,,,,35 g (0.25 cup),,0.0,[ organic-polenta -> en:organic-polenta ] [...,,,0.0,,,0.0,...,8.57,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
data.shape

(320772, 162)

In [7]:
data.dtypes

code                                           object
url                                            object
creator                                        object
created_t                                      object
created_datetime                               object
last_modified_t                                object
last_modified_datetime                         object
product_name                                   object
generic_name                                   object
quantity                                       object
packaging                                      object
packaging_tags                                 object
brands                                         object
brands_tags                                    object
categories                                     object
categories_tags                                object
categories_fr                                  object
origins                                        object
origins_tags                

In [8]:
data.describe()

Unnamed: 0,no_nutriments,additives_n,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,nutrition_grade_uk,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,...,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
count,0.0,248939.0,248939.0,0.0,248939.0,0.0,0.0,261113.0,857.0,243891.0,229554.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,1.0,24.0,23.0,0.0,0.0,1.0,0.0,22823.0,22859.0,841.0,186.0,38.0,78.0,188.0,149.0,8.0,24.0,23.0,21.0,13.0,0.0,14.0,0.0,0.0,0.0,143298.0,144090.0,243588.0,244971.0,72.0,26.0,38.0,...,259922.0,27.0,16.0,9.0,255510.0,255463.0,4133.0,137554.0,34.0,7057.0,1340.0,918.0,140867.0,11154.0,10815.0,11729.0,6784.0,5240.0,3042.0,5300.0,330.0,2483.0,38.0,81.0,24748.0,158.0,141050.0,5845.0,140462.0,6253.0,3929.0,2106.0,1620.0,79.0,1168.0,20.0,11.0,259.0,78.0,29.0,49.0,3036.0,165.0,948.0,0.0,268.0,221210.0,221210.0,0.0,0.0
mean,,1.936024,0.019659,,0.055246,,,1141.915,585.501214,12.730379,5.129932,,,7.4,6.04,36.136182,18.9,8.1,3.0,10.752667,10.673913,,,61.0,,10.425055,6.312493,3.182103,2.250285,3.186553,1.635462,16.229144,3.823819,0.057,0.153842,0.061567,40.192857,25.123077,,1.357143e-06,,,,0.073476,0.020071,32.073981,16.003484,11.841667,2.878846,25.897368,...,7.07594,4.658148,2.50625,0.021678,2.028624,0.798815,7.838105,0.000397,0.518715,8e-06,0.056705,0.034219,0.023367,0.325574,0.259007,0.020303,0.023378,0.006898,0.205856,8.938696e-05,0.12129,0.072138,0.013123,0.119052,0.424635,0.092638,0.125163,0.617282,0.003652,0.534143,0.00795,0.025794,0.003014,0.012161,0.003126,0.00169,0.000401,0.000427,1.594563,0.145762,6.425698,31.458587,15.412121,49.547785,,341.700764,9.165535,9.058049,,
std,,2.502019,0.140524,,0.269207,,,6447.154,712.809943,17.578747,8.014238,,,,0.226274,24.101433,,,,4.019993,3.379647,,,,,17.076167,10.832591,5.607988,7.971418,13.927752,1.978192,17.512632,6.494183,0.025534,0.02916,0.010597,25.175674,26.010496,,4.972452e-07,,,,1.540223,0.358062,29.731719,22.327284,13.993859,6.290341,30.015451,...,8.409054,2.97634,2.186769,0.003072,128.269454,50.504428,10.959671,0.073278,2.561144,0.00036,0.694039,1.031398,2.236451,2.474306,1.277026,0.339262,1.206822,0.335163,5.13225,0.005514738,0.737912,1.47512,0.04066,0.189486,12.528768,0.149725,3.318263,12.05809,0.214408,13.498653,0.080953,0.914247,0.028036,0.067952,0.104503,0.006697,0.001118,0.001285,6.475588,0.172312,2.047841,31.967918,3.753028,18.757932,,425.211439,9.055903,9.183589,,
min,,0.0,0.0,,0.0,,,0.0,0.0,0.0,0.0,,,7.4,5.88,0.04473,18.9,8.1,3.0,0.064,5.2,,,61.0,,0.0,0.0,0.0,0.0,0.05,0.041,0.05,0.09,0.007,0.095,0.043307,1.0,1.08,,1e-06,,,,-3.57,0.0,0.0,-17.86,0.0,0.0,0.0,...,-800.0,0.92,0.3,0.0155,0.0,0.0,0.0,-0.00034,0.0,0.0,0.0,0.0,-0.0021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,-2.0,8e-06,6e-06,0.0,2e-06,0.0,0.0,-0.00026,0.0,0.0,-6.896552,0.0,0.0,-2e-06,7e-06,5e-06,0.0,0.0,0.0018,0.0,0.0,8.0,6.0,,0.0,-15.0,-15.0,,
25%,,0.0,0.0,,0.0,,,377.0,49.4,0.0,0.0,,,7.4,5.96,34.661183,18.9,8.1,3.0,7.275,7.1,,,61.0,,0.0,0.0,0.6,0.06875,0.2,0.1715,1.85,0.437,0.04625,0.13,0.051181,27.0,6.9,,1.0625e-06,,,,0.0,0.0,6.0,1.3,2.1,0.2,0.95,...,0.7,2.15,0.3,0.0216,0.0635,0.025,0.0,0.0,0.001225,1e-06,0.0018,6e-06,0.0,0.004,0.00272,0.003077,0.000255,3e-05,4.2e-05,7.2e-07,6e-06,0.000417,0.0015,0.01732,0.107,0.001368,0.0,0.094,0.0,0.021,0.00115,0.000177,0.0,1.8e-05,5e-06,1.1e-05,2e-05,1.5e-05,0.0155,0.035,6.3,0.0,12.0,32.0,,98.75,1.0,1.0,,
50%,,1.0,0.0,,0.0,,,1100.0,300.0,5.0,1.79,,,7.4,6.04,47.6,18.9,8.1,3.0,12.85,12.6,,,61.0,,4.0,2.22,1.8,0.1175,0.5,0.8,10.05,0.647,0.061,0.155,0.062992,31.0,11.0,,1.25e-06,,,,0.0,0.0,20.6,5.71,8.1,0.25,13.0,...,4.76,4.2,1.95,0.022,0.58166,0.229,5.0,0.0,0.005261,1e-06,0.0052,2.5e-05,0.0,0.012,0.014167,0.005357,0.0007,5.2e-05,0.000114,1.92e-06,1.4e-05,0.001639,0.003185,0.036,0.18,0.0146,0.035,0.206,0.00101,0.075,0.0037,0.000417,0.001,6e-05,2.2e-05,2.3e-05,3.9e-05,3.4e-05,0.021,0.039,7.2,23.0,15.0,50.0,,195.75,10.0,9.0,,
75%,,3.0,0.0,,0.0,,,1674.0,898.0,20.0,7.14,,,7.4,6.12,49.075,18.9,8.1,3.0,13.375,13.05,,,61.0,,10.71,7.14,3.2,0.604,0.71575,3.3,23.0,3.6,0.0685,0.17,0.066929,68.0,40.5,,1.25e-06,,,,0.0,0.02,58.33,24.0,16.15,1.4,52.075,...,10.0,7.1,4.75,0.024,1.37414,0.541,12.0,0.000107,0.14,3e-06,0.012,0.0001,0.0037,0.402,0.3,0.009091,0.0015,7e-05,0.000214,4.69e-06,4.1e-05,0.004167,0.0071,0.143,0.341,0.07175,0.106,0.357,0.0024,0.141,0.0075,0.001,0.002,0.00045,6.1e-05,6.8e-05,7.4e-05,0.000103,0.043,0.4,7.4,51.0,15.0,64.25,,383.2,16.0,16.0,,
max,,31.0,2.0,,6.0,,,3251373.0,3830.0,714.29,550.0,,,7.4,6.2,49.3,18.9,8.1,3.0,15.4,14.6,,,61.0,,557.14,98.0,60.0,75.0,85.0,12.0,71.0,25.0,0.09,0.2032,0.08,75.0,76.0,,2.5e-06,,,,369.0,95.238,2916.67,3520.0,92.8,23.2,101.0,...,430.0,10.7,5.8,0.025,64312.8,25320.0,97.9,26.7,15.0,0.03,15.1,31.25,716.9811,161.0,42.5,21.428571,92.6,23.076923,178.571429,0.4,6.0,60.0,0.25,1.25,1870.37,0.589,694.737,559.459,50.0,657.143,4.0,37.6,0.7,0.56,3.571429,0.03,0.00376,0.0147,42.28,0.423,8.4,100.0,25.0,100.0,,2842.0,40.0,40.0,,


## 2 - Nettoyer et filtrer des features et produits

- Identifier ma cible selon les critères demandés.
- Mettres en place un processus clair et automatisé pour filtrer les features (variables / colonnes) et des produits (lignes) qui seront utilisés pour atteindre l’objectif de mon projet.

**Lister l’ensemble des features du fichier, quantitatives (numériques) ou qualitatives (catégorielles).**

In [11]:
features = data.columns
for feature in features:
    print(feature)

code
url
creator
created_t
created_datetime
last_modified_t
last_modified_datetime
product_name
generic_name
quantity
packaging
packaging_tags
brands
brands_tags
categories
categories_tags
categories_fr
origins
origins_tags
manufacturing_places
manufacturing_places_tags
labels
labels_tags
labels_fr
emb_codes
emb_codes_tags
first_packaging_code_geo
cities
cities_tags
purchase_places
stores
countries
countries_tags
countries_fr
ingredients_text
allergens
allergens_fr
traces
traces_tags
traces_fr
serving_size
no_nutriments
additives_n
additives
additives_tags
additives_fr
ingredients_from_palm_oil_n
ingredients_from_palm_oil
ingredients_from_palm_oil_tags
ingredients_that_may_be_from_palm_oil_n
ingredients_that_may_be_from_palm_oil
ingredients_that_may_be_from_palm_oil_tags
nutrition_grade_uk
nutrition_grade_fr
pnns_groups_1
pnns_groups_2
states
states_tags
states_fr
main_category
main_category_fr
image_url
image_small_url
energy_100g
energy-from-fat_100g
fat_100g
saturated-fat_100g
butyr

**Choisir une cible (une feature ayant moins de 50% de valeurs présentes) catégorielle sera très probablement plus simple à gérer qu’une quantitative dans ce projet.**

In [13]:
list_sort = data.isna().mean().sort_values()
list_sort

last_modified_t                               0.000000
last_modified_datetime                        0.000000
creator                                       0.000006
created_t                                     0.000009
created_datetime                              0.000028
code                                          0.000072
url                                           0.000072
states                                        0.000143
states_tags                                   0.000143
states_fr                                     0.000143
countries_fr                                  0.000873
countries                                     0.000873
countries_tags                                0.000873
product_name                                  0.055373
brands                                        0.088574
brands_tags                                   0.088599
energy_100g                                   0.185986
proteins_100g                                 0.189699
salt_100g 

In [14]:
#msno.bar(data)

In [15]:
# Filtrer les lignes où 'pnns_groups_1' n'est pas NaN
filtered_data = data[~data['pnns_groups_1'].isna()]

# Extraire la colonne 'pnns_groups_1' des lignes filtrées
list_pnns_groups_1 = filtered_data['pnns_groups_1'].unique()

print("Liste des valeurs :", list_pnns_groups_1)
print("Nombre de lignes :", list_pnns_groups_1.size)

Liste des valeurs : ['unknown' 'Fruits and vegetables' 'Sugary snacks' 'Cereals and potatoes'
 'Composite foods' 'Fish Meat Eggs' 'Beverages' 'Fat and sauces'
 'fruits-and-vegetables' 'Milk and dairy products' 'Salty snacks'
 'sugary-snacks' 'cereals-and-potatoes' 'salty-snacks']
Nombre de lignes : 14


La variable "pnns_groups_1" me paraît être une bonne cible : il y a 14 valeurs possibles. Elle a 71 % de valeurs manquantes.

Par contre il y a des doublons, par exemple 'Cereals and potatoes' et 'cereals-and-potatoes', à corriger.

In [17]:
mask = data['pnns_groups_1'] == 'cereals-and-potatoes'
data.loc[mask, 'pnns_groups_1'] = 'Cereals and potatoes'

mask = data['pnns_groups_1'] == 'salty-snacks'
data.loc[mask, 'pnns_groups_1'] = 'Salty snacks'

mask = data['pnns_groups_1'] == 'fruits-and-vegetables'
data.loc[mask, 'pnns_groups_1'] = 'Fruits and vegetables'

mask = data['pnns_groups_1'] == 'sugary-snacks'
data.loc[mask, 'pnns_groups_1'] = 'Sugary snacks'

data.shape

(320772, 162)

**Supprimer les lignes (produits) n’ayant pas de valeur cible.**

Mettre Nan pour les produits avec la valeur 'unknown'

In [20]:
mask = df['pnns_groups_1'] == 'unknown'
print('nombre de produits à supprimer avec la valeur "unknow":', df.loc[mask].shape[0])
# mettre la valeur Nan pour la variable cible des produits avec la valeur 'unknown'
df.loc[mask, 'pnns_groups_1'] = pd.NA
df.shape

NameError: name 'df' is not defined

In [None]:
df = data[~data['pnns_groups_1'].isna()].copy()
df.shape

**Supprimer les produits en double.**

Un doublon est un produit dont le nom du produit (product name) est identique.

In [None]:
# on ajoute une colonne pour indiquer le nombre de valeurs absentes
df['NB_NAN'] = df.isna().sum(axis=1)
# on trie sur le nbre de valeurs absentes
df = df.sort_values('NB_NAN')
# on conserve la ligne doublée qui a le moins de valeurs absentes
df = df.drop_duplicates('product_name', keep='first')
# on supprime la colonne inutile pour la suite
df = df.drop('NB_NAN', axis=1)

In [None]:
df.shape

**Afficher les taux de remplissage des features du dataset.**

In [None]:
list_sort = df.isna().mean().sort_values()
list_sort

**Sélectionner des features qui sont assez remplis (plus que 50%) et qui vous paraissent intéressantes pour effectuer la prédiction de votre cible.**

In [None]:
features = ['energy_100g', 'proteins_100g', 'fat_100g', 'carbohydrates_100g', 'salt_100g', 'sodium_100g', 'sugars_100g', 'saturated-fat_100g', 'fiber_100g', 'nutrition_grade_fr']
X = df[features]
X.shape

Voici les features qui me semblent intéressantes pour la prédiction de la cible :
- energy_100g
- proteins_100g
- fat_100g
- carbohydrates_100g
- salt_100g
- sodium_100g
- sugars_100g
- saturated-fat_100g
- fiber_100g
- nutrition_grade_fr

**Séparer la cible du reste du dataset.**

In [None]:
y = df['pnns_groups_1']
df = df.drop('pnns_groups_1', axis=1)
y.shape

In [None]:
df.shape

In [None]:
# msno.heatmap(df_features)

**Automatiser tout ce qui a été fait jusqu’à maintenant en utilisant :**
- une fonction qui prend en input votre dataframe d’origine
- les méthodes spécifiques aux dataframes pandas. Accessoirement numpy si ce que vous voulez faire n’est pas directement ou simplement disponible via la librairie pandas.

In [None]:
def clean_and_filter(df):
    # choisir la variable cible
    cible = 'pnns_groups_1'

    # nettoyer les catégories de la variable cible
    mask = df[cible] == 'cereals-and-potatoes'
    df.loc[mask, cible] = 'Cereals and potatoes'
    
    mask = df[cible] == 'salty-snacks'
    df.loc[mask, cible] = 'Salty snacks'
    
    mask = df[cible] == 'fruits-and-vegetables'
    df.loc[mask, cible] = 'Fruits and vegetables'
    
    mask = df[cible] == 'sugary-snacks'
    df.loc[mask, cible] = 'Sugary snacks'

    # mettre la valeur Nan pour la variable cible des produits avec la valeur 'unknown'
    mask = df[cible] == 'unknown'
    df.loc[mask, cible] = pd.NA
    
    # supprimer les lignes (produits) n’ayant pas de valeur cible.
    df = df[~df[cible].isna()].copy()

    # Supprimer les produits en double. Un doublon est un produit dont le nom du produit (product name) est identique.
    # on ajoute une colonne pour indiquer le nombre de valeurs absentes
    df['NB_NAN'] = df.isna().sum(axis=1)
    # on trie sur le nbre de valeurs absentes
    df = df.sort_values('NB_NAN')
    # on conserve la ligne doublée qui a le moins de valeurs absentes
    df = df.drop_duplicates('product_name', keep='first')
    # on supprime la colonne inutile pour la suite
    df = df.drop('NB_NAN', axis=1)

    # choisir les features
    features = ['energy_100g', 'proteins_100g', 'fat_100g', 'carbohydrates_100g', 'salt_100g', 'sodium_100g', 'sugars_100g', 'saturated-fat_100g', 'fiber_100g', 'nutrition_grade_fr']
    X = df[features]

    # séparer la cible du reste du dataset
    y = df['pnns_groups_1']

    # retourner le résultat y (cible) et X (features)
    return (y, X)
    

In [None]:
data = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_3/openfoodfacts/fr.openfoodfacts.org.products.csv", sep='\t', low_memory=False)
y, X = clean_and_filter(data)

In [None]:
X.shape

In [None]:
y.unique()

## 2 - Identifier et traiter les valeurs aberrantes

### 2.1 - vue globale

In [None]:
X.describe()

Sans entrer dans le détail, on constate déjà des valeurs aberrantes:
- Les valeurs max ne devraient pas dépasser 100 grammes (sauf "energy_100g"). Or toutes les features ont une valeur max > 100g (exemple fat max = 380g)
- Les valeurs min ne peuvent pas être négatives. Or pour la feature "sugars_100g", il y a au moins une valeur négative.


### 2.2 - Feature "energy_100g"
Visualiser la distribution de la variable

In [None]:
plt.figure(figsize=(10,3))
sns.boxplot(x=X['energy_100g'])
plt.show()

On peut aussi regarder la méthode basée sur l'IQR. L'IQR est la différence entre le 25e centile (Q1) et le 75e centile (Q3) des données. Les valeurs :
- inférieures à Q1 - 1,5 * IQR
- ou supérieures à Q3 + 1,5 * IQR

sont considérées comme aberrantes.

In [None]:
# On enlève les NaN
sample = X[X['energy_100g'].notnull()]

# On calcule l'IQR et les limites
iqr = np.quantile(sample.energy_100g, q=[0.25, 0.75])
limite_basse = iqr[0] - 1.5*(iqr[1] - iqr[0])
limite_haute = iqr[1] + 1.5*(iqr[1] - iqr[0])
print("limite basse :", limite_basse)
print("limite haute :", limite_haute)

In [None]:
mask = df['energy_100g'] > limite_haute
print(df.loc[mask, ['code', 'product_name', 'energy_100g']])

Utilisez la méthode du z-score Le z-score mesure de combien d'écarts types une valeur est éloignée de la moyenne de la variable. On considère qu'un z-score supérieur à 2 ou 3 correspond à un outlier.

In [None]:
# Calculer le z-score
sample['z_energy'] = stats.zscore(sample['energy_100g'])

In [None]:
mask = df['z_energy'] > 3
print(df.loc[mask, ['code', 'product_name', 'energy_100g']])

## 3 - Identifier et traiter les valeurs manquantes

valeurs aberrantes --> NaN

Vérifier les lignes produits qui ont plus de 80% de valeurs manquantes, et les supprimer.

- 0
- mediane, par groupe possible (genre groupe nutriscore)
- iterative inputer (variables corrélées entre elles)
- knn

## 4 - Effectuer les analyses uni-variées et bi-variées

## 5 - Réaliser une analyse multi-variée

## 6 - RGPD