# Food For Thought 
## Digging into the dataset

The first step of this project aims at understanding the dataset we have chosen ([Open Food Facts Database](https://world.openfoodfacts.org/)), to check whether it is suitable for the kind of analysis we want to develop.

As a reminder, we would like to focus our research on two main topics:
1. Impact of food on environment
2. Impact of food on the health

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
from mpl_toolkits.mplot3d import Axes3D
from difflib import get_close_matches

import pickle

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import findspark
findspark.init()

from pyspark.sql import *
import pyspark.sql.functions as F

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
%matplotlib inline

In [2]:
from plotly.offline import download_plotlyjs, init_notebook_mode
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tools

# pytools.set_credentials_file(username='tomM', api_key='oS78cBzIZzmTHXpgY4Rq')
init_notebook_mode(connected=True)

## load the raw csv file

## Brand vs Nutrition vs Food Category in France 

## load and clean

In [3]:
# we can now read it everytime we want to continue the analysis
bra_cat_nut_raw_df = pd.read_pickle('bra_cat_nut_raw_df')

In [5]:
bra_cat_nut_raw_df.head(3)

Unnamed: 0,brands,nutrition_grade_fr,pnns_groups_1
0,CROUS,,Unknown
1,"Crous Resto',Crous",d,Unknown
2,Ferme De La Frémondière,,Unknown


- We create a dictonary of parent brand to replace a brand by its "parent brand" when it belongs to a bigger one
For example 'Bio Village' is a brand of Leclerc which owes also 'Marque Repère' then we can replace 'Bio Village' and 'Marque Repère' by 'Leclerc'. We will to analysis the food sold by the company Leclerc.


- Moreover, we set to None the brand when it is "Sans Marque". "Sans Marque" means the brand had not been specified by the user: it can be "Marque Repère", "Carrefour", "Délisse" ect.To understand the 'Sans Marque' products we checked them using the product search tool on https://fr.openfoodfacts.org with the tag "Sans Marque".

In [None]:
# list of food brands present in France ('Coca Cola', 'Fleury Michon', 'Carrefour' ect.)
list_brand_file = open("brand_list.txt", "r")
list_brand = list_brand_file.read().split('\n')

# dictonary of parent brands. 
dict_parent_brand= {'Bio Village': 'Leclerc',
                    'Marque Repère': 'Leclerc',
                    'Sans Marque': None}

The problem is that the 'brands' tags are not always consistent. For example we can find the brand tag 'carrefour', "Carrefour", "carrfour,bio carrefour' ect. 
Hence we first clean the brands using a list of food brands present in France and a function to match the 'brands' tag with the closest brand in the list. 

In [None]:
# Function which finds for each value (string) in the column 'brands' the closest brand (string) in 'list_brand'.
# Note: the function 'get_close_matches' has been imported from the open-source 'difflib' library.
def clean_brand(brand_name):
    
    # brand name can be of the form 'Carrefour' (one word) or 'Carrefour, Bio Carrefour' (multiple words)
    # hence we split the string w.r.t the symbol ',' and find the best matching brand word for each split.
    matches=[]
    if brand_name:
        
        for word in brand_name.split(','):

            # the cutoff control for the False Positives possibilities in the 'list_brand' 
            # that don’t score at least the cutoff are ignored.
            # the parameter 'n' controls for the the number of possibilities (whose score is higher than the cutoff value).
            matches.append(get_close_matches(word.lower(), list_brand, n=1, cutoff=0.6))

        # remove empty sublists and unravel
        matches = [brand for sublist in matches for brand in sublist if sublist]
        
        # output brand
        if matches:
            output = matches[0]
        else:
            output = None
                
        # check for parent brand
        if output in dict_parent_brand:
            output = dict_parent_brand[output]
        
        return output

In [None]:
# we create a new column with the new "consistent" brand tag. 
# takes some minutes to run...
bra_cat_nut_raw_df['new_brands'] = bra_cat_nut_raw_df['brands'].apply(lambda x: clean_brand(x))

In [None]:
# drop when both 'nutrition_grade_fr' and 'brands' are None
bra_cat_nut_cleaned_df = bra_cat_nut_raw_df.dropna(subset=['nutrition_grade_fr', 'brands'], how='all') 

# we can drop the old column 'brands' since we have the new ones.
bra_cat_nut_cleaned_df = bra_cat_nut_cleaned_df.drop('brands',1).rename(columns={'new_brands':'brands'})

bra_cat_nut_cleaned_df.head()

In [None]:
# here we count the number of products for each brand. 
count = bra_cat_nut_cleaned_df.groupby('brands').count().sort_values('nutrition_grade_fr', ascending=False)

In [None]:
# we consider that brand can be analysed if at least 100 products in the dataset belong to this brand.
count['enough_products']= count[['nutrition_grade_fr']].apply(lambda x: x>100)
count = count[count.enough_products]
count.head()

The most abundant brands are Carrefour, Auchan and U. There are the brands of the biggest distributors (supermarkets) in France. 

In [None]:
# here we add the column "enough products" to the dataframe with the cleaned brand.
bra_cat_nut_cleaned_df = bra_cat_nut_cleaned_df.join(count[['enough_products']], on='brands')
bra_cat_nut_cleaned_df = bra_cat_nut_cleaned_df.drop('enough_products',1)
bra_cat_nut_cleaned_df.head()

The nutrition score in 'nutrition_grade_fr' is the Nutri-Score developped by the french governement and based on the components present in the food (sugar, fiber, fat ect.). You can find more information at https://fr.wikipedia.org/wiki/Nutri-score.

- 'a' (very good product for health) 
- 'b' (good product)
- 'c' ('neutral product')
- 'd' ('not so good product')
- 'e' (bad product for health)

In [None]:
# here we create dummy variables from the 'nutrition_grade_fr'.
bra_cat_nut_cleaned_expanded_df = pd.get_dummies(bra_cat_nut_cleaned_df.set_index('brands')).reset_index()
bra_cat_nut_cleaned_dummies_df = bra_cat_nut_cleaned_expanded_df.rename(columns=
                              {'nutrition_grade_fr_a':'a',
                               'nutrition_grade_fr_b':'b',
                               'nutrition_grade_fr_c':'c',
                               'nutrition_grade_fr_d':'d',
                               'nutrition_grade_fr_e':'e',
                               'pnns_groups_1_Beverages': 'Beverages',
                               'pnns_groups_1_Cereals And Potatoes':'Cereals And Potatoes',
                               'pnns_groups_1_Composite Foods': 'Composite Foods',
                               'pnns_groups_1_Fat And Sauces':'Fat And Sauces',
                               'pnns_groups_1_Fish Meat Eggs':'Fish Meat Eggs',
                               'pnns_groups_1_Fruits And Vegetables':'Fruits And Vegetables',
                               'pnns_groups_1_Milk And Dairy Products':'Milk And Dairy Products',
                               'pnns_groups_1_Salty Snacks':'Salty Snacks',
                               'pnns_groups_1_Sugary Snacks':'Sugary Snacks',
                               'pnns_groups_1_Unknown': 'Unknown'})
bra_cat_nut_cleaned_dummies_df.head(5)

## Load cleaned data

In [6]:
bra_cat_nut_df = pd.read_pickle('bra_cat_nut_cleaned_df')
bra_cat_nut_dummies_df = pd.read_pickle('bra_cat_nut_cleaned_dummies_df')

## Food quality in the whole dataset

In [9]:
# here we count the number of products inside each nutrition category (5 categories from 'a' to 'e')
count = bra_cat_nut_df[['nutrition_grade_fr','brands']].groupby('nutrition_grade_fr').count().rename(columns={'brands':'count'})
count.head(3)

Unnamed: 0_level_0,count
nutrition_grade_fr,Unnamed: 1_level_1
a,8951
b,8831
c,12866


In [10]:
# interactive plot
data = []
colors = sns.color_palette("RdBu_r", 5).as_hex()

for index, row in count.reset_index().iterrows():
    data.append(go.Bar(x=np.array(index),
                       y=np.array(row['count']),
                       name = row['nutrition_grade_fr'],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Nutrition score occurences in the original dataset.',
    xaxis=go.layout.XAxis(
        title='Nutriscore',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(5).astype(str),
        ticktext=count.index,
    ),
    yaxis=dict(
        title='Count',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

In [11]:
# interactive plot
labels = count.index
values = count['count'].values
colors = sns.color_palette("RdBu_r", 5).as_hex()

fig = {
    'data': [{'labels': labels,
              'values': values,
              'type': 'pie',
              'marker':{'colors':colors}
              }
            ],
    'layout': {'title': 'Nutrition score occurences in the original dataset.'}
     }

py.iplot(fig)

## Food quality for each french food brand

In [12]:
# read the data
bra_nut_df = bra_cat_nut_df[['brands', 'nutrition_grade_fr']]
bra_nut_dummies_df = bra_cat_nut_dummies_df[['brands','a','b','c','d','e']]

In [18]:
# extract the total number of products for each brands
brand_count = bra_nut_df.groupby('brands').count().rename(columns={'nutrition_grade_fr':'total'})

In [20]:
# count the nutrition grade occurences for each brand
brand_dummies_count = bra_nut_dummies_df.groupby('brands').sum()

In [21]:
# we add the column 'total' to further extract the ratio of each grade instead of the raw count.
brand_dummies_count = brand_dummies_count.join(brand_count)
brand_dummies_count.head()

Unnamed: 0_level_0,a,b,c,d,e,total
brands,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
123 bio,121.0,75.0,54.0,91.0,80.0,421
7Up,0.0,5.0,1.0,4.0,24.0,34
A l'olivier,0.0,2.0,19.0,43.0,3.0,67
Ajax,1.0,0.0,0.0,0.0,1.0,2
Albert Menes,2.0,6.0,27.0,37.0,22.0,94


In [22]:
# we convert the counts into ratios. 
# ex: ratio for 'a' = (count for 'a') / (total number of products)
brand_nutri_ratio = brand_dummies_count.copy()
brand_nutri_ratio[['a','b','c','d','e']] = brand_dummies_count[['a','b','c','d','e']].div(brand_dummies_count['total'].values,axis=0)*100
brand_nutri_ratio = brand_nutri_ratio.sort_values('total',ascending=False)

In [23]:
# extract the 10 biggest brands
ratio_top10 = brand_nutri_ratio.reset_index().loc[:9].drop('total',1)

# save the brand names for further analysis 
top10_brand_names = ratio_top10.brands.values

ratio_top10.head()

Unnamed: 0,brands,a,b,c,d,e
0,Carrefour,19.763931,15.344496,19.818831,26.269558,18.803184
1,Auchan,18.391764,15.303283,21.3133,26.794658,18.196995
2,Leclerc,13.64818,15.901213,22.010399,27.816291,20.623917
3,Casino,18.84984,15.609311,20.903697,25.604747,19.032405
4,Leader Price,17.482517,13.939394,22.750583,26.293706,19.5338


In [24]:
# interactive plot
data = []
colors = sns.color_palette("tab10", ratio_top10.brands.count()).as_hex()

for index, row in ratio_top10.iterrows():
    data.append(go.Bar(x=['a','b','c','d','e'],
                       y=np.array(row[1:]),
                       name = row['brands'],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Nutrition score occurences for the 10 biggest food brands in France.',
    xaxis=go.layout.XAxis(
        title='Nutriscore',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(5).astype(str),
        ticktext=count.index,
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

In [25]:
# interactive plot
data = []
colors = sns.color_palette("RdBu_r", 5).as_hex()

for index in range(len(['a','b','c','d','e'])):
    data.append(go.Bar(x=ratio_top10['brands'],
                       y=ratio_top10[ratio_top10.columns[index+1]],
                       name = ratio_top10.columns[index+1],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Nutrition score distribution within the 10 biggest food brands in France.',
    barmode='stack',
    xaxis=go.layout.XAxis(
        title='Food Brand',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(10).astype(str),
        ticktext=ratio_top10['brands'],
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

We observe for example that for almost all the biggest food brands present in France nearly half the products are "junk" food (products with labels 'e' or 'd'). We don't eat that well in France... 

The exception is 'Picard'. It mainly sells frozen food products which could be excpected to be healthier than the products sold by Netto (cheap product mainly industrial).

## Category Distribution in the 10 biggest food brands in France. 

In [32]:
# read the data
bra_cat_df = bra_cat_nut_df[['brands', 'pnns_groups_1']]
bra_cat_dummies_df = bra_cat_nut_dummies_df.drop(['a','b','c','d','e'], axis=1)

In [34]:
# here we count the category occurences for each brand
brand_dummies_count = bra_cat_dummies_df.groupby('brands').sum()

# we add the column total to further extract the ratio of each grade instead of the raw count
brand_count = pd.DataFrame(data=brand_dummies_count.sum(axis=1), columns=['total'])
brand_dummies_count = brand_dummies_count.join(brand_count)

brand_dummies_count.head(3)

Unnamed: 0_level_0,Beverages,Cereals And Potatoes,Composite Foods,Fat And Sauces,Fish Meat Eggs,Fruits And Vegetables,Milk And Dairy Products,Salty Snacks,Sugary Snacks,Unknown,total
brands,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
123 bio,38.0,60.0,28.0,38.0,28.0,64.0,43.0,15.0,72.0,222.0,608.0
7Up,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,37.0
A l'olivier,0.0,1.0,2.0,41.0,1.0,0.0,0.0,0.0,2.0,67.0,114.0


In [38]:
# we convert the counts into ratios
ratio = brand_dummies_count.copy().reset_index()
ratio = ratio[ratio['brands'].isin(top10_brand_names)].set_index('brands')
ratio.iloc[:, np.arange(9)] = brand_dummies_count.iloc[:,np.arange(9)].div(brand_dummies_count['total'].values,axis=0)*100

We chose to remove the category Unknown since it gathers products from multiple categories. It is not relevant to analyze the distribution of products tagged 'Unknown' since they don't represent any real category of products. The user just omitted to tag the category. 

In [39]:
ratio_to_plot = ratio.reset_index().drop(['total', 'Unknown'],1)
ratio_to_plot.head(3)

Unnamed: 0,brands,Beverages,Cereals And Potatoes,Composite Foods,Fat And Sauces,Fish Meat Eggs,Fruits And Vegetables,Milk And Dairy Products,Salty Snacks,Sugary Snacks
0,Auchan,5.265011,6.374362,8.681106,4.824793,8.188061,6.691319,8.874802,2.201092,7.941539
1,Belle France,6.370795,7.229778,11.739442,5.941303,9.949893,9.305655,13.815319,4.509664,15.175376
2,Carrefour,4.881403,8.078377,8.525266,4.331385,8.765899,7.35648,7.751805,2.543829,10.24407


In [40]:
# interactive plot
data = []
colors = sns.color_palette("tab10", ratio_to_plot.brands.count()).as_hex()

for index, nutri in enumerate(ratio_to_plot.columns[1:]):
    data.append(go.Bar(x=ratio_to_plot['brands'],
                       y=ratio_to_plot[ratio_to_plot.columns[index+1]],
                       name = ratio_to_plot.columns[index+1],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Food Category distribution within the 10 biggest food brands in France.',
    xaxis=go.layout.XAxis(
        title='Food Brand',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(10).astype(str),
        ticktext=ratio_to_plot['brands'],
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

In [42]:
# interactive plot
data = []
colors = sns.color_palette("tab10", len(ratio_to_plot.columns[1:])).as_hex()

for index, nutri in enumerate(ratio_to_plot.columns[1:]):
    data.append(go.Scatter(x=ratio_to_plot['brands'],
                       y=ratio_to_plot[ratio_to_plot.columns[index+1]],
                       name = ratio_to_plot.columns[index+1],
                       marker={'color': colors[index]}
                          )
               )
    
fig = tools.make_subplots(rows=3, cols=3, subplot_titles=tuple(ratio_to_plot.columns[1:]))

for index,trace in enumerate(data):
    (r, c) = divmod(index, 3)
    fig.append_trace(trace, r+1, c+1)

fig['layout']['yaxis1'].update(title='Ratio (in %)')
fig['layout']['yaxis4'].update(title='Ratio (in %)')
fig['layout']['yaxis7'].update(title='Ratio (in %)')

fig['layout'].update(title='Food Category distribution within the 10 biggest food brands in France.', height=900, width=1000)

py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]
[ (2,1) x4,y4 ]  [ (2,2) x5,y5 ]  [ (2,3) x6,y6 ]
[ (3,1) x7,y7 ]  [ (3,2) x8,y8 ]  [ (3,3) x9,y9 ]



## Nutrition grade repartition per category in France

In [94]:
# read the data 
cat_nut_df = bra_cat_nut_df[['pnns_groups_1', 'nutrition_grade_fr']]

In [95]:
# number of products (total)
cat_nut_df.index[-1]

223388

In [97]:
# we drop the null value
cat_nut_df = cat_nut_df.dropna()

In [100]:
# number of products (after removing the null)
cat_nut_df.reset_index(drop=True).index[-1]

103469

We removed half the products but having still more than 100,000 products we assume the analysis is still statistically significant.

In [101]:
# we get dummies from the nutrition grades
cat_nut_df = pd.get_dummies(cat_nut_df.set_index('pnns_groups_1')).rename(columns=
                              {'nutrition_grade_fr_a':'a',
                               'nutrition_grade_fr_b':'b',
                               'nutrition_grade_fr_c':'c',
                               'nutrition_grade_fr_d':'d',
                               'nutrition_grade_fr_e':'e'})

In [103]:
# we count the number of products with grade 'a','b' ect. for each food category
count = cat_nut_df.groupby('pnns_groups_1').sum()
count['total'] = count.apply(lambda x: x['a']+x['b']+x['c']+x['d']+x['e'], axis=1) 
count.head(3)

Unnamed: 0_level_0,a,b,c,d,e,total
pnns_groups_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Beverages,71.0,854.0,1812.0,1624.0,3754.0,8115.0
Cereals And Potatoes,3839.0,1630.0,1616.0,1045.0,101.0,8231.0
Composite Foods,1798.0,3407.0,3404.0,2073.0,185.0,10867.0


In [104]:
# we extract ratio from the counts and the total number of products for each category
ratio = count.copy()
ratio[['a','b','c','d','e']] = count[['a','b','c','d','e']].div(count['total'].values,axis=0)*100
ratio.head(3)

Unnamed: 0_level_0,a,b,c,d,e,total
pnns_groups_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Beverages,0.874923,10.523722,22.32902,20.012323,46.260012,8115.0
Cereals And Potatoes,46.640748,19.803183,19.633094,12.695906,1.227068,8231.0
Composite Foods,16.545505,31.351799,31.324193,19.076102,1.702402,10867.0


In [105]:
# we remove the 'total' column and we sort the category w.r.t to the ratio for the score 'a' (the healthier)
ratio_to_plot = ratio.reset_index().drop('total',1).sort_values(by=['a'],ascending=0).reset_index().drop('index',1)

In [106]:
# interative plot
data = []
colors = sns.color_palette("RdBu_r", 5).as_hex()

for index, nutri in enumerate(ratio_to_plot.columns[1:]):
    data.append(go.Bar(x=ratio_to_plot['pnns_groups_1'],
                       y=ratio_to_plot[ratio_to_plot.columns[index+1]],
                       name = ratio_to_plot.columns[index+1],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Health profile for each food category.',
    xaxis=go.layout.XAxis(
        title='Food Category',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickangle = 20,
        tickvals=np.arange(10).astype(str),
        ticktext=ratio_to_plot['pnns_groups_1'],
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

# Brand vs Palm Oil

## load and clean

# Load cleaned data

In [109]:
bra_palm_df = pd.read_pickle('bra_palm_cleaned_df')

## Palm oil presence in the whole dataset

In [116]:
bra_palm_bool_df = bra_palm_df.copy()
bra_palm_bool_df.iloc[:,0] = bra_palm_bool_df.iloc[:,0].apply(lambda x: x>0) # for 'ingredients_from_palm_oil_n'
bra_palm_bool_df.iloc[:,1] = bra_palm_bool_df.iloc[:,1].apply(lambda x: x>0) # for 'ingredients_that_may_be_from_palm_oil_n'

In [117]:
bra_palm_to_plot_df = bra_palm_df[['brands']].copy()
bra_palm_to_plot_df['no_palm_oil'] = bra_palm_bool_df.apply(lambda x: 1 if not x[0] and not x[1] else 0, axis=1) 
bra_palm_to_plot_df['may_contain_palm_oil'] = bra_palm_bool_df.apply(lambda x: 1 if x[0] and not x[1] else 0 , axis=1) 
bra_palm_to_plot_df['contain_palm_oil'] = bra_palm_bool_df.apply(lambda x: 1 if x[1] else 0, axis=1) 

In [119]:
bra_palm_to_plot_df.head(3)

Unnamed: 0,brands,no_palm_oil,may_contain_palm_oil,contain_palm_oil
0,Crous,1,0,0
1,,1,0,0
2,,1,0,0


In [120]:
# count the number of product for each palm category 
count = bra_palm_to_plot_df.sum(axis=0)

In [121]:
# interactive plot
data = []
colors = sns.color_palette("Reds", 3).as_hex()
labels = ['no palm oil', 'may contain palm oil', 'contain palm oil']

for index, item in enumerate(count.iteritems()):
    data.append(go.Bar(x=np.array(index),
                       y=np.array(item[1]),
                       name = labels[index],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Nutrition score occurences in the original dataset.',
    showlegend=False,
    xaxis=go.layout.XAxis(
        title='Product feature',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(3).astype(str),
        ticktext= labels,
    ),
    yaxis=dict(
        title='Count',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

In [122]:
# interactive plot
fig = {
    'data': [{'labels': labels,
              'values': count.values,
              'type': 'pie',
              'marker':{'colors':colors}
              }
            ],
    'layout': {'title': 'Nutrition score occurences in the original dataset.'}
     }

py.iplot(fig)

## Palm oil presence for each french food brand

In [123]:
# read the data
bra_palm_df = pd.read_pickle('bra_palm_cleaned_df')

In [128]:
# count number of products per brand
brand_count = bra_palm_df[['ingredients_that_may_be_from_palm_oil_n', 'brands']] \
                                    .groupby('brands')\
                                    .count() \
                                    .rename(columns={'ingredients_that_may_be_from_palm_oil_n':'total'})
brand_count.head(3)

Unnamed: 0_level_0,total
brands,Unnamed: 1_level_1
123 bio,592
7Up,28
A l'olivier,103


In [130]:
# count the number of products for each palm category w.r.t the brands
bra_palm_grouped_df = bra_palm_df.groupby('brands').sum()

# add the number of products per brand
bra_palm_grouped_df = brand_count.join(bra_palm_grouped_df)

bra_palm_grouped_df.head(3)

Unnamed: 0_level_0,total,ingredients_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_n
brands,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123 bio,592,2,1
7Up,28,0,0
A l'olivier,103,1,2


In [134]:
# get ratio from the counts
ratio_df = pd.DataFrame()
ratio_df['may contain palm oil'] = bra_palm_grouped_df.ingredients_that_may_be_from_palm_oil_n.div(bra_palm_grouped_df.total)
ratio_df['contain palm oil'] = bra_palm_grouped_df.ingredients_from_palm_oil_n.div(bra_palm_grouped_df.total)
ratio_df['no palm oil'] = 1 - ratio_df['may contain palm oil'] - ratio_df['contain palm oil']

# get percentage
ratio_df = ratio_df.applymap(lambda x: 100*x) 

ratio_df.head(3)

Unnamed: 0_level_0,may contain palm oil,contain palm oil,no palm oil
brands,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123 bio,0.168919,0.337838,99.493243
7Up,0.0,0.0,100.0
A l'olivier,1.941748,0.970874,97.087379


In [135]:
# extract the values for the 10 biggest brands in France
ratio_top10_df = ratio_df.copy().reset_index()[ratio_df.index.isin(top10_brand_names)].reset_index(drop=True)

In [136]:
# sort them to have the more "environmental" brands first
ratio_top10_df.sort_values(by='no palm oil', ascending=False, inplace=True)

In [137]:
# interactive plot
data = []
colors = sns.color_palette("tab10", 10).as_hex()

for index, row in ratio_top10_df.iterrows():
    data.append(go.Bar(x=ratio_top10_df.columns[1:],
                       y=np.array(row[1:]),
                       name = row['brands'],
                       marker={'color': colors[index]}))

layout = go.Layout(
    title='Raio of products with palm oil for the top 10 french brand.',
    xaxis=go.layout.XAxis(
        title='Product Feature',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(3).astype(str),
        ticktext=ratio_top10_df.columns[1:],
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

In [138]:
# interactive plot
data = []
colors = sns.color_palette("tab10", 10).as_hex()

for i in range(3):
    
    data.append(go.Bar(x=ratio_top10_df['brands'],
                       y=ratio_top10_df[ratio_top10_df.columns[i+1]],
                       name = ratio_top10_df.columns[i+1],
                       marker={'color': colors[i]}))

layout = go.Layout(
    title='Palm oil presence within the 10 biggest food brands in France.',
    barmode='stack',
    xaxis=go.layout.XAxis(
        title='Food Brand',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(10).astype(str),
        ticktext=ratio_top10_df['brands'],
    ),
    yaxis=dict(
        title='Ratio (in %)',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
)

go.FigureWidget(data=data, layout=layout)

# Score each brand from palm oil presence and nutrition score ratios

In [139]:
brand_nutri_ratio.head(3)

Unnamed: 0_level_0,a,b,c,d,e,total
brands,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carrefour,19.763931,15.344496,19.818831,26.269558,18.803184,3643
Auchan,18.391764,15.303283,21.3133,26.794658,18.196995,3594
Leclerc,13.64818,15.901213,22.010399,27.816291,20.623917,2308


In [140]:
# keep only the brand which have more than 100 products
brand_nutri_ratio = brand_nutri_ratio[brand_nutri_ratio.total > 100]

In [147]:
# we will assign a score to each brand. The best score is 5 (for 'a') and the worst is 1 ('e'). 
scores = np.array([5,4,3,2,1])

# for each brand the score will be the average of the ratios in each category ('a','b'..) weighted by the score of each category (5,4..) 
brand_score_nutri = brand_nutri_ratio.drop('total', axis=1).multiply(1/100).multiply(scores).mean(axis=1).to_frame(name='nutrition score')
brand_score_nutri['nutrition score'] *= 5
brand_score_nutri.sort_values(by='nutrition score', ascending=False, inplace=True)

A score of 5 for a brand means that 100% of the products sold by the brand are 'a' (very healthy). 

In [149]:
# take the 10 best (highest score) and 10 worst brand (lowest score)
brand_10best_10worst = pd.concat([brand_score_nutri.head(10).reset_index(), brand_score_nutri.tail(10).reset_index()], axis=0).reset_index(drop=True)

In [150]:
# interactive plot
data = []
colors = sns.color_palette("RdBu_r", 20).as_hex()
y_values = ['a','b','c','d','e']

for index, row in brand_10best_10worst.iterrows():
    data.append(go.Bar(x=np.array(index),
                       y=np.array(row['nutrition score']),
                       name = row['brands'],
                       marker={'color': colors[index]}))

layout = go.Layout(
    showlegend=False,
    title='Health score of the 10 best and 10 worst food brands in France.',
    xaxis=go.layout.XAxis(
        title='Brand',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        tickvals=np.arange(20).astype(str),
        ticktext=brand_10best_10worst.brands.values,
    ),
    yaxis=dict(
        title='Global Nutrition Score',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        ),
        showgrid = True,
        gridcolor='#bdbdbd',
        range=[0, 6],
        tickwidth=3,
        tickvals=np.arange(1,6).astype(str),
        ticktext=y_values[::-1],
    ),
)

go.FigureWidget(data=data, layout=layout)