This script classifies foods as healthy or unhealthy and yummy or yucky based on what you think about a few foods of your choice.
The dataset is from https://www.kaggle.com/openfoodfacts/world-food-facts

In [1]:
import pandas as pd
import numpy as np

Load the dataset. This could take a while.

In [2]:
df = pd.read_csv('en.openfoodfacts.org.products.tsv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


Examine the dataset.

In [3]:
print(df.head())
df.info()
print(df.columns)

    code                                                url  \
0   3087  http://world-en.openfoodfacts.org/product/0000...   
1   4530  http://world-en.openfoodfacts.org/product/0000...   
2   4559  http://world-en.openfoodfacts.org/product/0000...   
3  16087  http://world-en.openfoodfacts.org/product/0000...   
4  16094  http://world-en.openfoodfacts.org/product/0000...   

                      creator   created_t      created_datetime  \
0  openfoodfacts-contributors  1474103866  2016-09-17T09:17:46Z   
1             usda-ndb-import  1489069957  2017-03-09T14:32:37Z   
2             usda-ndb-import  1489069957  2017-03-09T14:32:37Z   
3             usda-ndb-import  1489055731  2017-03-09T10:35:31Z   
4             usda-ndb-import  1489055653  2017-03-09T10:34:13Z   

  last_modified_t last_modified_datetime                    product_name  \
0      1474103893   2016-09-17T09:18:13Z              Farine de blé noir   
1      1489069957   2017-03-09T14:32:37Z  Banana Chips Sweetened (

There are a lot of NaN's. Let's write a function to count the NaN entries in each column and return the proportion of entries that are NaNs.

In [4]:
def countNaNs(df, column):
    num_NaNs = df[column].isna().sum()
    return num_NaNs / len(df)
    
for column in df.columns:
    print(column + ": {}".format(countNaNs(df, column)))

code: 7.302816921188561e-05
url: 7.302816921188561e-05
creator: 8.426327216756033e-06
created_t: 8.426327216756033e-06
created_datetime: 2.8087757389186775e-05
last_modified_t: 0.0
last_modified_datetime: 0.0
product_name: 0.049187280739943884
generic_name: 0.8378943170040475
quantity: 0.6649551859830856
packaging: 0.7473253433026147
packaging_tags: 0.7473253433026147
brands: 0.08159493521558758
brands_tags: 0.08165111073036596
categories: 0.7098562749454396
categories_tags: 0.7099236855631735
categories_en: 0.7098506573939617
origins: 0.9296401677400872
origins_tags: 0.929746901218166
manufacturing_places: 0.8820089487595042
manufacturing_places_tags: 0.8820286101896766
labels: 0.834006971381384
labels_tags: 0.8337822693222705
labels_en: 0.8337092411530587
emb_codes: 0.9087428762425321
emb_codes_tags: 0.9087541113454879
first_packaging_code_geo: 0.9413752327772894
cities: 0.9999269718307882
cities_tags: 0.9367969283228519
purchase_places: 0.8130338429388783
stores: 0.8379308310886534


Let's trim down our dataset by removing every column that has more that 50% NaN values.

In [5]:
def drop_column(df, column):
    if countNaNs(df, column) > 0.5:
        df.drop(labels=column, axis=1, inplace=True)

for column in df.columns:
    drop_column(df, column)

print("Remaining columns of dataframe:")
print(df.columns)

Remaining columns of dataframe:
Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name', 'brands',
       'brands_tags', 'countries', 'countries_tags', 'countries_en',
       'ingredients_text', 'serving_size', 'additives_n', 'additives',
       'ingredients_from_palm_oil_n',
       'ingredients_that_may_be_from_palm_oil_n', 'nutrition_grade_fr',
       'states', 'states_tags', 'states_en', 'energy_100g', 'fat_100g',
       'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g',
       'proteins_100g', 'salt_100g', 'sodium_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g'],
      dtype='object')


Let's look at the NaN counts again to see how the remaining columns look.

In [6]:
for column in df.columns:
    print(column + ": {}".format(countNaNs(df, column)))

code: 7.302816921188561e-05
url: 7.302816921188561e-05
creator: 8.426327216756033e-06
created_t: 8.426327216756033e-06
created_datetime: 2.8087757389186775e-05
last_modified_t: 0.0
last_modified_datetime: 0.0
product_name: 0.049187280739943884
brands: 0.08159493521558758
brands_tags: 0.08165111073036596
countries: 0.0007724133282026363
countries_tags: 0.0007724133282026363
countries_en: 0.0007724133282026363
ingredients_text: 0.2026082291511599
serving_size: 0.39156019065969716
additives_n: 0.20268125732037176
additives: 0.20279922590140637
ingredients_from_palm_oil_n: 0.20268125732037176
ingredients_that_may_be_from_palm_oil_n: 0.20268125732037176
nutrition_grade_fr: 0.2841666502821415
states: 0.00014605633842377122
states_tags: 0.00014605633842377122
states_en: 0.00014605633842377122
energy_100g: 0.17038033632280697
fat_100g: 0.21495560729944638
saturated-fat_100g: 0.25898035823125776
carbohydrates_100g: 0.21573363817912686
sugars_100g: 0.2158291365542501
fiber_100g: 0.38015094360820

This looks pretty good. There are still some with over 20% NaNs, but I do want to keep some of those columns because they seem important, such as saturated-fat, carbs, sugars, fiber, etc. In fact, I think they are so important, that any food that doesn't have a value for those things is probably unusual and doesn't need to be included in the dataset, so I will just drop those rows.

In [7]:
important_columns = ['energy_100g', 'fat_100g',
       'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g',
       'proteins_100g', 'salt_100g', 'sodium_100g']
for column in important_columns:
    df = df[df[column].notna()]
    
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 197856 entries, 1 to 356022
Data columns (total 34 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   code                                     197856 non-null  object 
 1   url                                      197856 non-null  object 
 2   creator                                  197856 non-null  object 
 3   created_t                                197856 non-null  object 
 4   created_datetime                         197855 non-null  object 
 5   last_modified_t                          197856 non-null  object 
 6   last_modified_datetime                   197856 non-null  object 
 7   product_name                             195837 non-null  object 
 8   brands                                   194100 non-null  object 
 9   brands_tags                              194093 non-null  object 
 10  countries                       

There are a few more things we can get rid of. There are several columns that are irrelevant to the project I want to do, such as code, url, creator, etc, that are unrelated to nutrition. I will remove those columns.

In [8]:
irrelevant_columns = ['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime']
for column in irrelevant_columns:
    df.drop(labels=column, axis=1, inplace=True)

df.info()
print(df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 197856 entries, 1 to 356022
Data columns (total 27 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   product_name                             195837 non-null  object 
 1   brands                                   194100 non-null  object 
 2   brands_tags                              194093 non-null  object 
 3   countries                                197832 non-null  object 
 4   countries_tags                           197832 non-null  object 
 5   countries_en                             197832 non-null  object 
 6   ingredients_text                         191434 non-null  object 
 7   serving_size                             162900 non-null  object 
 8   additives_n                              191434 non-null  float64
 9   additives                                191413 non-null  object 
 10  ingredients_from_palm_oil_n     

Let's examine the NaN counts again.

In [9]:
for column in df.columns:
    print(column + ": {}".format(countNaNs(df, column)))

product_name: 0.010204391072295002
brands: 0.01898350315380883
brands_tags: 0.01901888241953744
countries: 0.00012130033964095099
countries_tags: 0.00012130033964095099
countries_en: 0.00012130033964095099
ingredients_text: 0.032457949215591135
serving_size: 0.17667394468704511
additives_n: 0.032457949215591135
additives: 0.03256408701277697
ingredients_from_palm_oil_n: 0.032457949215591135
ingredients_that_may_be_from_palm_oil_n: 0.032457949215591135
nutrition_grade_fr: 0.003517709849587579
states: 0.0
states_tags: 0.0
states_en: 0.0
energy_100g: 0.0
fat_100g: 0.0
saturated-fat_100g: 0.0
carbohydrates_100g: 0.0
sugars_100g: 0.0
fiber_100g: 0.0
proteins_100g: 0.0
salt_100g: 0.0
sodium_100g: 0.0
nutrition-score-fr_100g: 0.003517709849587579
nutrition-score-uk_100g: 0.003517709849587579


Only a very small number of the remaining foods have NaN values somewhere in the dataset. I think that for my purposes it is safe to drop them all.

In [10]:
df = df.dropna()
for column in df.columns:
    print(column + ": {}".format(countNaNs(df, column)))

product_name: 0.0
brands: 0.0
brands_tags: 0.0
countries: 0.0
countries_tags: 0.0
countries_en: 0.0
ingredients_text: 0.0
serving_size: 0.0
additives_n: 0.0
additives: 0.0
ingredients_from_palm_oil_n: 0.0
ingredients_that_may_be_from_palm_oil_n: 0.0
nutrition_grade_fr: 0.0
states: 0.0
states_tags: 0.0
states_en: 0.0
energy_100g: 0.0
fat_100g: 0.0
saturated-fat_100g: 0.0
carbohydrates_100g: 0.0
sugars_100g: 0.0
fiber_100g: 0.0
proteins_100g: 0.0
salt_100g: 0.0
sodium_100g: 0.0
nutrition-score-fr_100g: 0.0
nutrition-score-uk_100g: 0.0


Excellent! Now let's save this dataframe as a tsv so that we have a slimmed down dataset to work with for the classifier.

In [12]:
df.to_csv('nutrition_data.tsv', sep = '\t', index=False)