# Using nutritional-facts dataset

In [41]:
import pandas as pd

nutritional_df = pd.read_csv(r'..\data\raw\nutritional-facts.csv')
nutritional_df.head()

Unnamed: 0,Food Name,Category Name,Calcium,Calories,Carbs,Cholesterol,Copper,Fats,Fiber,Folate,...,Vitamin D,Vitamin E,Vitamin K,Omega-3 - ALA,Omega-6 - Eicosadienoic acid,Omega-6 - Gamma-linoleic acid,Omega-3 - Eicosatrienoic acid,Omega-6 - Dihomo-gamma-linoleic acid,Omega-6 - Linoleic acid,Omega-6 - Arachidonic acid
0,Acerola,Fruits,0.012,32.0,7.7,0.0,9e-05,0.3,1.1,1.4e-05,...,,,,,,,,,,
1,Apple,Fruits,0.006,52.0,14.0,0.0,3e-05,0.17,2.4,3e-06,...,0.0,0.00018,2e-06,,,,,,,
2,Apricot,Fruits,0.013,48.0,11.0,0.0,8e-05,0.39,2.0,9e-06,...,0.0,0.00089,3e-06,,,,,,,
3,Dried fruit,Fruits,0.055,241.0,63.0,0.0,0.00034,0.51,7.3,1e-05,...,0.0,0.0043,3e-06,,,,,,,
4,Avocado,Fruits,0.012,160.0,8.5,0.0,0.00019,15.0,6.7,8.1e-05,...,0.0,0.0021,2.1e-05,0.11,0.0,0.02,,,,


This dataset has simple and intuitive names for foods (`Food Name`) and also the category in which the food belongs (`Category Name`). It has a lot of nutritional facts but it lacks of serving size that could be added in another dataset (e.g. `nutri-facts-serving.csv`). Let's take only macronutrients features.

In [42]:
nutritional_df.columns

Index(['Food Name', 'Category Name', 'Calcium', 'Calories', 'Carbs',
       'Cholesterol', 'Copper', 'Fats', 'Fiber', 'Folate', 'Iron', 'Magnesium',
       'Monounsaturated Fat', 'Net carbs', 'Omega-3 - DHA', 'Omega-3 - DPA',
       'Omega-3 - EPA', 'Phosphorus', 'Polyunsaturated fat', 'Potassium',
       'Protein', 'Saturated Fat', 'Selenium', 'Sodium', 'Trans Fat',
       'Vitamin A (IU)', 'Vitamin A RAE', 'Vitamin B1', 'Vitamin B12',
       'Vitamin B2', 'Vitamin B3', 'Vitamin B5', 'Vitamin B6', 'Vitamin C',
       'Zinc', 'Choline', 'Fructose', 'Histidine', 'Isoleucine', 'Leucine',
       'Lysine', 'Manganese', 'Methionine', 'Phenylalanine', 'Starch', 'Sugar',
       'Threonine', 'Tryptophan', 'Valine', 'Vitamin D', 'Vitamin E',
       'Vitamin K', 'Omega-3 - ALA', 'Omega-6 - Eicosadienoic acid',
       'Omega-6 - Gamma-linoleic acid', 'Omega-3 - Eicosatrienoic acid',
       'Omega-6 - Dihomo-gamma-linoleic acid', 'Omega-6 - Linoleic acid',
       'Omega-6 - Arachidonic acid'],
   

In [43]:
cols = ['Food Name', 'Category Name', 'Calories', 'Carbs', 'Fats', 'Fiber', 'Protein']
nutritional_df = nutritional_df[cols]
nutritional_df.head()

Unnamed: 0,Food Name,Category Name,Calories,Carbs,Fats,Fiber,Protein
0,Acerola,Fruits,32.0,7.7,0.3,1.1,0.4
1,Apple,Fruits,52.0,14.0,0.17,2.4,0.26
2,Apricot,Fruits,48.0,11.0,0.39,2.0,1.4
3,Dried fruit,Fruits,241.0,63.0,0.51,7.3,3.4
4,Avocado,Fruits,160.0,8.5,15.0,6.7,2.0


Let's take a further look into distinct values of `Category Name`.

In [44]:
nutritional_df['Category Name'].unique()

array(['Fruits', 'Vegetables', 'Seafood', 'Dairy', 'Mushrooms', 'Grains',
       'Meat', 'Spices', 'Nuts', 'Greens', 'Sweets', 'Oils and Sauces',
       'Beverages', 'Soups', 'Baked Products', 'Fast Foods',
       'Meals, Entrees, and Side Dishes', 'Baby Foods'], dtype=object)

Some considerations about categories:

- `Fruits`, `Vegetables` and `Greens` should be picked following their seasonality, with a complementary dataset (e.g. `nutri-facts-season.csv`)
- `Seafood`, `Meat` and `Diary` should not be recommended to vegan users, while vegetarian ones could include `Diary`
- `Diary` should not be recommended to lactose intolerant users
- `Grains` containing gluten should not be recommended to gluten intolerant users, so it requires a fine-grained subdivision
- `Nuts` should be filtered according to users' allergies, so it requires a fine-grained selection
- `Sweets` is a special category that should be used according to users' emotive state of the day
- `Oils and Sauces` may require a fine-grained selection for users' with tyroid issues (e.g. soy)
- `Beverages` could be excluded from the dataset
- `Soups` needs further investigation
- `Fast Foods` is a special category that should be used according to users' emotive state of the day
- `Meals, Entrees, and Side Dishes` needs further investigation and could be useful during recipe generations
- `Baby Foods` could be excluded from the dataset as it is not taken into account

Let's remove some categories and see how many rows are deleted after the operation.

In [45]:
print("Before removing rows: ", nutritional_df.shape)
nutritional_df = nutritional_df[nutritional_df['Category Name'] != 'Baby Foods']
nutritional_df = nutritional_df[nutritional_df['Category Name'] != 'Meals, Entrees, and Side Dishes']
nutritional_df = nutritional_df[nutritional_df['Category Name'] != 'Fast Foods']
print("After removing rows: ", nutritional_df.shape)

Before removing rows:  (1174, 7)
After removing rows:  (1036, 7)


Now we can check if every food in the seasonality dataset matches with foods in the dataset.

In [46]:
import json
import pandas as pd

with open(r'..\data\raw\food-seasonality.json', 'r') as file:
    seasonality_data = json.load(file)

seasonal_food_names = []
for month, foods in seasonality_data['Italy'].items():
    seasonal_food_names.extend(foods)

seasonal_food_names = [food.lower() for food in seasonal_food_names]
nutritional_df['Food Name'] = nutritional_df['Food Name'].str.lower()
missing_food_names = [food for food in seasonal_food_names if food not in nutritional_df['Food Name'].values]

if not missing_food_names:
    print("All seasonal food names are present in the nutritional_df.")
else:
    print("The following seasonal food names are missing in the nutritional_df:")
    print(missing_food_names)


All seasonal food names are present in the nutritional_df.
