# Finding the right ingredients to provide

The following dataframe we obtained has nutritional information about the most varied food products there are. The information we are most interested in for our analysis. per product, are:
* **Food Group** - a generalized group in which the product is inserted
* **Food Name** - the name of the product itself
* **Protein (g)** - The amount of grams of proteins in a 100g serving
* **Carbohydrates (g)** - The amount of grams of carbohydrates in a 100g serving
* **Fat (g)** - The amount of grams of fat in a 100g serving

From the national agriculture library (https://www.nal.usda.gov/fnic/how-many-calories-are-one-gram-fat-carbohydrate-or-protein), we know that 1 gram of protein, fat and carbohydrates proved 4, 9 and 4 Kcal each, respectively. 

As such, and taking into account the necessity that each person, in their diet, should have their calories coming (**REFERENCE**): 
* 55% from proteins
* 25% from carbohydrates 
* 20% from fats.

In order to decide which products we want to provide to our needed countries, we'll apply a greedy rank that tries to find products which most closely respect these percentages.

In [37]:
import pandas as pd
test = pd.read_csv("../test_rank.csv").drop(columns="Unnamed: 0")
test

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g)
0,Legumes and Legume Products,"Soy meal, defatted, raw",196.80,143.56,21.51
1,Legumes and Legume Products,"Soy flour, full-fat, raw",151.24,127.68,185.85
2,Legumes and Legume Products,"Soybeans, mature seeds, raw",145.96,120.64,179.46
3,Legumes and Legume Products,"Lupins, mature seeds, raw",144.68,161.48,87.66
4,Legumes and Legume Products,"Winged beans, mature seeds, raw",118.60,166.84,146.88
...,...,...,...,...,...
1322,Fruits and Fruit Juices,"Apples, raw, without skin",1.08,51.04,1.17
1323,Fruits and Fruit Juices,"Apples, raw, with skin",1.04,55.24,1.53
1324,Fruits and Fruit Juices,"Apples, raw, without skin, cooked, boiled",1.04,54.56,3.24
1325,Fruits and Fruit Juices,"Apples, raw, gala, with skin",1.00,54.72,1.08


In [38]:
def rank_food(food):
    prot = food['Protein (kcal/100g)']
    carb = food['Carbohydrates (kcal/100g)']
    fat = food['Fat (kcal/100g)']
    
    if (prot == 0 and carb == 0 and fat == 0):
        return -1
    
    tot = prot + carb + fat
    
    err_prot = abs(tot*0.55/4 - prot) / 100
    err_carb = abs(tot*0.25/4 - carb) / 100
    err_fat = abs(tot*0.20/9 - fat) / 100
    
    avg_err = (err_prot + err_carb + err_fat)/3
    
    return avg_err

In [39]:
test['rank'] = test.apply(rank_food, axis=1)

In [40]:
test.sort_values(by="rank")

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
1275,Vegetables and Vegetable Products,"Cucumber, peeled, raw",2.36,8.64,1.44,0.032252
1220,Vegetables and Vegetable Products,"Taro shoots, raw",3.68,9.28,0.81,0.035700
1082,Vegetables and Vegetable Products,"Watercress, raw",9.20,5.16,0.90,0.039563
1195,Vegetables and Vegetable Products,"Radishes, white icicle, raw",4.40,10.52,0.90,0.041015
1260,Vegetables and Vegetable Products,"Celery, raw",2.76,11.88,1.53,0.041922
...,...,...,...,...,...,...
1022,"Lamb, Veal, and Game Products","Lamb, New Zealand, imported, subcutaneous fat,...",15.48,2.20,685.44,2.642530
997,Pork Products,"Pork, cured, salt pork, raw",20.20,0.00,724.50,2.788970
1052,Pork Products,"Pork, fresh, backfat, raw",11.68,0.00,798.21,3.101701
1150,Beef Products,"Beef, variety meats and by-products, suet, raw",6.00,0.00,846.00,3.304889


In [41]:
test.sort_values(by="rank").drop_duplicates("Food Group")

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
1275,Vegetables and Vegetable Products,"Cucumber, peeled, raw",2.36,8.64,1.44,0.032252
1225,Fruits and Fruit Juices,"Rhubarb, raw",3.6,18.16,1.8,0.061081
935,Dairy and Egg Products,"Egg, white, raw, frozen, pasteurized",40.8,4.16,0.0,0.123224
985,Finfish and Shellfish Products,"Mollusks, oyster, eastern, wild, raw",22.84,10.88,15.39,0.127322
922,"Lamb, Veal, and Game Products","Lamb, New Zealand, imported, testes, raw",45.6,0.56,21.42,0.199632
952,Legumes and Legume Products,"Tofu, raw, regular, prepared with calcium sulfate",32.32,7.48,43.02,0.214719
871,Beef Products,"Beef, New Zealand, imported, variety meats and...",59.44,0.0,17.82,0.232495
1017,Nut and Seed Products,"Seeds, lotus seeds, raw",16.52,69.12,4.77,0.234396
889,Pork Products,"Pork, fresh, variety meats and by-products, lu...",56.32,0.0,24.48,0.243148
725,Poultry Products,"Chicken, gizzard, all classes, raw",70.64,0.0,18.54,0.268366


In [42]:
# test.groupby(['Food Group'], as_index=False)['rank'].min().merge(test)

In [43]:
prices = pd.read_csv("../data/raw/ProducerPricesEurope2018.csv", usecols=["Area", "Item", "Value"])
prices.head()

Unnamed: 0,Area,Item,Value
0,Albania,Apples,435.2
1,Albania,Apricots,777.8
2,Albania,Barley,302.2
3,Albania,"Beans, dry",2078.7
4,Albania,"Beans, green",976.7


**Note: prices in USD/tonne**

In [44]:
# filtering by considering only the best countries for surplus*population found before
best_countries = ['Spain', 'United Kingdom', 'Italy',
       'France', 'Germany']
prices = prices[prices.Area.isin(best_countries)]

In [45]:
# the product we are gonna find must be available in all the countries
all_available = prices.groupby("Item").count().Area >= len(best_countries)

In [46]:
all_available[all_available==True]

Item
Apples                   True
Milk, whole fresh cow    True
Oats                     True
Potatoes                 True
Tomatoes                 True
Wheat                    True
Name: Area, dtype: bool

Considering only the total available is too restrictive, so we increase to list dimension - 2

In [47]:
all_available = prices.groupby("Item").count().Area >= len(best_countries) - 2
all_available[all_available==True]

Item
Apples                                True
Apricots                              True
Asparagus                             True
Barley                                True
Beans, dry                            True
Beans, green                          True
Cabbages and other brassicas          True
Carrots and turnips                   True
Cauliflowers and broccoli             True
Cherries                              True
Cucumbers and gherkins                True
Eggs, hen, in shell                   True
Hazelnuts, with shell                 True
Leeks, other alliaceous vegetables    True
Lettuce and chicory                   True
Maize                                 True
Meat live weight, cattle              True
Meat live weight, chicken             True
Meat live weight, pig                 True
Meat live weight, sheep               True
Meat, cattle                          True
Meat, chicken                         True
Meat, horse                           True
Meat, 

The situation improved, we use this as starting point. 

In [48]:
foods = all_available[all_available==True].index.values

In [49]:
test[test["Food Name"].str.contains('|'.join(foods), case=False)]

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
27,Legumes and Legume Products,"Peas, green, split, mature seeds, raw",95.28,254.96,10.44,0.935096
69,Poultry Products,"Chicken, broilers or fryers, light meat, meat ...",92.80,0.00,14.85,0.323947
110,Poultry Products,"Chicken, broiler or fryers, breast, skinless, ...",90.00,0.00,23.58,0.341792
169,Poultry Products,"Chicken, broilers or fryers, wing, meat only, raw",87.88,0.00,31.86,0.360329
260,"Lamb, Veal, and Game Products","Game meat, horse, raw",85.56,0.00,41.40,0.382056
...,...,...,...,...,...,...
1322,Fruits and Fruit Juices,"Apples, raw, without skin",1.08,51.04,1.17,0.179903
1323,Fruits and Fruit Juices,"Apples, raw, with skin",1.04,55.24,1.53,0.195937
1324,Fruits and Fruit Juices,"Apples, raw, without skin, cooked, boiled",1.04,54.56,3.24,0.199551
1325,Fruits and Fruit Juices,"Apples, raw, gala, with skin",1.00,54.72,1.08,0.193874


68 rows, but there are multiple matches: look to the elements that don't have a match:

In [50]:
total_foods = test["Food Name"].unique()
total_foods = [t.lower() for t in total_foods]
foods = [f.lower()for f in foods]

In [51]:
not_matching = [f for f in foods if not any(f in t for t in total_foods)]

In [52]:
len(not_matching)

33

In [53]:
not_matching[:10]

['beans, dry',
 'cabbages and other brassicas',
 'carrots and turnips',
 'cauliflowers and broccoli',
 'cucumbers and gherkins',
 'eggs, hen, in shell',
 'hazelnuts, with shell',
 'leeks, other alliaceous vegetables',
 'lettuce and chicory',
 'maize']

We can try to remove all the commas, () and divide the multiple words, also remove duplicates after that

In [54]:
foods = [f.replace(",","").replace("(", "").replace(")", "").split() for f in foods]
foods = [l for sublist in foods for l in sublist]
foods = list(set(foods))

Second step is removing all generated stopwords and add singulars

In [55]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words += ["dry", "weight", "live", "meat", "unmanufactured", "seed", "beans", "sugar", "whole", "peas", "green", "fresh"]
foods = [f for f in foods if f not in stop_words]
lem = WordNetLemmatizer()
singular = [lem.lemmatize(f) for f in foods]
foods += singular
foods = list(set(foods))
foods = [f for f in foods if f not in stop_words]

New matching check:

In [56]:
not_matching = [f for f in foods if not any(f in t for t in total_foods)]
len(not_matching)

30

In [57]:
test[test["Food Name"].str.contains('|'.join(foods+singular), case=False)].index.size

208

We can stop here for the next considerations

In [58]:
poss_diet = test[test["Food Name"].str.contains('|'.join(foods), case=False)]

In [59]:
poss_diet["Food Group"].unique()

array(['Legumes and Legume Products', 'Poultry Products',
       'Lamb, Veal, and Game Products', 'Finfish and Shellfish Products',
       'Cereal Grains and Pasta', 'Dairy and Egg Products',
       'Vegetables and Vegetable Products', 'Baked Products',
       'Nut and Seed Products', 'Fruits and Fruit Juices'], dtype=object)

We have some groups that are fore sure not possible to help us for building the diet:
- Baked Products
- Spices and Herbs  
- Nut and Seed Products
- Dairy and Egg Products

Also, Finfish and Shellfish Products is not really represented by our prices database.  
So, we drop them

In [60]:
poss_diet = poss_diet[~poss_diet["Food Group"].isin(["Baked Products","Spices and Herbs", "Nut and Seed Products", "Dairy and Egg Products", "Finfish and Shellfish Products"])].reset_index(drop=True)

In [61]:
# taking the best 5 from each group
poss_diet_proteins = poss_diet.groupby(["Food Group"]).apply(lambda x: x.sort_values(['Protein (kcal/100g)'], \
                                   ascending=False).reset_index(drop=True).groupby("Food Group").head(5)).reset_index(drop=True)
poss_diet_carbo = poss_diet.groupby(["Food Group"]).apply(lambda x: x.sort_values(['Carbohydrates (kcal/100g)'], \
                                   ascending=False).reset_index(drop=True).groupby("Food Group").head(5)).reset_index(drop=True)
poss_diet_fat = poss_diet.groupby(["Food Group"]).apply(lambda x: x.sort_values(['Fat (kcal/100g)'], \
                                   ascending=False).reset_index(drop=True).groupby("Food Group").head(5)).reset_index(drop=True)

In [66]:
poss_diet_proteins.sort_values(['rank'])

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
24,Vegetables and Vegetable Products,"Cowpeas, leafy tips, raw",16.4,19.28,2.25,0.098337
25,Vegetables and Vegetable Products,"Mushrooms, straw, canned, drained solids",15.32,18.56,6.12,0.103704
23,Vegetables and Vegetable Products,"Balsam-pear (bitter gourd), leafy tips, raw",21.2,13.16,6.21,0.105181
26,Vegetables and Vegetable Products,"Mushrooms, oyster, raw",13.24,24.36,3.69,0.107048
8,Fruits and Fruit Juices,"Apricots, raw",5.6,44.48,3.51,0.150728
6,Fruits and Fruit Juices,"Groundcherries, (cape-gooseberries or poha), raw",7.6,44.8,6.3,0.155327
9,Fruits and Fruit Juices,"Cherries, sweet, raw",4.24,64.04,1.8,0.217662
16,Legumes and Legume Products,SILK Banana-Strawberry soy yogurt,9.4,68.24,10.62,0.247061
15,Legumes and Legume Products,SILK Strawberry soy yogurt,9.4,72.96,10.62,0.263624
11,"Lamb, Veal, and Game Products","Goat, raw",82.4,0.0,20.79,0.310525


In [62]:
#sorting the groups by rank
poss_diet_proteins.sort_values(['Protein (kcal/100g)'], ascending=False)

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
12,Legumes and Legume Products,"Cowpeas, catjang, mature seeds, raw",95.4,238.56,18.63,0.914122
13,Legumes and Legume Products,"Cowpeas, common (blackeyes, crowder, southern)...",94.08,240.12,11.34,0.895844
17,Poultry Products,"Chicken, broilers or fryers, light meat, meat ...",92.8,0.0,14.85,0.323947
18,Poultry Products,"Chicken, stewing, light meat, meat only, raw",92.4,0.0,37.89,0.392076
19,Poultry Products,"Chicken, broiler or fryers, breast, skinless, ...",90.0,0.0,23.58,0.341792
20,Poultry Products,"Chicken, roasting, light meat, meat only, raw",88.8,0.0,14.67,0.311368
21,Poultry Products,"Chicken, broilers or fryers, wing, meat only, raw",87.88,0.0,31.86,0.360329
14,Legumes and Legume Products,"Pigeon peas (red gram), mature seeds, raw",86.8,251.12,13.41,0.910856
10,"Lamb, Veal, and Game Products","Game meat, horse, raw",85.56,0.0,41.4,0.382056
11,"Lamb, Veal, and Game Products","Goat, raw",82.4,0.0,20.79,0.310525


In [63]:
#sorting the groups by rank
poss_diet_carbo.sort_values(['Carbohydrates (kcal/100g)'], \
                                   ascending=False)

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
0,Cereal Grains and Pasta,"Rice, white, long-grain, regular, raw, enriched",28.52,319.8,5.94,1.06594
1,Cereal Grains and Pasta,"Rice, white, long-grain, regular, raw, unenriched",28.52,319.8,5.94,1.06594
2,Cereal Grains and Pasta,"Rice, white, medium-grain, raw, enriched",26.44,317.36,5.22,1.065442
3,Cereal Grains and Pasta,"Rice, white, medium-grain, raw, unenriched",26.44,317.36,5.22,1.065442
4,Cereal Grains and Pasta,"Rice, white, short-grain, raw, unenriched",26.0,316.6,4.68,1.065611
12,Legumes and Legume Products,"Pigeon peas (red gram), mature seeds, raw",86.8,251.12,13.41,0.910856
13,Legumes and Legume Products,"Cowpeas, common (blackeyes, crowder, southern)...",94.08,240.12,11.34,0.895844
14,Legumes and Legume Products,"Cowpeas, catjang, mature seeds, raw",95.4,238.56,18.63,0.914122
5,Fruits and Fruit Juices,"Strawberries, frozen, sweetened, sliced",2.12,103.68,1.17,0.3693
6,Fruits and Fruit Juices,"Custard-apple, (bullock's-heart), raw",6.8,100.8,5.4,0.351213


In [64]:
#sorting the groups by rank
poss_diet_fat.sort_values(['Fat (kcal/100g)'], ascending=False)

Unnamed: 0,Food Group,Food Name,Protein (kcal/100g),Carbohydrates (kcal/100g),Fat (kcal/100g),rank
17,Poultry Products,"Chicken, broilers or fryers, separable fat, raw",14.92,0.0,611.55,2.360008
18,Poultry Products,"Chicken, skin (drumsticks and thighs), raw",38.32,3.16,398.07,1.449107
19,Poultry Products,"Chicken, skin (drumsticks and thighs), with ad...",44.44,0.04,341.1,1.217225
20,Poultry Products,"Chicken, broilers or fryers, skin only, raw",53.32,0.0,291.15,1.0366
21,Poultry Products,"Chicken, broilers or fryers, back, meat and sk...",56.2,0.0,258.66,0.947495
0,Cereal Grains and Pasta,"Oat bran, raw",69.2,264.88,63.27,1.030167
10,"Lamb, Veal, and Game Products","Game meat, horse, raw",85.56,0.0,41.4,0.382056
1,Cereal Grains and Pasta,"Rice, brown, long-grain, raw",30.16,305.0,28.8,1.076163
2,Cereal Grains and Pasta,"Rice, brown, medium-grain, raw",30.0,304.68,24.12,1.059122
11,"Lamb, Veal, and Game Products","Goat, raw",82.4,0.0,20.79,0.310525


As such, and taking into account the necessity that each person, in their diet, should have their calories coming (**REFERENCE**): 
* 55% from proteins
* 25% from carbohydrates 
* 20% from fats.
We have values of **kcal/100g**. What we can do is try to build a diet with the data gathered so far, that represent almost all the Food Group categories we have, but having also data for prices in our interested countries.  
In order to do so, we can consider a diet of **1000 kcal**, then we will be able to scale the grams of product based on this.

In [36]:
prices[prices.Item=="Rice, paddy"]

Unnamed: 0,Area,Item,Value
521,France,"Rice, paddy",372.0
779,Italy,"Rice, paddy",403.7
1527,Spain,"Rice, paddy",344.6
