# Content Dictionaries

This notebook details the processes used to construct three different content dictionary versions.

These can be used to look up molecular content found in ingredients by name. E.g. content_dict_presence["Kiwi] will list all compounds found in Kiwi, listed by content id. These content ID's can be tied back to names of compounds found in FooDB using __source_id_lookup()__ as shown at the bottom of theis notebook.

Each of the three dictionaries offers slightly different information.

1. __content_dict_weight.pkl__ lists all compounds found in each ingredient with their corresponding weights in mg/100g. It limits the scope of its output to compounds whose weight is included in FooDB's database.


2. __content_dict_presence.pkl__ lists all compounds present in each ingredient. This dictionary does not concern itself with weights.


3. __content_dict_complete.pkl__ lists all compounds found in each ingredient with their corresponding weights in mg/100g. Ingredients whose weight is unknown are given a value of 1e-13, to signify their presence in an unknown amount.

## Imports and Data

Importing standard packages and loading in data.

Dataset comes from FooDB (https://foodb.ca/). It consists of 5.1 million rows of information regarding the molecular compounds found in various common foods. This notebook outlines how to prepare dictionaries such that one can easily look up the molecular compounds found in each ingredient.

In [51]:
import pandas as pd
import time
import pickle

In [52]:
def time_print(sec):
    '''Used to time blocks and output result in a more readable way.'''
    if sec>3600:
        print("Segment took",round(sec/3600,2),"hour(s) to complete.")
    elif sec>60:
        print("Segment took",round(sec/60,2), "minute(s) to complete.")
    else:
        print("Segment took",round(sec,2), "second(s) to complete.")

In [54]:
s_import = time.time()

content = pd.read_csv("Content.csv")
content.shape

e_import = time.time()
t_import = e_import - s_import
time_print(t_import)

Segment took 23.87 second(s) to complete.


## 1. content_dict_weight.pkl

__content_dict_weight__ restricts its view to ingredients whose units are in a known weight unit so that we can convert all compounds to grams. This greatly decreases the size of the dataset, from 5.1M rows to about 800,000.

In [55]:
s_2d = time.time()

content_weight = content[(content.orig_unit == "mg/100g")| (content.orig_unit == "mg/100 g") | (content.orig_unit ==  "mg/100 g freshweight") | (content.orig_unit == "mg/100 g fresh weight")]
print(content_weight.shape) #Clearly limiting to only ingredients with known weights cuts the size of our data immensely

e_2d = time.time()
t_2d = e_2d-s_2d
time_print(t_2d)


(803930, 26)
Segment took 1.0 second(s) to complete.


In [64]:
s_2 = time.time()

content_dict_weight = {}
#for i in range(200):
for i in range(content_weight.shape[0]):
    row  = content_weight.iloc[i]
    if row.orig_food_common_name in content_dict_weight.keys():
        content_dict_weight[row.orig_food_common_name].append(
            [row.source_id, row.standard_content])
    else:
        content_dict_weight[row.orig_food_common_name] = [[row.source_id, row.standard_content]]
        
    if i in [1,100,1000,10000,20000,100000,400000,800000]:
        print(i, "loop(s)", time.time()-s_2)
e_2 = time.time()
t_2 = e_2-s_2
time_print(t_2)

1 loop(s) 0.015532970428466797
100 loop(s) 0.04726767539978027
1000 loop(s) 0.22136592864990234
10000 loop(s) 1.8898768424987793
20000 loop(s) 3.7163479328155518
100000 loop(s) 18.018797874450684
400000 loop(s) 68.46686100959778
800000 loop(s) 135.59464979171753
Segment took 2.27 minute(s) to complete.


In [66]:
import pickle
with open('content_dict_weight.pkl', 'wb') as f:
    pickle.dump(content_dict_weight, f, protocol=pickle.HIGHEST_PROTOCOL)

## 2. content_dict_presence.pkl and 3. content_dict_complete.pkl  

These two dictionaries both use the entirety of Content.csv. To avoid having to loop through it twice we will run them concurrently.

* __content_dict_presence.pkl__ includes all compounds found in each ingredient but does not include their quantities.



* __content_dict_complete.pkl__ uses weights when available and fills unavailable weights with a dummy value, 1e-13.

In [59]:
s_3d = time.time()

content_complete = content
content_complete = content_complete.fillna(0.0000000000001)
print(content_complete.shape)

e_3d = time.time()
t_3d = e_3d-s_3d
time_print(t_3d)

(5145532, 26)
Segment took 16.59 second(s) to complete.


In [69]:
s_3 = time.time()

content_dict_presence = {}
content_dict_complete = {}


#for i in range(20000):
for i in range(content_complete.shape[0]):
    row  = content_complete.iloc[i]
    if row.orig_food_common_name in content_dict_complete.keys():
        content_dict_presence[row.orig_food_common_name].append(row.source_id)
        content_dict_complete[row.orig_food_common_name].append(
            [row.source_id, row.standard_content])
    else:
        content_dict_presence[row.orig_food_common_name] = [row.source_id]
        content_dict_complete[row.orig_food_common_name] = [[row.source_id, row.standard_content]]

        
    if i in [1,100,1000,10000,50000, 100000, 500000, 1000000, 3000000, 5000000]:
        print(i, "loop(s)", time.time()-s_3)
e_3 = time.time()
t_3 = e_3-s_3
time_print(t_3)

1 loop(s) 0.10676980018615723
100 loop(s) 0.12758708000183105
1000 loop(s) 0.3014981746673584
10000 loop(s) 1.9556689262390137
50000 loop(s) 9.329163074493408
100000 loop(s) 18.587740182876587
500000 loop(s) 92.57575917243958
1000000 loop(s) 186.80687403678894
3000000 loop(s) 560.2382588386536
5000000 loop(s) 931.9627001285553
Segment took 15.98 minute(s) to complete.


In [71]:
with open('content_dict_presence.pkl', 'wb') as f:
    pickle.dump(content_dict_presence, f, protocol=pickle.HIGHEST_PROTOCOL)

In [72]:
with open('content_dict_complete.pkl', 'wb') as f:
    pickle.dump(content_dict_complete, f, protocol=pickle.HIGHEST_PROTOCOL)

## Examples

### Dictionary Comparison

Below we provide examples of the information contained in each of the three dictionaries.

In [87]:
def compare_dictionaries(ingredient):
    print("content_dict_weight:\n", 
          len(content_dict_weight[ingredient]),"elements\n",
          content_dict_weight[ingredient],"\n\n")
    print("content_dict_presence:\n", 
          len(content_dict_presence[ingredient]),"elements\n",
          content_dict_presence[ingredient],"\n\n")
    print("content_dict_complete:\n", 
          len(content_dict_complete[ingredient]),"elements\n",
          content_dict_complete[ingredient])
    

compare_dictionaries("Apple")

content_dict_weight:
 109 elements
 [[1, 1870.5], [1, 20500.0], [2, 733.5], [3, 55040.0], [446, 14.45], [465, 30.0], [474, 38.3], [484, 28.85], [556, 25.25], [565, 1.95], [570, 22.75], [574, 0.2], [787, 17.1], [1014, 0.4], [1131, 9500.0], [1161, 3.35], [1224, 21.1], [1946, 43.3], [2250, 7.2], [2257, 21.65], [2558, 67.75], [2593, 23.75], [2890, 7.2], [2942, 25.25], [3011, 3.65], [3243, 0.3], [3513, 2265.0], [3514, 30.65], [3516, 6.3], [3517, 0.01525], [3519, 26.3], [3521, 49.65], [3522, 662.5], [3524, 6.65], [3571, 6.47], [3572, 0.15], [3572, 0.17], [3576, 5.55], [3582, 0.0024], [3583, 0.212], [3599, 3.0], [3637, 1.45], [3654, 0.02535], [3714, 3.55], [3718, 1.2325], [3730, 1.75], [21594, 209.5], [3764, 0.02155], [3765, 0.441], [3767, 0.0014], [3771, 0.11], [3778, 3.2001], [3781, 0.001], [3784, 0.5135], [3787, 0.43825], [3790, 0.15275], [3909, 63.2], [3959, 0.45], [4005, 72.35], [4147, 134.0], [4182, 0.0108], [4189, 0.00485], [4190, 0.054000000000000006], [4687, 1.75], [5127, 3.3], [5368

### source_id Lookup

The dictionaries above identify molecular compounds in food by their source_id, as is used in FooDB's Content.csv

The function below can be used to look up the common names used for that source_id. Please note that it is very common to see multiple variations on the same name listed. This is why we chose to aggregate on source_id rather than on orig_source_name.

In [89]:
def source_id_lookup(source_id):
    return content[content.source_id==source_id].orig_source_name.unique()

In [95]:
source_id_lookup(16369)

array(['Apigenin 7-O-glucoside', 'APIGETRIN', '7-GLUCOSYL-APIGENIN',
       'APIGENIN-7-GLUCOSIDE|APIGETRIN|COSMOSIIN', 'COSMOSIIN',
       'APIGENIN-7-O-BETA-D-GLUCOSIDE', 'APIGENIN-7-GLUCOSIDE',
       'APIGENIN-7-O-BETA-D-GLUCOSIDE|COSMOSIIN',
       'APIGENIN-7-BETA-D-GLUCOPYRANOSIDE', 'APIGENIN-7-O-GLUCOSIDE',
       'Cosmosiin', nan], dtype=object)