## Mapping Recipes to Compounds

This notebook is used to generate __all_recipes_presence.pkl and all_recipes_presence.csv__, a dataset of recipes transformed into their molecular components. Both file formats contain identical data. Each contains 176,286 recipes scraped from Food.com and converted to their molecular components using FooDB.

The notebook builds on work that is available elsewhere in this repository, namely __content_dict_generator.ipynb__ and __recipe_cleaner__ipynb__. Please consult those notebooks for a description of how the recipes were cleaned and how the content dictionary was created.

### Imports

In [190]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import ast
import time
import pickle
import zipfile

### Cleaned Recipes

I am importing recipes I have cleaned in an earlier notebook (processing_scraped_data). These were converted to names that are recognized by FooDB using BERT pretrained embeddings. I used a cutoff of 0.65 for cosine similarity of a matched ingredient. This cutoff was chosen because below 0.65 terms are very dissimilar. There are some strange "matches" above 0.65, but with the size of our dataset this noise is inconsequential. This allows us to preserve most of the data while filtering out recipes we do not have data on. A more thorough decription of this process is available in __recipe_cleaner.ipynb__.

In [192]:
import pickle
with open('all_recipes_cleaned.pkl', 'rb') as f:
    all_recipes_cleaned = pickle.load(f)

In [194]:
print(len(all_recipes_cleaned),"\n", all_recipes_cleaned[0])

176286 
 [['Chicken spread', 226.0], ['Sauce, shaslik', 30.0], ['Soy sauce', 30.0], ['Couscous, cooked', 45.0], ['Ginger', 7.0], ['Sugar, turbinado', 36.0]]


There are 176,286 recipes that survived the 0.65 cosine similarity threshold. Each is represented as a list of ingredient-quantity pairs. For this simplified mapping we will disregard quantities.

### Content Dictionary

Here I load in a dictionary of the molecular compounds found in each ingredient.

The work to generate this content dictionary is located in this repository in the notebook named __content_dict_generator.ipynb__.


In [185]:
#Same process as below
with open('content_dict_presence.pkl', 'rb') as f:
    content_dict_presence = pickle.load(f)

### Mapping to Compounds

In [195]:
all_recipes_presence = []
loopstart = time.time()
recipe_count = len(all_recipes_cleaned)
#recipe_count = 2 #For testing
for i in range(recipe_count):
    try:
        test_recipe = all_recipes_cleaned[i]
        d = []
        for j in test_recipe:
            #print(j, content_dict[j[0]])
            [d.append(x) for x in content_dict[j[0]] if x not in d]
            #print(d)
            
        all_recipes_presence.append(d)

        if i in [1,10,100,1000,20000,50000,100000]:
            print(i, "loop(s)", time.time()-loopstart)
    except:
        pass
    
print("Segment took", round(time.time() - loopstart), "seconds.")

1 loop(s) 0.046811819076538086
10 loop(s) 0.14573192596435547
100 loop(s) 0.6762518882751465
1000 loop(s) 5.61813497543335
20000 loop(s) 98.17081809043884
50000 loop(s) 240.75132179260254
100000 loop(s) 476.28015208244324
Segment took 842 seconds.


In [196]:
all_recipes_presence[0]

[2,
 3,
 4,
 5,
 11,
 15,
 20,
 21,
 23,
 24,
 29,
 34,
 36,
 38,
 3513,
 753,
 13393,
 2100,
 455,
 3716,
 3514,
 16258,
 3519,
 3521,
 3522,
 3524,
 3730,
 3583,
 3637,
 13403,
 13831,
 14616,
 13719,
 12022,
 14537,
 1224,
 8425,
 12163,
 1014,
 8323,
 574,
 14507,
 13267,
 2250,
 12002,
 12400,
 1946,
 474,
 12686,
 12566,
 14708,
 446,
 465,
 2257,
 11859,
 556,
 12570,
 12538,
 484,
 570,
 12742,
 13272,
 21594,
 12065,
 13900,
 3337,
 12030,
 3011,
 2890,
 11682,
 2942,
 3004,
 3103,
 6299,
 21595,
 39,
 31,
 1131,
 21596,
 12735,
 12814,
 565,
 12360,
 8417,
 14513,
 21468,
 3636,
 3517,
 13447,
 12531,
 12533,
 1145,
 1193,
 1128,
 10035,
 8052,
 2928,
 21981,
 12636,
 11678,
 21631,
 2953,
 3772,
 4288,
 21632,
 12763,
 12465,
 2943,
 11875,
 13401,
 1,
 4858,
 21950,
 2251,
 11820,
 23049,
 2520,
 21947,
 16140,
 10326,
 3307,
 4677,
 21973,
 12861,
 3006,
 6288,
 21989,
 12126,
 12890,
 24096,
 16357,
 2935,
 21939,
 2609,
 669,
 21182,
 670,
 2608,
 12228,
 11831,
 12224,


In [197]:
import pickle
with open('all_recipes_presence.pkl', 'wb') as f:
    pickle.dump(all_recipes_presence, f, protocol=pickle.HIGHEST_PROTOCOL)

In [198]:
import csv

with open("all_recipes_presence.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(all_recipes_presence)

### Frequency Counts

This section is included because users may want to limit dataset size. One way to do so is by source_id. For example, in order to shrink one's vocabulary for a machine learning task one could select only source_ids that appear at least N times. If this is not necessary for your use case disregard this section.

The outputs of this section are:

* __freq_dict.pkl__, a dictionary of each source_id's frequency within the cleaned dataset


* __greaterthan10.txt, greaterthan20.txt, and greaterthan50.txt__, text files listing source IDs that appear more than 10, 20 and 50 times in the cleaned recipes.

In [199]:
source_ids = content.source_id.unique()

In [200]:
print(len(all_recipes_presence))

176286


In [217]:
s = time.time()

freq_dict = {}

for s_id in source_ids:
    freq_dict[s_id] = 0

    
#loop through all ingredients and sum their occurrences
for rec in all_recipes_presence:
    for ing in rec:
        freq_dict[ing]  = freq_dict[ing] + 1
        
print(time.time() - s)

22.02142310142517


In [218]:
freq_dict

{1: 169169,
 2: 175988,
 3: 175981,
 4: 175020,
 5: 174930,
 6: 60520,
 7: 109299,
 8: 101350,
 9: 43319,
 10: 37456,
 11: 166135,
 12: 106922,
 13: 83073,
 14: 85695,
 15: 166365,
 16: 42609,
 17: 7292,
 18: 37491,
 19: 5856,
 20: 166368,
 21: 166316,
 22: 1106,
 23: 163400,
 24: 165014,
 25: 103503,
 26: 35712,
 27: 35788,
 28: 106161,
 29: 163950,
 30: 6090,
 31: 162987,
 32: 37622,
 33: 35712,
 34: 164081,
 35: 37795,
 36: 173690,
 37: 83045,
 38: 175009,
 21594: 175143,
 21595: 174358,
 1131: 172234,
 753: 173840,
 3513: 175973,
 13393: 175983,
 13831: 174939,
 14616: 175791,
 21596: 173358,
 12735: 161092,
 12814: 146725,
 565: 175014,
 12360: 174554,
 8417: 143711,
 12163: 175925,
 1014: 175922,
 574: 175282,
 8323: 175220,
 14513: 157251,
 14507: 175189,
 13267: 174142,
 1224: 175935,
 21468: 145567,
 3524: 175959,
 3522: 175957,
 3514: 175998,
 3519: 175871,
 3521: 175984,
 16258: 175990,
 3583: 175819,
 3730: 175863,
 3636: 159550,
 3637: 175708,
 3517: 163847,
 13403: 175230

In [219]:
#Save file to pkl

with open('freq_dict_presence.pkl', 'wb') as f:
    pickle.dump(freq_dict, f, protocol=pickle.HIGHEST_PROTOCOL)

In [220]:
def less_than_N(N):
    '''
    Function that returns the number of source_ids that are present in Content.csv
    and appear less than N times in the food.com dataset.
    At the end we deduct 49,965 because that is the number of compounds that appear in 0 recipes.
    '''
    ltN = 0
    for i in freq_dict.keys():
        if freq_dict[i] < N:
            ltN +=1
    return str(ltN - 49565) + " compounds appear less than " + str(N) + " times."

def greater_than_N_interface(N):
    '''
    Function that returns the number of source_ids that are present in Content.csv
    and appear less than N times in the food.com dataset.
    At the end we deduct 49,965 because that is the number of compounds that appear in 0 recipes.
    
    Return value is a printed string.
    '''
    ltN = 0
    for i in freq_dict.keys():
        if freq_dict[i] > N:
            ltN +=1
    return str(ltN) + " compounds appear greater than " + str(N) + " times."

In [221]:
def greater_than_N(N):
    '''
    Function that returns the a list of source IDs that appear more than N times in the food.com dataset.
    
    Return value is a list of source IDs.
    '''
    gtN = ""
    #gtN = []
    for i in freq_dict.keys():
        if freq_dict[i] > N:
            gtN = gtN + str(i) + "\n"
            #gtN.append(i)
    return gtN 

In [222]:
print("",greater_than_N_interface(10), "\n",
      greater_than_N_interface(100), "\n",
      greater_than_N_interface(500), "\n",
      less_than_N(10), "\n",
      less_than_N(20), "\n",
      less_than_N(50))

 12822 compounds appear greater than 10 times. 
 9607 compounds appear greater than 100 times. 
 5810 compounds appear greater than 500 times. 
 2776 compounds appear less than 10 times. 
 4940 compounds appear less than 20 times. 
 5535 compounds appear less than 50 times.


In [223]:
#This output is used to limit scope of the model as a necessity of training. 
#It mitigates lag due to expansive vocabulary size.

print(greater_than_N(50))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
21594
21595
1131
753
3513
13393
13831
14616
21596
12735
12814
565
12360
8417
12163
1014
574
8323
14513
14507
13267
1224
21468
3524
3522
3514
3519
3521
16258
3583
3730
3636
3637
3517
13403
13447
12531
12533
1145
1193
3716
1128
12065
13900
3337
12030
3011
2890
10035
11682
8052
2942
2928
21981
12636
11678
21631
2953
3772
4288
21632
12763
12465
2943
11875
3103
3004
13272
12400
1946
474
12686
12566
14708
446
12002
2250
465
2257
11859
556
12570
12538
484
570
12742
21633
21939
2100
455
4486
13719
12022
14537
12384
2431
2432
8425
710
9021
1936
12524
12365
13408
12706
29955
13514
2944
6299
2602
2597
1571
17191
19605
64
17158
17177
17155
17166
2603
17184
7050
17165
17172
2725
17180
17195
2614
17210
17211
17199
17209
17198
21879
17202
17208
40
17218
17216
55
17224
17217
17219
2714
12429
11787
17306
11788
2748
11795
21825
22001
11785
2753
1583
17311
17314
17315
17312
2756
1591
41
17321
17320
17

In [225]:
# Save files with ID's of compounds that appear more than 10, 20 50 times. 

gt10 = open("greaterthan10.txt","a") 
gt10.writelines(greater_than_N(10))
gt10.close() 

gt20 = open("greaterthan20.txt","a") 
gt20.writelines(greater_than_N(20))
gt20.close() 

gt50 = open("greaterthan50.txt","a") 
gt50.writelines(greater_than_N(50))
gt50.close() 
