Common Keys:
1) 'split' - which dataset split the recipe belongs to
2) 'id' - a unique identifier for each recipe
3) 'verb' - a dictionary where each key is a step number in the recipe (e.g., '0' is the first step). The value is the list of verbs that occur in this step. If no verbs occur, there should be a "<NO_CHANGE>" included
4) 'ingredient_list' - a list of plain text words for each ingredient_name
5) 'ingredients' - a dictionary where each key is a step number in the recipe. Steps without ingredients don't have a key. The value is a list indices for the ingredients that occur in that step (the indices correspond to their location in ingredient_list)
6) 'events' - a dictionary where each key is a step number in the recipe. Steps without an event don't have a key. The value for each key is another dictionary. In this 2nd dictionary, each key is a state change type (i.e., composition, cookedness, etc.). The values in the second dictionary are the end states for each state change type. if a state change type does not exist in this step, the second dictionary does not have a key for it.
7) 'text' - a dictionary where each key is a step number in the recipe. Each value is a list of tokens for this step.

For "train" recipes, two additional keys are as follows (if needed):
1) 'ing_type' - a dictionary where each key is a step number in the recipe. The value is a number corresponding to what type of annotation the ingredients are ( 1 = mention annotation, 2 = coref annotation, 3 = no ingredients to be selected)
2) 'ingredients_nocoref' - the ingredients at each step without the coref rules to augment training data

In [1]:
import argparse
import glob
import json
import numpy as np
import os
import pandas as pd
from tqdm import tqdm

from collections import defaultdict
from datetime import datetime

In [2]:
path = '/nfs/research/regan/src/data/cooking_dataset/recipes/'

In [6]:
def analyze_recipes(path):

    cnt_files_test = 0
    cnt_files_dev = 0
    
    all_cnt_utts = []
    all_cnt_tokens = []
    
    length_coref_chains_per_file = {}

    all_recipes_noun_counts = []
    all_vocab_verbs = []


    os.chdir(path)

    for file in tqdm(glob.glob("*.json")):

        length_utts = 0
        cnt_utts = 0

        # used as proxy for fd-entities
        cnt_vocab_nouns_from_ingredients = defaultdict(int)
        cnt_vocab_verbs_from_verbs = 0  

        with open(file, 'r') as f:

            data = json.load(f)

            # process length of all utts in 'text'
            
            if data['split'] in ['test', 'dev']:
                text = data['text']
                
                if data['split']=='test':
                    cnt_files_test += 1
                    
                elif data['split']=='dev':
                    cnt_files_dev += 1

                for step, utt in text.items():

                    length_utts += len(utt)
                    cnt_utts += 1

                all_cnt_tokens.append(length_utts)
                all_cnt_utts.append(cnt_utts)

                ingredients = data['ingredients']

                for step, arr in ingredients.items():

                    for ingred in arr:
                        cnt_vocab_nouns_from_ingredients[ingred] += 1

                all_recipes_noun_counts.append(cnt_vocab_nouns_from_ingredients)

                verbs = data['verb']

                for utt_idx, verb_arr in verbs.items():
                    cnt_vocab_verbs_from_verbs += len(verb_arr)

                all_vocab_verbs.append(cnt_vocab_verbs_from_verbs)


    print("Number of recipe files in test:", cnt_files_test)
    print("Number of recipe files in dev:", cnt_files_dev)
    print("Mean number of tokens per file:", np.mean(all_cnt_tokens))
    print("Mean number of utts per file:", np.mean(all_cnt_utts))

    length_coref_all_fd_entities = []
    cnt_number_unique_fd_entities_per_recipe = []

    cnt_number_total_fd_entities_per_recipe = []

    for recipe_noun_count in all_recipes_noun_counts:

        cnt_number_unique_fd_entities_per_recipe.append(len(recipe_noun_count))
        
        # each recipe is a defaultdict count the number of entities and their frequencies

        this_recipe_coref = []
        total_count_fd_entities = 0
        for ingred, count in recipe_noun_count.items():
            this_recipe_coref.append(count)

            total_count_fd_entities += count

        cnt_number_total_fd_entities_per_recipe.append(total_count_fd_entities)

        if len(this_recipe_coref) > 0:
            length_coref_all_fd_entities.append(np.mean(this_recipe_coref))

    print("Mean number of unique fd-entities per file:", np.mean(cnt_number_unique_fd_entities_per_recipe))
    print("Mean number of total fd-entities per file:", np.mean(cnt_number_total_fd_entities_per_recipe))
    print("Mean length coref chain per file:", np.mean(length_coref_all_fd_entities))
    print("Mean number of verbs per file:", np.mean(all_vocab_verbs))

    # print("Number of total nouns:", len(all_nouns))
    # print("Mean count of all entities per file:", np.mean(cnt_all_entities_per_file))

In [7]:
analyze_recipes(path)

100%|██████████| 111229/111229 [00:35<00:00, 3163.96it/s]

Number of recipe files in test: 629
Number of recipe files in dev: 162
Mean number of tokens per file: 92.69785082174462
Mean number of utts per file: 8.833122629582807
Mean number of unique fd-entities per file: 8.438685208596713
Mean number of total fd-entities per file: 38.78128950695322
Mean length coref chain per file: 4.435615955869103
Mean number of verbs per file: 11.663716814159292



