In [1]:
from pathlib import Path
import pandas as pd
import itertools as its

Each allergen is found in exactly one ingredient. Each ingredient contains zero or one allergen. Allergens aren't always marked; when they're listed (as in (contains nuts, shellfish) after an ingredients list), the ingredient that contains each listed allergen will be somewhere in the corresponding ingredients list. However, even if an allergen isn't listed, the ingredient that contains that allergen could still be present: maybe they forgot to label it, or maybe it was labeled in a language you don't know.

In [2]:
problem_ingreds = Path("ingredients.txt").read_text()
test_ingreds = """mxmxvkd kfcds sqjhc nhms (contains dairy, fish)
trh fvjkl sbzzf mxmxvkd (contains dairy)
sqjhc fvjkl (contains soy)
sqjhc mxmxvkd sbzzf (contains fish)"""

items = dict(food_id=[], allergen=[], ingred=[])

load_these_ingreds = problem_ingreds
for food_id, line in enumerate(load_these_ingreds.splitlines()):
    contents, warning = line.split('(contains')
    
    content_words = contents.split()
    allergy_words = warning[:-1].replace(",","").split()
    
    for c,a in its.product(content_words, allergy_words):
        items['food_id'].append(food_id)
        items['allergen'].append(a)
        items['ingred'].append(c)
        
df = pd.DataFrame(items)
df.head()

Unnamed: 0,food_id,allergen,ingred
0,0,dairy,cdblnb
1,0,sesame,cdblnb
2,0,dairy,txts
3,0,sesame,txts
4,0,dairy,scljtv


# Problem 1

Test data
```
mxmxvkd kfcds sqjhc nhms (contains dairy, fish)
trh fvjkl sbzzf mxmxvkd (contains dairy)
sqjhc fvjkl (contains soy)
sqjhc mxmxvkd sbzzf (contains fish)
```

The first food in the list has four ingredients (written in a language you don't understand): mxmxvkd, kfcds, sqjhc, and nhms. While the food might contain other allergens, a few allergens the food definitely contains are listed afterward: dairy and fish.

The first step is to determine which ingredients can't possibly contain any of the allergens in any food in your list. In the above example, none of the ingredients kfcds, nhms, sbzzf, or trh can contain an allergen. Counting the number of times any of these ingredients appear in any ingredients list produces 5: they all appear once each except sbzzf, which appears twice.

In [3]:
allergy_match = {}

done = False
while not done:
    
    done = True
    # make lists of matched and unmatched allergens
    matched = list(allergy_match.keys())
    unmatched = [a for a in df.allergen.unique() if a not in matched ]
    
    # stop considering ingredients we've already matched
    matched_ingreds = list(allergy_match.values())
    search_df = df[~df.ingred.isin(matched_ingreds)]
    
    for allergen in unmatched:
        # grab the rows for the given allergen
        aldf = search_df[search_df.allergen == allergen]
        
        # count how many ingredient lists w/ allergen
        how_many_foods = aldf.food_id.nunique()
        
        # count how many unmatched ingreds w/ allergen
        ingred_counts = aldf.groupby('ingred').food_id.count()
        
        # if the number ingredient associations equals the number of 
        # unique ingredient lists, that's a match
        match = ingred_counts[ingred_counts == how_many_foods]
        if match.shape== (1,):
            done = False
            allergy_match[allergen] = match.index[0]
        

In [4]:
len(allergy_match) == df.allergen.nunique()

True

In [5]:
df[~df.ingred.isin(allergy_match.values())][['food_id','ingred']].drop_duplicates().shape[0]

1945

# Problem 2

Arrange the ingredients alphabetically by their allergen and separate them by commas to produce your canonical dangerous ingredient list.

In [6]:
",".join([allergy_match[key] for key in sorted(allergy_match.keys())])

'pgnpx,srmsh,ksdgk,dskjpq,nvbrx,khqsk,zbkbgp,xzb'