# [Chefkoch.de](http://www.chefkoch.de/) Matura paper 2017/18
------

## Objective: 
### Brief analysis of the ingredients of 316,755 recipes using the [APRIORI](https://en.wikipedia.org/wiki/Apriori_algorithm) algorithm (Part 3.1).

In [1]:
!pip install mlxtend



In [2]:
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import OnehotTransactions
import pandas as pd
import csv
import re

The ingredients are stored in the file *recipe_details_merged.csv*. The first thing to do is to get the ingredients into a list

In [3]:
def get_list_of_ingredients():
    recipe_links = []
    chef_file = '/input/recipe_details_merged.csv'
    with open(chef_file, 'r', encoding='utf-8') as f:
        chefkoch = csv.reader(f)
        for row in chefkoch:
            try:
                recipe_links.append(row[-2])
            except:
                print('MISSED')
                continue 
    return(recipe_links)

In [4]:
zutaten = []
for zutat in get_list_of_ingredients()[1:]:
    zzz = zutat.split(',')
    z_liste = []
    for zz in zzz:
        z = zz.split('@')
        z_liste.append(z[-1])
    zutaten.append(z_liste)

The *ingredients* list now has a list of ingredients for each recipe

In [5]:
print(zutaten[:4])

[['Cachaca', 'Zucker', 'Limette', 'Eis'], ['Wodka', 'Gin', 'Rum', 'Rum', 'Likör', 'Likör', 'Ananassaft', 'Grapefruitsaft', 'Zitronensaft', 'Grenadine', 'Guaranapulver', 'Pitahaya', 'Karambolenscheibe', 'Orange'], ['Tofu', 'Batate', 'Palmfett', 'Öl ', 'Zwiebel', 'Knoblauch', 'Tabasco', 'Erdnüsse', 'Zitrone', 'Tomatenmark', 'Sojasauce', 'Rohrzucker', 'Honig', 'Koriander'], ['Basilikum', 'Zitrone', 'Avocado', 'Schmand', 'Sahne', 'Salz und Pfeffer']]


### In total, all recipes share 3,248,846 ingredients

In [6]:
count = 0
for i in zutaten:
    count += len(i)
print(count)

3248846


In [7]:
seen = set()
uniq = []
duplicates = []
for x in zutaten:
    for zutat in x:
        if zutat not in seen:
            uniq.append(zutat)
            seen.add(zutat)
        else: duplicates.append(zutat)

### Of these 3,248,846 ingredients, 63,588 are different

In [8]:
print(len(uniq))
print(len(duplicates))

63588
3185258


In the ingredients list you remove everything except letters

In [9]:
# remove everything but letters
clean_zutaten = []
regex = re.compile(r'[^A-Za-zäöüßéàèêëïùâîûç]+', re.IGNORECASE)

def get_n_statistic(zutat_name):
    count_ = 0
    for zutatliste in zutaten:
        temp = []
        for zutat in zutatliste:
            new_name = regex.sub(' ', zutat)
            if new_name.strip().lower().startswith(zutat_name):
                count_ += 1
                seperat = new_name.split(' und ')
                temp.append(seperat[0])
                temp.append(seperat[1])
                continue
            if len(new_name) > 1: temp.append(new_name)
        clean_zutaten.append(temp)
    print('Count von {} ist {}'.format(zutat_name, count_)) # Count of {} is {}

In [10]:
get_n_statistic('salz und pfeffer')

Count von salz und pfeffer ist 94786


In [11]:
len(clean_zutaten)

316755

The Apriori algorithm simply needs too much RAM (>20GB) so I use a twelfth of all ingredients. EDIT: The script was recalculated in the cloud with strong hardware and 56GB RAM.

In [12]:
sub_clean_zutaten = clean_zutaten[80000:160000] # 80'000 subset

### Remove empty entries

In [15]:
all_clean_zutaten = sorted(sub_clean_zutaten)[:]

In [16]:
all_clean_zutaten

[[' EL',
  'Butter',
  'Zwiebel',
  'Pilze Waldpilze',
  'Kochschinken',
  'Sahne',
  'Pulver ',
  'Salz',
  'Pfeffer',
  'Petersilie'],
 [' EL', 'Haferflocken', 'Milch', 'Salz', 'Zucker'],
 [' EL', 'Honig', 'Wein', 'Kartoffel ', 'Öl', 'Butter', 'Salz', 'Pfeffer'],
 [' EL',
 ...]

Conversion of ingredients into a one-hot encoded Pandas DataFrame

In [17]:
%%time
oht = OnehotTransactions()
oht_ary = oht.fit(all_clean_zutaten).transform(all_clean_zutaten)
df = pd.DataFrame(oht_ary, columns=oht.columns_)

CPU times: user 19min 42s, sys: 6.09 s, total: 19min 48s
Wall time: 19min 46s


In [19]:
df.drop(df.columns[[0, 1]], axis=1, inplace=True) # drop EL and TL
df.head(10)

Unnamed: 0,Aal,Aal.1,Aal Filet,Aal oder Räucherforelle,Aceto balsamico,Aceto balsamico.1,Aceto balsamico bianco,Aceto balsamico di Modena,Aceto balsamico oder Balsamico rosso,Aceto balsamico oder Granatapfel bzw Himbeeressig,...,Öl zum Frittieren oder Butterschmalz,Öl zum Frittieren oder kg Butterschmalz oder Palmin,Öl zum Grillen,Öl zum Herausbacken,Öl zum Konservieren,Öl zum Marinieren,Öl zum Rösten,Öl zum Würzen,Öl zumFrittieren,Überraschungsei
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Apriori provides a DataFrame in which you can see which ingredients occur in combinations with other ingredients and how often in total. For example: 
- 57.8 percent in all recipes salt occurs.
- 39.8 percent of all recipes contain salt **and** pepper.

In [20]:
%%time
frequent_ingr = apriori(df, min_support=0.04, use_colnames=True)
frequent_ingr['length'] = frequent_ingr['itemsets'].apply(lambda x: len(x))

CPU times: user 2min 39s, sys: 52 ms, total: 2min 39s
Wall time: 2min 39s


In [21]:
frequent_ingr.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets,length
24,0.601187,[Salz],1
21,0.432675,[Pfeffer],1
107,0.402075,"[Pfeffer, Salz]",2
35,0.357837,[Zucker],1
4,0.301750,[Ei],1
13,0.273625,[Mehl],1
2,0.272175,[Butter],1
36,0.269962,[Zwiebel],1
122,0.227850,"[Salz, Zwiebel]",2
111,0.205400,"[Pfeffer, Zwiebel]",2


## Which tuples of ingredients occur the most ?

In [22]:
frequent_ingr[(frequent_ingr['length'] == 2) & (frequent_ingr['support'] >= 0.125)].sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets,length
107,0.402075,"[Pfeffer, Salz]",2
122,0.22785,"[Salz, Zwiebel]",2
111,0.2054,"[Pfeffer, Zwiebel]",2
55,0.185262,"[Ei, Mehl]",2
62,0.177787,"[Ei, Zucker]",2
86,0.177575,"[Mehl, Zucker]",2
60,0.1693,"[Ei, Salz]",2
51,0.166475,"[Butter, Salz]",2
121,0.164038,"[Salz, Zucker]",2
83,0.155725,"[Mehl, Salz]",2


## Which triplets of ingredients occur the most ?

In [23]:
frequent_ingr[(frequent_ingr['length'] == 3) & (frequent_ingr['support'] >= 0.08)].sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets,length
198,0.194013,"[Pfeffer, Salz, Zwiebel]",3
162,0.139237,"[Ei, Mehl, Zucker]",3
141,0.106625,"[Backpulver, Mehl, Zucker]",3
138,0.106338,"[Backpulver, Ei, Zucker]",3
135,0.105288,"[Backpulver, Ei, Mehl]",3
151,0.105038,"[Butter, Mehl, Zucker]",3
144,0.103663,"[Butter, Ei, Mehl]",3
199,0.10005,"[Pfeffer, Salz, Öl]",3
147,0.09835,"[Butter, Ei, Zucker]",3
160,0.092575,"[Ei, Mehl, Salz]",3


## Quadruplets ?

In [24]:
frequent_ingr[(frequent_ingr['length'] == 4) & (frequent_ingr['support'] >= 0.04)].sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets,length
208,0.095275,"[Backpulver, Ei, Mehl, Zucker]",4
212,0.083488,"[Butter, Ei, Mehl, Zucker]",4
206,0.059638,"[Backpulver, Butter, Mehl, Zucker]",4
205,0.059112,"[Backpulver, Butter, Ei, Zucker]",4
204,0.058463,"[Backpulver, Butter, Ei, Mehl]",4
217,0.057225,"[Ei, Mehl, Salz, Zucker]",4
222,0.05625,"[Pfeffer, Salz, Zwiebel, Öl]",4
218,0.053475,"[Ei, Mehl, Vanillezucker, Zucker]",4
211,0.052625,"[Butter, Ei, Mehl, Salz]",4
219,0.051025,"[Knoblauchzehe, Pfeffer, Salz, Zwiebel]",4


## Quintuplets ??

In [25]:
frequent_ingr[(frequent_ingr['length'] == 5) & (frequent_ingr['support'] >= 0.04)].sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets,length
223,0.053937,"[Backpulver, Butter, Ei, Mehl, Zucker]",5
224,0.042037,"[Backpulver, Ei, Mehl, Vanillezucker, Zucker]",5
