## Übung zu Kapitel 5


Prüfen Sie Ihr Wissen über Assoziationsregeln in einer praktischen Übung. Lernen Sie hierbei Assoziationsregeln aus einem [Datensatz aus einem Supermarkt](https://data-science-crashkurs.de/exercises/data/store_data.csv). Zum Lesen der Daten können Sie folgenden Quelltext nutzen.

In [2]:
with open('data/store_data.csv') as f:
    records = []
    for line in f:
        records.append(line.strip().split(','))

In [9]:
records[:4]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado']]

In [3]:
# total amount of order items
len(records)

7501

In [7]:
# minimum length of order
# maximum length of order
min([l for l in records], key=lambda l: len(l)), max([l for l in records], key=lambda l: len(l))

(['chutney'],
 ['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'])

### Finden von Frequent Itemsets

Nutzen Sie den Apriori-Algorithmus, um Frequent Itemsets zu finden. Hierzu müssen Sie auch einen geeigneten Grenzwert für den Support bestimmen. Begründen Sie Ihre Wahl.

In [45]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit_transform(records)

data_df = pd.DataFrame(te_ary, columns=te.columns_)
data_df.head()

frequent_itemsets = apriori(data_df, use_colnames=True, min_support=0.01)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.020397,(almonds)
1,0.033329,(avocado)
2,0.010799,(barbecue sauce)
3,0.014265,(black tea)
4,0.011465,(body spray)
...,...,...
252,0.011065,"(ground beef, milk, mineral water)"
253,0.017064,"(spaghetti, mineral water, ground beef)"
254,0.015731,"(spaghetti, milk, mineral water)"
255,0.010265,"(olive oil, mineral water, spaghetti)"


### Erstellen von Regeln

Erstellen Sie Regeln aus den Frequent Itemsets. Nutzen Sie Lift und Confidence, um zu ermitteln, welche Regeln gut sind.

In [46]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence",
                  min_threshold=0.0, support_only=True)[['antecedents', 'consequents']]

Unnamed: 0,antecedents,consequents
0,(mineral water),(avocado)
1,(avocado),(mineral water)
2,(burgers),(cake)
3,(cake),(burgers)
4,(burgers),(chocolate)
...,...,...
427,"(spaghetti, pancakes)",(mineral water)
428,"(mineral water, pancakes)",(spaghetti)
429,(spaghetti),"(mineral water, pancakes)"
430,(mineral water),"(spaghetti, pancakes)"


In [49]:
association_rules(frequent_itemsets, metric="confidence",
                  min_threshold=0.0).drop('conviction', axis=1).reset_index(drop=True).sort_values("confidence", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage
366,"(ground beef, eggs)",(mineral water),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365
402,"(ground beef, milk)",(mineral water),0.021997,0.238368,0.011065,0.503030,2.110308,0.005822
343,"(ground beef, chocolate)",(mineral water),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434
391,"(milk, frozen vegetables)",(mineral water),0.023597,0.238368,0.011065,0.468927,1.967236,0.005440
297,(soup),(mineral water),0.050527,0.238368,0.023064,0.456464,1.914955,0.011020
...,...,...,...,...,...,...,...,...
290,(mineral water),(red wine),0.238368,0.028130,0.010932,0.045861,1.630358,0.004227
424,(mineral water),"(olive oil, spaghetti)",0.238368,0.022930,0.010265,0.043065,1.878079,0.004799
44,(mineral water),(cereals),0.238368,0.025730,0.010265,0.043065,1.673729,0.004132
371,(mineral water),"(ground beef, eggs)",0.238368,0.019997,0.010132,0.042506,2.125563,0.005365


### Validieren der Regeln

Teilen Sie die Daten zufällig in zwei Datensätze mit je 50% der Transaktionen auf. Wenden Sie den Apriori-Algorithmus auf beide Datensätze an, um Regeln zu bestimmen. Vergleichen Sie die gefundenen Regeln miteinander sowie mit den Regeln, die Sie auf allen Daten gefunden haben. Welche Unterschiede gibt es? Was bedeuten die Unterschiede?