# Preface

In [5]:
# IMPORTS
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 150 # helpful for itemsets
from mlxtend.association import apriori, association_rules
from collections import Counter

# DATA IMPORT
dt_basket = pd.read_excel("../data/ShoppingBaskets.xls", index_col=0)

# <span style="color:red";>TO BE REFACTORED</span>

In [218]:
def get_associated_items(dt, item):
    """"""
    itemcount_dict = Counter([item for itemsets in dt.values for item in itemsets])
    itemcount_dict.pop(item)
    return itemcount_dict

def filter_frequent_itemsets(df_itemsets, items, exact="false"):
    pass

def filter_rules_by_items(df_rules, items, criteria="all"):
    """"""
    df_tmp = df_rules.loc[:, ["antecedants", "consequents"]].applymap(lambda x: list(x))
        
    if criteria == "all":
        supersets = np.array(df_tmp["antecedants"] + df_tmp["consequents"])
    elif criteria == "antecedants":
        supersets = np.array(df_tmp["antecedants"])
    elif criteria == "consequents":
        supersets = np.array(df_tmp["consequents"])
    else:
        raise ValueError("Criteria must be 'all', 'antecedants' or 'consequents', got{}"
                        .format(criteria))
   
    frozenset_vect = np.vectorize(lambda x: frozenset(x))
    equality_vect = np.vectorize(lambda x,y: x == y)
    return df_rules[equality_vect(frozenset_vect(supersets), set(items))]

---
# Task 4.1: Shopping Basket – Frequent Itemsets
We want to analyse the buying behavior of our customers. The Shopping Baskets dataset describes the content of ten baskets, which are identified by the BasketNo attribute. The ten transactions altogether contain ten different items (products), while the corresponding attribute in the dataset states whether or not an item is included in a specific basket. Import the data into RapidMiner using the Read Excel operator. Please ensure that the attribute types and roles are set correctly. As a first task, we want to mine frequent itemsets using the FP-Growth operator with the parameter support set to 0.2 and the parameter positive value set of to 1. Which items are usually bought together with the laptop (ThinkPad X220), the netbook (Asus EeePC) and the printer (HP Laserjet P2055)?

<span style="color:#AAAAAA;">Comment: Since FP-Growth is not easily & reliably available, APRIORI is used with a minsup of 0.2</span>

In [3]:
# Generate the frequent itemset
dt_freq_basket = apriori(dt_basket, min_support=0.2, use_colnames=True)

# Filter by ThinkPad X220
mask_tp220 = dt_freq_basket["itemsets"].map(lambda iset: "ThinkPad X220 " in iset) #! last char is [space]
dt_freq_basket_tp220= dt_freq_basket[mask_tp220]
print("ThinkPad X220: \n{}\n\n".format(get_associated_items(dt_freq_basket_tp220["itemsets"], "ThinkPad X220 ")))

# Filter by Asus EeePC
mask_aepc = dt_freq_basket["itemsets"].map(lambda iset: "Asus EeePC" in iset)
dt_freq_basket_aepc = dt_freq_basket[mask_aepc]
print("Asus EeePC: \n{}\n\n".format(get_associated_items(dt_freq_basket_aepc["itemsets"], "Asus EeePC")))

# Filter by HP Laserjet P2055
mask_hplj = dt_freq_basket["itemsets"].map(lambda iset: "HP Laserjet P2055" in iset)
dt_freq_basket_hplj = dt_freq_basket[mask_hplj]
print("HP Laserjet P2055: \n{}".format(get_associated_items(dt_freq_basket_hplj["itemsets"], "HP Laserjet P2055")))

ThinkPad X220: 
Counter({'HP Laserjet P2055': 4, 'HP CE50 Toner': 4, 'Lenovo Tablet Sleeve': 4, '8 GB DDR3 RAM': 1, 'LT Laser Maus': 1})


Asus EeePC: 
Counter({'Netbook-Schutzhülle ': 6, '2 GB DDR3 RAM': 6, 'LT Laser Maus': 4, 'LT Minimaus': 4})


HP Laserjet P2055: 
Counter({'HP CE50 Toner': 4, 'ThinkPad X220 ': 4, 'Lenovo Tablet Sleeve': 4})


---
# Task 4.2: Shopping Basket – Association Rules
What can the created rules based on the former created itemsets tell you about the relationship between Asus EeePC, 2 GB DDR3 RAM extensions and the Netbook-Schutzhülle? Try to judge the interestingness of the rules based on the lift values.

## <span style="color:#AAAAAA;">Recap about lift</span>
The _lift_ value is defined as $$ lift(X \rightarrow Y) = \frac{confidence(X \rightarrow Y)}{\sigma(Y)}$$
where _confidence_ is defined as $$ confidence(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{\sigma(X)} $$
and _$\sigma$_ is the support.<br><br>

Having this in mind, the lift values can be interpreted as a kind of directed correlation between the _ancedant_ and the _consequent_. The following cases can occur:
- lift < 1 meaning that the rule $X \rightarrow Y$ should rather be seen as $X \rightarrow \neg Y$
- lift = 1 meaning that ancedant and consequent of the rule $X \rightarrow $ are rather independent
- lift > 1 meaning that the rule $X \rightarrow Y$ should be definitively seen as $X \overset{!}{\rightarrow} Y$

In [193]:
filter_rules_by_items(df_rules, ["Asus EeePC", "2 GB DDR3 RAM", "Netbook-Schutzhülle "])

Unnamed: 0,antecedants,consequents,support,lift,confidence
0,"(Asus EeePC, Netbook-Schutzhülle )",(2 GB DDR3 RAM),0.4,2.0,1.0
1,"(Asus EeePC, 2 GB DDR3 RAM)",(Netbook-Schutzhülle ),0.5,2.0,0.8
2,"(Netbook-Schutzhülle , 2 GB DDR3 RAM)",(Asus EeePC),0.4,1.666667,1.0
3,(Netbook-Schutzhülle ),"(Asus EeePC, 2 GB DDR3 RAM)",0.4,2.0,1.0
4,(2 GB DDR3 RAM),"(Asus EeePC, Netbook-Schutzhülle )",0.5,2.0,0.8


Since all the lift values are greater than `1` none of the rules should be ommited and all rules can be considered "interesting".

---
# Task 4.3: Adult Dataset – Preprocessing
In the following, we will work with a tweaked version (less attributes and less ex- amples, available as .ar  - le in the exercise repository) of this original dataset. In order to make it feasible for frequent itemset mining and association rule analysis, some preprocessing steps need to be performed. After you have read the dataset by using the Read ARFF operator, you  rst need to discretize the age and the hours- per-week attributes into three user-de ned ranges. Based on the original purpose of the dataset, think about what ranges might make sense. As frequent itemset mining only works on binary attributes, convert all attributes of the dataset into binary attributes.How many attributes does the dataset contain after executing the described preprocessing steps?

---
# Task 4.4: Adult Dataset – Frequent Itemsets
What can you learn from these itemsets about the people who earn less than $50 000?

---
# Task 4.5: Adult Dataset – Further Preprocessing
From the resulting itemsets we can see, that, although we have a rather low min- imal support threshold, the variation of di erent educations is rather low. We can mostly observe itemsets including HS-grad and almost no other educations.
Besides, also the native-country = United-States is dominant.
Try to explain the dominance of those two attributes and think about a pos- sibility to aggregate the values further to reduce this dominance.

---
# Task 4.6: Adult Dataset – Finding Rich Americans
Use the FP-Growth must contain parameter to restrict the patterns to the ones containing class => 50K and lower the support so that a decent number of item- sets is discovered. What can you learn from these itemsets about the people who earn more than $50 000 per year?


---
# Task 4.7: Adult Dataset – Association Rule Mining
As we have learned, looking only at the frequent itemsets might not reveal too many insights. Therefore, we now want to learn association rules. Apply the Create Association Rules operator to the process of the former task. Which rules do you consider to be interesting? Consider both => 50K and < 50K classes.