## Information Retrieval lab6

- Martyna Stasiak id.156071
- Maria Musiał id.156062
----

In [60]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
import ast
import matplotlib.pyplot as plt
import networkx as nx

You shall already be familiar with the concept of association rules and the apriori algorithm. Association rule mining is a method for discovering patterns within large data sets. It focuses on identifying relationships between variables and leveraging those connections to make predictions or informed decisions. The primary objective is to uncover rules that reveal the associations between various items in the data.



### Task 1
Load data from data.txt file - it contains lists of grocery shopping done by nearly 2000 customers.
Store it in a boolean one hot encoded dataframe - True for items bought in a given transaction, False otherwise.

In [61]:
with open('data.txt', 'r') as file:
    lines = file.readlines()
    
transactions = [ast.literal_eval(line.strip()) for line in lines]

unique_items = sorted(set(item for transaction in transactions for item in transaction))
print(f"There are {len(transactions)} transactoins.")
print(f"There are {len(unique_items)} unique items.")
print(f"And those are:")
print(unique_items)

There are 1916 transactoins.
There are 18 unique items.
And those are:
['apple', 'banana', 'beef', 'bread', 'butter', 'cheese', 'chicken', 'chocolate', 'eggs', 'grill', 'ketchup', 'milk', 'mustard', 'orange', 'pork', 'sausage', 'wagyu', 'yogurt']


In [62]:
#create a dataframe where each row is a transaction and each column is an item
basketdf = pd.DataFrame([{item: item in transaction for item in unique_items} for transaction in transactions]) 
basketdf = basketdf.astype(bool) #converting to T/F

print(f"The number of rows in the basket dataframe is {basketdf.shape[0]}")
print(f"The number of columns in the basket dataframe is {basketdf.shape[1]}")
print(f"Here is the basket dataframe:")
basketdf.head()

The number of rows in the basket dataframe is 1916
The number of columns in the basket dataframe is 18
Here is the basket dataframe:


Unnamed: 0,apple,banana,beef,bread,butter,cheese,chicken,chocolate,eggs,grill,ketchup,milk,mustard,orange,pork,sausage,wagyu,yogurt
0,True,True,False,False,False,False,True,True,False,False,False,True,False,True,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False


To extract rules you can use e.g. apriori algorithm implemented in mlxtend. There are other algorithms performing the same task but using different approaches e.g. fpgrowth internally uses a tree-based structure which makes it faster in most real-life examples.

----

### Task 2
Find association rules using selected algorithm

### `Apriori`

First we have used the apriori algorithm using `apriori` function fom `mlxtend.frequent_patterns` library, with minimal supprt of 0.1 meaning that the itemset (containing at least 1 item or more)  has to appear in at least of 10% of the transaction. From this we got 19 frequent itemsets.<br>
Then we chose the minimal confidence of 0.5 meaning that for a rule *X → Y* at least 50% of the transactions containing item *X* also contained *Y*. Next we have used the `association_rules` function from the same library to get the association rules. This allowed us to get the frequent but also reliable association rules.<br>
After extracting the rules we have analized them and visualized them on the two types of plots.
-  Heatmap of Lift - it represets the lift values of the association rules, where the antecedents (so the *X* part of the rule) are shown on the y-axis and the consequents (the *Y* part of the rule) on the x-axis. Each cell represents the lift of a particular rule. Lift values greater than 1 indicate a strong positive correlation, while values close to 1 suggest independence between the antecedent and consequent.
- Network Graph - represents the association rules as a directed graph. Each node corresponds to an item or itemset, and the edges represent the rules *X → Y*, with the direction indicating the association. The edge labels display the lift values of the rules<br>


The rule that had the highest lift and other metrics was `grill → sausage`, which makes a lot of sense since they are oftey bought by the people together since they are used together for barbecues and geilling.

In [63]:
minSup = 0.1 #iteset has to appear in at least 10% of the transactions
frequent_itemsets = apriori(basketdf, min_support=minSup, use_colnames=True)
print(f"Number of frequent itemsets with min support of {minSup} is {frequent_itemsets.shape[0]}")
print(f"Here are the frequent itemsets:")
print(frequent_itemsets)


Number of frequent itemsets with min support of 0.1 is 19
Here are the frequent itemsets:
     support            itemsets
0   0.151879             (apple)
1   0.165449            (banana)
2   0.127871              (beef)
3   0.253653             (bread)
4   0.212422            (cheese)
5   0.420668           (chicken)
6   0.179541         (chocolate)
7   0.113257             (grill)
8   0.400313              (milk)
9   0.162317            (orange)
10  0.205115              (pork)
11  0.248434           (sausage)
12  0.237474            (yogurt)
13  0.104384    (bread, chicken)
14  0.149791     (milk, chicken)
15  0.102296  (chicken, sausage)
16  0.122651   (milk, chocolate)
17  0.105428    (grill, sausage)
18  0.129436      (milk, yogurt)


In [64]:
minConf = 0.5 
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=minConf)
print(f"Number of rules with min confidence of {minConf} is {rules.shape[0]}")
print(f"The achieved rules:")
rules.sort_values(by='confidence', ascending=False)

Number of rules with min confidence of 0.5 is 3
The achieved rules:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1,(grill),(sausage),0.113257,0.248434,0.105428,0.930876,3.74697,0.077291,10.872651,0.826753
0,(chocolate),(milk),0.179541,0.400313,0.122651,0.68314,1.706513,0.050779,1.89259,0.504607
2,(yogurt),(milk),0.237474,0.400313,0.129436,0.545055,1.361571,0.034372,1.318152,0.348256


In [65]:
# import seaborn as sns

# # preparing data for heatmap
# rules['antecedents'] = rules['antecedents'].apply(lambda x: ''.join(list(x)))
# rules['consequents'] = rules['consequents'].apply(lambda x: ''.join(list(x)))
# rules_heatmap = rules.pivot(index='antecedents', columns='consequents', values='lift')

# # Adjust figure size
# plt.figure(figsize=(8, 6))

# # heatmap
# sns.set(font_scale=1.0)  
# sns.heatmap(
#     rules_heatmap,
#     annot=True,
#     fmt=".2f",
#     cmap="coolwarm",
#     cbar_kws={'label': 'Lift'},
#     annot_kws={"fontsize": 10},  
# )

# plt.title('Heatmap of Lift for achieved Association Rules', fontsize=14)
# plt.xlabel('Consequents', fontsize=12)
# plt.ylabel('Antecedents', fontsize=12)

# # Show plot
# plt.show()


In [66]:
# #directed graph
# G = nx.DiGraph()
# # adding edges for each rule
# for i, rule in rules.iterrows():
#     G.add_edge(''.join(list(rule['antecedents'])), ''.join(list(rule['consequents'])), weight=rule['lift'])


# plt.figure(figsize=(8, 5))
# pos = nx.spring_layout(G, k=0.5, seed=42)  # Positioning of nodes
# nx.draw(G, pos, with_labels=True, node_size=1500, node_color="pink", font_size=10, edge_color='gray')
# edge_labels = nx.get_edge_attributes(G, 'weight')
# nx.draw_networkx_edge_labels(G, pos, edge_labels={k: f"{v:.2f}" for k, v in edge_labels.items()}, font_color='red')
# plt.title('Network Graph of Association Rules (Edge Label: Lift)')
# plt.show()


### `FP-Growth`

Next we have tested the `FP-Growth` algorithm, by following the same steps as in the `apriori` from above but we got the exact same results in the form of rules.

In [67]:
minSup = 0.1 #iteset has to appear in at least 10% of the transactions
frequent_itemsets_fp = fpgrowth(basketdf, min_support=minSup, use_colnames=True)
print(f"Number of frequent itemsets with min support of {minSup} is {frequent_itemsets_fp.shape[0]}")
print(f"Here are the frequent itemsets:")
print(frequent_itemsets_fp)

Number of frequent itemsets with min support of 0.1 is 19
Here are the frequent itemsets:
     support            itemsets
0   0.420668           (chicken)
1   0.400313              (milk)
2   0.179541         (chocolate)
3   0.165449            (banana)
4   0.162317            (orange)
5   0.151879             (apple)
6   0.237474            (yogurt)
7   0.212422            (cheese)
8   0.205115              (pork)
9   0.248434           (sausage)
10  0.127871              (beef)
11  0.253653             (bread)
12  0.113257             (grill)
13  0.149791     (milk, chicken)
14  0.122651   (milk, chocolate)
15  0.129436      (milk, yogurt)
16  0.102296  (chicken, sausage)
17  0.104384    (bread, chicken)
18  0.105428    (grill, sausage)


In [68]:
minConf = 0.5 
rules_fp = association_rules(frequent_itemsets_fp, metric="confidence", min_threshold=minConf)
print(f"Number of rules with min confidence of {minConf} is {rules_fp.shape[0]}")
print(f"The achieved rules:")
rules_fp.sort_values(by='confidence', ascending=False)

Number of rules with min confidence of 0.5 is 3
The achieved rules:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
2,(grill),(sausage),0.113257,0.248434,0.105428,0.930876,3.74697,0.077291,10.872651,0.826753
0,(chocolate),(milk),0.179541,0.400313,0.122651,0.68314,1.706513,0.050779,1.89259,0.504607
1,(yogurt),(milk),0.237474,0.400313,0.129436,0.545055,1.361571,0.034372,1.318152,0.348256


----

### Task 3
The association rules are characterized by high support - frequency in the dataset. Can you use this algorithm as a base and try to extract different types of rules:
 - dissociation rules e.g. buying Porshe and Rolex is not frequent in the dataset, but usually people who bought Porshe also bought Rolex
 - negative rules e.g. if someone bought low-fat milk it's unlikely there will be whole milk in the basket
 - disjunction e.g. eggs and (kielecki xor winiary ;) )
 - imagine 50% of baskets have milk and 50% of baskets have tea. If there is no relation between them then in ~25% of baskets we will have both. If milk appears together with tea in e.g. 40% of baskets it means there is a pattern. Can you find such rules and use statistical tests to check if the relation is strong?

 Send the report within 144 hours starting from the end of this class to gmiebs@cs.put.poznan.pl; start this email's subject with [IR]
