The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

In [14]:
#pip install mlxtend

In [15]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import warnings
warnings.filterwarnings('ignore')  # Suppress all warnings

**Data Preprocessing:**

Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  

In [16]:
# Define your custom column names
custom_headers = ['Products']

In [17]:
# Load dataset
retail_data = pd.read_excel("/content/Online retail.xlsx", header = None, names = custom_headers)
retail_data.head()

Unnamed: 0,Products
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


In [18]:
# Get basis info on the imported dataset
retail_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Products  7501 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [19]:
retail_data.describe()

Unnamed: 0,Products
count,7501
unique,5176
top,cookies
freq,223


In [20]:
# Creating a list words seperated by ','
retail_data_list = retail_data['Products'].str.split(',').tolist()
retail_data_list

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spagh

In [21]:
# Apply Encoder
te = TransactionEncoder()
te_ary = te.fit(retail_data_list).transform(retail_data_list)
retail_te = pd.DataFrame(te_ary, columns=te.columns_)
retail_te

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [22]:
# Convert encoded list into binary dataset
retail_te = retail_te.astype(int)
retail_te

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,0,1,1,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7499,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Association Rule Mining:**

* Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.
* Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
* Set appropriate threshold for support, confidence and lift to extract meaning full rules.

In [23]:
# minimum frequency threshold 1% of the total words
frequent_patterns = apriori(retail_te, min_support=0.01, use_colnames=True)
frequent_patterns

Unnamed: 0,support,itemsets
0,0.020397,(almonds)
1,0.033329,(avocado)
2,0.010799,(barbecue sauce)
3,0.014265,(black tea)
4,0.011465,(body spray)
...,...,...
252,0.011065,"(mineral water, ground beef, milk)"
253,0.017064,"(mineral water, spaghetti, ground beef)"
254,0.015731,"(mineral water, spaghetti, milk)"
255,0.010265,"(mineral water, spaghetti, olive oil)"


In [24]:
# Among the number of words that frequents about 1%, show the metrics for those all the antecedents and consequents with confidence (likelyhood of consequents occuring given antecedents) 50%
rules = association_rules(frequent_patterns, metric="confidence", min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,"(eggs, ground beef)",(mineral water),0.019997,0.238368,0.010132,0.506667,2.125563,1.0,0.005365,1.543848,0.540342,0.040816,0.352268,0.274586
1,"(ground beef, milk)",(mineral water),0.021997,0.238368,0.011065,0.50303,2.110308,1.0,0.005822,1.532552,0.537969,0.044385,0.347493,0.274725


**Analysis and Interpretation:**

* Analyse the generated rules to identify interesting patterns and relationships between the products.
* Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

* Among all the transactions, only itemsets that repeats >= 1% are considered. In this itemsets, only those transactions with conditional probability >= 50% of consequents occuring when antecdents are given are shown above.
* Antecedent support is the ratio of Antecedent occuring with respect to all the words. Here, about 2% of all products are (eggs, ground beef), and (milk, ground beef), and similar (mineral water) appears about 23%.  
* The frequency of both Antecedent and Consequent appearing togather is the Support, which is about 1% in both cases.
* Lift is a measure of how more likely does the Consequent occur given the Antecedent and when compared with the baseline of probability of the Consequent. Here, the customer is about 2.1x more likely to buy the (mineral water) along with (eggs, ground beef)/(milk, ground beef) compared to a random chance.
* A measure of how often does the rule fails if Antecedent and Consequent were independent is Conviction. From these outputs, about 1.5 times the rule fails, meaning the customer would buy these combinations (eggs, ground beef and mineral water)/(milk, ground beef and mineral water) about 1.5 times more often than a random chance.
* A measure of dependence is Leverage. It measures the difference between observed and expected co-occurrence. Here, the values are positive, and so customer buying (eggs, ground beef)/(milk. ground beef) and (mineral water) occurs 0.5% more times than a random chance.
* Zhang's Metric is a measure to evaluate association rules for both strength and directionality. It means how much does the rule improves the prediction of Consequent compared to a random chance. Here, this means customers buying (eggs, ground beef)/(milk, ground beef) are about 50% more likely to buy (mineral water) than a random chance. Also, the metric is positive therefore this rule is meaningful.

In [25]:
rules = association_rules(frequent_patterns, metric="lift", min_threshold=1.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(burgers),(cake),0.087188,0.081056,0.011465,0.131498,1.622319,1.0,0.004398,1.058080,0.420238,0.073129,0.054892,0.136473
1,(cake),(burgers),0.081056,0.087188,0.011465,0.141447,1.622319,1.0,0.004398,1.063198,0.417434,0.073129,0.059442,0.136473
2,(eggs),(burgers),0.179709,0.087188,0.028796,0.160237,1.837830,1.0,0.013128,1.086988,0.555754,0.120941,0.080026,0.245256
3,(burgers),(eggs),0.087188,0.179709,0.028796,0.330275,1.837830,1.0,0.013128,1.224818,0.499424,0.120941,0.183552,0.245256
4,(burgers),(green tea),0.087188,0.132116,0.017464,0.200306,1.516139,1.0,0.005945,1.085270,0.372947,0.086526,0.078570,0.166248
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,"(mineral water, pancakes)",(spaghetti),0.033729,0.174110,0.011465,0.339921,1.952333,1.0,0.005593,1.251198,0.504819,0.058384,0.200766,0.202885
210,"(spaghetti, pancakes)",(mineral water),0.025197,0.238368,0.011465,0.455026,1.908923,1.0,0.005459,1.397557,0.488452,0.045479,0.284466,0.251562
211,(mineral water),"(spaghetti, pancakes)",0.238368,0.025197,0.011465,0.048098,1.908923,1.0,0.005459,1.024059,0.625163,0.045479,0.023494,0.251562
212,(spaghetti),"(mineral water, pancakes)",0.174110,0.033729,0.011465,0.065850,1.952333,1.0,0.005593,1.034385,0.590626,0.058384,0.033242,0.202885


In [26]:
for i in rules['lift']:
  if i == rules['lift'].max():
    print(rules['confidence'])

0      0.131498
1      0.141447
2      0.160237
3      0.330275
4      0.200306
         ...   
209    0.339921
210    0.455026
211    0.048098
212    0.065850
213    0.120617
Name: confidence, Length: 214, dtype: float64


The above output is filtered from that transactions with atleast 1% frequency. Here, Lift metric is used to further filtering the itemset where customers are 1.5x more likely to purchase Consequents with the given Antecedents compared to Consequents baseline.
Some of the limitations can be observed:
* Arbitrary thresholds of the metrics will lead to analysis (which may not always be correct).
* The combonations of Antecedents and Consequents are repeated (X, Y and Y, X), this makes it computationally expensve.
* They are hard numbers without context.


* What is lift and why is it important in Association rules?
  * Lift is a metric used to measure the strength of an association rule, beyond what would be expected by mere chance or coincidence. It tells us how much more likely the consequent (Y) is to be purchased, given that the antecedent (X) has already been purchased, compared to the baseline probability of Y occurring on its own.

  * Lift(X -> Y) = Confidence(X -> Y) / Support(Y)
  * In other words, it is the ratio of observed support to that expected if X and Y were independent.
  * Lift > 1 implies a positive association. For example, 1.5 means that items X and Y are purchased together 1.5 times more often than if they were independent.
  * Lift = 1 implies independence. The presence of X has no effect on the probability of Y. They are independent.
  * Lift < 1 implies a negative association. Ex: -0.5, The items are purchased together less often than expected.
* Why is Lift so important?
  * Association rules are a fundamental concept used to find relationships, correlations, or patterns within large sets of data items. They describe how often itemsets occur together. There are three key metrics: Support, Confidence, and Lift. Lift is critical because it overcomes a major weakness of the Confidence measure.
  * Example:
    * Suppose we have a case where, Support(X) = 80%, Support(Y) = 5%, Support(X ∪ Y) = 4%
    * Confidence (Y -> X) = Support(X ∪ Y) / Support(Y) = 4% / 5% = 80%
    * This metric indicates that there is an 80% chance of buying Y along with X; however, buying X has 80% but Y has only 5%. So this high confidence is not a meaningful rule.
    * When Lift(Y -> X) = Confidence(Y -> X) / Support(X) = 4% / (5% * 80%) = 1.0
    * This means X and Y are independent.
  * This example illustrates the importance of Lift in a scenario where measurements like confidence can be misleading.
  * Lift is important because it is a direction-aware measure that accounts for the baseline measurement of the Consequent, which helps to identify truly interesting and non-random associations.
* What is Support and Confidence? How do you calculate them?
  * Support is the relative frequency of an item set in a dataset. It is used to identify frequent item sets in a dataset, which can be used to generate association rules.
  * Support(X) = Total number of transactions/ Number of transactions containing itemset X
  * Where: X is the item or combination of items.
  * The numerator is the number of transactions that contain the item.
  * The denominator is the total number of transactions in the dataset.
  * Example:
    * Consider a dataset that has a total of 100 transactions. If 30 of these transactions include both cake and burgers, the
    * Support(cake, burgers) = 30/100 = 0.3
    * This means that 30% of the transactions in the dataset contain both cake and burgers.
  * Confidence is a measure that indicates how likely it is that item Y will appear in a transaction given that item X is already in the transaction. It is a way of evaluating the strength of association between two items.
    * Confidence(X -> Y) = Support(X ∪ Y) / Support(X)
    * where: X is the item or itemset that is already present.
  * Y is the item or itemset that we are trying to predict.
  * Support(X ∪ Y) is the support of the combination of both items X and Y.
  * Support(X) is the support of item X alone.
  * Example:
    * Consider a dataset that has a total of 100 transactions, and 40 transactions contain Cake, and 20 transactions contain both Cake and burgers, then.
    * Confidence (Cake -> burgers) = 20/40 = 0.5
    * This means that when a cake is bought there is a 50% chance that the burger will be bought as well.
  * High Support means that an item or a combination of items appears a lot in the dataset.
  * High Confidence means that if one item is present, there's a strong chance that another item will be present too.
* What are some limitations or challenges of Association rules mining?
  * Sensitivity to thresholds: High thresholds may miss a few interesting patterns, and Low thresholds can generate too many rules, which can clutter the analysis.
  * Generating a Large Number of Rules: Depending on the thresholds, a large number of rules may be generated, which is very difficult for a go through them all.
  * It only works for discrete data with numerical attributes.
  * It can be computationally expensive if the dataset is very large.
  * Domain knowledge is essential to validate the findings, as high confidence does not guarantee associations.