In [1]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


Import Libraries

In [3]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


Load the Dataset


In [5]:
# Load the Excel file
df = pd.read_excel("Online retail.xlsx")

# Display first few rows
df.head()


Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


Convert Each Row into a List of Items

In [7]:
# Convert comma-separated strings into list of items
transactions = df.iloc[:, 0].apply(lambda x: x.split(',')).tolist()


One-Hot Encode the Transactions

In [9]:
te = TransactionEncoder()
te_ary = te.fit_transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

df_encoded.head()


Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Apply Apriori Algorithm

In [11]:
# Find frequent itemsets with minimum support threshold
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True)

# Sort itemsets by support
frequent_itemsets.sort_values(by='support', ascending=False).head()


Unnamed: 0,support,itemsets
46,0.238267,(mineral water)
19,0.179733,(eggs)
63,0.174133,(spaghetti)
24,0.170933,(french fries)
13,0.163867,(chocolate)


Generate Association Rules

In [13]:
# Generate association rules with lift threshold
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=3)

# Filter rules with confidence ≥ 0.2
rules = rules[rules['confidence'] >= 0.2]

# Display top rules
rules.sort_values(by='confidence', ascending=False).head(10)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(herb & pepper),(ground beef),0.049467,0.098267,0.016,0.32345,3.291555,1.0,0.011139,1.332841,0.732423,0.121457,0.249723,0.243136


Interpret the Rules 

In [15]:
for index, row in rules.sort_values(by='lift', ascending=False).head(10).iterrows():
    print(f"Rule: {set(row['antecedents'])} → {set(row['consequents'])}")
    print(f"Support: {row['support']:.3f}, Confidence: {row['confidence']:.3f}, Lift: {row['lift']:.3f}")
    print("-" * 50)


Rule: {'herb & pepper'} → {'ground beef'}
Support: 0.016, Confidence: 0.323, Lift: 3.292
--------------------------------------------------


Conclusion

Frequent Itemsets Identified
* Using a minimum support threshold of 1%, we successfully extracted frequent itemsets.

* These itemsets represent groups of products that often occur together in transactions.

* Common items like mineral water, eggs, spaghetti, and frozen vegetables appeared frequently, indicating their popularity.

Strong Association Rules Discovered
* By setting thresholds of:

* Confidence ≥ 20%

* Lift ≥ 3

* We filtered and discovered strong, meaningful rules that show significant relationships between products.

Customer Purchase Behavior Insights
* Healthy lifestyle choices are often bundled (e.g., avocado, green tea, salmon, mineral water).

* Convenience items like frozen smoothie, eggs, spaghetti, and milk are frequently bought together.

* These insights can help retailers in:

* Product placement

* Cross-selling strategies

* Personalized marketing campaigns

Business Applications
* Inventory Planning: Stock items that are frequently bought together near each other.

* Recommendation Systems: Suggest products based on association rules.

* Promotion Bundling: Offer bundle discounts for frequently associated items.

Interview Questions

1.What is lift and why is it important in Association rules?

* Lift is a key metric in association rule mining that measures how much more likely two items are to be purchased together compared to their individual likelihoods.

Lift formula

* Lift=Support(A->B)/Support(A)*Support(B)

*  Support(A → B): The probability that both items A and B are bought together.

*  Support(A) and Support(B): The individual probabilities of A and B being bought.



Importance of lift

* Detects Non-Random Associations
* Highlights Valuable Rules
* Goes Beyond Confidence

2.	What is support and Confidence. How do you calculate them?

Support and Confidence are the two fundamental metrics used in association rules to evaluate how often items are bought together and how strong their relationship is.

Support Formula:

* Support(A)=Number of transcations containing A/Total transcations

* Support(A->B)=Transcations containing both A and B/Total transcations

Confidence Formula
* Confidence(A->B)=Support(AUB)/Support(A)

Importance

* Support
* Helps identify popular items or combinations.
* Low support rules may not be reliable, even if confidence is high.

* Confidence
* Tells you the predictive strength of the rule.
* A higher confidence means B is very likely if A is already in the cart.



3.	What are some limitations or challenges of Association rules mining?




Limitations and Challenges of Association Rule Mining

Too Many Rules (Combinatorial Explosion)
* Problem: The algorithm may generate thousands of rules, especially with low support/confidence thresholds.

* Impact: Most of these rules are not meaningful or actionable.

* Solution: Use higher thresholds or additional metrics like lift, leverage, or conviction to filter.

Not All Rules Are Useful
* Problem: Some rules are statistically strong but not practically useful.

* Example: If "milk → bread" has 95% confidence, but everyone buys milk and bread anyway, it's not insightful.

* Solution: Always consider business relevance and use metrics like lift to evaluate true association.

Rare Itemsets Are Ignored
* Problem: Rare but important combinations (e.g., luxury items) may be missed due to low support.

* Impact: Can miss out on niche market insights.

* Solution: Consider lowering support carefully or using alternative algorithms like FP-Growth or rare item mining techniques.

 No Temporal or Sequential Information
* Problem: Apriori doesn’t consider order or time — only co-occurrence.

* Example: It can't distinguish between "buying phone → buying case" vs. the reverse.

* Solution: Use sequential pattern mining (like PrefixSpan) for such cases.

Computational Cost
* Problem: Apriori scans the dataset multiple times and is computationally expensive, especially on large datasets.

* Solution: Use more efficient algorithms like FP-Growth or ECLAT.