# Name: Snehal Shyam Jagtap

## ASSOCIATION RULES

#### The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules


In [2]:
# Load the dataset from the Excel file
df = pd.read_excel('Online Retails.xlsx')


In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [6]:
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [7]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

## Task 2 : Prepare the data for the Apriori algorithm

In [11]:
# Create a basket format: One-hot encoding each transaction
basket = df.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

In [12]:
# Convert quantities to 0 and 1 (0 if not bought, 1 if bought)
def encode_units(x):
    return 1 if x > 0 else 0

basket = basket.applymap(encode_units)

In [13]:
# Display the transformed basket dataset
basket.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Task 3: Apply the Apriori algorithm

In [15]:
from mlxtend.frequent_patterns import apriori, association_rules

In [16]:
# Generate frequent itemsets using Apriori algorithm
frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)



In [17]:
frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True, low_memory=True)



In [18]:
# Filter out products that were purchased less than 5 times
basket_filtered = basket.loc[:, (basket.sum(axis=0) >= 5)]

# Run apriori on the filtered basket
frequent_itemsets = apriori(basket_filtered, min_support=0.01, use_colnames=True, low_memory=True)



In [19]:
from mlxtend.frequent_patterns import fpgrowth

# Use FP-Growth instead of Apriori
frequent_itemsets = fpgrowth(basket, min_support=0.01, use_colnames=True)



In [20]:
# Generate association rules with support, confidence, and lift
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

In [21]:
# Display the generated rules
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(ASSORTED COLOUR BIRD ORNAMENT),(WHITE HANGING HEART T-LIGHT HOLDER),0.059519,0.092449,0.013213,0.221993,2.401258,0.00771,1.166508,0.620482
1,(WHITE HANGING HEART T-LIGHT HOLDER),(ASSORTED COLOUR BIRD ORNAMENT),0.092449,0.059519,0.013213,0.14292,2.401258,0.00771,1.097309,0.642996
2,(ASSORTED COLOUR BIRD ORNAMENT),(REGENCY CAKESTAND 3 TIER),0.059519,0.081363,0.011045,0.185567,2.28073,0.006202,1.127947,0.597081
3,(REGENCY CAKESTAND 3 TIER),(ASSORTED COLOUR BIRD ORNAMENT),0.081363,0.059519,0.011045,0.135747,2.28073,0.006202,1.088201,0.611279
4,(WHITE HANGING HEART T-LIGHT HOLDER),(HOME BUILDING BLOCK WORD),0.092449,0.031784,0.010431,0.112832,3.54992,0.007493,1.091355,0.791474


## Task 4: Analyze the rules

In [23]:
# Sort the rules by confidence, lift, or support
sorted_rules = rules.sort_values(by='lift', ascending=False)

In [24]:
# Display the top 10 rules
sorted_rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1413,"(REGENCY TEA PLATE GREEN , REGENCY TEA PLATE R...",(REGENCY TEA PLATE PINK),0.013049,0.012476,0.010431,0.799373,64.070404,0.010268,4.922188,0.997408
1416,(REGENCY TEA PLATE PINK),"(REGENCY TEA PLATE GREEN , REGENCY TEA PLATE R...",0.012476,0.013049,0.010431,0.836066,64.070404,0.010268,6.0204,0.996829
1414,"(REGENCY TEA PLATE PINK, REGENCY TEA PLATE ROS...",(REGENCY TEA PLATE GREEN ),0.011004,0.015585,0.010431,0.947955,60.823405,0.01026,18.914824,0.994502
1415,(REGENCY TEA PLATE GREEN ),"(REGENCY TEA PLATE PINK, REGENCY TEA PLATE ROS...",0.015585,0.011004,0.010431,0.669291,60.823405,0.01026,2.990536,0.999131
1409,(REGENCY TEA PLATE PINK),(REGENCY TEA PLATE GREEN ),0.012476,0.015585,0.011372,0.911475,58.48275,0.011178,11.120239,0.995319
1408,(REGENCY TEA PLATE GREEN ),(REGENCY TEA PLATE PINK),0.015585,0.012476,0.011372,0.729659,58.48275,0.011178,3.652878,0.998462
1417,(REGENCY TEA PLATE ROSES ),"(REGENCY TEA PLATE GREEN , REGENCY TEA PLATE P...",0.018203,0.011372,0.010431,0.573034,50.389863,0.010224,2.315471,0.998328
1412,"(REGENCY TEA PLATE GREEN , REGENCY TEA PLATE P...",(REGENCY TEA PLATE ROSES ),0.011372,0.018203,0.010431,0.917266,50.389863,0.010224,11.866933,0.991429
1411,(REGENCY TEA PLATE ROSES ),(REGENCY TEA PLATE PINK),0.018203,0.012476,0.011004,0.604494,48.45072,0.010777,2.496863,0.997519
1410,(REGENCY TEA PLATE PINK),(REGENCY TEA PLATE ROSES ),0.012476,0.018203,0.011004,0.881967,48.45072,0.010777,8.317999,0.991734


## Task 5: Save the results 

In [26]:
rules.to_csv('association_rules.csv', index=False)

In [27]:
data=pd.read_csv('association_rules.csv')

In [31]:
data.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,frozenset({'ASSORTED COLOUR BIRD ORNAMENT'}),frozenset({'WHITE HANGING HEART T-LIGHT HOLDER'}),0.059519,0.092449,0.013213,0.221993,2.401258,0.00771,1.166508,0.620482
1,frozenset({'WHITE HANGING HEART T-LIGHT HOLDER'}),frozenset({'ASSORTED COLOUR BIRD ORNAMENT'}),0.092449,0.059519,0.013213,0.14292,2.401258,0.00771,1.097309,0.642996
2,frozenset({'ASSORTED COLOUR BIRD ORNAMENT'}),frozenset({'REGENCY CAKESTAND 3 TIER'}),0.059519,0.081363,0.011045,0.185567,2.28073,0.006202,1.127947,0.597081
3,frozenset({'REGENCY CAKESTAND 3 TIER'}),frozenset({'ASSORTED COLOUR BIRD ORNAMENT'}),0.081363,0.059519,0.011045,0.135747,2.28073,0.006202,1.088201,0.611279
4,frozenset({'WHITE HANGING HEART T-LIGHT HOLDER'}),frozenset({'HOME BUILDING BLOCK WORD'}),0.092449,0.031784,0.010431,0.112832,3.54992,0.007493,1.091355,0.791474


## Interview Questions:

1.	What is lift and why is it important in Association rules?

2.	What is support and Confidence. How do you calculate them?

3.	What are some limitations or challenges of Association rules mining?


**1. What is lift and why is it important in Association Rules?**

**Definition:** Lift is a metric that measures the strength of an association rule by comparing the observed frequency of co-occurrence of items to the expected frequency if the items were independent. It is calculated as the ratio of the confidence of the rule to the support of the consequent.

**Importance:** Lift helps identify the strength of the relationship between items. A lift value greater than 1 indicates a positive correlation, meaning the presence of one item increases the likelihood of the other, which can be crucial for marketing strategies and product placements.

**2. What is support and confidence? How do you calculate them?**

**Support:** Support measures the frequency with which an itemset appears in the dataset. It is calculated as the number of transactions containing the itemset divided by the total number of transactions.

Support
(
𝐴
)
=
Number of transactions containing A
----------------
Total number of transactions

​
 
**Confidence:** Confidence measures the likelihood that item B is purchased when item A is purchased. It is calculated as the number of transactions containing both A and B divided by the number of transactions containing A.

Confidence(A→B)= 
Support(A∪B)
/
Support(A)

**3. What are some limitations or challenges of Association Rules Mining?**
**High Dimensionality:** The presence of many items can lead to a combinatorial explosion of potential itemsets, making it computationally expensive to find frequent itemsets and generate rules. This can result in long processing times and difficulty in managing the resulting rules.

**Interpretation Challenges:**  The rules generated may not always be meaningful or actionable. Users may struggle to interpret complex rules, and the rules may not always lead to effective marketing or operational strategies, especially if the support and confidence thresholds are not properly set.