# **Data Preprocessing**

Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  

In [1]:
# Required Libraries
import pandas as pd

In [2]:
# Load the dataset (Assuming it's pre-processed)
data = pd.read_csv('/content/Online retail.csv')

In [3]:
data


Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
...,...
7495,"butter,light mayo,fresh bread"
7496,"burgers,frozen vegetables,eggs,french fries,ma..."
7497,chicken
7498,"escalope,green tea"


In [4]:
data.describe()

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
count,7500
unique,5175
top,cookies
freq,223


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                           Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                                           --------------  ----- 
 0   shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil  7500 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [6]:
missing_values = data.isnull().sum()  # Handling the missing values
missing_values

Unnamed: 0,0
"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil",0


In [7]:
# Remove duplicates and missing values
data_cleaned = data.dropna().drop_duplicates()

In [8]:
transaction = data.iloc[:, 0].apply(lambda x: x.split(',')) # Convert the data to a list of lists (each transaction is a list of items)

transaction

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"[burgers, meatballs, eggs]"
1,[chutney]
2,"[turkey, avocado]"
3,"[mineral water, milk, energy bar, whole wheat ..."
4,[low fat yogurt]
...,...
7495,"[butter, light mayo, fresh bread]"
7496,"[burgers, frozen vegetables, eggs, french frie..."
7497,[chicken]
7498,"[escalope, green tea]"


# **Association Rule Mining :**

• Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

• Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

• Set appropriate threshold for support, confidence and lift to extract meaning full rules.

In [9]:
!pip install mlxtend



In [10]:
# Import required libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [11]:
Transac = TransactionEncoder()   # Convert the transactions to a one-hot encoded dataframe
Transac_ary = Transac.fit(transaction).transform(transaction)
data_encoded = pd.DataFrame(Transac_ary, columns = Transac.columns_)

  and should_run_async(code)


In [12]:
frequent_itemsets = apriori(data_encoded, min_support = 0.01, use_colnames = True)  # Apply the Apriori algorithm with a minimum support threshold.

frequent_itemsets  # Display the frequent itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.020267,(almonds)
1,0.033200,(avocado)
2,0.010800,(barbecue sauce)
3,0.014267,(black tea)
4,0.011467,(body spray)
...,...,...
254,0.011067,"(mineral water, milk, ground beef)"
255,0.017067,"(spaghetti, mineral water, ground beef)"
256,0.015733,"(spaghetti, mineral water, milk)"
257,0.010267,"(spaghetti, mineral water, olive oil)"


In [13]:
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 0.7)  # Association rules with specific thresholds for confidence and lift.

rules = rules[(rules['confidence'] >= 0.5) & (rules['lift'] >= 1 )] # Filter rules based on confidence and lift thresholds

rules_sorted = rules.sort_values(by = 'lift', ascending = False)  # Sort the rules by lift.

# Display top 10 association rules
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
372,"(eggs, ground beef)",(mineral water),0.02,0.238267,0.010133,0.506667,2.126469,0.005368,1.544054,0.540548
408,"(ground beef, milk)",(mineral water),0.022,0.238267,0.011067,0.50303,2.111207,0.005825,1.532756,0.538177


# **Analysis and Interpretation:**

**1 - Analyse the generated rules to identify interesting patterns and relationships between the product**s.**

•	**Analyse the generated rules**

The generated rules reveal that products with high lift are frequently bought together, indicating strong associations for bundling or cross-selling. High confidence suggests reliable co-purchase patterns, useful for recommendations. Niche rules with low support but high lift highlight specific but valuable customer preferences for targeted marketing. These patterns guide product placement and promotional strategies.

   **Identify interesting patterns and relationships between the product**

The identified patterns show that certain products are frequently bought together, revealing strong associations. High lift values indicate potential bundling opportunities, while high confidence suggests reliable co-purchase behaviors, ideal for cross-selling. Low support but high lift patterns highlight niche but valuable product combinations, useful for targeted promotions. These relationships help optimize product placement and marketing strategies.


**2 -	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.**

The discovered rules reveal that customers often buy certain products together, indicating natural bundling opportunities. High-confidence and high-lift pairs suggest common purchasing habits, ideal for cross-selling. Niche combinations with high lift highlight specific customer preferences, offering opportunities for targeted marketing and personalized promotions.

---



# **Interview Questions:**

1.**What is lift and why is it important in Association rules?**

Lift measures how much more likely two items are to be purchased together compared to if they were purchased independently. It helps to understand the strength of a rule beyond just support and confidence.

Formula:
Lift
(
𝑋
→
𝑌
)
=
Confidence
(
𝑋
→
𝑌
)
Support
(
𝑌
)
Lift(X→Y)=
Support(Y)
Confidence(X→Y)
​


2.	**What is support and Confidence. How do you calculate them?**

Support: The proportion of transactions that contain a particular itemset.

Formula:
Support
(
𝑋
)
=
Number of transactions containing
𝑋
Total number of transactions
Support(X)=
Total number of transactions
Number of transactions containing X
​


Confidence: The likelihood that an item
𝑌
Y is purchased given that item
𝑋
X is purchased.

Formula:
Confidence
(
𝑋
→
𝑌
)
=
Support
(
𝑋
∪
𝑌
)
Support
(
𝑋
)
Confidence(X→Y)=
Support(X)
Support(X∪Y)
​


**3.	What are some limitations or challenges of Association rules mining?**

Data Sparsity: In large datasets with many items, most itemsets will have very low support, making it hard to find significant rules.

Computational Complexity: Mining rules with large datasets can be computationally expensive.

Rule Overfitting: Too many rules can lead to overfitting, where the rules may not generalize well.

Lack of Context: Association rules may not provide actionable insights without context or further analysis.