### 3. Association Mining

Market basket analysis explores combinations of products often bought together, guiding strategies for cross-selling or bundling products. Originating from observing supermarket shopping patterns, it plays a crucial role in understanding customer buying habits.

Association Rule Mining is pivotal for identifying associations among items in a dataset, crucial for discovering frequent buying patterns. This technique, central to Market Basket Analysis, is utilized by major retailers like Amazon and Flipkart to strategize on product placement, promotions, and personalized marketing by analyzing items commonly purchased together.

Essential Metrics:
- **Support**: Indicates the frequency of an item across all transactions, calculated as the ratio of transactions containing the item to the total number of transactions.
  
- **Confidence**: Measures the probability of buying item B given item A has been bought, derived from the ratio of transactions with both items A and B to those with item A.

- **Lift**: Assesses the impact of selling item B on the sales of item A, indicating the strength of association between items A and B.

A lift value above 1 signals a positive correlation (items likely bought together), equal to 1 indicates no correlation, and below 1 suggests a negative correlation (items unlikely bought together).

**Apriori Algorithm**: This foundational algorithm for Market Basket Analysis posits that all subsets of a frequent itemset must also be frequent, enabling the identification of commonly purchased item combinations efficiently.

In [22]:
import sys
print (sys.version)

import time

3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]


In [2]:
import multiprocessing

num_processors = multiprocessing.cpu_count()
num_processors

16

In [3]:
import os
os.environ["OMP_NUM_THREADS"] = "15" # Change '15' to the number of cores 
import matplotlib.pyplot as plt

In [4]:
# For data manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
df = pd.read_parquet('final_complete_dataset.parquet')

In [7]:
df.head()

Unnamed: 0,user_id,order_id,product_id,aisle_id,department_id,add_to_cart_order,reordered,product_name,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,112108,1,49302,120,16,1,1,Bulgarian Yogurt,yogurt,dairy eggs,train,4,4,10,9.0
1,112108,1,10246,83,4,3,0,Organic Celery Hearts,fresh vegetables,produce,train,4,4,10,9.0
2,112108,1,49683,83,4,4,0,Cucumber Kirby,fresh vegetables,produce,train,4,4,10,9.0
3,112108,1,43633,95,15,5,1,Lightly Smoked Sardines in Olive Oil,canned meat seafood,canned goods,train,4,4,10,9.0
4,112108,1,13176,24,4,6,0,Bag of Organic Bananas,fresh fruits,produce,train,4,4,10,9.0


In [8]:
df.product_id.nunique()

49685

- Data Preparation: Identify the top 200 products by volume.
- Transaction Encoding: Prepare the data for association rule mining by encoding it in a suitable format.
- Apply Apriori Algorithm: Use the Apriori algorithm to find frequent itemsets.
- Generate Association Rules: Use the frequent itemsets to generate association rules.
- Analyze Results: Draw insights from the generated rules.

In [35]:
# Identify the top 200 products by volume
top_200_products = df['product_id'].value_counts().head(200).index.tolist()

# Filter the dataset to transactions containing only the top 200 products
filtered_df = df[df['product_id'].isin(top_200_products)]


In [36]:
filtered_df.head()

Unnamed: 0,user_id,order_id,product_id,aisle_id,department_id,add_to_cart_order,reordered,product_name,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
1,112108,1,10246,83,4,3,0,Organic Celery Hearts,fresh vegetables,produce,train,4,4,10,9.0
2,112108,1,49683,83,4,4,0,Cucumber Kirby,fresh vegetables,produce,train,4,4,10,9.0
4,112108,1,13176,24,4,6,0,Bag of Organic Bananas,fresh fruits,produce,train,4,4,10,9.0
5,112108,1,47209,24,4,7,0,Organic Hass Avocado,fresh fruits,produce,train,4,4,10,9.0
6,112108,1,22035,21,16,8,1,Organic Whole String Cheese,packaged cheese,dairy eggs,train,4,4,10,9.0


In [37]:
filtered_df.shape

(10344052, 15)

For association mining, particularly when using the Apriori algorithm, you'll need to one-hot encode your transaction data such that each transaction (or basket) is represented by a row, and each column represents one of the 100 product names you've identified as your focus. Each cell in this matrix should indicate whether the product (column) was purchased in the transaction (row), typically with a 1 for purchased and a 0 for not purchased.

In [38]:
%%time
import pandas as pd

# Create the one-hot encoded matrix
basket = (filtered_df
          .groupby(['order_id', 'product_name'])['product_name']  # Group by order and product
          .count()  # Count occurrences, though you'll convert this to 1s
          .unstack(fill_value=0)  # Pivot the table, filling missing values with 0
          .reset_index()  # Reset index to turn 'order_id' back into a column
          .set_index('order_id'))  # Set 'order_id' as the index

# Convert counts to 1s (since any positive count means the product was bought)
basket = basket.applymap(lambda x: 1 if x > 0 else 0)


CPU times: user 7min 1s, sys: 13.7 s, total: 7min 15s
Wall time: 7min 14s


In [39]:
basket.head()

product_name,100% Raw Coconut Water,100% Recycled Paper Towels,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Baby Spinach,Bag of Organic Bananas,Banana,Bartlett Pears,...,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Unsweetened Vanilla Almond Milk,Vanilla Almond Breeze Almond Milk,Watermelon Chunks,Whipped Cream Cheese,Whole Milk,Yellow Bell Pepper,Yellow Onions,"YoKids Squeezers Organic Low-Fat Yogurt, Strawberry"
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
basket.shape

(2655924, 200)

In [30]:
#!pip install mlxtend

In [44]:
from mlxtend.frequent_patterns import apriori, association_rules

# Find frequent itemsets
frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)

frequent_items.head()



Unnamed: 0,support,itemsets
0,0.014786,(100% Raw Coconut Water)
1,0.010937,(100% Recycled Paper Towels)
2,0.023763,(100% Whole Wheat Bread)
3,0.014545,(2% Reduced Fat Milk)
4,0.032859,(Apple Honeycrisp Organic)


In [45]:
## Filter by association rules

rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
30,(Large Lemon),(Limes),0.060541,0.05522,0.010918,0.180345,3.265935,0.007575,1.152655,0.73852
31,(Limes),(Large Lemon),0.05522,0.060541,0.010918,0.197723,3.265935,0.007575,1.17099,0.73436
43,(Organic Raspberries),(Organic Strawberries),0.053692,0.103759,0.013379,0.249174,2.401463,0.007808,1.193673,0.616699
42,(Organic Strawberries),(Organic Raspberries),0.103759,0.053692,0.013379,0.12894,2.401463,0.007808,1.086387,0.65115
39,(Organic Raspberries),(Organic Hass Avocado),0.053692,0.083164,0.010095,0.188018,2.260818,0.00563,1.129134,0.589325
38,(Organic Hass Avocado),(Organic Raspberries),0.083164,0.053692,0.010095,0.121389,2.260818,0.00563,1.077049,0.608268
23,(Organic Fuji Apple),(Banana),0.034974,0.184979,0.013236,0.378441,2.045855,0.006766,1.311252,0.529734
22,(Banana),(Organic Fuji Apple),0.184979,0.034974,0.013236,0.071552,2.045855,0.006766,1.039397,0.627232
5,(Organic Raspberries),(Bag of Organic Bananas),0.053692,0.148698,0.01592,0.296508,1.994034,0.007936,1.21011,0.526789
4,(Bag of Organic Bananas),(Organic Raspberries),0.148698,0.053692,0.01592,0.107065,1.994034,0.007936,1.059772,0.585578


Certainly! Let's delve into specifics with numbers, focusing on the lift values and other relevant metrics for a few selected association rules. These insights are based on the hypothetical output you provided:

### 1. **Cross-Promotion Strategies**
- **"Large Lemon" and "Limes"**: With a lift of 3.265935, this pair shows a strong positive correlation, suggesting that customers who buy one are more than 3 times as likely to buy the other compared to the baseline probability of buying "Limes." Promoting "Limes" to customers who have "Large Lemon" in their basket could significantly increase the likelihood of an additional purchase.

### 2. **Store Layout Adjustments**
- **"Organic Raspberries" and "Organic Strawberries"**: These items have a lift of 2.401463, indicating that customers who buy "Organic Raspberries" are more than twice as likely to buy "Organic Strawberries" as well. Placing these items near each other in both physical stores and online recommendation sections can encourage additional purchases.

### 3. **Personalized Recommendations**
- **"Organic Raspberries" and "Organic Hass Avocado"**: With a lift of 2.260818, suggesting these products together in personalized email campaigns or online recommendations could resonate well with customers interested in one to consider purchasing the other.

### 4. **Inventory Management**
- **"Organic Fuji Apple" and "Banana"**: This combination shows a lift of 2.045855. Anticipating increased demand for "Organic Fuji Apple" when bananas are being purchased in higher volumes allows for better inventory planning and management to meet customer demand.

### 5. **New Product Introduction**
- If introducing a new product similar to "Organic Strawberries," leveraging the association with "Organic Hass Avocado" (lift of 1.847146) could guide where to position this new product in the store or how to include it in promotional bundles.

### 6. **Pricing Strategies**
- For the "Bag of Organic Bananas" and "Organic Raspberries" pair (lift of 1.994034), a promotional discount on "Organic Raspberries" could drive up sales for "Bag of Organic Bananas" as well, leveraging the strong association between them to boost overall basket size.

### 7. **Understanding Customer Segments**
- Analyzing the buying patterns of customers who purchase "Organic Baby Spinach" and "Organic Avocado" (lift of 1.849520) can provide insights into health-conscious customer segments, guiding targeted marketing efforts to cater to their preferences for organic products.

### Conclusion
These specific insights with quantifiable lift values offer a clear direction for implementing practical retail strategies. By understanding the strength of product associations, retailers can tailor their marketing, inventory, and store layout decisions to foster an environment that encourages increased sales and enhances customer satisfaction.|