# Basket Analysis

### Overview

Market Basket Analysis (MBA) is a data mining technique used to uncover associations between items within large datasets, commonly used in retail to discover frequent item sets or combinations of items purchased together. It’s most famously applied in the context of shopping transactions to identify patterns of products that frequently co-occur in shopping baskets, but its applications extend to any domain where understanding item association is beneficial, such as content recommendation, cross-selling strategies, and inventory management.

### Key Concepts

- **Association Rules**: The core of MBA revolves around finding association rules, which are implication expressions of the form *X* ⇒ *Y*, where *X* and *Y* are disjoint item sets. The rule suggests that when items in *X* are purchased, items in *Y* are also likely to be purchased.
- **Support**: This metric measures the proportion of transactions in the dataset that contain a specific item set. It helps in identifying the most common item combinations.
- **Confidence**: Confidence measures how often items in *Y* are purchased when items in *X* are purchased, providing insight into the reliability of the inference made by a rule.
- **Lift**: Lift indicates the strength of a rule over the random co-occurrence of *X* and *Y*, with a lift value greater than 1 suggesting a strong rule.

### Process Overview

1. **Data Collection and Preparation**: The first step involves collecting transaction data and organizing it in a format suitable for analysis, typically a transaction-item matrix.

2. **Frequency Analysis**: Identifying frequent individual items or item sets that meet a minimum support threshold.

3. **Rule Generation**: Generating association rules from the frequent item sets. This involves calculating metrics like support, confidence, and lift to evaluate the strength and usefulness of the rules.

4. **Rule Pruning and Selection**: Filtering out rules that do not meet the minimum criteria for the metrics of interest, leaving only the most potentially valuable associations.

### Applications

- **Cross-Selling and Upselling**: Identifying products to promote together based on their likelihood of being purchased together.
- **Store Layout Optimization**: Organizing shelves and aisles in a way that maximizes cross-category purchases.
- **Targeted Marketing**: Developing promotions and offers for specific customer segments based on their buying patterns.
- **Inventory Management**: Understanding product affinities can help in optimizing stock levels and reducing inventory costs.
- **Product Bundling**: Creating product bundles that are likely to be purchased together, enhancing value for customers and increasing sales.

### Tools and Techniques

Market Basket Analysis can be performed using various statistical software and programming languages equipped with data mining capabilities, such as R (using the `arules` package) and Python (using libraries like `mlxtend`).

### Challenges

- **Data Volume and Quality**: MBA requires access to large transactional datasets, and the quality of insights is directly related to the quality of the underlying data.
- **Computational Complexity**: As the number of items in the dataset grows, the computational complexity of finding frequent item sets and generating rules increases exponentially.
- **Dynamic Patterns**: Consumer behavior and item associations can change over time, necessitating regular updates to the analysis to keep insights current.

## Import libraries and data

In [36]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

pd.set_option('display.max_rows', None)

root = './_data/'

orders = pd.read_csv(root + 'orders.csv')
order_products_prior = pd.read_csv(root + 'order_products__prior.csv')
order_products_train = pd.read_csv(root + 'order_products__train.csv')
products = pd.read_csv(root + 'products.csv')

In [5]:
order_products = pd.concat([order_products_prior, order_products_train])
print(order_products.shape)

(33819106, 4)


In [6]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [8]:
print(order_products.product_id.nunique())

49685


Subset products for analysis to top 50.

In [9]:
product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:50].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(10)

Unnamed: 0,product_id,frequency,product_name,aisle_id,department_id
0,24852,491291,Banana,24,4
1,13176,394930,Bag of Organic Bananas,24,4
2,21137,275577,Organic Strawberries,24,4
3,21903,251705,Organic Baby Spinach,123,4
4,47209,220877,Organic Hass Avocado,24,4
5,47766,184224,Organic Avocado,24,4
6,47626,160792,Large Lemon,24,4
7,16797,149445,Strawberries,24,4
8,26209,146660,Limes,24,4
9,27845,142813,Organic Whole Milk,84,16


In [15]:
freq_products = list(product_counts.product_id)
print(len(freq_products))

50


In [17]:
order_products = order_products[order_products.product_id.isin(freq_products)]
print(order_products.shape)

(5595295, 4)


In [19]:
print(order_products.order_id.nunique())

2163422


In [20]:
order_products = order_products.merge(products, on = 'product_id', how='left')
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,28985,2,1,Michigan Organic Kale,83,4
1,2,17794,6,1,Carrots,83,4
2,3,21903,4,1,Organic Baby Spinach,123,4
3,5,13176,1,1,Bag of Organic Bananas,24,4
4,5,27966,4,1,Organic Raspberries,123,4


In [21]:
basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
basket.head()

product_name,100% Whole Wheat Bread,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Blueberries,Carrots,Cucumber Kirby,Fresh Cauliflower,Half & Half,...,Organic Whole String Cheese,Organic Yellow Onion,Organic Zucchini,Original Hummus,Raspberries,Seedless Red Grapes,Sparkling Water Grapefruit,Spring Water,Strawberries,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [22]:
del product_counts, products, order_products, order_products_prior, order_products_train

Encoding

In [24]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket = basket.map(encode_units)
basket.head()

product_name,100% Whole Wheat Bread,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Blueberries,Carrots,Cucumber Kirby,Fresh Cauliflower,Half & Half,...,Organic Whole String Cheese,Organic Yellow Onion,Organic Zucchini,Original Hummus,Raspberries,Seedless Red Grapes,Sparkling Water Grapefruit,Spring Water,Strawberries,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [25]:
print(basket.size)

108171100


In [26]:
print(basket.shape)

(2163422, 50)


In [30]:
basket_bool = basket.map(lambda x: True if x > 0 else False)
frequent_items = apriori(basket_bool, min_support=0.01, use_colnames=True, low_memory=True)
frequent_items.head()

Unnamed: 0,support,itemsets
0,0.029173,(100% Whole Wheat Bread)
1,0.04034,(Apple Honeycrisp Organic)
2,0.032888,(Asparagus)
3,0.182549,(Bag of Organic Bananas)
4,0.22709,(Banana)


In [31]:
print(frequent_items.shape)

(96, 2)


In [35]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
66,(Organic Yellow Onion),(Organic Garlic),0.054412,0.052665,0.010596,0.194731,3.697569,0.00773,1.176421,0.771533
67,(Organic Garlic),(Organic Yellow Onion),0.052665,0.054412,0.010596,0.201192,3.697569,0.00773,1.183749,0.77011
44,(Limes),(Large Lemon),0.067791,0.074323,0.013404,0.197723,2.660316,0.008365,1.153812,0.66949
45,(Large Lemon),(Limes),0.074323,0.067791,0.013404,0.180345,2.660316,0.008365,1.137319,0.674214
68,(Organic Lemon),(Organic Hass Avocado),0.042179,0.102096,0.010182,0.241389,2.364332,0.005875,1.183616,0.602459
69,(Organic Hass Avocado),(Organic Lemon),0.102096,0.042179,0.010182,0.099725,2.364332,0.005875,1.063921,0.642661
74,(Organic Raspberries),(Organic Strawberries),0.065915,0.12738,0.016424,0.249174,1.956147,0.008028,1.162214,0.523283
75,(Organic Strawberries),(Organic Raspberries),0.12738,0.065915,0.016424,0.12894,1.956147,0.008028,1.072354,0.560142
47,(Organic Avocado),(Large Lemon),0.085154,0.074323,0.01191,0.139862,1.881818,0.005581,1.076196,0.512216
46,(Large Lemon),(Organic Avocado),0.074323,0.085154,0.01191,0.160244,1.881818,0.005581,1.089419,0.506223
