# Market Basket Analysis

Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip.

Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository.

The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include:

- Changing the store layout according to trends
- Customers behavior analysis
- Catalog Design
- Cross marketing on online stores
- Customized emails with add-on sales, etc.

### Matrices

- **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.


- **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.
     - Confidence(A => B) = Support(A, B)/Support(A)


- **Lift** : Increase in the sale of A when you sell B.
    
    - Lift(A => B) = Confidence(A, B)/Support(B)
        
    - Lift (A => B) = 1 means that there is no correlation within the itemset.
    - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.
    - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together.

**Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent.

In [1]:
!pip install sqlalchemy
!pip install mlxtend

import sqlalchemy
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules



### Data

In [None]:
# limit the data to 1M rows for avoid OFME
sparkConn = sqlalchemy.create_engine('hive://spark-thrift:10000/default')
order_products = pd.read_sql_query("select * from sample.order_products limit 1000000", con=sparkConn)
order_products.shape

  from imp import reload


(1000000, 16)

In [None]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,days_since_prior_order_cum,order_date,product_name,aisle_id,aisle,department_id,department
0,347,1158,14,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Mango Chunks,116.0,frozen produce,1.0,frozen
1,347,17304,12,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Powdered Peanut Butter,88.0,spreads,13.0,pantry
2,347,17948,13,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Frozen Organic Wild Blueberries,116.0,frozen produce,1.0,frozen
3,347,18689,7,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Dairy-Free Chive Cream Cheese,108.0,other creams cheeses,16.0,dairy eggs
4,347,21903,5,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Organic Baby Spinach,123.0,packaged vegetables fruits,4.0,produce


In [None]:
order_products.product_id.nunique()

35306

In [6]:
# limit the data to 1M rows for avoid OFME
products = pd.read_sql_query("select * from source.products limit 1000000", con=sparkConn)

In [7]:
products.shape

(49689, 4)

Out of 49685 keeping top 100 most frequent products.

In [8]:
product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(10)

Unnamed: 0,product_id,frequency,product_name,aisle_id,department_id
0,24852,14424,Banana,24.0,4.0
1,13176,11616,Bag of Organic Bananas,24.0,4.0
2,21137,8121,Organic Strawberries,24.0,4.0
3,21903,7352,Organic Baby Spinach,123.0,4.0
4,47209,6520,Organic Hass Avocado,24.0,4.0
5,47766,5383,Organic Avocado,24.0,4.0
6,47626,4693,Large Lemon,24.0,4.0
7,16797,4321,Strawberries,24.0,4.0
8,27845,4293,Organic Whole Milk,84.0,16.0
9,26209,4220,Limes,24.0,4.0


Keeping 100 most frequent items in order_products dataframe

In [9]:
freq_products = list(product_counts.product_id)
freq_products[1:10]

[13176, 21137, 21903, 47209, 47766, 47626, 16797, 27845, 26209]

In [10]:
len(freq_products)

100

In [11]:
order_products = order_products[order_products.product_id.isin(freq_products)]
order_products.shape

(228991, 16)

In [12]:
order_products.order_id.nunique()

71818

In [13]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,days_since_prior_order_cum,order_date,product_name_x,aisle_id_x,aisle,department_id_x,department,product_name_y,aisle_id_y,department_id_y
0,347,21903,5,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Organic Baby Spinach,123.0,packaged vegetables fruits,4.0,produce,Organic Baby Spinach,123.0,4.0
1,347,27966,4,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Organic Raspberries,123.0,packaged vegetables fruits,4.0,produce,Organic Raspberries,123.0,4.0
2,347,44359,6,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Organic Small Bunch Celery,83.0,fresh vegetables,4.0,produce,Organic Small Bunch Celery,83.0,4.0
3,347,47209,8,True,17155,94,1,16,7.0,268.0,2022-11-11 16:00:00,Organic Hass Avocado,24.0,fresh fruits,4.0,produce,Organic Hass Avocado,24.0,4.0
4,447,22935,3,False,173924,36,4,14,7.0,154.0,2022-06-13 14:00:00,Organic Yellow Onion,83.0,fresh vegetables,4.0,produce,Organic Yellow Onion,83.0,4.0


Structuring the data for feeding in the algorithm

In [14]:
basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
basket.head()

KeyError: 'product_name'

In [None]:
del product_counts, products, order_products, order_products_prior, order_products_train

encoding the units

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket = basket.applymap(encode_units)
basket.head()

In [None]:
basket.size

In [None]:
basket.shape

Creating frequent sets and rules

In [None]:
frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)
frequent_items.head()

In [None]:
frequent_items.tail()

In [None]:
frequent_items.shape

In [None]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

In [None]:
sparkConn.connect().close()