### Market Basket Analysis

* Fundamental technique used by large retailers to uncover association between items.

* Relationship between itmes which are frequently bought together.


### Association Rules

* Main goal is to identify products or variables in a dataset.

* Idea is to determine which products come together.

* Widely used to analyze retail transaction data.

### Metrics 

* Assume there are 100 customers.

* 10 bought milk, 8 bought butter and 6 bought both milk and butter.

* ppl who bought milk > ppl who bought butter

* Support = P(Milk and Butter) = 6/100 = 0.06

* Confidence = support / P (Butter) = 0.06 / 0.08 = 0.75

* Lift = confidence / P (milk) = 0.75 / 0.10 = 7.5

## Example

* order 1 : apple, egg and milk

* order 2 : carrot and milk

* order 3 : apple, egg, and carrot

* order 4 : apple, and egg

* order 5 : apple, and carrot


1. **Support - percentage of order that contain the item set.**

    * In above example we've 5 orders and 3 orders contain **apple and egg.**
    * **support{apple,egg} = 3/5 or 60%**


2. **Confidence - given 2 items, e.g. apple and egg, confidence will measure % of times egg is purchased given apple has been purchased.**

    * **confidence{A -> B} : support{A,B}/support{A}**
    * confidence is directional.
    * confidence values lie between 0 & 1. If confidence is 0, then it means that egg was never purchased when apple was purchased. If confidence is 1, then it means that egg was always purchased when apple was purchased.
    * Calculation of confidence {apple -> egg}
        * confidence {apple -> egg} = support{apple,egg}/support{apple}
        * (3/5)/(4/5) = 3/4 = 0.75 or 75%
    * Calculation of confidence {egg -> apple}
        * confidence {egg -> apple} = support{egg,apple}/support{egg}
        * (3/5)/(3/5) = 1 or 100%


3. **Lift - Unlike confidence, lift is non-directional. i.e. lift {A,B} = lift {B,A}**

    * Formula
        * **lift {A,B} = lift {B,A} = support{A,B} / (support{A} * support{B})**
        * lift {apple,egg} = lift {egg,apple} = (3/5) / (4/5 * 3/5) = 5/4 = 1.25
        
    * Lift = 1 => no relationship between A & B i.e. A & B occur together by chance.
    * Lift > 1 => positive relationship between A & B i.e. A & B occur together more often than random.
    * Lift < 1 => negative relationship between A & B i.e. A & B occur together less often than random.

### Import libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
orders = pd.read_csv('datasets/order_products__prior.csv')

orders.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [3]:
### Convert df into series with order_id as index, item_id as value

orders = orders.set_index('order_id')['product_id'].rename('item_id')

### Association Rules

In [5]:
from collections import Counter
from itertools import groupby, combinations

In [6]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

In [7]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().values
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]               

In [8]:
rules = association_rules(orders, 0.01)


Starting order_item:                4999999
Items with support >= 0.01:           10967
Remaining order_item:               4607364
Remaining orders with 2+ items:      464919
Remaining order_item:               4579612
Item pairs:                        11265241
Item pairs with support >= 0.01:      49243



In [9]:
rules

Unnamed: 0,item_A,item_B,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
10344,28613,45636,61,0.013121,259,0.055709,171,0.036781,0.235521,0.356725,6.403409
11838,42345,42085,47,0.010109,269,0.057860,133,0.028607,0.174721,0.353383,6.107609
21718,1443,15707,47,0.010109,358,0.077003,104,0.022369,0.131285,0.451923,5.868928
34491,42345,8186,66,0.014196,269,0.057860,214,0.046030,0.245353,0.308411,5.330343
39000,1377,30219,51,0.010970,228,0.049041,197,0.042373,0.223684,0.258883,5.278936
...,...,...,...,...,...,...,...,...,...,...,...
38949,21137,16797,91,0.019573,40713,8.757009,21871,4.704260,0.002235,0.004161,0.000475
7207,47209,47766,70,0.015056,32694,7.032193,26949,5.796494,0.002141,0.002597,0.000369
22618,47766,47209,64,0.013766,26949,5.796494,32694,7.032193,0.002375,0.001958,0.000338
22188,24852,13176,94,0.020219,72594,15.614333,57948,12.464107,0.001295,0.001622,0.000104


## Applying it to our products data

In [10]:
item_name = pd.read_csv('datasets/products.csv')

item_name.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [12]:
item_name = item_name.rename(columns={'product_id':'item_id',
                                      'product_name':'item_name'})

item_name

Unnamed: 0,item_id,item_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13
...,...,...,...,...
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8


In [13]:
rules_final = merge_item_name(rules,item_name).sort_values('lift',ascending=False)

rules_final

Unnamed: 0,itemA,itemB,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
0,Organic Grapefruit Ginger Sparkling Yerba Mate,Cranberry Pomegranate Sparkling Yerba Mate,61,0.013121,259,0.055709,171,0.036781,0.235521,0.356725,6.403409
1,Baby Food Pouch - Roasted Carrot Spinach & Beans,Baby Food Pouch - Spinach Pumpkin & Chickpea,47,0.010109,269,0.057860,133,0.028607,0.174721,0.353383,6.107609
3,Strawberry and Banana Fruit Puree,"Peter Rabbit Organics Mango, Banana and Orange...",47,0.010109,358,0.077003,104,0.022369,0.131285,0.451923,5.868928
2,Baby Food Pouch - Roasted Carrot Spinach & Beans,"Baby Food Pouch - Butternut Squash, Carrot & C...",66,0.014196,269,0.057860,214,0.046030,0.245353,0.308411,5.330343
4,Organic Lactose Free Strawberry Yogurt,Lactose Free Blueberry Yogurt,51,0.010970,228,0.049041,197,0.042373,0.223684,0.258883,5.278936
...,...,...,...,...,...,...,...,...,...,...,...
11570,Organic Strawberries,Strawberries,91,0.019573,40713,8.757009,21871,4.704260,0.002235,0.004161,0.000475
7356,Organic Hass Avocado,Organic Avocado,70,0.015056,32694,7.032193,26949,5.796494,0.002141,0.002597,0.000369
4420,Organic Avocado,Organic Hass Avocado,64,0.013766,26949,5.796494,32694,7.032193,0.002375,0.001958,0.000338
1538,Banana,Bag of Organic Bananas,94,0.020219,72594,15.614333,57948,12.464107,0.001295,0.001622,0.000104


**Conclusion**
From the above output, we can see that top associations are not surprising as flavour of 1 item is being purchased with another.

In [16]:
# Function vs Generator

def function_name(names):
    for name in names:
        return name
    
student = function_name(['Vivek','Abhijeet','Shamali','Sucharitha'])

student

'Vivek'

In [25]:
def generator_name(names):
    for name in names:
        yield name
        
student = generator_name(['Vivek','Abhijeet','Shamali','Sucharitha'])

student

<generator object generator_name at 0x7fac52135270>

In [26]:
list(student)

['Vivek', 'Abhijeet', 'Shamali', 'Sucharitha']

# Great Job !