# Overview

This notebook runs the a priori algorithm to build out a recomendation list based on lift associated with items being purchased together.

# Setup

In [1]:
import pandas as pd
import numpy as np

You will first need to download this dataset:
https://www.kaggle.com/carrie1/ecommerce-data?select=data.csv and update the filepath below.

In [2]:
DATA_FILEPATH = '../data/retail_txn_data_demo.csv'
KEEP_COLS = ['InvoiceNo', 'StockCode']

# Functions

## Data Structures

These data structures are used for book keeping so we can store the item associations in a matrtix.

The index map maps an item id to it's position in a sorted list index

In [3]:
def get_index_map(df_):
    # make a map for item number: base_0_index to make matrix
    index_map = {
    row['item']: idx for idx, row in (
        df_
        .sort_values('item')
        .item.drop_duplicates()
        .reset_index()
        .iterrows()
        )
    }
    return index_map

The item array stores unique product ids in an array where the index matches the value in the above dictionary.

In [4]:
def get_item_array(df_):
    item_array = (
        df_
        .sort_values('item')
        .item
        .drop_duplicates()
        .values
    )
    return item_array

Get an array of lists where each element of outer array is an order, and inner lists represent the items in the order

In [5]:
def get_orders_array(df_):
    orders = (
        df_
        .sort_values('item')
        .groupby('invoice')
        .agg({'item': list})
    ).values
    return orders

## Support
Support refers to the **popularity of item** and can be calculated by finding the number of transactions containing a particular item divided by the total number of transactions:  
$$Support(A) = (Transactions containing (A))/(Total Transactions)$$

In [6]:
def get_support_df(df_):
    total_invoices = df_.invoice.nunique()
    support = (
        df_
        .groupby('item')
        # Count number of different invoices that this item appears
        .count()
        .rename(columns={'invoice': 'n_invoices'})
        .assign(support=lambda x: x.n_invoices/total_invoices)
    )
    return support

## Association Matrix
To store number of orders where each item pair appears together

Make a dictionary that has a key being tuple for matrix coordinates and value being the frequency of association. 

In [7]:
def make_association_dict(orders_, index_map_):
    
    combination_counts = {}
    #iterate through array 
    for order in orders_:
        # get the item list from the order
        order = order[0]

        # keep popping items from front of item list until empty
        # popping ensures we don't duplicate combinations as we
        # iterate through the list of items
        while len(order) > 0:
            item_num = order.pop()
            item_index = index_map_[item_num]
            
            # self association
            # This will tell us how many total orders the item appears
            # in different orders. It will be our diagonal in the 
            # association matrix
            self_item_set = (item_index, item_index)
            if self_item_set in combination_counts.keys():
                combination_counts[self_item_set] += 1
            else:
                combination_counts[self_item_set] = 1
            
            # Association with all other items in order
            for other_item in order:
                other_item_index = index_map_[other_item]
                item_set = (item_index, other_item_index)
                if item_set in combination_counts.keys():
                    combination_counts[item_set] += 1
                else:
                    combination_counts[item_set] = 1
                    
    return combination_counts

Make association matrix with proportions of orders where each item is ordered together 

In [8]:
def make_assocation_matrix(item_array_, association_dict_):
    association_matrix = np.ndarray(shape=(item_array.shape[0], item_array.shape[0]))
    # fill matrix with count of orders where items appear together in same order.
    for key, value in combination_counts.items():
        association_matrix[key[0], key[1]] = value
    return association_matrix

## Confidence
Confidence refers to the **likelihood that an item B is also bought if item A is bought**. It can be calculated by finding the number of transactions where A and B are bought together, divided by the total number of transactions where A is bought:  
$$Confidence(A → B) = (Transactions containing both (A and B))/(Transactions containing A)$$

Confidence is similar to Naive Based Algorithm.

Fill a **confidence matrix** where each index pair reports the _confidence_

In [9]:
# Convert this to more of a linear alg approach instead of iterating. 
# Should just divide matrix by a 1D array with order counts
def make_confidence_matrix(association_matrix_):    
    confidence_matrix = np.ndarray(shape=association_matrix_.shape)
    for i in range(association_matrix_.shape[0]):
        for j in range(association_matrix_.shape[1]):
            confidence_matrix[i, j] = association_matrix_[i, j] / association_matrix_[i, i]
    return confidence_matrix

# Lift

Lift refers to the increase in the ratio of the sale of B when A is sold.  
Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B).  
Mathematically it can be represented as:  
Lift(A→B) = (Confidence (A→B))/(Support (B))

In [10]:
def make_lift_matrix(confidence_matrix_, item_array_, df_support_):
    lift_matrix = np.ndarray(shape=confidence_matrix_.shape)
    for item_index in range(confidence_matrix_.shape[0]):
        item = item_array_[item_index]
        item_support = df_support_.loc[item]['support']
        lift_matrix[item_index] = confidence_matrix[item_index] / item_support
    return lift_matrix

## Utils

Used to get the top n recommendations given an item id

In [11]:
def get_top_n(item_id, matrix_, item_index_map_, item_array, n=10, omit_self=True):
    item_index = item_index_map_[item_id]
    if omit_self:
        top_n = matrix_[item_index].argsort()[::-1][1:n+1]
    else:
        top_n = matrix_[item_index].argsort()[::-1][:n]
        
    top_n_list = []
    for other_item_index in top_n:
        top_n_list.append(
            {item_array[other_item_index]: matrix_[item_index, other_item_index]})
    return top_n_list

# Running

Get Data

In [12]:
df = (
    pd.read_csv(DATA_FILEPATH, encoding= 'unicode_escape')
    .filter(KEEP_COLS)
    .rename(columns={'InvoiceNo': 'invoice', 'StockCode': 'item'})
)
df.head()

Unnamed: 0,invoice,item
0,536365,85123A
1,536365,71053
2,536365,84406B
3,536365,84029G
4,536365,84029E


Get **Support** 

In [13]:
df_support = get_support_df(df)
df_support.head()

Unnamed: 0_level_0,n_invoices,support
item,Unnamed: 1_level_1,Unnamed: 2_level_1
10002,73,0.002819
10080,24,0.000927
10120,30,0.001158
10123C,4,0.000154
10123G,1,3.9e-05


Get the reference data structures

In [14]:
index_map = get_index_map(df)
item_array = get_item_array(df)
print(len(index_map), len(item_array))

4070 4070


Get orders in a simpler data format: array orders containing lists of items in the order

In [15]:
orders = get_orders_array(df)
print(orders.shape)
orders[0:5]

(25900, 1)


array([[list(['21730', '22752', '71053', '84029E', '84029G', '84406B', '85123A'])],
       [list(['22632', '22633'])],
       [list(['21754', '21755', '21777', '22310', '22622', '22623', '22745', '22748', '22749', '48187', '84879', '84969'])],
       [list(['22912', '22913', '22914', '22960'])],
       [list(['21756'])]], dtype=object)

Make the matrices to calculate association counts and confidence

In [16]:
combination_counts = make_association_dict(orders, index_map)
print(len(combination_counts))
association_matrix = make_assocation_matrix(item_array, combination_counts)
print(association_matrix.shape)
confidence_matrix = make_confidence_matrix(association_matrix)
print(confidence_matrix.shape)

3755646
(4070, 4070)
(4070, 4070)


Check top n items using confidence. This shows the proportion of orders which have both the item in question and the paired item

In [17]:
test_item_index = 3536
test_item = "85123A"

get_top_n(test_item, confidence_matrix, index_map, item_array, n=10)

[{'21733': 0.21877619446772842},
 {'85099B': 0.17812238055322716},
 {'22457': 0.17812238055322716},
 {'47566': 0.17141659681475271},
 {'22469': 0.1697401508801341},
 {'82482': 0.16303436714165967},
 {'22423': 0.15884325230511315},
 {'22804': 0.15590947191953058},
 {'22470': 0.15171835708298406},
 {'20725': 0.14752724224643754}]

In [18]:
lift_matrix = make_lift_matrix(confidence_matrix, item_array, df_support)

In [19]:
get_top_n(test_item, lift_matrix, index_map, item_array, n=10)

[{'21733': 2.44976369940085},
 {'85099B': 1.9945394104317267},
 {'22457': 1.9945394104317267},
 {'47566': 1.9194508679213556},
 {'22469': 1.9006787322937628},
 {'82482': 1.825590189783392},
 {'22423': 1.7786598507144102},
 {'22804': 1.745808613366123},
 {'22470': 1.6988782742971411},
 {'20725': 1.6519479352281592}]

# References

https://medium.com/@deepak6446/apriori-algorithm-in-python-recommendation-engine-5ba89bd1a6da