# Association Analysis 

The goal is to apply the 'apriori algorithm' to find the two most frequent collections of items in 250,000 orders at the market.  

At the end we will give the two lists of the four items that are found together most of the times. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

import itertools as it

# The Data

The data today is a subset of data from the ["Insta-Cart"](https://www.kaggle.com/c/instacart-market-basket-analysis/data) dataset on Kaggle

In [2]:
file_path = "order_products__train.csv"

data_train = pd.read_csv(file_path)
data_train

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1
...,...,...,...,...
1384612,3421063,14233,3,1
1384613,3421063,35548,4,1
1384614,3421070,35951,1,1
1384615,3421070,16953,2,1


In [3]:
file_path = "order_products__prior.csv"

data_prior = pd.read_csv(file_path)
data_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [4]:
file_path = "products.csv"

data_products = pd.read_csv(file_path)
data_products

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13
...,...,...,...,...
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8


For now we will use "order_products__train" and "products". In the "order_products..." datasets we find that the "order_id" is repeated because an order can consist on several products, so we can now the total numbero of orders.

In [5]:
unique_orders = len(data_train.order_id.unique())
print(unique_orders)

131209


Count appearances of each product, 

In [74]:
count_of_each_product = data_train.product_id.value_counts()
count_of_each_product

24852    18726
13176    15480
21137    10894
21903     9784
47626     8135
         ...  
44256        1
2764         1
4815         1
43736        1
46835        1
Name: product_id, Length: 39123, dtype: int64

delete products which do not appear at least 500 times.

In [7]:
products_to_keep = count_of_each_product[count_of_each_product > 500].index
products_to_keep

Int64Index([24852, 13176, 21137, 21903, 47626, 47766, 47209, 16797, 26209,
            27966,
            ...
            38273, 18288,  4086,  5769, 19019, 17758, 40198, 49191, 14197,
            31915],
           dtype='int64', length=372)

To keep only those products in a data frame we construct a subset of products from the original set that satisfy the condition above

In [124]:
compacted_data = data_train[data_train.product_id.isin(products_to_keep)]
compacted_data

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
2,1,10246,3,0
3,1,49683,4,0
5,1,13176,6,0
6,1,47209,7,0
7,1,22035,8,1
...,...,...,...,...
1384599,3421056,21709,3,1
1384610,3421063,49235,1,1
1384612,3421063,14233,3,1
1384614,3421070,35951,1,1


# Association & Apriori Algorithm

The "`find_groups_of_size_n(compacted_data, 4)`" function finds all of the product groupings of a specified size, and counts how many times they appear.

But we would need a vast amount of computing power and memory, that would take a very long time and/or result in a memory error.   

In [125]:
def find_groups_of_size_n(data, size):
    
    group_by = data.groupby("order_id")['product_id'].unique()
    group_by = group_by.apply(lambda x: sorted(x))
    group_by = pd.DataFrame(group_by)
    
    def groupings(x):
        
        return list(it.combinations(x,size))
    
    group_by['groups'] = group_by['product_id'].apply(groupings)
    counts = pd.Series(list(it.chain.from_iterable(group_by['groups'].values))).value_counts()
    
    return counts

Don't run the below cell before reading this note...

If the data is not cut-down enough, you will see a MemoryError

Thus we need to determine how to cut-down the size of the data being investigated.  

In [126]:
#%%time
#find_groups_of_size_n(compacted_data,a).head()

# Determining the Cut-Down

Ideas:

1. We know we are looking for product groups of four items. Thus all orders of fewer than 4 items may be discarded.
2. Use the theories which support the apriori algorithm to further reduce the orders / products which are grouped.

We count the number of unique values for the order id in our compacted data frame, that is the size of each order

In [190]:
size_of_order = compacted_data.order_id.value_counts()
size_of_order

2022893    34
736120     34
951047     34
2803519    32
2787030    31
           ..
1601363     1
386975      1
3322368     1
2106736     1
207425      1
Name: order_id, Length: 110509, dtype: int64

We now filter the orders we are going to keep according to the number of products purchased in each order, that is the order size

In [191]:
size = 4
orders_to_keep = size_of_order[size_of_order.values == size]
orders_to_keep

757591     4
2806059    4
1351723    4
3252912    4
1625870    4
          ..
1213223    4
3399073    4
1232341    4
2194904    4
1008486    4
Name: order_id, Length: 11705, dtype: int64

And define a new data frame which is the compacted_data set with the additional contraint of orders of a given size

In [170]:
cut_down_data = compacted_data[compacted_data.order_id.isin(orders_to_keep.index)]
cut_down_data

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
81,112,27104,1,1
84,112,38273,4,0
85,112,47209,5,0
86,112,5876,6,1
127,349,33000,1,1
...,...,...,...,...
1384528,3420895,45007,15,0
1384546,3420996,24852,1,1
1384547,3420996,45066,2,1
1384551,3420996,14947,6,1


Notice that this data frame is way smaller than the one we had before $(522045 rows × 4 columns)$, this simplifies the prolem and solves the MemoryError

## Product Lookup

We have our list of items that are found together in more than 300 orders encoded in int's. To give the product item's names we need the following function

In [192]:
products = data_products

def product_lookup(product_ids):
    try:
        len(product_ids)
        names = [products[products.product_id == pid].iloc[0,1] for pid in product_ids]
    except:
        names = products[products.product_id == product_ids].iloc[0,1]
    
    return names

For example, 

In [193]:
product_lookup(13176)

'Bag of Organic Bananas'

# Association Goal

Two lists of the four items that are found together most of the times...

In [194]:
%%time
find_groups_of_size_n(cut_down_data, 4).head()

CPU times: user 1.5 s, sys: 8.12 ms, total: 1.51 s
Wall time: 1.48 s


(14947, 21709, 35221, 44632)    9
(21709, 26620, 35221, 44632)    4
(16797, 21288, 39275, 43352)    4
(196, 6184, 37710, 43154)       3
(12341, 13176, 21137, 39275)    3
dtype: int64

In [195]:
print(find_groups_of_size_n(cut_down_data, 4).values[0], "appereances")
product_lookup(list(find_groups_of_size_n(cut_down_data, 4).index[0]))

9 appereances


['Pure Sparkling Water',
 'Sparkling Lemon Water',
 'Lime Sparkling Water',
 'Sparkling Water Grapefruit']

In [196]:
print(find_groups_of_size_n(cut_down_data, 4).values[2], "appereances")
product_lookup(list(find_groups_of_size_n(cut_down_data, 4).index[2]))

4 appereances


['Strawberries', 'Blackberries', 'Organic Blueberries', 'Raspberries']