# Instacart Market Basket Analysis

**Background**

Instacart is a grocery ordering and delivery app that aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

Instacart has:
- 100s of retailers
- 10,000s of stores
- 10,000s of shoppers
- 1,00,000s of products
- 100,000,000s of items

Instacart's data science team uses transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session.

(https://tech.instacart.com/predicting-real-time-availability-of-200-million-grocery-items-in-us-canada-stores-61f43a16eafe)

(https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)

**Data**

The dataset is a relational set of files describing customers' orders over time. The goal is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200K Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

(https://www.kaggle.com/c/instacart-market-basket-analysis)

**Possible problem**

Because Instacart does not have the logistics supply chain information for products, when an item that a customer adds to the cart is unavailable in store, it costs every stakeholder in Instacart's marketplace. Shoppers waste time searching for an unavailable item, customers can't buy what they want, and retail partners lose out on revenue.

By proactively and accurately predicting customers' buying behavior, Instacart can use this information to match and search through their availability prediction model of whether a certain item out of the 200 million grocery items is available in real-time and make appropriate recommendations to the customer for the out-of-stock item(s) if applicable.

**Possible prediction tasks**
- Products that a user will buy again
- Products that a user will try for the first time
- Products that a user will add to cart next during a session
- Products that a user will buy together
- Time that a user will make the next purchase

In [1]:
import pandas as pd
import numpy as np
import scipy as sc

import dask.array as da
import dask.dataframe as dd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

import pickle

**Let's load in the `mba` and `num_items_order` DataFrames from the "instacart_market_basket_analysis" notebook.**

In [2]:
mba = pd.read_pickle('mba.pickle')
num_items_order = pd.read_pickle('num_items_order.pickle')

## Market Basket Analysis

**Association analysis: what is the probability that customers will buy product A with product B?**

(https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis)

**Association analysis**: 

- Association rules are written like {**`antecedent`**} -> {**`consequent`**}
- Both `antecedents` and `consequents` can have multiple items


- **`Support`** is the percentage of transactions that contain all the items in an itemset
    - High support used to make sure there is a useful relationship
    - Low support used to find "hidden" relationships
    
    
- **`Confidence`** is the probability that a transaction that contains the items in the `antecedent` also contains the items in the `consequent`
    - **confidence{A -> B} = support{A, B}/support{A}**
    - `Confidence` of 0.5 means that in 50% of the cases where the `antecedent` was purchased, the purchase also included the `consequent`
    - The higher the `confidence` the greater the likelihood that the `consequents` will be purchased with the `antecedents`
    
    
- **`Lift`** is the probability of all the items in a rule occurring together divided by the product of the probabilities of the `antecedents` and `consequents` occuring as if they were independent of each other
    - `Lift` indicates whether there is a relationship between A and B or whether the two items are occuring in the same orders simply by chance
    - `Lift` = 1 implies no relationship (co-occur at random) between A and B
    - `Lift` > 1 implies that there is a positive relationship (co-occur more often than random) between A and B
    - `Lift` < 1 implies that there is a negative relationship (co-occur less often than random) between A and B 
    - **lift{A, B} = lift{B, A}** unlike for `confidence`
    
(http://pbpython.com/market-basket-analysis.html)

**Apriori algorithm**:

- Apriori algorithm, a data mining algorithm, is used to perfom a market basket analysis and identify potential rules
    1. Set a minimum value for support and confidence
    2. Extract all the subsets having higher value of support than the minimum threshold
    3. Select all the rules from the subsets with confidence value higher than minimum threshold
    4. Order the rules by descending order of `lift`
    - Apriori algorithm does not use `lift` to establish rules, but `lift` is used when exploring the rules the algorithm returns

(https://select-statistics.co.uk/blog/market-basket-analysis-understanding-customer-behaviour/)

### Method 1: MLxtend

(http://pbpython.com/market-basket-analysis.html)

**We'll be using MLxtend library by Sebastian Raschka because scikit-learn has no built in Apriori algorithm for extracting frequent item sets. `pip install mlxtend` did not work for me so I used `conda install -c rasbt mlxtend` instead which worked! When I had mlxtend installed, it said the "frequent_patterns" modul couldn't be found, so I used `pip install git+git://github.com/rasbt/mlxtend.git` to resolve the issue.**

In [3]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

import dask.array as da
import dask.dataframe as dd

### Transactions with 10 or less items

**Because the `mba` dataset contains so many transactions and unique products, let's look at a subset of the data -- transactions with 2-10 items, which is about 60% of the orders.**

In [4]:
cartsize_10 = num_items_order[(num_items_order['num_of_items']<=10) & (num_items_order['num_of_items']>1)]

In [None]:
mba_10 = mba.merge(cartsize_10, how='right', on='order_id')
mba_subset =  

In [None]:
%time
mba_subset = dd.from_pandas(mba_subset, npartitions=1+mba_subset.memory_usage().sum()//1000000000)

**We need to first convert our dataframe into the format where each row represents a cart and each column is a unique product name.**

In [None]:
%time
mba_basket = mba_subset.groupby(['order_id', 'product_name'])['organic'].count().reset_index().compute()
mba_basket.head()

**Because the `apriori` fuction takes one-hot encoded dataframes and we want to know whether a product was present in a cart or not, we will one-hot encode the counts as 0 (did not purchase) and 1 (purchased).**

**With our data one-hot encoded, we can generate frequent item sets. But first we need to test and select a `support` threshold, which is fairly small because there are so many distinct items and combinations of items in a cart at a time.**

**Let's generate our association rules based on the frequent item sets created above with their corresponding `support`, `confidence`, and `lift`.**

**Rules with high `lift` values indicates that it occurs more frequently than would be expected given the number of transaction and product combinations. Let's filter for rules with fairly high `confidence` and `lift`.**

**It seems that and are purchased together in a manner that is higher than the overall probability would suggest. Let's look at how much opportunity there is to use the popularity of one product to drive the sales of another.**

### Method 2: MLxtend Transaction Encoder

(http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)

In [None]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

In [None]:
ord_num = mba_basket.reset_index().order_id.nunique()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
print('There are {} orders with 2-10 items.'.format(ord_num))

In [None]:
ord_ids = mba_basket.reset_index().order_id.unique()

**Because the dataset is so large, our kernel keeps crashing, so let's save the `list_of_lists` so we don't have to run the above code and create it every time after the kernel crashes.**

In [None]:
with open('list_of_items.pickle', 'rb') as f:
       list_of_lists = pickle.load(f)

**TransactionEncoder() takes in a list of list of products in each transaction and converts it into a format where each column is a unqiue item. If the dataset contains small transactions but lots of unique products, representing the data in sparse format can save memory.**

In [None]:
%time
te = TransactionEncoder()
te_ary = te.fit_transform(list_of_lists)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

In [None]:
df.head()

In [None]:
%time
df = dd.from_pandas(df, npartitions=1+df.memory_usage().sum()//1000000000)

**Let's return the items and itemsets with at least 10% support.**

In [None]:
%time
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)

**Let's filter through the results for itemsets of size 2 or more that has a support of at least 40%.**

In [None]:
frequent_itemsets['size'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.head()

In [None]:
frequent_itemsets[(frequent_itemsets['size'] == 2) & (frequent_itemsets['support'] >= 0.4)]

### Method 3: Apyori

(https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/)

In [None]:
import apyori as apriori

**Since the Apyori library requires that the dataset to be in the form of a list of lists where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list, we need to convert our dataframe into a list of lists, which we have done in method 2 already.**

**The apriori class from the apyori library has several parameters, such as for the support threshold, the confidence threshold, the lift threshold, and the minimum number of items included in the rules, that can be altered.**

In [None]:
association_rules = apriori(list_of_lists, min_support=0.05, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)

In [None]:
print('There are {} association rules mined by the Apriori algorithm.'.format(len(associaton_rules)))

In [None]:
for item in association_rules:
    pair = item[0]
    items = [x for x in pair]
    print('Rule: ' + items[0] + '->' + items[1])
    print('Support: ' + str(item[1]))
    print('Confidence: ' + str(item[2][0][2]))
    print('Lift: ' + str(item[2][0][3]))
    print('\n')

## Conclusion/Findings

The output of the analysis reflects how frequently items co-occur in transactions. This is a function both of the strength of association between the items, and the way the site owner has presented them.

To say that in a different way: items might cooccur not because they are “naturally” connected, but because we, the people in charge of the site, have presented them together.

This is an example of a more general problem in web analytics: our data reflects the way users behave, and the way we have encouraged them to behave, by the website design decisions we have made.

There are a number of ways we can use the data to drive site organisation:

- Large clusters of co-occuring items should probably be placed in their own category / theme
- Item pairs that commonly co-occur should be placed close together within broader categories on the website. This is especially important where one item in a pair is very popular, and the other item is very high margin.
- Long lists of rules (including ones with low support and confidence) can be used to put recommendations at the bottom of product pages and on product cart pages. The only thing that matters for these rules is that the lift is greater than one. (And that we pick those rules that are applicable for each product with the high lift where the product recommended has a high margin.)
- In the event that doing the above (3) drives significant uplift in profit, it would strengthen the case to invest in a recommendation system, that uses a similar algorithm in an operational context to power automatic recommendation engine on your website.

Using the data for targeted marketing
- The same results can be used to drive targeted marketing campaigns. For each user, we pick a handful of products based on products they have bought to date which have both a high uplift and a high margin, and send them a e.g. personalized email or display ads etc.
- How we use the analysis has significant implications for the analysis itself: if we are feeding the analysis into a machine-driven process for delivering recommendations, we are much more interested in generating an expansive set of rules. If, however, we are experimenting with targeted marketing for the first time, it makes much more sense to pick a handful of particularly high value rules, and action just them, before working out whether to invest in the effort of building out that capability to manage a much wider and more complicated rule set.