<div id="container" style="position:relative;">
<div style="float:left"><h1> Market Basket Analysis </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://drive.google.com/uc?export=view&id=1EnB0x-fdqMp6I5iMoEBBEuxB_s7AmE2k" />
</div>
</div>

So far, we have seen supervised learning, and unsupervised learning. Another kind of learning is *rules-based* learning, which debatedly belongs to the machine learning family. We will also briefly cover another type, reinforcement learning, in the future.

In rules-based learning, rather than learning labels or clustering like in machine learning, or relationships like in regressions, we are trying to learn rules of associations between objects.

A rule might look like: '*If* a customer bought a plane ticket, *then* they will buy a hotel room.' This example might be obvious, but a (possibly apocryphal) association rule:

*If* the customer is male, aged 20-40, and buys diapers between 5 and 7pm, *then* customer will also buy beer.

Mining retail datasets like this is done to find a number of relations:

* **Complementary products**: products which are often bought together, like chips and salsa
* **Substitute products**: products which replace each other, like Coke and Pepsi
* **Trigger products**: products which when bought, trigger other purchases
* **Common Baskets**: combinations of products that are often bought together

This kind of data is gold to retailers: We can design promos where one complement is discounted and the rest of the items are marked up, offer discounts for commonly bought items, plan store layout, recommend items, and promote cross-sell and upsell.

### Coffee Preferences

We have set up a store which only sells three items: Coffee, Milk and Sugar. Our basket types are thus all combinations between the three items.

We have a dataset of 'baskets' - you can download [the data from here](https://drive.google.com/uc?export=view&id=1iI1IJZXlC0WgcSzQv40vl2fpXm-22aW-). Each transaction that comes through our system:

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
basket_df = pd.read_csv("data/baskets.csv")
basket_df.head(10)

Unnamed: 0,transaction_id,item
0,916369,coffee
1,916369,milk
2,916369,sugar
3,743789,coffee
4,743789,milk
5,743789,sugar
6,169588,coffee
7,169588,milk
8,169588,sugar
9,723327,coffee


In this dataframe, every item that was purchased has its transcation id associated with it, but we would like to group items into complete transcations.

---
### Exercise 1

Group the dataframe into a pandas series so that each transaction id indexes a list of the items purchased with it.

e.g.:

```
transaction_id
2376      [coffee, milk, sugar]
3688                   [coffee]
10266            [coffee, milk]
26740     [coffee, milk, sugar]
40073     [coffee, milk, sugar]
                  ...          
```

#### Solution

We group by transaction id and apply the `list` function to each group:

In [9]:
#my_baskets_series = basket_df.groupby('transaction_id')['item'].apply(list)
my_baskets_series = basket_df.groupby('transaction_id')['item'].agg(list)
#my_baskets_series = basket_df.groupby('transaction_id')['item'].apply(lamda x: x)
my_baskets_series

transaction_id
2376      [coffee, milk, sugar]
3688                   [coffee]
10266            [coffee, milk]
26740     [coffee, milk, sugar]
40073     [coffee, milk, sugar]
                  ...          
961226                 [coffee]
983283    [coffee, milk, sugar]
986306    [coffee, milk, sugar]
996478    [coffee, milk, sugar]
999182           [coffee, milk]
Name: item, Length: 100, dtype: object

------------

This gives us a series of lists where the index of every list is its transaction id. We don't care about the transaction ids too much so we will just grab the values of that series. The values come back as a numpy array of lists (**NOTE: This is not the same as a 2D numpy array**)

In [10]:
my_baskets = my_baskets_series.values
my_baskets

array([list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee', 'sugar']),
       list(['coffee']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk',

This is not nice data! We can't just put it into a dataframe - there is a different number of items bought each time.

In R, the standard package to analyse this type of data is `arules`, but it is not very user-friendly. As we have been working with `scikit-learn`, let's stick to the same ecosystem.

Unfortunately, `scikit-learn` does not have anything built in either - but we can use the `mlxtend` package, which contains a number of extensions to `scikit-learn`.

### Using mlxtend

We follow the same basic API in `mlxtend` as in `scikit-learn`. We have preprocessing, along with models that we use to fit, transform and predict.

In [11]:
from mlxtend.preprocessing import TransactionEncoder

In [12]:
te = TransactionEncoder()
coffee = te.fit_transform(my_baskets)
coffee_df = pd.DataFrame(coffee, columns=te.columns_)
coffee_df.head()

Unnamed: 0,coffee,milk,sugar
0,True,True,True
1,True,False,False
2,True,True,False
3,True,True,True
4,True,True,True


In [13]:
my_baskets

array([list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee', 'sugar']),
       list(['coffee']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk']), list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']), list(['coffee']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk', 'sugar']),
       list(['coffee', 'milk',

Great, so we now have a data frame with columns as dummy variables for each item in our store. The rows are each a basket. In association rules, we normally discount the total number of an item bought - though it is possible to add a 'fake item' flag to tag larger purchases of items.

What if we have a large store/dataset? Our columns might number in the tens of thousands. Amazon sells over 500 million distinct items. We will discuss solutions shortly - for now let's keep working on our coffee shops data.

First, we can do some simple exploratory data analysis:

In [14]:
import matplotlib.pyplot as plt
import seaborn as sns

###How many in total:
print(coffee_df.sum(axis = 0))

coffee    92
milk      71
sugar     61
dtype: int64


In [15]:
# make a co-occurrence table:
co_occurrence = pd.DataFrame({"coffee": [0,0,0],
                             "milk": [0,0,0],
                             "sugar": [0,0,0]},
                            index = ["coffee", "milk", "sugar"])
co_occurrence

Unnamed: 0,coffee,milk,sugar
coffee,0,0,0
milk,0,0,0
sugar,0,0,0


In [16]:
# Iterate over each row
for index, row in coffee_df.iterrows():
    # For each item combination
    for item1 in ["coffee", "milk", "sugar"]:
        for item2 in ["coffee", "milk", "sugar"]:
            # If both are true, add one to the associated index in the co-occurence table
            if row[item1] and row[item2]:
                co_occurrence.loc[item1, item2] += 1
co_occurrence

Unnamed: 0,coffee,milk,sugar
coffee,92,63,55
milk,63,71,58
sugar,55,58,61


In [17]:
# Turn the table into percent
co_occurrence/coffee_df.shape[0]

Unnamed: 0,coffee,milk,sugar
coffee,0.92,0.63,0.55
milk,0.63,0.71,0.58
sugar,0.55,0.58,0.61


In [18]:
coffee_df.head()

Unnamed: 0,coffee,milk,sugar
0,True,True,True
1,True,False,False
2,True,True,False
3,True,True,True
4,True,True,True


In [19]:
# Can also be accomplished with some linear algebra magic
ctable = coffee_df.T.dot(coffee_df.astype('int'))
ctable = ctable/coffee_df.shape[0]
ctable

Unnamed: 0,coffee,milk,sugar
coffee,0.92,0.63,0.55
milk,0.63,0.71,0.58
sugar,0.55,0.58,0.61


By eyeballing the data, we can see that the coffee is most common, milk is less common, and sugar is the least common. It is hard to tell if there are any rules about which products co-occur in purchases.

This is where association rules come in.

### Apriori Algorithm

Again, the problem of a large number of items rears its head. What we want to do is to create all possible combinations of items, then see which items are most commonly also purchased, given that one of these combinations has been purchased.

We can see that for a number of objects, $n$, in our store, there are about $n^x$ possible $x$ sized baskets

The method for creating our baskets is called the Apriori algorithm (Agrawal & Srikant, 1994<sup>[1](http://www.vldb.org/conf/1994/P487.PDF)</sup>). There are several other more efficient methods since proposed but not coded, so we will stick with it for now. [Wikipedia has the exact details](https://en.wikipedia.org/wiki/Apriori_algorithm).

The idea is, we take a threshold occurrence, $C$, and find all individual items with occurence greater than $C$. Any items that are less than our threshold are removed from further analysis. We then go up a level and find all _pairs_ of non-excluded objects, and use the same threshold to exclude items. We can progressively widen the number of items in our sets while avoiding some of the explosion of computation size with sensible exclusions.

This is implemented in mlxtend:

In [20]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules



In [21]:
### create a df with all our items: threshold, names and len
apriori(coffee_df, min_support=0.5, use_colnames=True, max_len = 3)

Unnamed: 0,support,itemsets
0,0.92,(coffee)
1,0.71,(milk)
2,0.61,(sugar)
3,0.63,"(milk, coffee)"
4,0.55,"(sugar, coffee)"
5,0.58,"(milk, sugar)"
6,0.52,"(milk, sugar, coffee)"


In [22]:
### create a df with all our items: threshold, names and len
apriori(coffee_df, min_support=0.7, use_colnames=True, max_len = 3)

Unnamed: 0,support,itemsets
0,0.92,(coffee)
1,0.71,(milk)


We can see the *support* score is the proportion of baskets that contain the given item, or combinations of items.

In the case here with setting a minimum support of 0.5, all of our combinations are returned. In a larger example, we would start to find interesting co-occurrences here.

The output is a pandas dataframe, so we can filter, sort, *etc.* as desired. The `itemsets` column is of object datatype, and contains tuples:


In [23]:
x = apriori(coffee_df, min_support=0.5, use_colnames=True, max_len = 3)
x['length'] = x.itemsets.apply(lambda x: len(x))
x

Unnamed: 0,support,itemsets,length
0,0.92,(coffee),1
1,0.71,(milk),1
2,0.61,(sugar),1
3,0.63,"(milk, coffee)",2
4,0.55,"(sugar, coffee)",2
5,0.58,"(milk, sugar)",2
6,0.52,"(milk, sugar, coffee)",3


### Determining Rules

Once our data is in the format above, we can begin to determine association rules.

Here, we calculate several metrics to analyse the rules. These are calculated automatically by the package, but we will take time to understand them.

First, all of our groups are designated as 'antecedents' and 'consequents'. This allows us to say: 'given this group of antecedents, we see this group of consequents with frequency x'. We will designate antecedents as $X$ and consequents as $Y$ below.

Let's make some rules for illustration of these measures:

In [24]:
from mlxtend.frequent_patterns import association_rules

x = apriori(coffee_df, min_support=0.5, use_colnames=True)

#take a look at the help for ways we can use this function
association_rules(x, metric="lift", min_threshold=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(milk),(sugar),0.71,0.61,0.58,0.816901,1.339183,0.1469,2.13
1,(sugar),(milk),0.61,0.71,0.58,0.95082,1.339183,0.1469,5.896667
2,"(milk, coffee)",(sugar),0.63,0.61,0.52,0.825397,1.35311,0.1357,2.233636
3,"(sugar, coffee)",(milk),0.55,0.71,0.52,0.945455,1.331626,0.1295,5.316667
4,(milk),"(sugar, coffee)",0.71,0.55,0.52,0.732394,1.331626,0.1295,1.681579
5,(sugar),"(milk, coffee)",0.61,0.63,0.52,0.852459,1.35311,0.1357,2.507778


We have already calculated **support**: How often our items, or group of items occur in the dataset.

$$  \text{support}(X \cup Y) = \frac{\text{# of transactions with X and Y together}}{\text{total number of transactions}} $$

If items are not related, we would expect support of one to be independent of the support of the other. If item $X$, our antecedent, occurs in 0.7 of baskets, and item $Y$, our consequent, occurs in 0.6, we expect them to occur together in 0.7 * 0.6 = 0.42 (or 42%) of baskets.

If the proportion is higher, then we have items which are occurring at a higher frequency than expected - this might indicate that we have a useful association rule. If `milk`, then `sugar`.

**Confidence** is a measure of how much more likely one basket is to occur than another. It is calculated by dividing the support of our antecedent and consequent together by the support of our antecedent alone:

$$ \text{confidence}(X\rightarrow Y) = \frac{\text{support}(X\cup Y)}{\text{support}(X)} = \frac{\text{proportion of transactions with X and Y}}{\text{proportion of transactions with X}}$$

If $X$ and $Y$ are unrelated, we expect $\text{support}(X \cup Y) = \text{support}(X) \times \text{support}(Y)$, so a value of 1 indicates that our consequent is always bought with our antecedent. If $\text{confidence}(X\rightarrow Y) = \text{support}(Y)$ this suggests no relationship between the two, and a lower value suggests that they are substitutes.

**Lift** measures a similar idea: How much we have _lifted_ the purchase likelihood of the consequent by having antecedent included in our basket. A values of 1 represents no increase.

$$ \text{lift}(X\rightarrow Y) = \frac{\text{confidence}(X\rightarrow Y)}{\text{support}(Y)} = \frac{\frac{\text{support}(X\cup Y)}{\text{support}(X)}}{\text{support}(Y)} = \frac{\text{support}(X\cup Y)}{\text{support}(X)\times\text{support}(Y)}$$

We can think of lift as measuring how much more often $X$ and $Y$ occur together than expected if their purchase frequency were independent.

**Leverage** is the difference in support of the larger group, than would be expected if the antecedent and consequent were independent:

$$\text{leverage}(X\rightarrow Y) = \text{support}(X\cup Y) - \text{support}(X) \times \text{support}(Y)$$

**Conviction** is a measure of the dependence of the consequent on the antecedent. It compares the expected proportion of $X$ appearing without $Y$ if they were dependent with the actual proportion of baskets containing $X$ without $Y$. A high value denotes that we always see the consequent purchased with the antecedent:

$$\text{conviction}(X\rightarrow Y) = \frac{1 - \text{support}(Y)}{1 - \text{confidence}(X\rightarrow Y)} = \frac{\text{proportion with $X$}\times\text{proportion without $Y$}}{\text{proportion with $X$ and without $Y$}}$$

### How to interpret

We have a range of metrics - how to we decide which to report, and what is significant?

We would normally report the support, lift and confidence.

**Support** allows us to see how often the basket occurs. We don't want to waste our time promoting strong links between items if only a few people buy them.

**Confidence** allows us to see the strength of the rule. What proportion of transactions with our first item also contain the other item (or items)?

**Lift** can be interpreted a measure of how much we potentially drive up the sales of the consequent by the relationship? In theory it can be seen as proportional to the increase of sales of the antecedent.

In practice, we would start with all rules with lift above 1, and drill down into the pricing, sales, and desires of our store.

---
Additional Association Rules: Leverage and Conviction are less common options for assessing the strength of the co-occurrence relationship.

**Leverage** computes the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. A leverage value of 0 indicates independence.

The rationale in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sales.

**Conviction** looks at the ratio of the expected frequency that the rule makes an incorrect prediction if X and Y were independent, divided by the observed frequency of incorrect predictions.

If the conviction value is greater than 1, then incorrect predictions occur less often compared to if these two actions were independent. A conviction of 1.5 for example, would indicate that if the variables were independent, the prediction would be incorrect 50% more often.

### Caveats

We can see from our table above that lift and leverage are reversible, whereas conviction and confidence are not (*i.e.* $\text{lift}(X\rightarrow Y) = \text{lift}(Y\rightarrow X)$, however $\text{confidence}(X\rightarrow Y) \neq \text{confidence}(Y\rightarrow X)$)

We need to be careful about inferring causation from lift or leverage: we cannot say that the lift in $X$ or $Y$ was caused by $X$ or $Y$, just whether they more frequently occur together than chance or not.

Similarly, confidence needs to be taken carefully. If we have two baskets with kiwis and diamonds, and two with just diamonds, our confidence for kiwi $\rightarrow$ diamonds is 1! Our confidence for diamonds to kiwi is 0.5.

### Working with larger data

Creating the groups of all items is extremely expensive in larger sizes, and this is a constant problem in association rules.

We could pool by product category - if all we want to predict is what categories of items go together, we could pool all game consoles, cables of a given type, or pastas.

We could also run our model for each subcategory independently - when predicting rules for pasta, we could turn all basket items that are non-pasta into categories, and see if I bought a cheese item, I will buy rigatoni.

One piece that can help is working with sparse matrices, which is implemented in this package. As well as scipy's sparse module, pandas has limited support for sparse matrices.

In general, it is much more efficient to work in scipy and call `todense()` or `toarray()` when needed. However we can take advantage of Pandas DataFrame functionality to make easy to use dataframes with column names using the `DataFrame.sparse.from_spmatrix()` method. However, this will still consume more memory than using a pure scipy sparse matrix so be careful when you have large datasets.

In [25]:
#Fit the TransactionEncoder and transform the buckets
te = TransactionEncoder()
coffee = te.fit_transform(my_baskets, sparse=True)#.transform(baskets, sparse=True)

In [26]:
display(coffee)
print('')
display(type(coffee))

<100x3 sparse matrix of type '<class 'numpy.bool_'>'
	with 224 stored elements in Compressed Sparse Row format>




scipy.sparse.csr.csr_matrix

In [27]:
# Create a DataFrame
coffee_df = pd.DataFrame.sparse.from_spmatrix(coffee, columns=te.columns_)
coffee_df.head()

Unnamed: 0,coffee,milk,sugar
0,True,1,1
1,True,0,0
2,True,1,0
3,True,1,1
4,True,1,1


In [28]:
coffee_df.dtypes

coffee    Sparse[bool, 0]
milk      Sparse[bool, 0]
sugar     Sparse[bool, 0]
dtype: object

We can see that the columns are using the Pandas Sparse dtypes. These are not exactly eqivalent to a standard scipy sparse matrix, however do help to reduce memory usage somewhat. You can read more about Pandas sparse data structures [here](https://pandas.pydata.org/docs/user_guide/sparse.html).

### Recommender Systems

Once we have our rules, we can start to recommend items to customers. If we have the current basket, we can check our association rules for the next most common item, based on highest lift (or, highest proft for our store). Stores like Amazon can use association rules to efficiently recommend the next $n$ items that we might purchase, based on lift, confidence, expected profit, or past purchases.

In reality, most current recommender systems will use a combination of approaches (*e.g.* collaborative filtering, clustering, *etc.*) depending on their requirements and intent. Association rules can be used for promotions, product placement, and coupons, as they take considerable time to generate and are specifically well-suited for these applications.

---
#### Exercise 2

Download the `Online Retail.csv` [here](https://drive.google.com/uc?export=view&id=1r2OjAHs6C27Z7Z59mex_XNZ5lbcsisp9). We've provided you with the code to read in the data, and transform it into a sparse DataFrame. Use the resulting DataFrame for this exercise:

1. Use the apriori algorithm (use min_support ~0.02) to reduce the dataset and then create association rules. Feel free to play around with different metrics to use as the threshold.

2. Sort your association rules dataframe by one of the metrics in descending order. Try looking up the product descriptions for the antecedent and consequent stockcodes for the first row in your rules dataframe and think of a reason behind this purchasing pattern.

2. Complete the function listed in a cell below. The function takes in a basket of items (stock codes), an association rules dataframe, and a specific association rules metric. The function will return a recommended item to purchase based on an item in the basket (Just like on Amazon!). Feel free to try it with various item combinations.

---

In [42]:
# Loading the data
retail_df = pd.read_csv('data/Online Retail.csv')

In [44]:
retail_df.shape

(541909, 8)

In [45]:
# Create our list of 'baskets' for use with the TransactionEncoder
basket_series = retail_df.groupby('InvoiceNo').apply(lambda x: list(x['StockCode']))
basket_series.head()

InvoiceNo
536365    [85123A, 71053, 84406B, 84029G, 84029E, 22752,...
536366                                       [22633, 22632]
536367    [84879, 22745, 22748, 22749, 22310, 84969, 226...
536368                         [22960, 22913, 22912, 22914]
536369                                              [21756]
dtype: object

#### Solution

First, we transform the data with the encoder:

In [46]:
from mlxtend.preprocessing import TransactionEncoder

# Transform our basket series into a transaction matrix
te = TransactionEncoder()
transaction_matrix = te.fit_transform(basket_series, sparse=True)

# Convert to dataframe
transaction_df = pd.DataFrame.sparse.from_spmatrix(transaction_matrix,
                                                  columns = te.columns_)
transaction_df.head()

Unnamed: 0,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,M,PADS,POST,S,gift_0001_10,gift_0001_20,gift_0001_30,gift_0001_40,gift_0001_50,m
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we use the apriori algorithm:

In [47]:
# Q1
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

x = apriori(transaction_df, min_support=0.02, use_colnames=True)
x

Unnamed: 0,support,itemsets
0,0.020193,(15036)
1,0.027181,(20685)
2,0.020541,(20711)
3,0.033668,(20712)
4,0.026023,(20713)
...,...,...
215,0.022471,"(23203, 85099B)"
216,0.021197,"(23301, 23300)"
217,0.022896,"(85099B, 85099C)"
218,0.021042,"(85099F, 85099B)"


We can look at the association rules and calculate the metrics we studied earlier. The table below is sorted by `lift`.

In [48]:
# Q2
rules_df = association_rules(x, metric='lift', min_threshold=1)
rules_df.sort_values('lift', ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
72,"(22699, 22697)",(22698),0.03027,0.030965,0.021197,0.700255,22.614223,0.02026,3.232865
73,(22698),"(22699, 22697)",0.030965,0.03027,0.021197,0.684539,22.614223,0.02026,3.074005
75,(22697),"(22698, 22699)",0.040811,0.023707,0.021197,0.519395,21.909313,0.020229,2.031382
70,"(22698, 22699)",(22697),0.023707,0.040811,0.021197,0.894137,21.909313,0.020229,9.060649
65,(23300),(23301),0.029537,0.035676,0.021197,0.717647,20.115865,0.020143,3.415315
64,(23301),(23300),0.035676,0.029537,0.021197,0.594156,20.115865,0.020143,2.391222
74,(22699),"(22698, 22697)",0.043243,0.024865,0.021197,0.490179,19.713703,0.020122,1.912699
71,"(22698, 22697)",(22699),0.024865,0.043243,0.021197,0.852484,19.713703,0.020122,6.485804
50,(22698),(22697),0.030965,0.040811,0.024865,0.802993,19.675976,0.023601,4.868796
51,(22697),(22698),0.040811,0.030965,0.024865,0.609272,19.675976,0.023601,2.480072


Take a look at the product descriptions for the products that are a part of the first association rule, (ie. StockCode = 22699,22697,22698).

In [49]:
retail_df[retail_df['StockCode'].isin(['22699','22697','22698'])]['Description'].unique()

array(['ROSES REGENCY TEACUP AND SAUCER ',
       'GREEN REGENCY TEACUP AND SAUCER',
       'PINK REGENCY TEACUP AND SAUCER', nan], dtype=object)

Finally, we complete the recommendation function below:

In [50]:
#RUN THIS CODE FIRST FOR Q3
rules_df['antecedents'] = rules_df['antecedents'].apply(lambda x:list(x)).copy()
rules_df['consequents'] = rules_df['consequents'].apply(lambda x:list(x)).copy()

In [56]:
import numpy as np
# Input basket
mybasket = ['22698', '22697', '22356']

#metric
metric = 'lift'

#COMPLETE THIS FUNCTION
def product_recs(basket, rule_df, metric):

    # Randomly select an item from the basket
    random_item = np.random.choice(basket, 1)[0]
    print('Selected item in basket')
    print(random_item)
    print('======================')

    # Find rules where the item is in the antecedent
    rule_filter = rule_df['antecedents'].apply(lambda x: x[0]) == random_item

    # Filter the dataframe using rule_filter and sort by the selected metric
    filtered_df = rule_df[rule_filter].sort_values(by=metric)

    # Randomly return one of the top 20 items from the filtered dataframe
    reco = filtered_df.head(20).sample(1)['consequents']
    
    print('Recommeded product...')

    return reco



In [57]:
product_recs(mybasket, rules_df, metric)

Selected item in basket
22698
Recommeded product...


54    [22699]
Name: consequents, dtype: object

<div id="container" style="position:relative;">
<div style="position:relative; float:right"><img style="height:25px""width: 50px" src ="https://drive.google.com/uc?export=view&id=14VoXUJftgptWtdNhtNYVm6cjVmEWpki1" />
</div>
</div>