Here we have essentially a recommendation system using Association Rule Mining. Here, we'll be predicting which items are likely to be bought together.

The Apriori algorithm is a popular choice for these kinds of problems, and its implementation is simple with the help of Python libraries such as mlxtend. Here's a step-by-step approach:

## Step 1: Import Necessary Libraries

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

##  Step 2: Load the Dataset

In [2]:
df = pd.read_csv('online_retail_II.csv')

In [3]:
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


## Step 3: Data Preprocessing
Clean the data by removing NaN values from the Description field and dropping rows that don't have an invoice number. Also, remove return orders - the ones with a negative quantity

In [4]:
df = df.dropna(subset=['Description'])
df.dropna(axis=0, subset=['Invoice'], inplace=True)
df = df[df['Quantity'] > 0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis=0, subset=['Invoice'], inplace=True)


## Step 4: Build the Basket
Create a basket which is a representation of all products that are bought together. Since the dataset is too large, we might want to filter it for computation reasons. Let's filter it to 'Germany' for this example. It also makes sense because customers usually have different purchase behaviour in different countries.

In [5]:
basket = (df[df['Country'] =="Germany"]
          .groupby(['Invoice', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('Invoice'))

## Step 5: Encode the Data
We need to encode the data in a way that if a product is bought its value should be True and False if it's not bought.

In [6]:
def encode_units(x):
    if x <= 0:
        return False
    if x >= 1:
        return True

basket_sets = basket.applymap(encode_units)

## Step 6: Generate Frequent Itemsets
Next, we generate frequent itemsets using the Apriori algorithm. A typical value for the min_support parameter is within 0.05 to 0.25. then the final step is to generate the rules with their corresponding support, confidence and lift.

In [7]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [8]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

In [9]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(6 RIBBONS RUSTIC CHARM),(POSTAGE),0.106464,0.798479,0.091255,0.857143,1.073469,0.006246,1.410646,0.076596
1,(POSTAGE),(6 RIBBONS RUSTIC CHARM),0.798479,0.106464,0.091255,0.114286,1.073469,0.006246,1.008831,0.339623
2,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE),0.09379,0.798479,0.08365,0.891892,1.116988,0.008761,1.864068,0.115575
3,(POSTAGE),(JUMBO BAG WOODLAND ANIMALS),0.798479,0.09379,0.08365,0.104762,1.116988,0.008761,1.012256,0.519726
4,(LUNCH BAG WOODLAND),(POSTAGE),0.082383,0.798479,0.072243,0.876923,1.098242,0.006462,1.637357,0.097485


The results show the association rules generated from the market basket analysis. Let's understand what each column represents:

antecedents and consequents: These columns represent the sets of items involved in each rule. The antecedents are the items that appear in the left-hand side of the rule, and the consequents are the items that appear in the right-hand side of the rule.

antecedent support and consequent support: These columns show the support values for the antecedent and consequent sets, respectively. Support is the proportion of transactions that contain a specific itemset.

support: This column represents the support of the rule, which is the proportion of transactions that contain both the antecedent and consequent sets.

confidence: Confidence indicates the conditional probability of the consequent given the antecedent. It measures the reliability or strength of the rule.

lift: Lift is the ratio of the observed support to the expected support if the antecedent and consequent were independent. It indicates the strength of association between the antecedent and consequent.

leverage: Leverage measures the difference between the observed frequency of the antecedent and consequent appearing together and the frequency that would be expected if they were independent.

conviction: Conviction is a measure of how much the rule's consequent relies on the antecedent. It compares the expected confidence with the observed confidence under independence assumption.

zhangs_metric: Zhang's metric is a measure that combines the lift and conviction values. It is used to assess the quality of association rules.

By analyzing these metrics, we can gain insights into the relationships between products and understand which products tend to be purchased together. For example, in the first rule, the antecedent POSTAGE and the consequent 6 RIBBONS RUSTIC CHARM have a support of 0.091255, confidence of 0.114286, and a lift of 1.073469. This indicates that customers who purchase POSTAGE are 1.073469 times more likely to also purchase 6 RIBBONS RUSTIC CHARM compared to the average likelihood.

We can use these metrics to identify meaningful associations between products and make recommendations for cross-selling or product placement strategies.

Then we can specify the item for which we want to find possible items in the basket and using the rules calculated before, make suggestions based on which items will land in the basket.

In [10]:
# Specify the item for which you want to find possible items in the basket
item_name = "PLASTERS IN TIN WOODLAND ANIMALS"

filtered_rules = rules[rules['antecedents'].apply(lambda x: item_name in x)]
possible_items = filtered_rules['consequents'].apply(lambda x: list(x)[0])

# Print the list of possible items
print(possible_items)

10                                POSTAGE
13    ROUND SNACK BOXES SET OF4 WOODLAND 
Name: consequents, dtype: object
