## Market Basket Analysis

### Step 1: Importing the required libraries

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

### Step 2: Loading and exploring the data

In [2]:
# Load the dataset
file_path = 'groceries.csv'
df = pd.read_csv(file_path)

In [3]:
# Display the first few rows of the dataset
df.head(10)

Unnamed: 0,Items
0,"citrus fruit,semi-finished bread,margarine,rea..."
1,"tropical fruit,yogurt,coffee"
2,whole milk
3,"pip fruit,yogurt,cream cheese ,meat spreads"
4,"other vegetables,whole milk,condensed milk,lon..."
5,"whole milk,butter,yogurt,rice,abrasive cleaner"
6,rolls/buns
7,"other vegetables,UHT-milk,rolls/buns,bottled b..."
8,pot plants
9,"whole milk,cereals"


In [4]:
# Checking for missing values
missing_values = df.isnull().sum()
print(missing_values)

Items    0
dtype: int64


In [5]:
df.shape

(700, 1)

### Data Preparation for Market Basket Analysis

The following step is a critical phase in the Market Basket Analysis process, where the raw transactional data is transformed into a suitable format for extracting meaningful insights.

To do this, we must transform this data into a format suitable for seamless integration with the Apriori algorithm. Essentially, we aim to represent it in a tabular structure where ones and zeros denote the presence or absence of specific elements.

### Step 3: Converting the data into a suitable format for analysis

In [6]:
# 1. Split transaction strings (i.e., Items) into lists called transactions
transactions = df['Items'].apply(lambda t: t.split(','))

print(transactions)

0      [citrus fruit, semi-finished bread, margarine,...
1                       [tropical fruit, yogurt, coffee]
2                                           [whole milk]
3       [pip fruit, yogurt, cream cheese , meat spreads]
4      [other vegetables, whole milk, condensed milk,...
                             ...                        
695    [pork, UHT-milk, bottled water, soda, canned b...
696    [other vegetables, curd, yogurt, curd cheese, ...
697    [rolls/buns, soda, fruit/vegetable juice, cann...
698    [frankfurter, pip fruit, whole milk, rolls/bun...
699    [yogurt, hygiene articles, newspapers, shoppin...
Name: Items, Length: 700, dtype: object


In [7]:
# 2. Convert DataFrame column into list of strings
transactions = list(transactions)

### One-Hot Encoding and Apriori Algorithm
Now we apply the TransactionEncoder which converts item lists into transaction data for frequent itemset mining. That is, we convert the list to a One-Hot Encoded Boolean list

In [8]:
# Apply the Transaction Encoder
transformer = TransactionEncoder()

The `fit` method of the TransactionEncoder learns the unique labels present in the dataset, and through the transform method, it converts the input dataset (a Python list of lists) into a NumPy boolean array using one-hot encoding.

In [9]:
transformer_data = transformer.fit(transactions).transform(transactions)

Convert the encoded array into a pandas DataFrame:

In [10]:
df = pd.DataFrame(transformer_data,columns=transformer.columns_)
df = df.replace(False,0)
df

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baking powder,bathroom cleaner,beef,berries,beverages,...,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,True,0,0,0,0,0,0,0,True,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,True,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,True,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,True,0
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,True,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,0,0


Next, the code below uses the Apriori algorithm to find frequent itemsets. Setting the `min_support` with a threshold value of 1%. The results are returned in the object `frequent_itemsets` by the `aprior` method. 

In [11]:
# Apply Apriori algorithm to find frequent itemsets
# Set a threshold value for the support value
frequent_itemsets = apriori(df, min_support = 0.01, use_colnames = True, verbose = 1)
frequent_itemsets

Processing 5208 combinations | Sampling itemset size 3

Processing 212 combinations | Sampling itemset size 4




Unnamed: 0,support,itemsets
0,0.02,(UHT-milk)
1,0.011429,(baking powder)
2,0.06,(beef)
3,0.04,(berries)
4,0.031429,(beverages)
...,...,...
309,0.012857,"(root vegetables, whole milk, tropical fruit)"
310,0.01,"(root vegetables, whole milk, yogurt)"
311,0.011429,"(yogurt, whole milk, sausage)"
312,0.011429,"(yogurt, whole milk, soda)"


Next, we then generates association rules based on a minimum confidence threshold of 70%. 

In addition, it's crucial to understand that an association rule comprises two components: an antecedent (if) and a consequent (then). The antecedent represents an item identified within the dataset, while the consequent is an item found in conjunction with the antecedent. To assess the interest of an association rule, various metrics have been devised. In the current implementation, we utilize the `confidence` metric.

In [12]:
#Let's view our interpretation values using the Association rule function.
association_rules_df = association_rules(frequent_itemsets, metric = "confidence", min_threshold = 0.7)
association_rules_df

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(cereals),(whole milk),0.011429,0.251429,0.01,0.875,3.480114,0.007127,5.988571,0.720892
1,(frozen dessert),(whole milk),0.014286,0.251429,0.011429,0.8,3.181818,0.007837,3.742857,0.695652
2,"(yogurt, fruit/vegetable juice)",(whole milk),0.017143,0.251429,0.014286,0.833333,3.314394,0.009976,4.491429,0.710465
3,"(margarine, rolls/buns)",(whole milk),0.017143,0.251429,0.012857,0.75,2.982955,0.008547,2.994286,0.676357
4,"(sugar, other vegetables)",(whole milk),0.014286,0.251429,0.01,0.7,2.784091,0.006408,2.495238,0.650104
5,"(root vegetables, sausage)",(rolls/buns),0.012857,0.218571,0.01,0.777778,3.55846,0.00719,3.516429,0.728344


The resulting table shows that the five most popular product combinations that are frequently bought together are:
- Cereals and whole milk
- Frozen dessert and whole milk
- Yogurt, fruit, vegetable, juice and whole milk
- Sugar, other vegetable and whole milk
- Root vegetable sausage and rolls/buns

For example, if we take a look at our 1st index value:
- 80% of those who buys frozen dessert, buys whole milk as well.
- Their correlation with each other is seen as 3.7.