# Market Basket Analysis on Retail Transactions

## Objective
The objective of this notebook is to analyze retail transaction data and
identify frequently co-purchased products using Market Basket Analysis.
Association rule mining techniques such as Apriori / FP-Growth will be used
to extract meaningful patterns.


## Import Required Libraries


In [14]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
import numpy as np

## Load the Dataset

The dataset is loaded using a semicolon (`;`) as the delimiter, which is commonly
used in European-style CSV files. Loading the dataset correctly ensures that
columns are parsed accurately.


In [15]:
data = pd.read_csv(
    "market_basket_dataset.csv",
    sep=";"
)
# The warning in this cell output is because there are mixed types of bill numbers like 536365 and C536379
# So pandas is not sure what data type to assign to the 'BillNo' column
# But we can ignore this warning for now as it does not affect our analysis and we wont do any operations that depend on the data type of 'BillNo' column

  data = pd.read_csv(


### Preview of the Dataset

In [12]:
data.head()


Unnamed: 0,BillNo,Itemname,Quantity,Date,Price,CustomerID,Country
0,536365,WHITE HANGING HEART T-LIGHT HOLDER,6,01.12.2010 08:26,255,17850.0,United Kingdom
1,536365,WHITE METAL LANTERN,6,01.12.2010 08:26,339,17850.0,United Kingdom
2,536365,CREAM CUPID HEARTS COAT HANGER,8,01.12.2010 08:26,275,17850.0,United Kingdom
3,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6,01.12.2010 08:26,339,17850.0,United Kingdom
4,536365,RED WOOLLY HOTTIE WHITE HEART.,6,01.12.2010 08:26,339,17850.0,United Kingdom


## Initial Data Inspection

This step helps us understand the structure of the dataset, including:
- Number of rows and columns
- Data types of each column
- Presence of missing values
- Memory usage

This understanding is essential before performing any cleaning or transformation.


In [16]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522064 entries, 0 to 522063
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   BillNo      522064 non-null  object 
 1   Itemname    520609 non-null  object 
 2   Quantity    522064 non-null  int64  
 3   Date        522064 non-null  object 
 4   Price       522064 non-null  object 
 5   CustomerID  388023 non-null  float64
 6   Country     522064 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 27.9+ MB


### Observations from Initial Inspection
(After running data.info(), write observations like below in Markdown)
- The dataset contains 522,064 rows and 7 columns.
- `Price` is numeric (`float64`).
- `Quantity` is integer (`int64`).
- `Itemname` and `CustomerID` contain missing values.
- `BillNo` contains mixed types and is stored as an object.
- `Date` is currently stored as an object and not as a datetime.


## Data Understanding

In this section, we describe what each column in the dataset represents.
Understanding the meaning of each column helps determine which columns are
relevant for Market Basket Analysis and which can be ignored or used later.

### Column Descriptions

| Column Name | Description |
|------------|-------------|
| BillNo | Unique transaction or invoice identifier |
| Itemname | Name of the product purchased |
| Quantity | Number of units purchased in the transaction |
| Date | Date of the transaction |
| Price | Price per unit of the product |
| CustomerID | Unique customer identifier |
| Country | Country where the transaction occurred |


### Relevance for Market Basket Analysis

For Market Basket Analysis, we are primarily interested in:
- `BillNo` – to define a transaction
- `Itemname` – to define products
- `Quantity` – to identify whether an item was purchased

Other columns such as `Price`, `Date`, `CustomerID`, and `Country` are not
required for association rule mining at this stage and will be ignored for now.


## Data Cleaning – Handling Missing Values

Market Basket Analysis requires complete information about transactions and
items. Transactions with missing product names cannot contribute to meaningful
association rules, so such rows must be removed.


### Missing Values Overview


In [17]:
data.isnull().sum()


BillNo             0
Itemname        1455
Quantity           0
Date               0
Price              0
CustomerID    134041
Country            0
dtype: int64

### Decision on Missing Values

- Rows with missing `Itemname` are removed because the product is unknown.
- `CustomerID` contains missing values, but it is not required for Market Basket
  Analysis and will be ignored.
- No rows are missing `BillNo` or `Quantity`, so those columns are safe.


In [18]:
data = data.dropna(subset=["Itemname"])

### Verification After Removing Missing Values


In [20]:
data.isnull().sum()


BillNo             0
Itemname           0
Quantity           0
Date               0
Price              0
CustomerID    132586
Country            0
dtype: int64

## Selecting Relevant Columns

For Market Basket Analysis, we only require information about:
- The transaction identifier
- The product purchased
- The quantity purchased

All other columns are not needed for association rule mining at this stage.
Selecting only the relevant columns simplifies the dataset and improves
processing efficiency.


In [22]:
data = data[["BillNo", "Itemname", "Quantity"]]


### Verification After Column Selection


In [23]:
data.head()


Unnamed: 0,BillNo,Itemname,Quantity
0,536365,WHITE HANGING HEART T-LIGHT HOLDER,6
1,536365,WHITE METAL LANTERN,6
2,536365,CREAM CUPID HEARTS COAT HANGER,8
3,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6
4,536365,RED WOOLLY HOTTIE WHITE HEART.,6


In [24]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 520609 entries, 0 to 522063
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   BillNo    520609 non-null  object
 1   Itemname  520609 non-null  object
 2   Quantity  520609 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 15.9+ MB


## Aggregating Items Within Each Transaction

A single transaction (BillNo) may contain the same product multiple times.
Before creating the market basket matrix, we need to aggregate quantities
for each product within each transaction.


In [25]:
transaction_item = (
    data
    .groupby(["BillNo", "Itemname"])["Quantity"]
    .sum()
    .reset_index()
)


### Verification After Aggregation


In [26]:
transaction_item.head()


Unnamed: 0,BillNo,Itemname,Quantity
0,536365,CREAM CUPID HEARTS COAT HANGER,8
1,536365,GLASS STAR FROSTED T-LIGHT HOLDER,6
2,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6
3,536365,RED WOOLLY HOTTIE WHITE HEART.,6
4,536365,SET 7 BABUSHKA NESTING BOXES,2


In [27]:
transaction_item.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509829 entries, 0 to 509828
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   BillNo    509829 non-null  object
 1   Itemname  509829 non-null  object
 2   Quantity  509829 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 11.7+ MB


## Creating the Transaction–Item Matrix

Market Basket Analysis algorithms require data in the form of a matrix where:
- Each row represents a transaction
- Each column represents a product
- Each cell indicates the quantity of that product purchased in the transaction


In [28]:
basket = (
    transaction_item
    .pivot(index="BillNo", columns="Itemname", values="Quantity")
    .fillna(0)
)


### Verification of Basket Matrix


In [29]:
basket.head()


Itemname,*Boombox Ipod Classic,*USB Office Mirror Ball,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
BillNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
basket.shape


(20210, 4185)

## Binary Encoding of the Basket Matrix

Market Basket Analysis algorithms such as Apriori and FP-Growth require binary
input, where:
- 1 indicates the presence of an item in a transaction
- 0 indicates the absence of an item

The actual quantity purchased is not required for association rule mining.


In [31]:
basket_binary = basket.map(lambda x: 1 if x > 0 else 0)


  basket_binary = basket.applymap(lambda x: 1 if x > 0 else 0)


### Verification of Binary Basket Matrix


In [32]:
basket_binary.head()


Itemname,*Boombox Ipod Classic,*USB Office Mirror Ball,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
BillNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
basket_binary.values.sum()


np.int64(509356)