# Market Basket Analysis on Retail Transactions

## Objective
The objective of this notebook is to analyze retail transaction data and
identify frequently co-purchased products using Market Basket Analysis.
Association rule mining techniques such as Apriori / FP-Growth will be used
to extract meaningful patterns.


## Import Required Libraries


In [14]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
import numpy as np

## Load the Dataset

The dataset is loaded using a semicolon (`;`) as the delimiter, which is commonly
used in European-style CSV files. Loading the dataset correctly ensures that
columns are parsed accurately.


In [15]:
data = pd.read_csv(
    "market_basket_dataset.csv",
    sep=";"
)
# The warning in this cell output is because there are mixed types of bill numbers like 536365 and C536379
# So pandas is not sure what data type to assign to the 'BillNo' column
# But we can ignore this warning for now as it does not affect our analysis and we wont do any operations that depend on the data type of 'BillNo' column

  data = pd.read_csv(


### Preview of the Dataset

In [12]:
data.head()


Unnamed: 0,BillNo,Itemname,Quantity,Date,Price,CustomerID,Country
0,536365,WHITE HANGING HEART T-LIGHT HOLDER,6,01.12.2010 08:26,255,17850.0,United Kingdom
1,536365,WHITE METAL LANTERN,6,01.12.2010 08:26,339,17850.0,United Kingdom
2,536365,CREAM CUPID HEARTS COAT HANGER,8,01.12.2010 08:26,275,17850.0,United Kingdom
3,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6,01.12.2010 08:26,339,17850.0,United Kingdom
4,536365,RED WOOLLY HOTTIE WHITE HEART.,6,01.12.2010 08:26,339,17850.0,United Kingdom


## Initial Data Inspection

This step helps us understand the structure of the dataset, including:
- Number of rows and columns
- Data types of each column
- Presence of missing values
- Memory usage

This understanding is essential before performing any cleaning or transformation.


In [16]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522064 entries, 0 to 522063
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   BillNo      522064 non-null  object 
 1   Itemname    520609 non-null  object 
 2   Quantity    522064 non-null  int64  
 3   Date        522064 non-null  object 
 4   Price       522064 non-null  object 
 5   CustomerID  388023 non-null  float64
 6   Country     522064 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 27.9+ MB


### Observations from Initial Inspection
(After running data.info(), write observations like below in Markdown)
- The dataset contains 522,064 rows and 7 columns.
- `Price` is numeric (`float64`).
- `Quantity` is integer (`int64`).
- `Itemname` and `CustomerID` contain missing values.
- `BillNo` contains mixed types and is stored as an object.
- `Date` is currently stored as an object and not as a datetime.
