<a href="https://colab.research.google.com/github/pratikagithub/All-About-Data-Analyst/blob/main/Market_Basket_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Market Basket Analysis is a data-driven technique used to uncover patterns and relationships within large transactional datasets, particularly in retail and e-commerce. It helps businesses understand which products or items are often purchased together, providing insights for optimizing product placement, marketing strategies, and promotions.

Market Basket Analysis is a valuable tool for businesses seeking to optimize their product offerings, increase cross-selling opportunities, and improve marketing strategies. It can lead to higher revenue, enhanced customer satisfaction, and overall business success.

Below is the process you can follow for the task of Market Basket Analysis as a Data Science professional:

Gather transactional data, including purchase history, shopping carts, or invoices.

Analyze product sales and trends.

Use algorithms like Apriori or FP-growth to discover frequent item sets and generate association rules.

Interpret the discovered association rules to gain actionable insights.

Develop strategies based on the insights gained from the analysis.

In [1]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
from google.colab import files
uploaded = files.upload()
data = pd.read_csv("market_basket_dataset.csv")
print(data.head())

Saving market_basket_dataset.csv to market_basket_dataset (1).csv
   BillNo  Itemname  Quantity  Price  CustomerID
0    1000    Apples         5   8.30       52299
1    1000    Butter         4   6.06       11752
2    1000      Eggs         4   2.66       16415
3    1000  Potatoes         4   8.10       22889
4    1004   Oranges         2   7.26       52255


Let’s have a look if the data has any null values or not before moving forward:

In [2]:
print(data.isnull().sum())

BillNo        0
Itemname      0
Quantity      0
Price         0
CustomerID    0
dtype: int64


Now, let’s have a look at the summary statistics of this dataset:

In [3]:
print(data.describe())

            BillNo    Quantity       Price    CustomerID
count   500.000000  500.000000  500.000000    500.000000
mean   1247.442000    2.978000    5.617660  54229.800000
std     144.483097    1.426038    2.572919  25672.122585
min    1000.000000    1.000000    1.040000  10504.000000
25%    1120.000000    2.000000    3.570000  32823.500000
50%    1246.500000    3.000000    5.430000  53506.500000
75%    1370.000000    4.000000    7.920000  76644.250000
max    1497.000000    5.000000    9.940000  99162.000000


Now, let’s have a look at the sales distribution of items:

In [4]:
fig = px.histogram(data, x='Itemname',
                   title='Item Distribution')
fig.show()

Now, let’s have a look at the top 10 most popular items sold by the store:

In [5]:
# Calculate item popularity
item_popularity = data.groupby('Itemname')['Quantity'].sum().sort_values(ascending=False)

top_n = 10
fig = go.Figure()
fig.add_trace(go.Bar(x=item_popularity.index[:top_n], y=item_popularity.values[:top_n],
                     text=item_popularity.values[:top_n], textposition='auto',
                     marker=dict(color='skyblue')))
fig.update_layout(title=f'Top {top_n} Most Popular Items',
                  xaxis_title='Item Name', yaxis_title='Total Quantity Sold')
fig.show()

So, bananas are the most popular items sold at the store. Now, let’s have a look at the customer behaviour:

In [6]:
# Calculate average quantity and spending per customer
customer_behavior = data.groupby('CustomerID').agg({'Quantity': 'mean', 'Price': 'sum'}).reset_index()

# Create a DataFrame to display the values
table_data = pd.DataFrame({
    'CustomerID': customer_behavior['CustomerID'],
    'Average Quantity': customer_behavior['Quantity'],
    'Total Spending': customer_behavior['Price']
})

# Create a subplot with a scatter plot and a table
fig = go.Figure()

# Add a scatter plot
fig.add_trace(go.Scatter(x=customer_behavior['Quantity'], y=customer_behavior['Price'],
                         mode='markers', text=customer_behavior['CustomerID'],
                         marker=dict(size=10, color='coral')))

# Add a table
fig.add_trace(go.Table(
    header=dict(values=['CustomerID', 'Average Quantity', 'Total Spending']),
    cells=dict(values=[table_data['CustomerID'], table_data['Average Quantity'], table_data['Total Spending']]),
))

# Update layout
fig.update_layout(title='Customer Behavior',
                  xaxis_title='Average Quantity', yaxis_title='Total Spending')

# Show the plot
fig.show()

Here, we are exploring customer behaviour, comparing average quantity and total spending, and analyzing exact numerical values in the table for each customer.

Now, let’s use the Apriori algorithm to create association rules. The Apriori algorithm is used to discover frequent item sets in large transactional datasets. It aims to identify items that are frequently purchased together in transactional data. It helps uncover patterns in customer behaviour, allowing businesses to make informed decisions about product placement, promotions, and marketing. Here’s how to implement Apriori to generate association rules:

In [24]:
pip install --upgrade --force-reinstall mlxtend


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Collecting mlxtend
  Downloading mlxtend-0.23.3-py3-none-any.whl.metadata (7.3 kB)
Collecting scipy>=1.2.1 (from mlxtend)
  Downloading scipy-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.16.2 (from mlxtend)
  Downloading numpy-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=0.24.2 (from mlxtend)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn>=1.3.1 (from mlxtend)
  Downloading scikit_learn-1.6.0-cp310-cp310-manylinux_2_17_x86_6

In [12]:
print(type(frequent_itemsets))
print(frequent_itemsets.columns)
print(frequent_itemsets.head())

<class 'pandas.core.frame.DataFrame'>
Index(['support', 'itemsets'], dtype='object')
    support   itemsets
0  0.163399   (Apples)
1  0.241830  (Bananas)
2  0.150327    (Bread)
3  0.163399   (Butter)
4  0.202614   (Cereal)



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [13]:
print(f"Number of frequent itemsets: {len(frequent_itemsets)}")
print(frequent_itemsets.head())

Number of frequent itemsets: 1477
    support   itemsets
0  0.163399   (Apples)
1  0.241830  (Bananas)
2  0.150327    (Bread)
3  0.163399   (Butter)
4  0.202614   (Cereal)



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

# Group items by BillNo and create a list of items for each bill
basket = data.groupby('BillNo')['Itemname'].apply(list).reset_index()

# Encode items as binary variables using one-hot encoding
basket_encoded = basket['Itemname'].str.join('|').str.get_dummies('|')

# Find frequent itemsets using Apriori algorithm with lower support
frequent_itemsets = apriori(basket_encoded, min_support=0.01, use_colnames=True)

# Generate association rules with lower lift threshold
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=0.5)

# Display association rules
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))

The above output shows association rules between different items (antecedents) and the items that tend to be purchased together with them (consequents). Let’s interpret the output step by step:


Antecedents: These are the items that are considered as the starting point or “if” part of the association rule. For example, Bread, Butter, Cereal, Cheese, and Chicken are the antecedents in this analysis.

Consequents: These are the items that tend to be purchased along with the antecedents or the “then” part of the association rule.

Support: Support measures how frequently a particular combination of items (both antecedents and consequents) appears in the dataset. It is essentially the proportion of transactions in which the items are bought together. For example, the first rule indicates that Bread and Apples are bought together in approximately 4.58% of all transactions.

Confidence: Confidence quantifies the likelihood of the consequent item being purchased when the antecedent item is already in the basket. In other words, it shows the probability of buying the consequent item when the antecedent item is bought. For example, the first rule tells us that there is a 30.43% chance of buying Apples when Bread is already in the basket.

Lift: Lift measures the degree of association between the antecedent and consequent items, while considering the baseline purchase probability of the consequent item. A lift value greater than 1 indicates a positive association, meaning that the items are more likely to be bought together than independently. A value less than 1 indicates a negative association. For example, the first rule has a lift of approximately 1.86, suggesting a positive association between Bread and Apples.

So, this is how you can perform Market Basket Analysis using Python.