# Customer Purchase Behavior Analysis â€“ Data Cleaning & EDA


## Project Overview

This project focuses on analyzing customer purchase behavior to uncover insights related to spending patterns, product performance, discounts, subscriptions, and customer loyalty. The analysis aims to support data-driven business decisions by identifying key revenue drivers and customer segments.

The workflow begins with data cleaning, feature engineering, and exploratory data analysis (EDA) using Python and Pandas. SQL-based analysis is then performed using SQLAlchemy to compute business KPIs and answer targeted analytical questions. Finally, the insights are visualized through an interactive Power BI dashboard to present findings in a clear and actionable format.


##  Data Loading


In [4]:
# Loading the dataset using pandas

import pandas as pd

df = pd.read_csv("../Data/RAW/customer_shopping_behavior.csv")
df.shape


(3900, 18)

In [5]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


## 3. Initial Data Overview


In [7]:
df.info()
print()
print()
df.describe(include='all')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3863 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  Promo Code Used         3900 non-null   

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
count,3900.0,3900.0,3900,3900,3900,3900.0,3900,3900,3900,3900,3863.0,3900,3900,3900,3900,3900.0,3900,3900
unique,,,2,25,4,,50,4,25,4,,2,6,2,2,,6,7
top,,,Male,Blouse,Clothing,,Montana,M,Olive,Spring,,No,Free Shipping,No,No,,PayPal,Every 3 Months
freq,,,2652,171,1737,,96,1755,177,999,,2847,675,2223,2223,,677,584
mean,1950.5,44.068462,,,,59.764359,,,,,3.750065,,,,,25.351538,,
std,1125.977353,15.207589,,,,23.685392,,,,,0.716983,,,,,14.447125,,
min,1.0,18.0,,,,20.0,,,,,2.5,,,,,1.0,,
25%,975.75,31.0,,,,39.0,,,,,3.1,,,,,13.0,,
50%,1950.5,44.0,,,,60.0,,,,,3.8,,,,,25.0,,
75%,2925.25,57.0,,,,81.0,,,,,4.4,,,,,38.0,,


## Data Quality Checks

In [9]:
# Checking if missing data or null values are present in the dataset

df.isnull().sum()

Customer ID                0
Age                        0
Gender                     0
Item Purchased             0
Category                   0
Purchase Amount (USD)      0
Location                   0
Size                       0
Color                      0
Season                     0
Review Rating             37
Subscription Status        0
Shipping Type              0
Discount Applied           0
Promo Code Used            0
Previous Purchases         0
Payment Method             0
Frequency of Purchases     0
dtype: int64

## Data Cleaning

In [11]:
# Imputing missing values in Review Rating column with the median rating of the product category

df['Review Rating'] = df.groupby('Category')['Review Rating'].transform(lambda x: x.fillna(x.median()))

In [12]:
df.isnull().sum()

Customer ID               0
Age                       0
Gender                    0
Item Purchased            0
Category                  0
Purchase Amount (USD)     0
Location                  0
Size                      0
Color                     0
Season                    0
Review Rating             0
Subscription Status       0
Shipping Type             0
Discount Applied          0
Promo Code Used           0
Previous Purchases        0
Payment Method            0
Frequency of Purchases    0
dtype: int64

In [13]:
# Renaming columns according to snake casing for better readability and documentation

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ','_')
df = df.rename(columns={'purchase_amount_(usd)':'purchase_amount'})

In [14]:
df.columns

Index(['customer_id', 'age', 'gender', 'item_purchased', 'category',
       'purchase_amount', 'location', 'size', 'color', 'season',
       'review_rating', 'subscription_status', 'shipping_type',
       'discount_applied', 'promo_code_used', 'previous_purchases',
       'payment_method', 'frequency_of_purchases'],
      dtype='object')

In [15]:
df[['discount_applied','promo_code_used']].head(10)

Unnamed: 0,discount_applied,promo_code_used
0,Yes,Yes
1,Yes,Yes
2,Yes,Yes
3,Yes,Yes
4,Yes,Yes
5,Yes,Yes
6,Yes,Yes
7,Yes,Yes
8,Yes,Yes
9,Yes,Yes


In [16]:
(df['discount_applied'] == df['promo_code_used']).all()

True

In [17]:
# Dropping promo code used column

df = df.drop('promo_code_used', axis=1)

In [18]:
df.columns

Index(['customer_id', 'age', 'gender', 'item_purchased', 'category',
       'purchase_amount', 'location', 'size', 'color', 'season',
       'review_rating', 'subscription_status', 'shipping_type',
       'discount_applied', 'previous_purchases', 'payment_method',
       'frequency_of_purchases'],
      dtype='object')

In [77]:
df.to_csv("customer_cleaned.csv", index=False)


## Feature Engineering

In [20]:
# create a new column age_group
labels = ['Young Adult', 'Adult', 'Middle-aged', 'Senior']
df['age_group'] = pd.qcut(df['age'], q=4, labels = labels)

In [21]:
df[['age','age_group']].head(10)

Unnamed: 0,age,age_group
0,55,Middle-aged
1,19,Young Adult
2,50,Middle-aged
3,21,Young Adult
4,45,Middle-aged
5,46,Middle-aged
6,63,Senior
7,27,Young Adult
8,26,Young Adult
9,57,Middle-aged


In [22]:
# create new column purchase_frequency_days

frequency_mapping = {
    'Fortnightly': 14,
    'Weekly': 7,
    'Monthly': 30,
    'Quarterly': 90,
    'Bi-Weekly': 14,
    'Annually': 365,
    'Every 3 Months': 90
}

df['purchase_frequency_days'] = df['frequency_of_purchases'].map(frequency_mapping)

In [23]:
df[['purchase_frequency_days','frequency_of_purchases']].head(10)

Unnamed: 0,purchase_frequency_days,frequency_of_purchases
0,14,Fortnightly
1,14,Fortnightly
2,7,Weekly
3,7,Weekly
4,365,Annually
5,7,Weekly
6,90,Quarterly
7,7,Weekly
8,365,Annually
9,90,Quarterly


## Exploratory Data Analysis (EDA)


###  Customer Demographics


In [26]:
gender_distribution = (
    df.groupby('gender', as_index=False)
      .agg(customer_count=('customer_id', 'count'))
      .sort_values(by='customer_count', ascending=False)
)

gender_distribution


Unnamed: 0,gender,customer_count
1,Male,2652
0,Female,1248


**Insight:**  
Male customers represent a significantly larger portion of the customer base compared to female customers, suggesting stronger engagement from male audiences.


### Spending Behavior


In [29]:
summary = pd.DataFrame({
    "Metric": [
        "Average Purchase Amount",
        "Median Purchase Amount",
        "Minimum Purchase",
        "Maximum Purchase",
        "Standard Deviation"
    ],
    "Value": [
        df['purchase_amount'].mean(),
        df['purchase_amount'].median(),
        df['purchase_amount'].min(),
        df['purchase_amount'].max(),
        df['purchase_amount'].std()
    ],
    "What it indicates": [
        "Typical customer spend per order",
        "Middle value of customer spending",
        "Lowest observed purchase amount",
        "Highest observed purchase amount",
        "Variability in customer spending"
    ]
})

display(summary)


Unnamed: 0,Metric,Value,What it indicates
0,Average Purchase Amount,59.764359,Typical customer spend per order
1,Median Purchase Amount,60.0,Middle value of customer spending
2,Minimum Purchase,20.0,Lowest observed purchase amount
3,Maximum Purchase,100.0,Highest observed purchase amount
4,Standard Deviation,23.685392,Variability in customer spending


**Insight:**  
The average and median purchase amounts are nearly the same, indicating a fairly balanced spending pattern across customers. The presence of higher-value purchases still contributes to some variability in overall spending.


### Product & Category Performance


In [66]:
df.groupby(['category'], as_index=False) \
  .agg(revenue=('purchase_amount', 'sum')) \
  .sort_values(by='revenue', ascending=False)


Unnamed: 0,category,revenue
1,Clothing,104264
0,Accessories,74200
2,Footwear,36093
3,Outerwear,18524


**Insight:**  
Clothing and Accessories are the strongest revenue contributors, together accounting for the majority of total sales. Footwear and Outerwear contribute comparatively less, indicating potential areas for growth or targeted promotion.


### Discounts & Promotions


In [34]:
discount_spend_analysis = (
    df.groupby('discount_applied', as_index=False)
      .agg(
          avg_purchase_amount=('purchase_amount', 'mean'),
          total_purchases=('customer_id', 'count')
      )
)

discount_spend_analysis


Unnamed: 0,discount_applied,avg_purchase_amount,total_purchases
0,No,60.130454,2223
1,Yes,59.27907,1677


**Insight:**  
Average purchase amounts are nearly identical for discounted and non-discounted orders, suggesting that discounts do not materially impact how much customers spend. Discounts appear to influence purchase decisions without lowering order value.


### Customer Loyalty


In [36]:
loyalty_spend_analysis = (
    df.assign(customer_type=lambda x: x['previous_purchases'].apply(
        lambda v: 'Repeat Customer' if v > 1 else 'One-time Customer'
    ))
    .groupby('customer_type', as_index=False)
    .agg(
        avg_purchase_amount=('purchase_amount', 'mean'),
        customer_count=('customer_id', 'count')
    )
)

loyalty_spend_analysis


Unnamed: 0,customer_type,avg_purchase_amount,customer_count
0,One-time Customer,58.46988,83
1,Repeat Customer,59.792507,3817


**Insight:**  
Repeat customers make up the vast majority of the customer base and show slightly higher average purchase amounts than one-time customers. This highlights the strong role of repeat buyers in sustaining overall revenue.
