# Day 9 â€“ Data Cleaning  
## E-Commerce Customer Behavior & Sales Analysis

**Objective:**  
Clean and prepare raw e-commerce transactional data for analysis by handling missing values, duplicates, and inconsistent column formats.

**Key steps:**
- Missing value treatment
- Duplicate removal
- Column standardization
- Data validation

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [7]:
df = pd.read_csv("/content/ecommerce_customer_behavior_dataset.csv")
df.head()

Unnamed: 0,Order_ID,Customer_ID,Date,Age,Gender,City,Product_Category,Unit_Price,Quantity,Discount_Amount,Total_Amount,Payment_Method,Device_Type,Session_Duration_Minutes,Pages_Viewed,Is_Returning_Customer,Delivery_Time_Days,Customer_Rating
0,ORD_001337,CUST_01337,2023-01-01,27,Female,Bursa,Toys,54.28,1,0.0,54.28,Debit Card,Mobile,4,14,True,8,5
1,ORD_004885,CUST_04885,2023-01-01,42,Male,Konya,Toys,244.9,1,0.0,244.9,Credit Card,Mobile,11,3,True,3,3
2,ORD_004507,CUST_04507,2023-01-01,43,Female,Ankara,Food,48.15,5,0.0,240.75,Credit Card,Mobile,7,8,True,5,2
3,ORD_000645,CUST_00645,2023-01-01,32,Male,Istanbul,Electronics,804.06,1,229.28,574.78,Credit Card,Mobile,8,10,False,1,4
4,ORD_000690,CUST_00690,2023-01-01,40,Female,Istanbul,Sports,755.61,5,0.0,3778.05,Cash on Delivery,Desktop,21,10,True,7,4


In [8]:
df.shape

(5000, 18)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Order_ID                  5000 non-null   object 
 1   Customer_ID               5000 non-null   object 
 2   Date                      5000 non-null   object 
 3   Age                       5000 non-null   int64  
 4   Gender                    5000 non-null   object 
 5   City                      5000 non-null   object 
 6   Product_Category          5000 non-null   object 
 7   Unit_Price                5000 non-null   float64
 8   Quantity                  5000 non-null   int64  
 9   Discount_Amount           5000 non-null   float64
 10  Total_Amount              5000 non-null   float64
 11  Payment_Method            5000 non-null   object 
 12  Device_Type               5000 non-null   object 
 13  Session_Duration_Minutes  5000 non-null   int64  
 14  Pages_Vi

In [17]:
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
order_id,0
customer_id,0
date,0
age,0
gender,0
city,0
product_category,0
unit_price,0
quantity,0
discount_amount,0


In [18]:
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].fillna(df[col].mode()[0])
    else:
        df[col] = df[col].fillna(df[col].median())

In [19]:
df.isnull().sum()

Unnamed: 0,0
order_id,0
customer_id,0
date,0
age,0
gender,0
city,0
product_category,0
unit_price,0
quantity,0
discount_amount,0


In [21]:
df.duplicated().sum()

np.int64(0)

In [23]:
df.drop_duplicates(inplace=True)

In [24]:
df.columns = (
    df.columns
    .str.lower()
    .str.strip()
    .str.replace(" ", "_")
)

In [25]:
df.columns

Index(['order_id', 'customer_id', 'date', 'age', 'gender', 'city',
       'product_category', 'unit_price', 'quantity', 'discount_amount',
       'total_amount', 'payment_method', 'device_type',
       'session_duration_minutes', 'pages_viewed', 'is_returning_customer',
       'delivery_time_days', 'customer_rating'],
      dtype='object')

In [26]:
df.dtypes

Unnamed: 0,0
order_id,object
customer_id,object
date,object
age,int64
gender,object
city,object
product_category,object
unit_price,float64
quantity,int64
discount_amount,float64


In [27]:
df.describe()

Unnamed: 0,age,unit_price,quantity,discount_amount,total_amount,session_duration_minutes,pages_viewed,delivery_time_days,customer_rating
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,35.0326,455.83412,2.22,24.852804,983.108914,14.5734,8.9842,6.497,3.9028
std,11.080546,712.477209,1.398711,88.385124,1898.978528,8.66575,2.80434,3.464966,1.128542
min,18.0,5.18,1.0,0.0,7.87,1.0,1.0,1.0,1.0
25%,27.0,76.5875,1.0,0.0,122.5175,8.0,7.0,4.0,3.0
50%,35.0,182.95,2.0,0.0,337.91,13.0,9.0,6.0,4.0
75%,42.0,513.93,3.0,8.76,979.695,19.0,11.0,8.0,5.0
max,75.0,7159.45,5.0,1525.55,22023.9,73.0,24.0,25.0,5.0


In [28]:
df.to_csv("ecommerce_cleaned.csv", index=False)

## Data Cleaning Summary

- Handled missing values using median (numerical) and mode (categorical)
- Removed duplicate records
- Standardized column names
- Validated data types

The dataset is now clean and ready for exploratory data analysis.