# Project: Ecommerce_Project EDA
**Author:** Muhammed Riswan
**Date:** 2025-12-08

## 1. Business Understanding
**Context:**
This project focuses on a large-scale e-commerce platform selling various products (Electronics, Apparel, etc.). As a Data Scientist Intern, I am analyzing a dataset of 100,000 transaction records to understand customer behavior and sales trends.

**Objective:**
The goal of this analysis is to:
1. Identify high-value customers and products.
2. Understand sales trends over time.
3. Detect data quality issues that might affect reporting.
4. Provide actionable insights to improve marketing strategies.

In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/e-commerce-dataset-for-practice/ecommerce_synthetic_dataset.csv


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df = pd.read_csv('/kaggle/input/e-commerce-dataset-for-practice/ecommerce_synthetic_dataset.csv')

In [8]:
print(f"Dataset Shape :{df.shape}")

Dataset Shape :(100000, 21)


In [9]:
print("Column Info:")
df.info()

Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   UserID              100000 non-null  int64  
 1   UserName            100000 non-null  object 
 2   Age                 100000 non-null  int64  
 3   Gender              100000 non-null  object 
 4   Country             100000 non-null  object 
 5   SignUpDate          100000 non-null  object 
 6   ProductID           100000 non-null  int64  
 7   ProductName         100000 non-null  object 
 8   Category            100000 non-null  object 
 9   Price               100000 non-null  float64
 10  PurchaseDate        100000 non-null  object 
 11  Quantity            100000 non-null  int64  
 12  TotalAmount         100000 non-null  float64
 13  HasDiscountApplied  100000 non-null  bool   
 14  DiscountRate        100000 non-null  float64
 15  ReviewScore         10

In [10]:
print("Sample Data")
df.head()

Sample Data


Unnamed: 0,UserID,UserName,Age,Gender,Country,SignUpDate,ProductID,ProductName,Category,Price,...,Quantity,TotalAmount,HasDiscountApplied,DiscountRate,ReviewScore,ReviewText,LastLogin,SessionDuration,DeviceType,ReferralSource
0,1,User_1,39,Male,UK,2021-02-01,8190,Shoes,Books,532.37,...,1,532.37,False,0.02,5.1,Excellent,2024-05-03 04:04:27.591583,45.02,Mobile,Social Media
1,2,User_2,25,Female,Canada,2020-12-04,9527,T-shirt,Accessories,848.83,...,1,848.83,True,0.29,5.1,Excellent,2024-08-31 04:04:27.591606,13.83,Mobile,Social Media
2,3,User_3,43,Male,Canada,2022-07-08,3299,Headphones,Apparel,64.88,...,2,129.76,False,0.03,3.2,Good,2024-07-28 04:04:27.591611,59.09,Tablet,Organic Search
3,4,User_4,44,Male,Germany,2021-06-07,8795,T-shirt,Apparel,465.08,...,2,930.16,False,0.23,4.3,Good,2024-03-11 04:04:27.591615,55.42,Desktop,Email Marketing
4,5,User_5,23,Female,Canada,2021-11-06,1389,Shoes,Books,331.82,...,1,331.82,False,0.02,5.1,Average,2024-07-02 04:04:27.591619,14.99,Tablet,Email Marketing


In [15]:
df.tail()

Unnamed: 0,UserID,UserName,Age,Gender,Country,SignUpDate,ProductID,ProductName,Category,Price,...,Quantity,TotalAmount,HasDiscountApplied,DiscountRate,ReviewScore,ReviewText,LastLogin,SessionDuration,DeviceType,ReferralSource
99995,99996,User_99996,25,Female,UK,2020-11-18,8401,Shoes,Accessories,213.12,...,1,213.12,True,0.04,4.9,Poor,2024-04-05 04:04:28.072377,96.64,Tablet,Organic Search
99996,99997,User_99997,53,Non-Binary,Canada,2020-07-28,6555,T-shirt,Electronics,672.12,...,3,2016.36,True,0.42,5.4,Excellent,2024-04-16 04:04:28.072381,60.98,Tablet,Social Media
99997,99998,User_99998,25,Male,Canada,2020-04-13,7686,Laptop,Electronics,515.35,...,3,1546.05,True,0.21,4.0,Average,2024-08-28 04:04:28.072384,15.69,Tablet,Social Media
99998,99999,User_99999,63,Female,Germany,2022-12-23,2885,Watch,Books,448.82,...,2,897.64,True,0.09,2.0,Poor,2024-10-07 04:04:28.072388,28.95,Tablet,Organic Search
99999,100000,User_100000,52,Female,Germany,2020-07-12,3963,Laptop,Books,117.13,...,2,234.26,False,0.21,3.2,Good,2024-10-11 04:04:28.072392,68.52,Desktop,Organic Search


In [11]:
print("Missing Values:")
df.isna().sum()

Missing Values:


UserID                0
UserName              0
Age                   0
Gender                0
Country               0
SignUpDate            0
ProductID             0
ProductName           0
Category              0
Price                 0
PurchaseDate          0
Quantity              0
TotalAmount           0
HasDiscountApplied    0
DiscountRate          0
ReviewScore           0
ReviewText            0
LastLogin             0
SessionDuration       0
DeviceType            0
ReferralSource        0
dtype: int64

In [12]:
print(f"Duplicate Rows: {df.duplicated().sum()}")

Duplicate Rows: 0


In [13]:
df.describe()

Unnamed: 0,UserID,Age,ProductID,Price,Quantity,TotalAmount,DiscountRate,ReviewScore,SessionDuration
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,43.46081,5508.11723,505.631966,2.49569,1260.85236,0.249831,4.006239,62.408836
std,28867.657797,14.980333,2606.544036,286.137505,1.120354,964.100243,0.144505,0.99875,33.177372
min,1.0,18.0,1000.0,10.0,1.0,10.02,0.0,-0.6,5.0
25%,25000.75,31.0,3228.0,257.1375,1.0,494.6775,0.13,3.3,33.72
50%,50000.5,43.0,5520.0,505.95,2.0,966.08,0.25,4.0,62.415
75%,75000.25,56.0,7776.0,753.7325,4.0,1850.33,0.37,4.7,91.04
max,100000.0,69.0,9998.0,999.98,4.0,3999.72,0.5,8.6,120.0


In [14]:
df.nunique()

UserID                100000
UserName              100000
Age                       52
Gender                     3
Country                    6
SignUpDate              1096
ProductID               8999
ProductName                7
Category                   4
Price                  63008
PurchaseDate             366
Quantity                   4
TotalAmount            79235
HasDiscountApplied         2
DiscountRate              51
ReviewScore               85
ReviewText                 4
LastLogin             100000
SessionDuration        11499
DeviceType                 3
ReferralSource             4
dtype: int64

## 2. Data Quality Issue Log (Day 1 Observations)

| Column Name | Issue Observed | Action Plan (Day 2) |
|---|---|---|
| **SignUpDate** | Data Type Mismatch | Currently 'Object' (String). Convert to Datetime. |
| **PurchaseDate** | Data Type Mismatch | Currently 'Object' (String). Convert to Datetime. |
| **LastLogin** | Data Type Mismatch | Currently 'Object' (String). Convert to Datetime. |
| **ReviewScore** | Outliers / Invalid Data | Min value is **-0.60** and Max is **8.60**. (Expected range 1-5). Needs clamping or removal. |
| **UserID** | Data Type Semantics | Currently Integer. Should be converted to String (Categorical). |
| **ProductID** | Data Type Semantics | Currently Integer. Should be converted to String (Categorical). |
| **General** | Missing Values | **0 missing values** found across the dataset. Excellent data completeness. |

## 3. Summary of Findings (Day 1)

**1. Dataset Structure:**
* **Shape:** 100,000 rows and 21 columns.
* **Completeness:** No missing values were found in any column (`Non-Null Count` is 100% for all).

**2. Key Variables:**
* **Categorical:** Contains rich demographic data including `Gender` (3 unique), `Country` (6 unique), and `Category` (4 unique).
* **Numerical:** Includes transactional data like `Price`, `Quantity`, and `TotalAmount`.

**3. Anomalies Detected:**
* **Review Scores:** The `ReviewScore` column has suspicious values. The minimum is **-0.60** (negative) and the maximum is **8.60**, which likely violates the standard 1-5 rating scale. This requires cleaning on Day 2.
* **Dates:** All date columns (`SignUpDate`, `PurchaseDate`, `LastLogin`) are currently stored as strings and must be converted to perform time-series analysis.