# Exploratory Data Analysis of E-commerce Product Data

## Overview for the EDA Notebook on E-commerce Product Data

This notebook will perform Exploratory Data Analysis (EDA) on a dataset containing detailed product information from an e-commerce platform. The dataset includes the following features:
- **Product ID**: Unique identifier for each product.
- **Gender**: Target gender for the product (e.g., Men, Women).
- **Master Category**: The main category of the product (e.g., Apparel, Accessories).
- **Subcategory**: Specific subcategory within the master category (e.g., Topwear, Bottomwear, Watches).
- **Article Type**: Specific type of article (e.g., Shirts, Jeans, Watches).
- **Base Colour**: The primary color of the product.
- **Season**: The season associated with the product (e.g., Fall, Summer, Winter).
- **Year**: The year the product was released.
- **Usage**: The intended usage of the product (e.g., Casual, Party).
- **Product** Display Name: The name of the product as displayed on the platform.

## Objectives:

- **Category Analysis**: Explore the distribution of products across different master categories and subcategories to understand the product mix.
- **Gender Insights**: Analyze products based on target gender to identify market segmentation and potential gaps.
- **Color Trends**: Identify popular base colors and how color preferences vary across categories and seasons.
- **Seasonality and Yearly Trends**: Examine product releases over different seasons and years to uncover seasonal trends and growth patterns.
- **Usage Patterns**: Investigate the intended usage types (e.g., Casual, Party) to understand consumer preferences and product positioning.
- **Brand and Product Popularity**: Utilize the product display names to identify top brands and popular products.


This EDA aims to provide valuable insights into the product offerings and trends within the e-commerce platform, aiding in strategic decision-making for merchandising, marketing, and inventory management.

In [6]:
import pandas as pd 
import sys
import os as os
import matplotlib.pyplot as plt

### Data Understanding

In [8]:
product_df = pd.read_csv('product.csv', quotechar='"', on_bad_lines='skip')
product_df.head()

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


In [9]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44424 entries, 0 to 44423
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  44424 non-null  int64  
 1   gender              44424 non-null  object 
 2   masterCategory      44424 non-null  object 
 3   subCategory         44424 non-null  object 
 4   articleType         44424 non-null  object 
 5   baseColour          44409 non-null  object 
 6   season              44403 non-null  object 
 7   year                44423 non-null  float64
 8   usage               44107 non-null  object 
 9   productDisplayName  44417 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 3.4+ MB


In [11]:
product_df.shape

(44424, 10)

In [13]:
product_df.dtypes

id                      int64
gender                 object
masterCategory         object
subCategory            object
articleType            object
baseColour             object
season                 object
year                  float64
usage                  object
productDisplayName     object
dtype: object

### Data cleaning