## Objective

### Understand the dataset, identify data quality issues, and define a structured EDA roadmap before deep analysis.

### 1Ô∏è‚É£ Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)


### 2Ô∏è‚É£ Load Dataset

In [None]:
df = pd.read_csv("retail_sales.csv")


### 3Ô∏è‚É£ First Look at Data

In [None]:
df.head()

df.tail()


df.sample(5)

### 4Ô∏è‚É£ Dataset Size & Shape

In [None]:
df.shape


- Rows represent individual customer orders
- Columns represent customer, product, and transaction attributes


### 5Ô∏è‚É£ Column Names & Data Types

In [None]:
df.columns

In [None]:
df.info()

### üìå Key Analyst Checks

- Are dates stored as datetime?
- Are numerical columns stored as object?
- Any unexpected null values?

### 6Ô∏è‚É£ Data Type Fixes (If Needed)

In [None]:
df['Order_Date'] = pd.to_datetime(df['Order_Date'], errors='coerce')


### 7Ô∏è‚É£ Statistical Summary

In [None]:
df.describe()

In [None]:
df.describe(include='object')


### üìå Business Interpretation

- Sales show high variance ‚Üí potential outliers
- Discount values appear capped ‚Üí possible business rule


### 8Ô∏è‚É£ Missing Values Overview

In [None]:
df.isnull().sum()

In [None]:
(df.isnull().mean() * 100).round(2)


### üìå Initial Observation

- Age and Region have missing values
- Missing data < 10% ‚Üí suitable for imputation

### 9Ô∏è‚É£ Duplicate Records

In [None]:
df.duplicated().sum()

In [None]:
df[df.duplicated()]

### üìå Action Plan

- Remove exact duplicates if found

### üîü Unique Values Check (Categorical)

In [None]:
categorical_cols = df.select_dtypes(include='object').columns

for col in categorical_cols:
    print(col, ":", df[col].nunique())


üìå Why This Matters

- Detect inconsistent categories (e.g., 'Male' vs 'male')
- Identify high-cardinality columns

### 1Ô∏è‚É£1Ô∏è‚É£ Initial Visual Scan (Quick EDA)

In [None]:
df.hist(figsize=(12,8))
plt.show()


In [None]:
sns.boxplot(data=df[['Sales', 'Profit']])
plt.show()


1Ô∏è‚É£2Ô∏è‚É£ Business Questions Definition
### Key Questions for EDA
1. What factors influence Sales and Profit?
2. How do discounts affect profitability?
3. Which product categories and regions perform best?

1Ô∏è‚É£3Ô∏è‚É£ EDA Roadmap (Final Cell)
### Next Steps
1. Handle missing values & outliers
2. Perform univariate analysis
3. Explore relationships (bivariate analysis)
4. Generate business insights & recommendations