# Project 1: Exploratory Data Analysis (EDA) on Global Superstore Dataset

**Objective:**  
Clean and analyze the Global Superstore dataset to identify trends, patterns, anomalies, and relationships between Sales, Profit, Region, and Product Categories.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Load the Dataset

In [None]:
# Load the dataset
df = pd.read_csv("Project1_Global_Superstore.csv")  # <-- Make sure file name is correct

##  Step 2: Check Basic Information
- Number of rows and columns
- Data types
- Missing values
- Summary statistics

In [None]:
# Shape of data
print("Shape:", df.shape)

# Info about columns
df.info()

# Summary stats
df.describe()

df.isnull().sum()


## Step 3: Clean the Data
- Remove duplicates
- Fill missing values using median

In [None]:
df.drop_duplicates(inplace=True)
df.fillna(df.median(numeric_only=True), inplace=True)

##  Step 4: Remove Outliers (IQR Method)

In [None]:
# Keep only numeric columns like Sales, Profit, Discount
numeric_df = df[["Sales", "Profit", "Discount"]]  

# Calculate Q1 (25%) and Q3 (75%) only for numbers
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
df = df[~((numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR))).any(axis=1)]


## Step 5: Save Cleaned Data

In [None]:
df.to_csv('Project1_global_superstore_clean.csv', index=False)

# Step 6: Visualizations
Below are the key visualizations to understand data trends, distributions, and relationships.

In [None]:
#Histogram of Sales
plt.figure(figsize=(8, 5))
sns.histplot(df['Sales'], kde=True)
plt.title('Sales Distribution')
plt.xlabel('Sales Amount')
plt.ylabel('Count')
plt.show()

In [None]:
#Boxplot of Profit (To Detect Outliers)
plt.figure(figsize=(8, 3))
sns.boxplot(x=df['Profit'])
plt.title('Profit - Outlier Detection')
plt.xlabel('Profit')
plt.show()

In [None]:
##Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Between Numerical Features')
plt.show()

## Summary
- Cleaned the dataset by handling missing values, removing duplicates, and dropping outliers.
- Found trends in Sales and Profit using histograms and boxplots.
- Discovered correlations using heatmaps.

Next: We'll use this knowledge to analyze sales performance in **Project 2**.
