# Supermarket Sales Analysis
## Advanced Python Project (First semester 2025/2026)

**Objective:** Perform deep data analysis on the Supermarket Sales Dataset using Pandas, NumPy, Seaborn, and Matplotlib.

---

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

## 2. Load Dataset

In [3]:
try:
    df = pd.read_csv('SuperMarket Analysis.csv')
    print("Dataset loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")

Error loading data: name 'pd' is not defined


## 3. Data Inspection

In [None]:

df.head()

In [None]:

df.info()

In [None]:

df.describe()

In [None]:

df.isnull().sum()

## 4. Data Cleaning
### 4.1 Convert Date and Time

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes

### 4.2 Check for Duplicates & Inconsistencies

In [None]:

duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicates removed.")

### 4.3 Feature Engineering
Creating an 'Hour' column from 'Time' for temporal analysis to answer Q4.

In [1]:

df['Hour'] = pd.to_datetime(df['Time']).dt.hour
print("Hour column created.")
df[['Time', 'Hour']].head()

NameError: name 'pd' is not defined

## 5. Exploratory Data Analysis (EDA)
### 5.1 Basic Statistics & Distributions (Univariate)

In [4]:
def analyze_categorical(col_name):
    print(f"--- Analysis for {col_name} ---")
    print(df[col_name].value_counts())
    
    plt.figure(figsize=(8, 5))
    sns.countplot(x=col_name, data=df, palette='viridis')
    plt.title(f'Distribution of {col_name}')
    plt.xticks(rotation=45)
    plt.show()


categories = ['Branch', 'Customer type', 'Gender', 'Payment', 'Product line']
for cat in categories:
    analyze_categorical(cat)

--- Analysis for Branch ---


NameError: name 'df' is not defined

### 5.2 Time Series Analysis (Sales & Ratings)
Analyzing how sales fluctuate over the recorded time period.

In [None]:

daily_sales = df.groupby('Date')[['Sales', 'Rating']].mean()

plt.figure(figsize=(12, 6))
sns.lineplot(data=daily_sales, x=daily_sales.index, y='Sales', label='Average Daily Sales', color='blue')
plt.title('Sales Trend Over Time')
plt.ylabel('Average Sales')
plt.xlabel('Date')
plt.legend()
plt.show()

plt.figure(figsize=(12, 6))
sns.lineplot(data=daily_sales, x=daily_sales.index, y='Rating', label='Average Daily Rating', color='orange')
plt.title('Rating Trend Over Time')
plt.ylabel('Average Rating')
plt.xlabel('Date')
plt.legend()
plt.show()

### 5.3 Relationships Analysis

In [None]:

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Sales', y='Rating', data=df, alpha=0.6)
plt.title('Relationship between Sales and Rating')
plt.show()

In [None]:

plt.figure(figsize=(10, 8))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

---
## 6. Answering Key Business Questions

### Q1: Which Branch has the highest sales?

In [None]:
branch_sales = df.groupby('Branch')['Sales'].sum().sort_values(ascending=False)
print("Total Sales by Branch:")
print(branch_sales)

plt.figure(figsize=(8, 5))
sns.barplot(x=branch_sales.index, y=branch_sales.values, palette='Blues_r')
plt.title('Total Sales by Branch')
plt.ylabel('Total Sales ($)')
plt.show()

### Q2: Which payment method is the most popular?

In [None]:
payment_counts = df['Payment'].value_counts()
print("Payment Method Popularity:")
print(payment_counts)

plt.figure(figsize=(8, 5))
sns.barplot(x=payment_counts.index, y=payment_counts.values, palette='magma')
plt.title('Most Popular Payment Methods')
plt.ylabel('Count')
plt.show()

### Q3: Is there a relationship between Sales and Rating?

In [None]:
cols = ['Sales', 'Rating']
corr = df[cols].corr().iloc[0, 1]
print(f"Correlation between Sales and Rating: {corr:.4f}")

if abs(corr) < 0.1:
    print("Conclusion: There is almost no linear relationship between Sales and Rating.")
else:
    print("Conclusion: There is a notable relationship.")

### Q4: At what time do people buy the most?

In [None]:
hourly_sales = df.groupby('Hour')['Sales'].sum()
hourly_counts = df['Hour'].value_counts().sort_index()

plt.figure(figsize=(10, 5))
sns.lineplot(x=hourly_sales.index, y=hourly_sales.values, marker='o', label='Total Sales Amount')
plt.title('Total Sales by Hour of Day')
plt.xlabel('Hour (24h)')
plt.ylabel('Total Sales')
plt.grid(True)
plt.xticks(range(0, 24))
plt.show()

print("Peak Sales Hour:", hourly_sales.idxmax())

### Q5: Which product line performs the best?

In [None]:
product_sales = df.groupby('Product line')['Sales'].sum().sort_values(ascending=False)
print("Best Performing Product Lines:")
print(product_sales)

plt.figure(figsize=(10, 6))
sns.barplot(y=product_sales.index, x=product_sales.values, palette='viridis')
plt.title('Total Sales by Product Line')
plt.xlabel('Total Sales ($)')
plt.show()