# Exploratory Data Analysis (EDA): Financial Sample Dataset

This notebook performs a comprehensive exploratory data analysis on **Financial_Sample_Prepared.csv** using **Pandas, Matplotlib, and Seaborn**.  
Deliverables include this Jupyter Notebook and an accompanying PDF report.

*Dataset preview and all derived visualisations are generated below.*

In [None]:
%pip install pandas matplotlib seaborn numpy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
%matplotlib inline

# Load dataset
file_path = r'/mnt/data/Financial_Sample_Prepared.csv'
df = pd.read_csv(file_path)

# Basic inspection
df.head()

: 

In [None]:
# Dataset info
print(df.info())

# Missing values summary
missing = df.isnull().sum()
print("\nMissing values by column:\n", missing)

**Observation:** No missing values detected; data types are appropriate.

In [None]:
df.describe().T

**Observation:** Note the wide range of `Profit` and `Units Sold`, suggesting potential skewness.

In [None]:
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    display(col, df[col].value_counts().head())

**Observation:** `Segment` appears evenly distributed, while some countries dominate sales volume.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter

In [None]:
plt.figure(figsize=(10,8))
corr = df[['Units Sold', 'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts', ' Sales', 'COGS', 'Profit', 'Month Number', 'Year']].corr()
sns.heatmap(corr, annot=True, fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix');

**Observation:** Strong positive correlation between `Gross Sales`, `Sales`, and `COGS`; `Profit` also highly correlated with these metrics.

In [None]:
sns.pairplot(df.sample(200), vars=['Units Sold','Gross Sales','COGS','Profit'], hue='Segment')

**Observation:** Higher `Units Sold` generally drives higher `Gross Sales`, but `Profit` varies by `Segment`.

In [None]:
df[['Units Sold', 'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts', ' Sales', 'COGS', 'Profit', 'Month Number', 'Year']].hist(bins=20, figsize=(15,10))
plt.tight_layout()

**Observation:** Distributions for monetary columns are right-skewed, indicating majority of sales values cluster at lower ranges.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='Segment', y='Profit', data=df)
plt.title('Profit Distribution by Segment');

**Observation:** The Government segment shows wider profit variability compared to Midmarket.

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Units Sold', y='Profit', hue='Segment', data=df)
plt.title('Profit vs Units Sold');

**Observation:** Profit grows with units sold but with diminishing returns beyond ~4000 units.


## Summary of Findings
1. **Sales Drivers**: Units sold strongly influence Gross Sales and subsequently Profit.  
2. **Segment Insights**: Government and Midmarket segments display distinct profit patterns.  
3. **Geographical Performance**: A handful of countries dominate sales volume; further drill-down could pinpoint high-value regions.  
4. **Skewed Distributions**: Financial metrics are right-skewed, suggesting a small number of high‑value transactions.  
5. **No Missing Data**: The dataset is complete, facilitating straightforward modeling.

*End of analysis.*
