### Objective

- Identify, visualize, and treat missing values and outliers using business-driven logic.

In [None]:
### 1Ô∏è‚É£ Import Libraries & Load Data

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("retail_sales.csv")


2Ô∏è‚É£ Missing Values Detection
Count of Missing Values

In [None]:
df.isnull().sum()



#### Percentage of Missing Values

In [None]:
(df.isnull().mean() * 100).round(2)


Percentage of Missing Values

In [None]:
(df.isnull().mean() * 100).round(2)


### üìå Analyst Interpretation (Markdown)

- Columns with <5% missing values can be imputed
- Columns with >30% missing values may be dropped

3Ô∏è‚É£ Visualizing Missing Data


In [None]:

sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Values Heatmap")
plt.show()




4Ô∏è‚É£ Handling Missing Values (Business Logic)


In [None]:

# Age ‚Üí Median Imputation
df['Age'].fillna(df['Age'].median(), inplace=True)


In [None]:

# Region ‚Üí Category Imputation
df['Region'].fillna('Unknown', inplace=True)




üìå Why This Works

- Median handles skewed age distribution
- 'Unknown' preserves customer records



5Ô∏è‚É£ Validate Missing Values


In [None]:
df.isnull().sum()



6Ô∏è‚É£ Outlier Detection ‚Äì Visual
Boxplot


In [None]:
sns.boxplot(x=df['Sales'])
plt.title("Sales Outliers")
plt.show()



7Ô∏è‚É£ Outlier Detection ‚Äì IQR Method


In [None]:
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

lower_bound, upper_bound

outliers = df[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound)]
outliers.head()



8Ô∏è‚É£ Outlier Treatment Options
Option 1: Remove Extreme Values


In [None]:
df_clean = df[df['Sales'] <= upper_bound]

# Option 2: Cap (Winsorization)

df['Sales'] = np.where(df['Sales'] > upper_bound, upper_bound, df['Sales'])




üìå Business Decision (Markdown)

Extreme sales values are likely data errors rather than bulk orders.
Capping preserves dataset size while reducing distortion.



9Ô∏è‚É£ Outlier Impact Comparison


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,4))

sns.boxplot(x=df_clean['Sales'], ax=ax[0])
ax[0].set_title("After Removing Outliers")

sns.boxplot(x=df['Sales'], ax=ax[1])
ax[1].set_title("After Capping Outliers")

plt.show()



üîü Validate Final Dataset


In [None]:
df.describe()



‚úÖ Final Outcome

‚úî No missing values
‚úî Outliers treated logically
‚úî Data ready for analysis