# Product Master Data Analysis (Annex 1)

## 1. Objectives
This notebook analyzes the product master data (`annex1.csv`) to understand the product portfolio structure, category distribution, and data quality.

**Key Goals:**
- Assess data quality (missing values, duplicates).
- Analyze product category distribution.
- Identify the hierarchy of product categories.
- Provide strategic recommendations on product portfolio management.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('ggplot')
sns.set_palette('viridis')

## 2. Data Ingestion & Engineering
We load the data and check for basic structural integrity.

In [1]:
# Load the dataset
file_path = 'annex1.csv'
try:
    df = pd.read_csv(file_path)
    print("Data loaded successfully.")
    print(f"Shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: File {file_path} not found.")

NameError: name 'pd' is not defined

In [None]:
# Preview data
df.head()

In [None]:
# Check data types and missing values
df.info()

### Data Cleaning
- **Missing Values**: Identify and handle any nulls.
- **Duplicates**: Check for duplicate Item Codes.

In [None]:
# Check for missing values
missing = df.isnull().sum()
print("Missing Values:\n", missing[missing > 0])

# Check for duplicates in primary key (Item Code)
duplicates = df['Item Code'].duplicated().sum()
print(f"\nDuplicate Item Codes: {duplicates}")

# Remove duplicates if any (keeping first)
if duplicates > 0:
    df = df.drop_duplicates(subset='Item Code', keep='first')
    print("Duplicates removed.")

## 3. Exploratory Data Analysis (EDA)

### 3.1 Category Distribution
We analyze how many products exist within each category.

In [None]:
# Count items per category
category_counts = df['Category Name'].value_counts().reset_index()
category_counts.columns = ['Category Name', 'Item Count']

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(data=category_counts, x='Item Count', y='Category Name')
plt.title('Number of Items per Category')
plt.xlabel('Count')
plt.ylabel('Category')
plt.tight_layout()
plt.show()

### 3.2 Category Code Analysis
Checking if there is a 1:1 mapping between Category Code and Category Name.

In [None]:
# Analyze mapping consistency
mapping_check = df.groupby('Category Name')['Category Code'].nunique()
print("Categories with multiple codes:")
print(mapping_check[mapping_check > 1])

## 4. Strategic Insights

### Observations:
1.  **Portfolio Breadth**: Identify which categories dominate the SKU count. A high concentration in specific categories (e.g., 'Vegetables') might indicate specialization.
2.  **Data Integrity**: If duplicates were found, data governance processes need to be tightened.

### Recommendations:
-   **Portfolio Optimization**: Review categories with very few items to see if they should be consolidated or expanded.
-   **Master Data Management**: Ensure 'Item Code' remains unique and 'Category Code' is consistently mapped.