## Step 1: Hello, Data!
Load raw CSV, display first 3 rows

In [None]:
import pandas as pd
df = pd.read_csv('data/1000SalesRecords.csv')
df.head(3)

## Step 2: Pick the Right Container
Justify dict vs namedtuple vs sets

We will use **dict** because it provides key-value mapping and is flexible for transformations. Namedtuple is immutable and sets only hold unique values without context.

## Step 3: Implement Functions and Data Structure

In [None]:
def build_sales_dict(row):
    return {
        'Region': row['Region'],
        'Country': row['Country'],
        'Item': row['Item Type'],
        'Revenue': row['Total Revenue']
    }

sales_dicts = df.head(5).apply(build_sales_dict, axis=1).tolist()
sales_dicts

## Step 4: Bulk Loaded
Map dataframes to dictionaries

In [None]:
records = df.to_dict(orient='records')
records[:2]

## Step 5: Quick Profiling

In [None]:
print('Min Price:', df['Unit Price'].min())
print('Mean Price:', df['Unit Price'].mean())
print('Max Price:', df['Unit Price'].max())
print('Unique countries:', df['Country'].nunique())

## Step 6: Spot the Grime

Examples of dirty data:
- Missing values in columns
- Inconsistent date formats
- Duplicate rows

## Step 7: Cleaning Rules

In [None]:
def clean(df):
    before = len(df)
    df = df.drop_duplicates()
    df = df.dropna()
    after = len(df)
    print('Before:', before, 'After:', after)
    return df

df = clean(df)

## Step 8: Transformations

In [None]:
import numpy as np
df['coupon_code'] = np.where(df['Order Priority']=='H', 'DISC10', 'NONE')
df['discount'] = df['coupon_code'].apply(lambda x: 0.10 if x=='DISC10' else 0)
df[['Order Priority','coupon_code','discount']].head()

## Step 9: Feature Engineering

In [None]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
df['days_since_purchase'] = (df['Ship Date'] - df['Order Date']).dt.days
df[['Order Date','Ship Date','days_since_purchase']].head()

## Step 10: Mini-Aggregation

In [None]:
revenue_per_country = df.groupby('Country')['Total Revenue'].sum().to_dict()
revenue_per_country

## Step 11: Serialization Checkpoint

In [None]:
df.to_json('data/outputs/json/cleaned_sales.json', orient='records')
df.to_csv('data/outputs/csv/cleaned_sales.csv', index=False)
print('Files saved.')

## Step 12: Soft Interview Reflection

Functions help by modularizing data processing. They reduce duplication, improve readability, and make debugging easier. With functions, we can encapsulate cleaning, transformations, and profiling logic in reusable units. This improves productivity and ensures consistency across the project.

## Data Dictionary Section

| Field | Type | Description | Source |
|-------|------|-------------|--------|
| Region | string | Sales region | Primary CSV |
| Country | string | Customer country | Primary CSV |
| Item Type | string | Product category | Primary CSV |
| Sales Channel | string | Online/Offline channel | Primary CSV |
| Order Date | date | Order placement date | Primary CSV |
| Ship Date | date | Order shipment date | Primary CSV |
| Units Sold | int | Number of units sold | Primary CSV |
| Unit Price | float | Price per unit | Primary CSV |
| Total Revenue | float | Units * Unit Price | Primary CSV |
| coupon_code | string | Applied coupon | Synthetic |
| discount | float | Numeric discount | Transformation |
| days_since_purchase | int | Derived shipping lag | Engineered |
| product_name | string | Product name | Secondary catalogue |
| description | string | Product description | Secondary catalogue |