## Step 1: Hello, Data!
Load raw CSV, display first 3 rows

In [None]:
import pandas as pd
df = pd.read_csv('data/1000SalesRecords.csv')
df.head(3)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Middle East and North Africa,Libya,Cosmetics,Offline,M,10/18/2014,686800706,10/31/2014,8446,437.2,263.33,3692591.2,2224085.18,1468506.02
1,North America,Canada,Vegetables,Online,M,11/7/2011,185941302,12/8/2011,3018,154.06,90.93,464953.08,274426.74,190526.34
2,Middle East and North Africa,Libya,Baby Food,Offline,C,10/31/2016,246222341,12/9/2016,1517,255.28,159.42,387259.76,241840.14,145419.62


## Step 2: Pick the Right Container
Justify dict vs namedtuple vs sets

We will use **dict** because it provides key-value mapping and is flexible for transformations. 
Namedtuple is immutable and sets only hold unique values without context.

## Step 3: Implement Functions and Data Structure
Using a small Python class with methods `.clean()` and `.total()`

In [None]:
class SalesDataProcessor:
    def __init__(self, file_path):
        import pandas as pd
        self.df = pd.read_csv(file_path)
        self.cleaned = False

    def clean(self):
        before = len(self.df)
        self.df.drop_duplicates(inplace=True)
        self.df.dropna(inplace=True)
        after = len(self.df)
        self.cleaned = True
        print(f"Cleaned data: before={before} rows, after={after} rows.")
        return self.df

    def total(self, column):
        if column not in self.df.columns:
            raise ValueError(f"Column '{column}' not found in data.")
        return self.df[column].sum()

    def build_dict(self):
        return self.df.to_dict(orient="records")

processor = SalesDataProcessor('data/1000SalesRecords.csv')
processor.clean()
sales_dicts = processor.build_dict()
print(sales_dicts[:3])
total_revenue = processor.total("Total Revenue")
print("Total Revenue:", total_revenue)

Cleaned data: before=1000 rows, after=1000 rows.
[{'Region': 'Middle East and North Africa', 'Country': 'Libya', 'Item Type': 'Cosmetics', 'Sales Channel': 'Offline', 'Order Priority': 'M', 'Order Date': '10/18/2014', 'Order ID': 686800706, 'Ship Date': '10/31/2014', 'Units Sold': 8446, 'Unit Price': 437.2, 'Unit Cost': 263.33, 'Total Revenue': 3692591.2, 'Total Cost': 2224085.18, 'Total Profit': 1468506.02}, {'Region': 'North America', 'Country': 'Canada', 'Item Type': 'Vegetables', 'Sales Channel': 'Online', 'Order Priority': 'M', 'Order Date': '11/7/2011', 'Order ID': 185941302, 'Ship Date': '12/8/2011', 'Units Sold': 3018, 'Unit Price': 154.06, 'Unit Cost': 90.93, 'Total Revenue': 464953.08, 'Total Cost': 274426.74, 'Total Profit': 190526.34}, {'Region': 'Middle East and North Africa', 'Country': 'Libya', 'Item Type': 'Baby Food', 'Sales Channel': 'Offline', 'Order Priority': 'C', 'Order Date': '10/31/2016', 'Order ID': 246222341, 'Ship Date': '12/9/2016', 'Units Sold': 1517, 'Unit

## Step 4: Bulk Loaded
Map dataframes to dictionaries

In [None]:
records = processor.df.to_dict(orient='records')
records[:2]

[{'Region': 'Middle East and North Africa',
  'Country': 'Libya',
  'Item Type': 'Cosmetics',
  'Sales Channel': 'Offline',
  'Order Priority': 'M',
  'Order Date': '10/18/2014',
  'Order ID': 686800706,
  'Ship Date': '10/31/2014',
  'Units Sold': 8446,
  'Unit Price': 437.2,
  'Unit Cost': 263.33,
  'Total Revenue': 3692591.2,
  'Total Cost': 2224085.18,
  'Total Profit': 1468506.02},
 {'Region': 'North America',
  'Country': 'Canada',
  'Item Type': 'Vegetables',
  'Sales Channel': 'Online',
  'Order Priority': 'M',
  'Order Date': '11/7/2011',
  'Order ID': 185941302,
  'Ship Date': '12/8/2011',
  'Units Sold': 3018,
  'Unit Price': 154.06,
  'Unit Cost': 90.93,
  'Total Revenue': 464953.08,
  'Total Cost': 274426.74,
  'Total Profit': 190526.34}]

## Step 5: Quick Profiling

In [None]:
print('Min Price:', processor.df['Unit Price'].min())
print('Mean Price:', processor.df['Unit Price'].mean())
print('Max Price:', processor.df['Unit Price'].max())
print('Unique countries:', processor.df['Country'].nunique())

Min Price: 9.33
Mean Price: 262.10684
Max Price: 668.27
Unique countries: 185


## Step 6: Spot the Grime

Examples of dirty data:
- Missing values in columns
- Inconsistent date formats
- Duplicate rows

## Step 7: Cleaning Rules

In [None]:
# Already cleaned in Step 3 via processor.clean()

## Step 8: Transformations

In [None]:
import numpy as np
processor.df['coupon_code'] = np.where(processor.df['Order Priority']=='H', 'DISC10', 'NONE')
processor.df['discount'] = processor.df['coupon_code'].apply(lambda x: 0.10 if x=='DISC10' else 0)
processor.df[['Order Priority','coupon_code','discount']].head()

Unnamed: 0,Order Priority,coupon_code,discount
0,M,NONE,0.0
1,M,NONE,0.0
2,C,NONE,0.0
3,C,NONE,0.0
4,H,DISC10,0.1


## Step 9: Feature Engineering

In [None]:
processor.df['Order Date'] = pd.to_datetime(processor.df['Order Date'])
processor.df['Ship Date'] = pd.to_datetime(processor.df['Ship Date'])
processor.df['days_since_purchase'] = (processor.df['Ship Date'] - processor.df['Order Date']).dt.days
processor.df[['Order Date','Ship Date','days_since_purchase']].head()

Unnamed: 0,Order Date,Ship Date,days_since_purchase
0,2014-10-18,2014-10-31,13
1,2011-11-07,2011-12-08,31
2,2016-10-31,2016-12-09,39
3,2010-04-10,2010-05-12,32
4,2011-08-16,2011-08-31,15


## Step 10: Mini-Aggregation

In [None]:
revenue_per_country = processor.df.groupby('Country')['Total Revenue'].sum().to_dict()
revenue_per_country

{'Afghanistan': 2843589.07,
 'Albania': 9709899.27,
 'Algeria': 10272591.440000001,
 'Andorra': 7153122.97,
 'Angola': 15643032.02,
 'Antigua and Barbuda ': 5650520.67,
 'Armenia': 7139689.51,
 'Australia': 3215330.16,
 'Austria': 16199378.41,
 'Azerbaijan': 5308405.46,
 'Bahrain': 9022805.73,
 'Bangladesh': 5811989.16,
 'Barbados': 2803550.0999999996,
 'Belarus': 13482813.12,
 'Belgium': 9959553.530000001,
 'Belize': 9839301.81,
 'Benin': 9039257.06,
 'Bhutan': 12986378.17,
 'Bosnia and Herzegovina': 4359359.83,
 'Botswana': 2758990.99,
 'Brunei': 2702495.8899999997,
 'Bulgaria': 5430330.5600000005,
 'Burkina Faso': 3779357.44,
 'Burundi': 7032758.550000001,
 'Cambodia': 4642313.7,
 'Cameroon': 95209.92,
 'Canada': 1226103.3,
 'Cape Verde': 3629118.65,
 'Central African Republic': 16591036.850000001,
 'Chad': 17278040.69,
 'China': 10272536.76,
 'Comoros': 8999886.92,
 'Costa Rica': 19628279.63,
 "Cote d'Ivoire": 5121515.92,
 'Croatia': 941892.69,
 'Cuba': 27522085.87,
 'Cyprus': 5502

## Step 11: Serialization Checkpoint

In [None]:
processor.df.to_json('data/outputs/json/cleaned_sales.json', orient='records')
processor.df.to_csv('data/outputs/csv/cleaned_sales.csv', index=False)
print('Files saved.')

Files saved.


## Step 12: Soft Interview Reflection

Functions and classes help by modularizing data processing. They reduce duplication, improve readability, and make debugging easier. 
Encapsulating cleaning, transformations, and profiling logic into a class makes the project reusable, extendable, and maintainable.

## Data Dictionary Section

| Field | Type | Description | Source |
|-------|------|-------------|--------|
| Region | string | Sales region | Primary CSV |
| Country | string | Customer country | Primary CSV |
| Item Type | string | Product category | Primary CSV |
| Sales Channel | string | Online/Offline channel | Primary CSV |
| Order Date | date | Order placement date | Primary CSV |
| Ship Date | date | Order shipment date | Primary CSV |
| Units Sold | int | Number of units sold | Primary CSV |
| Unit Price | float | Price per unit | Primary CSV |
| Total Revenue | float | Units * Unit Price | Primary CSV |
| coupon_code | string | Applied coupon | Synthetic |
| discount | float | Numeric discount | Transformation |
| days_since_purchase | int | Derived shipping lag | Engineered |
| product_name | string | Product name | Secondary catalogue |
| description | string | Product description | Secondary catalogue |