<a href="https://colab.research.google.com/github/iarondon3/End-to-End-Retail-Data-Ecosystem/blob/main/03-NoSQL-Integration/migration_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üçÉ‚öôÔ∏è Step 1: Install & Start MongoDB *(Click Play)*

In [None]:
# STRATEGY CHANGE: Using 'mongomock' to bypass Colab infrastructure limitations.
# This simulates a perfect MongoDB instance in RAM for ETL demonstration.

import time
import os

start_time = time.time()
print("üì¶ Installing Python drivers...")

# Install mongomock alongside standard libraries
os.system("pip install mongomock pymongo faker pandas > /dev/null 2>&1")

print("üöÄ Initializing In-Memory MongoDB Engine...")
# We don't need to start a Linux service. The database lives in Python now.
from mongomock import MongoClient

# Verify we can instantiate a client
try:
    client = MongoClient()
    db = client.test_db
    print(f"‚úÖ In-Memory NoSQL Engine Ready!")
    print("   State: Active (RAM-based)")
    print("   Compatibility: Full PyMongo API support")
except Exception as e:
    print(f"‚ùå Setup Failed: {e}")

elapsed = round(time.time() - start_time, 2)
print(f"‚è±Ô∏è Setup completed in {elapsed} seconds.")

> **üîç What just happened?**

> We initialized an **In-Memory NoSQL Engine** using `mongomock`.
> * **Infrastructure strategy:** In professional CI/CD pipelines and ephemeral environments (like Colab), creating a heavy database service can be flaky. By using an in-memory instance, we guarantee 100% reliability for this demo.
> * **Compatibility:** This engine accepts standard MongoDB commands (PyMongo), allowing us to test our data modeling logic.

# üèóÔ∏è Step 2: Generate "Relational" Source Data (SQL Simulation)


In [None]:


from faker import Faker
import random
import pandas as pd
from datetime import datetime, timedelta

# @markdown **Scenario Configuration: | Define Dataset Volume**
SALES_VOLUME = 8000  # @param {type:"slider", min:1000, max:50000, step:1000}

print(f"üé≤ Generating synthetic SQL-like data ({SALES_VOLUME} transactions)...")
fake = Faker('en_US')

# Quantities (Fixed Dimensions, Dynamic Sales)
QUANTITIES = {
    'branches': 50,      # Scaled down slightly for Colab RAM (originally 500)
    'employees': 500,    # Scaled down (originally 5000)
    'categories': 30,
    'products': 2000,    # Scaled down (originally 8000)
    'customers': 5000,   # Scaled down (originally 20000)
    'sales': SALES_VOLUME # <--- LINKED TO THE SLIDER (Step 1)
}

WALGREENS_STATES = ['FL', 'TX', 'CA', 'IL', 'NY', 'PA', 'NC', 'GA', 'OH', 'MI']

CATEGORIES_LIST = [
    "Prescription Drugs (Rx)", "Over-the-Counter (OTC)", "Pain Relief", "Vitamins & Supplements",
    "Digestive Health", "Allergy & Sinus", "Wound Care", "First Aid",
    "Skin Care", "Hair Care", "Oral Hygiene", "Feminine Care",
    "Deodorants", "Makeup", "Facial Care", "Fragrances", "Baby Formula & Food",
    "Diapers & Wipes", "Baby Care", "Household Cleaning", "Paper & Plastic",
    "Batteries & Bulbs", "Snacks", "Beverages", "Candy", "Frozen Food",
    "Breakfast & Cereal", "Photo & Electronics", "Contact Lenses", "Cards & Gifts"
]

PRODUCT_TEMPLATES = {
    "Prescription Drugs (Rx)": ["Metformin 500mg", "Lisinopril 10mg", "Atorvastatin 20mg"],
    "Over-the-Counter (OTC)": ["Allegra 24hr", "Zyrtec 10mg", "Claritin 24hr", "Pepto-Bismol"],
    "Pain Relief": ["Ibuprofen 200mg", "Advil Liqui-Gels", "Tylenol Extra Strength"],
    "Vitamins & Supplements": ["Vitamin C 1000mg", "Vitamin D3 2000IU", "Omega-3 Fish Oil"],
    "Snacks": ["Lay's Classic Chips", "Doritos", "Cheetos", "Oreo Cookies"],
    "Beverages": ["Coca-Cola 2L", "Pepsi 2L", "Bottled Water 1L", "Gatorade"],
    "Household Cleaning": ["Clorox Wipes", "Lysol Spray", "Tide Detergent"],
    "Electronics": ["USB-C Cable", "Anker Charger", "Apple EarPods"]
    # ... (Shortened for brevity in demo, but logic allows expansion)
}

print(f"üé≤ Generating synthetic SQL-like data ({SALES_VOLUME} transactions)...")
fake = Faker('en_US')

# --- 2. GENERATING DIMENSIONS (Flat Tables) ---

# A. Branches
print(f"   ... Generating {QUANTITIES['branches']} Branches...")
branches = []
for i in range(1, QUANTITIES['branches'] + 1):
    branches.append({
        "branch_id": i,
        "state": random.choice(WALGREENS_STATES),
        "city": fake.city(),
        "country": "USA"
    })

# B. Customers
print(f"   ... Generating {QUANTITIES['customers']} Customers...")
customers = []
for i in range(1, QUANTITIES['customers'] + 1):
    customers.append({
        "customer_id": i,
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "is_member": random.choice([True, False]),
        "state": random.choice(WALGREENS_STATES) # For geographic realism
    })

# C. Products (Complex Logic from SQL Script)
print(f"   ... Generating {QUANTITIES['products']} Products...")
products = []
for i in range(1, QUANTITIES['products'] + 1):
    cat_name = random.choice(CATEGORIES_LIST)

    # Template logic
    templates = PRODUCT_TEMPLATES.get(cat_name, [f"Generic {cat_name} Product"])
    prod_name = random.choice(templates)
    brand = random.choice(["Walgreens", "Nice!", "Tylenol", "Dove", "Coca-Cola", "Generic"])

    products.append({
        "product_id": i,
        "name": prod_name,
        "brand": brand,
        "category": cat_name,
        "price": round(random.uniform(1.99, 89.99), 2)
    })

# --- 3. GENERATING TRANSACTIONS (Simulating Relational Structure) ---
# We simulate two tables: 'Sale' (Header) and 'Sale_Detail' (Line Items)

print(f"   ... Generating {QUANTITIES['sales']} Sales (Header & Details)...")
sql_sales = []        # Table: Sale
sql_sale_details = [] # Table: Sale_Detail

sale_id_counter = 1
# Pre-fetch lists for performance
branch_list = branches
cust_list = customers
prod_list = products

# Weights for realistic distribution (Long Tail)
# Most sales come from a few popular branches/customers
branch_weights = [random.expovariate(1.5) for _ in branch_list]
cust_weights = [random.expovariate(1.5) for _ in cust_list]

for _ in range(QUANTITIES['sales']):
    # 1. Foreign Keys (Simulated Joins)
    # We use random.choices with weights for realism, or simple random for speed
    branch = random.choices(branch_list, k=1)[0]
    cust = random.choices(cust_list, k=1)[0]

    # 2. Sale Header
    sale_date = fake.date_this_year()
    sql_sales.append({
        "sale_id": sale_id_counter,
        "branch_id": branch["branch_id"],
        "customer_id": cust["customer_id"],
        "date": str(sale_date),
        "channel": random.choice(["STORE", "ONLINE", "APP"])
    })

    # 3. Sale Details (1 to 6 items per sale)
    num_items = random.randint(1, 6)
    selected_prods = random.sample(prod_list, num_items)

    for prod in selected_prods:
        qty = random.randint(1, 3)
        sql_sale_details.append({
            "sale_id": sale_id_counter, # FK to Header
            "product_id": prod["product_id"],
            "quantity": qty,
            "unit_price": prod["price"]
        })

    sale_id_counter += 1

# --- 4. VERIFICATION ---
print(f"‚úÖ Data Generation Complete!")
print(f"   Simulated SQL Tables in Memory:")
print(f"   - Branches: {len(branches)}")
print(f"   - Products: {len(products)}")
print(f"   - Customers: {len(customers)}")
print(f"   - Sales (Header): {len(sql_sales)}")
print(f"   - Sale Details (Lines): {len(sql_sale_details)}")
print("\nüëâ Preview of 'Sale' Table (First 3 rows):")
print(pd.DataFrame(sql_sales[:3]))

> **üîç What just happened?**

> We simulated a **Legacy Relational Database (SQL)** environment.
> * **The Source:** We generated flat lists (`sql_sales`, `sql_sale_details`, `branches`) that mimic normalized SQL tables.
> * **The Bottleneck:** Notice that the data is fragmented. To answer a simple question like *"What did John buy?"*, the system currently needs to perform expensive **JOINs** across 4 different tables. This is the friction point we aim to solve with NoSQL.

# üèóÔ∏è Step 3: ETL Pipeline (SQL -> NoSQL Transformation)

In [None]:
# @title

import time
import json
from datetime import datetime
# We use the In-Memory client we set up in Step 1
from mongomock import MongoClient

print("üîÑ Starting ETL Process...")
start_etl = time.time()

# --- 1. PRE-PROCESSING (Memory Indexing) ---
# Simulating SQL Lookup Tables
branch_map = {b['branch_id']: b for b in branches}
product_map = {p['product_id']: p for p in products}
customer_map = {c['customer_id']: c for c in customers}

# Index sale details by sale_id for faster grouping
details_map = {}
for d in sql_sale_details:
    s_id = d['sale_id']
    if s_id not in details_map:
        details_map[s_id] = []
    details_map[s_id].append(d)

print(f"   ‚úì Dimensions indexed in memory.")

# --- 2. TRANSFORMATION (SQL Rows -> Nested Documents) ---
mongo_docs = []

for sale in sql_sales:
    s_id = sale['sale_id']
    b_id = sale['branch_id']
    c_id = sale['customer_id']

    # Retrieve Context ("JOINs")
    branch_data = branch_map.get(b_id)
    cust_data = customer_map.get(c_id)
    sale_items = details_map.get(s_id, [])

    # === DATA MODELING: PURE EMBEDDING ===
    doc = {
        "sale_id": s_id,
        "date": datetime.strptime(sale['date'], "%Y-%m-%d"),
        "channel": sale['channel'],

        # Denormalized Customer
        "customer": {
            "customer_id": c_id,
            "first_name": cust_data['first_name'],
            "last_name": cust_data['last_name'],
            "is_member": cust_data['is_member']
        },

        # Denormalized Branch
        "branch": {
            "city": branch_data['city'],
            "state": branch_data['state'],
            "country": "USA"
        },

        # Embedded Items Array
        "items": []
    }

    # Transform Items
    total_amount = 0
    for item in sale_items:
        prod = product_map.get(item['product_id'])
        line_total = item['quantity'] * item['unit_price']
        total_amount += line_total

        doc["items"].append({
            "product_name": prod['name'],
            "brand": prod['brand'],
            "category": prod['category'],
            "quantity": item['quantity'],
            "unit_price": item['unit_price'],
            "line_total": round(line_total, 2)
        })

    doc["total_amount"] = round(total_amount, 2)
    mongo_docs.append(doc)

print(f"   ‚úì Transformation complete. Prepared {len(mongo_docs)} nested documents.")

# --- 3. LOADING (Insert into NoSQL) ---
# Connect to In-Memory DB
client = MongoClient()
db = client.walgreens_analytics
collection = db.sales

# Clear and Bulk Insert
collection.delete_many({})
collection.insert_many(mongo_docs)

elapsed = round(time.time() - start_etl, 2)
print(f"‚úÖ ETL Finished in {elapsed}s. Data is loaded into MongoDB (Memory)!")
print(f"   Collection Size: {collection.count_documents({})} documents.")

# --- 4. QUALITY ASSURANCE (Verify Schema) ---
print("\nüîé QA Check: Sampling one document to verify 'Pure Embedding' Schema:")
sample_doc = collection.find_one()
# We use 'default=str' to handle datetime objects nicely in JSON
print(json.dumps(sample_doc, indent=4, default=str))

> **üîç What just happened?**

> We executed the core **ETL (Extract, Transform, Load)** process to migrate from SQL to NoSQL.
> * **Transformation Pattern:** We applied the **Pure Embedding Strategy**. Instead of maintaining foreign keys, we rewrote the document by embedding the Customer, Branch, and Items directly into the Sale object.
> * **The Result:** As seen in the QA Output, we now have **Rich Documents**. A single read operation retrieves the entire context of a transaction, optimizing "Read-Heavy" analytical workloads.

# üìä Step 4: Run Analytics (Average Basket Size üõí)

Calculates consumer behavior by analyzing embedded arrays.

In [None]:
# @title


print("üîé Executing Aggregation Pipeline: 'Average Basket Size per Customer'...")

pipeline = [
    # 1. DOCUMENT LEVEL: Calculate total items in this specific transaction
    # We use $addFields to create a temporary field 'basket_size' by summing the embedded array
    {
        "$addFields": {
            "basket_size": { "$sum": "$items.quantity" }
        }
    },

    # 2. COLLECTION LEVEL: Group by Customer
    {
        "$group": {
            "_id": {
                "first_name": "$customer.first_name",
                "last_name": "$customer.last_name"
            },
            "avg_items_per_visit": { "$avg": "$basket_size" },
            "total_visits": { "$sum": 1 }
        }
    },

    # 3. SORT: Show customers with the largest baskets first
    { "$sort": { "avg_items_per_visit": -1 } },

    # 4. LIMIT: Top 10 for display
    { "$limit": 10 }
]

# Execute
results = list(collection.aggregate(pipeline))

# Visualization
print(f"\n{'CUSTOMER':<30} | {'AVG ITEMS/VISIT':<20} | {'TOTAL VISITS'}")
print("-" * 65)

for r in results:
    full_name = f"{r['_id']['first_name']} {r['_id']['last_name']}"
    avg = round(r['avg_items_per_visit'], 1)
    visits = r['total_visits']
    print(f"{full_name:<30} | {avg:<20} | {visits}")

print(f"\n‚úÖ Analysis Complete. This metric helps identify bulk buyers vs. impulsive shoppers.")

> **üîç What just happened?**

> We utilized the **MongoDB Aggregation Framework** to perform server-side analytics.
> * **The Challenge:** We needed to calculate a metric (Basket Size) that didn't exist explicitly in the database.
> * **The Solution:** We used a multi-stage pipeline. First, we used `$addFields` to sum the quantities inside the embedded `items` array (Document manipulation), and then we used `$group` to calculate the average across all purchases (Collection aggregation). This demonstrates MongoDB's capability to handle complex math beyond simple storage.