# Phase 3: Standard Library Deep Dive
## Python's batteries-included toolkit

Python comes with a huge set of built-in modules called the **standard library**. 
You do not need to install them -- they are already there when you install Python.

Odibi uses over a dozen standard library modules. This notebook covers every one of them, 
with real Odibi examples so you understand why each module matters.

### What does `import` mean?

When you write `import os`, you are telling Python: "Load the `os` module so I can use it." 
A **module** is just a file full of pre-written functions and classes.

There are two ways to import:
```python
import os                    # Import the whole module, use as os.path.join()
from pathlib import Path     # Import just one thing, use as Path()
```

Both are fine. Use whichever you see in the existing code. Odibi uses both styles.

**Rules:** Same as before. Type everything. Run everything. Complete every exercise.

---
## Section 1: os and pathlib -- Working with Files and Paths

These modules let you interact with the file system: build file paths, check if files exist, 
list directories, read environment variables.

Odibi uses these in 15+ files. Every connection, every config loader, every read/write 
operation touches file paths.

### pathlib.Path -- the modern way

`pathlib` was added in Python 3.4 and is the preferred way to work with paths. 
It represents a path as an **object** instead of a plain string, which means you get 
methods and operators that make path manipulation cleaner.

Odibi uses `from pathlib import Path` in almost every module.

In [1]:
from pathlib import Path

# Creating paths
p = Path("data/bronze/customers.csv")
print(p)
print(type(p))

# Path parts
print(f"Name: {p.name}")         # customers.csv
print(f"Stem: {p.stem}")         # customers (name without extension)
print(f"Suffix: {p.suffix}")     # .csv
print(f"Parent: {p.parent}")     # data/bronze
print(f"Parts: {p.parts}")       # ("data", "bronze", "customers.csv")

data\bronze\customers.csv
<class 'pathlib.WindowsPath'>
Name: customers.csv
Stem: customers
Suffix: .csv
Parent: data\bronze
Parts: ('data', 'bronze', 'customers.csv')


In [None]:
from pathlib import Path

# Building paths with / operator (this is the Pythonic way)
base = Path("data")
full_path = base / "bronze" / "customers.csv"
print(full_path)  # data/bronze/customers.csv

# This is exactly how odibi/connections/local.py builds paths:
# self.base_path / relative_path

# Change extension
parquet_path = full_path.with_suffix(".parquet")
print(parquet_path)  # data/bronze/customers.parquet

# Check existence
print(full_path.exists())  # False (we haven't created it)
print(Path(".").exists())  # True (current directory always exists)

# Current working directory
print(Path.cwd())

data\employees.csv
data\employees.parquet
False
True
c:\Users\hodibi\OneDrive - Ingredion\Desktop\Repos\Odibi\learning\notebooks


In [14]:
from pathlib import Path

# Listing files in a directory
odibi_dir = Path(".")  # Current directory

# List all items
for item in sorted(odibi_dir.iterdir()):
    if not str(item).startswith("."):  # Skip hidden files
        kind = "DIR" if item.is_dir() else "FILE"
        print(f"  [{kind}] {item.name}")
        if item.is_dir():
            break  # Just show first few to avoid clutter

  [FILE] 01_python_basics.ipynb
  [FILE] 02_data_structures.ipynb
  [FILE] 03_standard_library.ipynb
  [FILE] 04_oop.ipynb
  [FILE] 05_advanced_patterns.ipynb
  [FILE] 06_pydantic.ipynb
  [FILE] 07_pandas.ipynb
  [FILE] 08_mini_odibi_core.ipynb
  [FILE] 09_mini_odibi_features.ipynb
  [FILE] 10_testing.ipynb


In [18]:
from pathlib import Path

# glob - find files matching a pattern
# This is how odibi/engine/pandas_engine.py finds data files
p = Path(".")

# Find all Python files in current directory
py_files = list(p.glob("*.py"))
print(f"Python files in current dir: {len(py_files)}")
for f in sorted(py_files)[:5]:  # Show first 5
    print(f"  {f.name}")

# Recursive glob with ** (find in all subdirectories)
yaml_files = list(p.glob("**/*.yaml"))  # All YAML files anywhere below
print(yaml_files)

Python files in current dir: 0
[]


### os module -- environment variables and system operations

While `pathlib` handles paths, `os` is still used for:
- Reading environment variables (`os.environ`, `os.getenv()`)
- Creating directories (`os.makedirs()`)
- Checking if paths exist (`os.path.exists()` -- though `Path.exists()` is preferred)

In [19]:
import os

# Environment variables
# These are key-value pairs set in your operating system
# Odibi uses them for connection secrets, API keys, etc.

# Get an env var (returns None if not set)
home = os.getenv("USERPROFILE")  # Windows home directory
print(f"Home: {home}")

# Get with a default
env = os.getenv("ODIBI_ENV", "development")
print(f"Environment: {env}")  # "development" (since ODIBI_ENV is not set)

# Check if env var exists
has_key = "PATH" in os.environ
print(f"PATH exists: {has_key}")

Home: C:\Users\hodibi
Environment: development
PATH exists: True


In [35]:
import os
from pathlib import Path

# Creating directories
# os.makedirs creates the directory AND all parent directories
# exist_ok=True means it won't crash if it already exists

test_dir = Path("learning//mini_odibi/temp_test")
os.makedirs(test_dir, exist_ok=True)
print(f"Created: {test_dir.exists()}")  # True

# Clean up
test_dir.rmdir()  # Remove empty directory

Created: True


### Exercise 1.1: File path builder

Write a function called `build_output_path` that:
- Takes: `base_dir` (str), `layer` (str), `table_name` (str), `format` (str, default "parquet")
- Returns a Path object like: `base_dir/layer/table_name.format`
- Example: `build_output_path("data", "bronze", "customers")` -> `Path("data/bronze/customers.parquet")`

Then use the function and print the resulting path's name, stem, suffix, and parent.

In [39]:
# Exercise 1.1
# YOUR CODE HERE
from pathlib import Path

def build_output_path(base_dir:str, layer:str, table_name:str, format:str = 'parquet'):
    path = Path(f'{base_dir}/{layer}/{table_name}.{format}')
    return path



# Test:
p = build_output_path("data", "bronze", "customers")
print(f"Full path: {p}")
print(f"Name: {p.name}")
print(f"Stem: {p.stem}")
print(f"Suffix: {p.suffix}")
print(f"Parent: {p.parent}")

Full path: data\bronze\customers.parquet
Name: customers.parquet
Stem: customers
Suffix: .parquet
Parent: data\bronze


**Expected output:**
```
Full path: data/bronze/customers.parquet
Name: customers.parquet
Stem: customers
Suffix: .parquet
Parent: data/bronze
```

---
## Section 2: json -- Reading and Writing Structured Data

JSON (JavaScript Object Notation) is a text format for storing structured data. 
It looks almost identical to Python dictionaries and lists.

Odibi uses JSON for catalog metadata, diagnostics output, schema exports, and lineage data.

The `json` module converts between Python objects and JSON strings:
- `json.dumps(obj)` -- Python object -> JSON string ("dump string")
- `json.loads(text)` -- JSON string -> Python object ("load string")
- `json.dump(obj, file)` -- Python object -> JSON file
- `json.load(file)` -- JSON file -> Python object

In [40]:
import json

# Python dict to JSON string
node_config = {
    "name": "customers",
    "format": "delta",
    "row_count": 1542,
    "columns": ["id", "name", "email"],
    "active": True,
    "last_error": None,
}

# Convert to JSON string
json_str = json.dumps(node_config)
print(f"Compact: {json_str}")

# Pretty-printed (indent=2 is standard)
pretty = json.dumps(node_config, indent=2)
print(f"\nPretty:\n{pretty}")

# Notice: True became true, None became null
# JSON has different names for these values

Compact: {"name": "customers", "format": "delta", "row_count": 1542, "columns": ["id", "name", "email"], "active": true, "last_error": null}

Pretty:
{
  "name": "customers",
  "format": "delta",
  "row_count": 1542,
  "columns": [
    "id",
    "name",
    "email"
  ],
  "active": true,
  "last_error": null
}


In [41]:
import json

# JSON string to Python dict
json_text = '{"name": "orders", "format": "csv", "row_count": 8930, "active": true}'

config = json.loads(json_text)
print(type(config))  # dict
print(config['name'])  # orders
print(config['active'])  # True (json true becomes Python True)

<class 'dict'>
orders
True


In [48]:
import json
from pathlib import Path

# Write JSON to a file
node_config = {
    "name": "customers",
    "format": "delta",
    "row_count": 1542,
}

output_path = Path("learning/mini_odibi/test_config.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.parent
with open(output_path, "w") as f:
    json.dump(node_config, f, indent=2)
print(f"Wrote to {output_path}")

# Read JSON from a file
with open(output_path, "r") as f:
    loaded = json.load(f)
print(f"Loaded: {loaded}")
print(f"Name: {loaded['name']}")

# Clean up
output_path.unlink()  # Delete the file

Wrote to learning\mini_odibi\test_config.json
Loaded: {'name': 'customers', 'format': 'delta', 'row_count': 1542}
Name: customers


### The `with` statement (context managers)

You just saw `with open(path, "w") as f:`. This is called a **context manager**.

When you open a file, you need to close it when you are done. The `with` statement 
does this automatically -- even if your code crashes:

```python
# Bad: If an error happens, the file stays open
f = open("file.txt", "r")
data = f.read()      # What if this crashes?
f.close()             # This might never run!

# Good: File is ALWAYS closed when the block ends
with open("file.txt", "r") as f:
    data = f.read()  # Even if this crashes, file gets closed
```

Always use `with` when working with files. This is a guaranteed interview topic.

Odibi's `PhaseTimer` in `node.py` uses the same `with` pattern for timing execution phases.

### Exercise 2.1: Config file handler

Write two functions:
1. `save_config(config, filepath)` -- saves a dict as pretty-printed JSON to a file
2. `load_config(filepath)` -- loads a JSON file and returns a dict

Both should use `with` statements. Test them by saving a config and loading it back.

In [60]:
# Exercise 2.1
# YOUR CODE HERE
import json
from pathlib import Path
def save_config(config:dict, filepath:str):
    with open(filepath,'w') as f:
        json.dump(config,f, indent=4)
        print(f"Wrote to {output_path}")

def load_config(filepath:str):
    with open(filepath, 'r') as f:
        json_str = json.load(f)
    return json_str

# Test:
test_config = {"name": "test_node", "format": "csv", "rows": 100}
save_config(test_config, "learning/mini_odibi/test_output.json")
loaded = load_config("learning/mini_odibi/test_output.json")
print(loaded)
print(loaded == test_config)  # Should be True
Path("learning/mini_odibi/test_output.json").unlink()  # Clean up

Wrote to learning\mini_odibi\test_config.json
{'name': 'test_node', 'format': 'csv', 'rows': 100}
True


---
## Section 3: re -- Regular Expressions (Pattern Matching)

Regular expressions (regex) let you search for **patterns** in text, not just exact matches.

Examples of what regex can do:
- Find all dates in a string (pattern: `\d{4}-\d{2}-\d{2}`)
- Check if a string looks like an email
- Extract numbers from mixed text
- Replace patterns in text

Odibi uses `re` in `context.py`, `config_loader.py`, `introspect.py`, and `node.py` 
for things like parsing SQL, substituting date variables, and extracting error messages.

In [61]:
import re

# re.search() - find the FIRST match
text = "Node customers processed 1542 rows in 3.45 seconds"

# Find a number
match = re.search(r"\d+", text)  # \d+ means "one or more digits"
if match:
    print(f"Found number: {match.group()}")  # 1542

# re.findall() - find ALL matches
numbers = re.findall(r"[\d.]+", text)  # Digits and dots
print(f"All numbers: {numbers}")  # ["1542", "3.45"]

Found number: 1542
All numbers: ['1542', '3.45']


### Common regex patterns

| Pattern | Meaning | Example |
|---------|---------|--------|
| `\d` | Any digit (0-9) | `\d+` matches `1542` |
| `\w` | Any word character (letter, digit, underscore) | `\w+` matches `customers` |
| `\s` | Any whitespace | `\s+` matches spaces, tabs |
| `.` | Any character | `.+` matches anything |
| `*` | Zero or more of previous | `\d*` matches `""` or `123` |
| `+` | One or more of previous | `\d+` matches `123` but not `""` |
| `?` | Zero or one of previous | `\d?` matches `""` or `1` |
| `{n}` | Exactly n of previous | `\d{4}` matches `2024` |
| `^` | Start of string | `^Node` matches `"Node..."` |
| `$` | End of string | `.csv$` matches `"file.csv"` |
| `()` | Capture group | `(\d+)` captures the number |

The `r` before the string (`r"\d+"`) means "raw string" -- it tells Python not to 
interpret backslashes. Always use raw strings for regex.

In [62]:
import re

# re.sub() - replace patterns
# This is how odibi/utils/config_loader.py substitutes date variables
config_text = "source: data/bronze/sales_{YYYY}_{MM}.csv"

# Replace date placeholders
result = re.sub(r"\{YYYY\}", "2024", config_text)
result = re.sub(r"\{MM\}", "01", result)
print(result)  # source: data/bronze/sales_2024_01.csv

source: data/bronze/sales_2024_01.csv


In [63]:
import re

# Capture groups with ()
log_line = "[2024-01-15 14:30:00] ERROR Node: customers - Column not found"

# Extract the date, level, and node name
pattern = r"\[(.*?)\]\s+(\w+)\s+Node:\s+(\w+)"
match = re.search(pattern, log_line)
if match:
    print(f"Date: {match.group(1)}")   # 2024-01-15 14:30:00
    print(f"Level: {match.group(2)}")  # ERROR
    print(f"Node: {match.group(3)}")   # customers

Date: 2024-01-15 14:30:00
Level: ERROR
Node: customers


### Exercise 3.1: Parse a table path

Write a function that uses regex to parse table paths in the format:
`schema.table` or `catalog.schema.table`

Return a dict with keys: `catalog` (or None), `schema`, `table`.

In [87]:
"bronze.customers"
pattern = r'(\w+)+'
match = re.findall(pattern,"bronze.customers")
match

['bronze', 'customers']

In [90]:
# Exercise 3.1
# YOUR CODE HERE
import re

def parse_table_path(table_path:str):
    pattern = r'(\w+)+'
    match = re.findall(pattern,table_path)
    table_dict = dict()
    if len(match) == 2:
        table_dict['catalog'] = None
        table_dict['schema'] = match[0]
        table_dict['table'] = match[1]
    else:
        table_dict['catalog'] = match[0]
        table_dict['schema'] = match[1]
        table_dict['table'] = match[2]
    return table_dict




# Test:
print(parse_table_path("bronze.customers"))         # {"catalog": None, "schema": "bronze", "table": "customers"}
print(parse_table_path("main.bronze.customers"))    # {"catalog": "main", "schema": "bronze", "table": "customers"}

{'catalog': None, 'schema': 'bronze', 'table': 'customers'}
{'catalog': 'main', 'schema': 'bronze', 'table': 'customers'}


---
## Section 4: logging -- Professional Output

In Phase 1, you used `print()` for output. In real code, you use `logging`. Here is why:

- **Levels**: DEBUG, INFO, WARNING, ERROR, CRITICAL -- you can filter what you see
- **Format**: Timestamps, module names, line numbers -- automatically
- **Destinations**: Console, files, remote services -- configurable
- **Production**: You can turn off debug messages without changing code

Odibi has its own logging system built on top of Python's `logging` module 
(`odibi/utils/logging.py` and `odibi/utils/logging_context.py`).

In [91]:
import logging

# Basic setup
logging.basicConfig(
    level=logging.DEBUG,  # Show all messages DEBUG and above
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%H:%M:%S",
)

# Create a logger for this module
logger = logging.getLogger("mini_odibi")

# Five levels (from least to most severe)
logger.debug("Detailed info for debugging")    # Usually hidden in production
logger.info("Normal operation info")            # Standard messages
logger.warning("Something unexpected")          # Not an error, but worth noting
logger.error("Something went wrong")            # An error occurred
logger.critical("System is broken")             # Fatal error

10:34:25 [DEBUG] Detailed info for debugging
10:34:25 [INFO] Normal operation info
10:34:25 [ERROR] Something went wrong
10:34:25 [CRITICAL] System is broken


In [94]:
import logging

# Using logger with f-strings (the way Odibi does it)
logger = logging.getLogger("mini_odibi")

node_name = "customers"
row_count = 1542
duration = 3.45

logger.info(f"Processing node: {node_name}")
logger.info(f"Read {row_count:,} rows in {duration:.2f}s")
logger.warning(f"Node {node_name} has no validation tests")

# In Odibi, you would see:
# ctx = get_logging_context()
# ctx.info("Processing node", node_name=node_name, rows=row_count)

10:34:52 [INFO] Processing node: customers
10:34:52 [INFO] Read 1,542 rows in 3.45s


---
## Section 5: datetime -- Working with Dates and Times

Data engineering is obsessed with dates: when was data loaded? When does it expire? 
What is the processing window? Every Odibi pattern uses timestamps.

Python's `datetime` module handles all of this.

In [101]:
from datetime import datetime, date, timedelta

# Current date and time
now = datetime.now()
print(f"Now: {now}")
print(f"Date only: {now.date()}")
print(f"Time only: {now.time()}")
print(f"Year: {now.year}, Month: {now.month}, Day: {now.day}")

# Format as string (strftime = "string format time")
formatted = now.strftime("%Y-%m-%d %H:%M:%S")
print(f"Formatted: {formatted}")

# Common format codes:
# %Y = 4-digit year (2024)
# %m = 2-digit month (01-12)
# %d = 2-digit day (01-31)
# %H = 24-hour hour (00-23)
# %M = minute (00-59)
# %S = second (00-59)

Now: 2026-02-09 11:10:58.766892
Date only: 2026-02-09
Time only: 11:10:58.766892
Year: 2026, Month: 2, Day: 9
Formatted: 2026-02-09 11:10:58


In [102]:
from datetime import datetime, timedelta

# Parse a string into a datetime (strptime = "string parse time")
date_str = "2024-01-15 14:30:00"
dt = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S")
print(f"Parsed: {dt}")
print(f"Year: {dt.year}")

# timedelta - represent a duration
# Odibi uses timedelta for retention periods, SCD2 expiry, etc.
now = datetime.now()
one_week_ago = now - timedelta(days=7)
tomorrow = now + timedelta(days=1)

print(f"One week ago: {one_week_ago.strftime('%Y-%m-%d')}")
print(f"Tomorrow: {tomorrow.strftime('%Y-%m-%d')}")

# Duration between two dates
start = datetime(2024, 1, 1)
end = datetime(2024, 3, 15)
diff = end - start
print(f"Difference: {diff.days} days")

Parsed: 2024-01-15 14:30:00
Year: 2024
One week ago: 2026-02-02
Tomorrow: 2026-02-10
Difference: 74 days


### Exercise 5.1: Timestamp logger

Write a function `log_with_timestamp(message)` that:
1. Gets the current time
2. Formats it as `YYYY-MM-DD HH:MM:SS`
3. Prints `[timestamp] message`

Then write a function `calculate_duration(start, end)` that:
1. Takes two datetime objects
2. Returns the duration as a formatted string like `"2h 15m 30s"`

In [109]:
# Exercise 5.1
# YOUR CODE HERE
from datetime import datetime, timedelta

def log_with_timestamp(message:str):
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f'[{now}] {message}')


# Test log_with_timestamp:
log_with_timestamp("Pipeline started")
log_with_timestamp("Node customers completed")

def calculate_duration(start: datetime, end: datetime):
    delta = end - start
    return delta

# Test calculate_duration:
start = datetime(2024, 1, 15, 14, 30, 0)
end = datetime(2024, 1, 15, 16, 45, 30)
print(calculate_duration(start, end))  # 2h 15m 30s



[2026-02-09 11:19:17] Pipeline started
[2026-02-09 11:19:17] Node customers completed
2:15:30


---
## Section 6: hashlib and uuid -- Hashing and Unique IDs

### hashlib -- Creating fingerprints of data

A **hash** is a fixed-length fingerprint of any data. The same input always produces 
the same hash. Different inputs (almost always) produce different hashes.

Odibi uses hashing in `catalog.py`, `node.py`, and `pandas_engine.py` to:
- Detect if data has changed (content hashing)
- Create unique row identifiers
- Track which version of data was processed

### uuid -- Generating unique identifiers

A **UUID** (Universally Unique Identifier) is a random 128-bit ID. The odds of generating 
the same UUID twice are essentially zero.

Odibi uses UUIDs in `pipeline.py` and `lineage.py` to give each pipeline run a unique ID.

In [111]:
data.encode()

b'customers|1542|2024-01-15'

In [114]:
import hashlib
import uuid

# hashlib - create a hash of data
data = "customers|1542|2024-01-15"
hash_value = hashlib.md5(data.encode()).hexdigest()
print(f"MD5 hash: {hash_value}")

# Same input = same hash (always)
hash_again = hashlib.md5(data.encode()).hexdigest()
print(f"Same? {hash_value == hash_again}")  # True

# SHA256 (more secure, used in Odibi)
sha_hash = hashlib.sha256(data.encode()).hexdigest()
print(f"SHA256: {sha_hash}")

# uuid - generate unique run IDs
run_id = str(uuid.uuid4())
print(f"Run ID: {run_id}")
# Something like: a38f981d-52da-47b1-818c-fbaa9ab56e0c

MD5 hash: dcfcc194c1be1d330f7142cbd5719b29
Same? True
SHA256: d1ae72bc40effee1d1ab8a1c432f71e41b0422ef95a88d93e661df20d6e0a60c
Run ID: fbd4d480-d359-4da5-94bb-4cee344464ff


---
## Section 7: collections -- Specialized Data Structures

The `collections` module provides enhanced versions of dicts, lists, and tuples.

Odibi's `graph.py` uses `defaultdict` and `deque` for dependency graph traversal.

In [121]:
from collections import defaultdict, Counter, deque

# defaultdict - a dict that creates default values automatically
# Remember the grouping exercise from Phase 2? defaultdict makes it easier.

# Without defaultdict (what you did in Phase 2):
grouped = {}
items = [("bronze", "customers"), ("silver", "customers"), ("bronze", "orders"), ("silver", "orders")]
for layer, table in items:
    if layer not in grouped:
        grouped[layer] = []
    grouped[layer].append(table)
print(f"Manual: {grouped}")

# With defaultdict (much cleaner):
grouped = defaultdict(list)  # Missing keys automatically get an empty list
for layer, table in items:
    grouped[layer].append(table)  # No need to check if key exists!
print(f"defaultdict: {dict(grouped)}")

Manual: {'bronze': ['customers', 'orders'], 'silver': ['customers', 'orders']}
defaultdict: {'bronze': ['customers', 'orders'], 'silver': ['customers', 'orders']}


In [123]:
from collections import Counter

# Counter - count occurrences (the drill from Phase 2, done in one line)
statuses = ["SUCCESS", "SUCCESS", "FAILED", "SUCCESS", "FAILED", "SUCCESS"]
counts = Counter(statuses)
print(counts)                # Counter({'SUCCESS': 4, 'FAILED': 2})
print(counts["SUCCESS"])     # 4
print(counts.most_common(1)) # [('SUCCESS', 4)] - most common item

Counter({'SUCCESS': 4, 'FAILED': 2})
4
[('SUCCESS', 4)]


In [125]:
from collections import deque

# deque ("deck") - double-ended queue
# Efficient for adding/removing from both ends
# Odibi's graph.py uses deque for BFS (breadth-first search) traversal

queue = deque()
queue.append("customers")   # Add to right
queue.append("orders")      # Add to right
queue.appendleft("setup")   # Add to left
print(f"Queue: {list(queue)}")

# Process in order (FIFO - first in, first out)
first = queue.popleft()     # Remove from left
print(f"Processing: {first}")  # setup
print(f"Remaining: {list(queue)}")

Queue: ['setup', 'customers', 'orders']
Processing: setup
Remaining: ['customers', 'orders']


---
## Section 8: dataclasses -- Simple Classes for Data

A **dataclass** is a shortcut for creating classes that mainly hold data. 
Instead of writing `__init__`, `__repr__`, and `__eq__` yourself, Python generates them.

Odibi uses dataclasses in 12+ modules: `pandas_engine.py`, `pipeline.py`, `diagnostics/`, 
`validation/`, and more.

We will cover full OOP (classes, inheritance) in Phase 4. This is just a preview of 
dataclasses because they are simpler and you will need them soon.

In [126]:
from dataclasses import dataclass, field

# Without dataclass (lots of boilerplate)
class NodeResultOld:
    def __init__(self, name, success, rows, duration):
        self.name = name
        self.success = success
        self.rows = rows
        self.duration = duration

    def __repr__(self):
        return f"NodeResult(name={self.name}, success={self.success}, rows={self.rows})"

# With dataclass (Python generates __init__ and __repr__ for you)
@dataclass
class NodeResult:
    name: str
    success: bool
    rows: int
    duration: float

# Both create the same thing, but dataclass is much shorter
result = NodeResult(name="customers", success=True, rows=1542, duration=3.45)
print(result)  # NodeResult(name='customers', success=True, rows=1542, duration=3.45)
print(result.name)  # customers
print(result.rows)  # 1542

NodeResult(name='customers', success=True, rows=1542, duration=3.45)
customers
1542


In [127]:
from dataclasses import dataclass, field
from typing import List, Optional

# Dataclass with defaults and complex fields
@dataclass
class PipelineResult:
    name: str
    success: bool = True              # Default value
    duration: float = 0.0
    errors: List[str] = field(default_factory=list)  # Mutable defaults need field()
    metadata: Optional[dict] = None   # Optional means it can be None

# Use with all defaults
r1 = PipelineResult(name="sales_pipeline")
print(r1)

# Override some defaults
r2 = PipelineResult(name="failed_pipeline", success=False, errors=["Connection timeout"])
print(r2)
print(r2.errors)  # ['Connection timeout']

PipelineResult(name='sales_pipeline', success=True, duration=0.0, errors=[], metadata=None)
PipelineResult(name='failed_pipeline', success=False, duration=0.0, errors=['Connection timeout'], metadata=None)
['Connection timeout']


### Exercise 8.1: Create a data class

Create a `@dataclass` called `ValidationResult` with:
- `test_name` (str)
- `passed` (bool)
- `total_rows` (int)
- `failed_rows` (int, default 0)
- `message` (str, default "")

Add a method called `pass_rate` that returns the pass rate as a float.

Then create a list of 3 results and print which tests failed.

In [135]:
# Exercise 8.1
# YOUR CODE HERE
from dataclasses import dataclass, field


@dataclass
class ValidationResult:
    test_name: str
    passed: bool
    total_rows: int
    failed_rows: int = field(default=0)
    message: str = field(default="")
    
    def pass_rate(self):
        return float((self.total_rows - self.failed_rows)/self.total_rows)

# Test:
results = [
    ValidationResult("not_null", True, 1000, 0),
    ValidationResult("unique", False, 1000, 23, "23 duplicate rows"),
    ValidationResult("range_check", True, 1000, 5),
]
for r in results:
    status = "PASS" if r.passed else "FAIL"
    print(f"[{status}] {r.test_name}: {r.pass_rate():.1%} pass rate")

[PASS] not_null: 100.0% pass rate
[FAIL] unique: 97.7% pass rate
[PASS] range_check: 99.5% pass rate


---
## Section 9: enum -- Named Constants

An **Enum** (enumeration) is a set of named constants. Instead of using magic strings 
like `"pandas"` or `"spark"` scattered throughout your code, you define them once as an Enum.

Odibi's `config.py` defines Enums for EngineType, ConnectionType, WriteMode, and more.
This is what makes the YAML config validation work -- invalid values are caught automatically.

In [137]:
from enum import Enum

# Define an Enum
class EngineType(str, Enum):
    PANDAS = "pandas"
    SPARK = "spark"
    POLARS = "polars"

# Using it
engine = EngineType.PANDAS
print(engine)          # EngineType.PANDAS
print(engine.value)    # pandas
print(engine.name)     # PANDAS

# Comparison
print(engine == EngineType.PANDAS)  # True
print(engine == "pandas")           # True (because it inherits from str)

# List all values
for e in EngineType:
    print(f"  {e.name} = {e.value}")

# This is EXACTLY how odibi/config.py defines EngineType!

EngineType.PANDAS
pandas
PANDAS
True
True
  PANDAS = pandas
  SPARK = spark
  POLARS = polars


In [138]:
from enum import Enum

# Practical use: validation
class WriteMode(str, Enum):
    OVERWRITE = "overwrite"
    APPEND = "append"
    UPSERT = "upsert"
    APPEND_ONCE = "append_once"
    MERGE = "merge"

# Validate input
def validate_write_mode(mode_str):
    try:
        return WriteMode(mode_str)
    except ValueError:
        valid = [m.value for m in WriteMode]
        raise ValueError(f"Invalid mode: {mode_str}. Must be one of: {valid}")

print(validate_write_mode("upsert"))  # WriteMode.UPSERT

try:
    validate_write_mode("delete")
except ValueError as e:
    print(e)

WriteMode.UPSERT
Invalid mode: delete. Must be one of: ['overwrite', 'append', 'upsert', 'append_once', 'merge']


---
## Section 10: Quick Hits -- contextlib and functools

These will be covered in depth in Phase 5. Here is a preview of the most common uses.

In [143]:
from contextlib import contextmanager

# contextmanager lets you create your own 'with' blocks
# Odibi's PhaseTimer in node.py uses this exact pattern

import time

@contextmanager
def timer(label):
    """Time a block of code."""
    start = time.time()
    yield  # This is where the 'with' block runs
    elapsed = time.time() - start
    print(f"{label}: {elapsed:.2f}s")

# Use it
with timer("Sleep test"):
    time.sleep(0.5)  # Pause for half a second
# Prints: Sleep test: 0.50s

Sleep test: 0.50s


In [144]:
from functools import wraps

# wraps preserves function metadata when decorating
# Odibi's registry.py uses @wraps in the @transform decorator
# Full coverage in Phase 5

def log_call(func):
    """Decorator that logs function calls."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__}")
        return func(*args, **kwargs)
    return wrapper

@log_call
def process_node(name):
    """Process a single node."""
    return f"Processed {name}"

print(process_node("customers"))
print(process_node.__name__)  # "process_node" (preserved by @wraps)

Calling process_node
Processed customers
process_node


---
## Section 11: Interview Drill

These test your standard library knowledge.

### Drill 1: File operations

Write a function that:
1. Takes a directory path
2. Lists all `.csv` files in it (using pathlib)
3. Returns a list of their names (without the directory path)

Test it on the current directory (it is fine if the list is empty).

In [None]:
# Drill 1
# YOUR CODE HERE
from pathlib import Path


### Drill 2: JSON round-trip

Given a list of dicts, convert them to a JSON string, then back to Python objects. 
Verify the round-trip was lossless.

In [None]:
# Drill 2
# YOUR CODE HERE
import json

data = [
    {"node": "customers", "rows": 1542, "success": True},
    {"node": "orders", "rows": 8930, "success": True},
]

# Convert to JSON string, then back. Print both and verify equality.

### Drill 3: Date math

Calculate how many days between January 15, 2024 and today. 
Print the result.

In [None]:
# Drill 3
# YOUR CODE HERE
from datetime import datetime


### Drill 4: Counter

Given a list of log levels, use `Counter` to find the most common level 
and the least common level.

In [None]:
# Drill 4
# YOUR CODE HERE
from collections import Counter

log_levels = ["INFO", "INFO", "WARNING", "INFO", "ERROR", "INFO", 
              "WARNING", "DEBUG", "INFO", "ERROR", "WARNING", "INFO"]

---
## Checkpoint

You now know Python's standard library toolkit. Here is what you covered:

- **pathlib / os** -- file paths, directories, environment variables
- **json** -- reading and writing structured data, the `with` statement
- **re** -- pattern matching with regular expressions
- **logging** -- professional output with levels
- **datetime** -- dates, times, durations, formatting
- **hashlib / uuid** -- data fingerprints and unique IDs
- **collections** -- `defaultdict`, `Counter`, `deque`
- **dataclasses** -- simple data-holding classes
- **enum** -- named constants for clean code
- **contextlib / functools** -- context managers and decorators (preview)

Every module above is used in Odibi. You are not learning theory -- you are learning 
the tools your framework is built with.

**Next:** Notebook 04 -- Object-Oriented Programming (classes, inheritance, ABC, 
dunder methods, composition). This is where you start building mini-odibi's architecture.