# Phase 2: Data Structures and Comprehensions
## The tools you reach for every day

In Phase 1, you learned how to work with single values -- a string, a number, a boolean.
But real programs work with **collections** of data. A list of node names. A dictionary of 
configuration settings. A set of unique column names.

Python has four built-in collection types. By the end of this notebook, you will know:

- **Lists** -- ordered collections you can change
- **Dictionaries** -- key-value pairs (the most important data structure in Python)
- **Sets** -- collections of unique values
- **Tuples** -- ordered collections you cannot change

You will also learn **comprehensions** -- a Python superpower that lets you build and transform 
collections in a single line. Odibi uses over 280 comprehensions across its codebase. 
By the end of this notebook, you will understand every single one of them.

**Rules (same as Phase 1):**
1. Type every line of code yourself.
2. Run every cell.
3. Complete every exercise before moving on.

---
## Section 1: Lists

A **list** is an ordered collection of items. You create one with square brackets `[]`.

Think of a list like a numbered shelf. Each item has a position (called an **index**), 
starting at 0. You can add items, remove items, and rearrange them.

Lists are the most common data structure in Python. In Odibi, lists are used for:
- Column names
- Node execution order
- Validation test results
- Error messages and suggestions

### Creating lists

In [2]:
# Creating lists
nodes = ["customers", "orders", "products"]  # List of strings
row_counts = [1542, 8930, 234]                # List of integers
mixed = ["customers", 1542, True, None]       # Lists can hold any type
empty = []                                     # Empty list

print(nodes)
print(row_counts)
print(mixed)
print(empty)
print(type(nodes))  # <class 'list'>

['customers', 'orders', 'products']
[1542, 8930, 234]
['customers', 1542, True, None]
[]
<class 'list'>


### Indexing and slicing

Every item in a list has an index. Remember: **indexing starts at 0**, not 1.

```
Index:     0           1          2
List:  ["customers", "orders", "products"]
```

You can also use **negative indexing** to count from the end:
```
Index:    -3          -2         -1
List:  ["customers", "orders", "products"]
```

In [None]:
# Indexing - accessing individual items
nodes = ["customers", "orders", "products", "inventory", "shipments"]

print(nodes[0])    # First item: customers
print(nodes[1])    # Second item: orders
print(nodes[-1])   # LAST item: shipments
print(nodes[-2])   # Second to last: inventory

# len() - how many items in the list
print(len(nodes))  # 5

In [3]:
# Slicing - getting a portion of the list
# Syntax: list[start:stop]  (start is included, stop is NOT included)
nodes = ["customers", "orders", "products", "inventory", "shipments"]

print(nodes[0:3])   # First 3 items: ['customers', 'orders', 'products']
print(nodes[1:4])   # Items at index 1, 2, 3: ['orders', 'products', 'inventory']
print(nodes[:3])    # From start to index 3: ['customers', 'orders', 'products']
print(nodes[2:])    # From index 2 to end: ['products', 'inventory', 'shipments']
print(nodes[:])     # Copy of the entire list

# Slicing with step
print(nodes[::2])   # Every other item: ['customers', 'products', 'shipments']
print(nodes[::-1])  # Reversed! ['shipments', 'inventory', 'products', 'orders', 'customers']

['customers', 'orders', 'products']
['orders', 'products', 'inventory']
['customers', 'orders', 'products']
['products', 'inventory', 'shipments']
['customers', 'orders', 'products', 'inventory', 'shipments']
['customers', 'products', 'shipments']
['shipments', 'inventory', 'products', 'orders', 'customers']


### Modifying lists

Lists are **mutable** -- you can change them after creation. This is different from strings, 
which are immutable.

In [None]:
# Modifying lists
nodes = ["customers", "orders", "products"]

# Add to the end
nodes.append("inventory")
print(nodes)  # ['customers', 'orders', 'products', 'inventory']

# Insert at a specific position
nodes.insert(1, "accounts")  # Insert at index 1
print(nodes)  # ['customers', 'accounts', 'orders', 'products', 'inventory']

# Remove by value
nodes.remove("accounts")
print(nodes)  # ['customers', 'orders', 'products', 'inventory']

# Remove by index and get the value back
last = nodes.pop()       # Removes and returns the last item
print(last)              # inventory
print(nodes)             # ['customers', 'orders', 'products']

second = nodes.pop(1)    # Removes and returns item at index -1
print(second)            # orders

['customers', 'orders', 'products', 'inventory']
['customers', 'accounts', 'orders', 'products', 'inventory']
['customers', 'orders', 'products', 'inventory']
inventory
['customers', 'orders', 'products']
orders


In [5]:
# More list operations
nodes = ["customers", "orders", "products"]

# Extend - add multiple items
nodes.extend(["inventory", "shipments"])
print(nodes)  # ['customers', 'orders', 'products', 'inventory', 'shipments']

# + operator - combine lists (creates a new list)
bronze = ["raw_sales", "raw_returns"]
silver = ["clean_sales", "clean_returns"]
all_nodes = bronze + silver
print(all_nodes)

# Sort
numbers = [3, 1, 4, 1, 5, 9, 2, 6]
numbers.sort()
print(numbers)  # [1, 1, 2, 3, 4, 5, 6, 9]

# Sort descending
numbers.sort(reverse=True)
print(numbers)  # [9, 6, 5, 4, 3, 2, 1, 1]

# Check if item exists
print("customers" in nodes)   # True
print("missing" in nodes)     # False

['customers', 'orders', 'products', 'inventory', 'shipments']
['raw_sales', 'raw_returns', 'clean_sales', 'clean_returns']
[1, 1, 2, 3, 4, 5, 6, 9]
[9, 6, 5, 4, 3, 2, 1, 1]
True
False


### IMPORTANT: .sort() vs sorted()

This trips up many people and is an interview favorite:

- `list.sort()` -- sorts the list **in place** (changes the original, returns `None`)
- `sorted(list)` -- returns a **new sorted list** (original is unchanged)

```python
nums = [3, 1, 2]
result = nums.sort()   # result is None! nums is now [1, 2, 3]

nums = [3, 1, 2]
result = sorted(nums)  # result is [1, 2, 3], nums is still [3, 1, 2]
```

In [6]:
# Demonstrating .sort() vs sorted()
nums = [3, 1, 2]
result = nums.sort()
print(f"nums.sort() returned: {result}")  # None!
print(f"nums is now: {nums}")              # [1, 2, 3]

nums = [3, 1, 2]
result = sorted(nums)
print(f"sorted(nums) returned: {result}")  # [1, 2, 3]
print(f"nums is still: {nums}")            # [3, 1, 2]

nums.sort() returned: None
nums is now: [1, 2, 3]
sorted(nums) returned: [1, 2, 3]
nums is still: [3, 1, 2]


### Exercise 1.1: Build a node execution report

Given the list below:
1. Add `"shipments"` to the end
2. Insert `"accounts"` at index 2
3. Remove `"temp_table"` (it should not be there)
4. Print the final list and its length
5. Print the first node and the last node
6. Print the list in reverse order (without modifying the original)

In [11]:
# Exercise 1.1
# YOUR CODE HERE
nodes = ["customers", "orders", "temp_table", "products", "inventory"]

# 1. Add "shipments" to the end
nodes.append("shipments")

# 2. Insert "accounts" at index 2
nodes.insert(2,"accounts")

# 3. Remove "temp_table"
nodes.remove("temp_table")

# 4. Print the final list and its length
print(f"List: {nodes} | Length: {len(nodes)}")
# 5. Print first and last node
print(f"First item: {nodes[0]} | Last item: {nodes[-1]}")
# 6. Print in reverse (without modifying original)
print(f"Reversed: {(nodes[::-1])} original: {nodes}")

List: ['customers', 'orders', 'accounts', 'products', 'inventory', 'shipments'] | Length: 6
First item: customers | Last item: shipments
Reversed: ['shipments', 'inventory', 'products', 'accounts', 'orders', 'customers'] original: ['customers', 'orders', 'accounts', 'products', 'inventory', 'shipments']


**Expected output:**
```
['customers', 'orders', 'accounts', 'products', 'inventory', 'shipments']
Length: 6
First: customers
Last: shipments
Reversed: ['shipments', 'inventory', 'products', 'accounts', 'orders', 'customers']
```

---
## Section 2: Dictionaries

A **dictionary** (dict) is a collection of **key-value pairs**. You create one with curly braces `{}`.

If a list is like a numbered shelf, a dictionary is like a labeled filing cabinet. 
Each drawer has a label (the key) and contents (the value).

Dictionaries are arguably the **most important** data structure in Python. They are used for:
- Configuration (Odibi YAML configs become dicts)
- Mapping names to values
- JSON data (which is just dicts and lists)
- Function keyword arguments
- Counting occurrences

In Odibi, dictionaries are everywhere: config settings, node metadata, column mappings, 
engine registries, validation results, and more.

### Creating dictionaries

In [12]:
# Creating dictionaries
# Syntax: {key: value, key: value, ...}
node_config = {
    "name": "customers",
    "format": "delta",
    "write_mode": "upsert",
    "row_count": 1542,
    "is_active": True,
}

print(node_config)
print(type(node_config))

{'name': 'customers', 'format': 'delta', 'write_mode': 'upsert', 'row_count': 1542, 'is_active': True}
<class 'dict'>


In [None]:
# Accessing values
node_config = {
    "name": "customers",
    "format": "delta",
    "write_mode": "upsert",
    "row_count": 1542,
}

# Method 1: Square brackets (raises KeyError if key missing)
print(node_config["name"])       # customers
print(node_config["row_count"])  # 1542

# Method 2: .get() (returns None or default if key missing -- SAFER)
print(node_config.get("name"))           # customers
print(node_config.get("missing_key"))    # None (no error!)
print(node_config.get("missing_key", "default_value"))  # default_value

# IMPORTANT: Always use .get() when the key might not exist.
# This is a common source of bugs -- KeyError crashes your program.
# In an interview, using .get() shows you write defensive code.

In [13]:
# Modifying dictionaries
node_config = {
    "name": "customers",
    "format": "csv",
}

# Add a new key
node_config["write_mode"] = "upsert"
print(node_config)

# Update an existing key
node_config["format"] = "delta"
print(node_config)

# Delete a key
del node_config["format"]
print(node_config)

# .pop() -- remove and return the value
mode = node_config.pop("write_mode")
print(f"Removed: {mode}")  # upsert
print(node_config)         # Only 'name' left

{'name': 'customers', 'format': 'csv', 'write_mode': 'upsert'}
{'name': 'customers', 'format': 'delta', 'write_mode': 'upsert'}
{'name': 'customers', 'write_mode': 'upsert'}
Removed: upsert
{'name': 'customers'}


### Iterating over dictionaries

This is one of the most common things you will do in Python. There are three ways to loop 
over a dict, and you need to know all three.

In [14]:
# Three ways to iterate over a dictionary
node_config = {
    "name": "customers",
    "format": "delta",
    "write_mode": "upsert",
    "row_count": 1542,
}

# 1. Loop over KEYS (default behavior)
print("=== Keys ===")
for key in node_config:
    print(key)

# 2. Loop over VALUES
print("\n=== Values ===")
for value in node_config.values():
    print(value)

# 3. Loop over KEY-VALUE PAIRS (most common and most useful)
print("\n=== Key-Value Pairs ===")
for key, value in node_config.items():
    print(f"  {key}: {value}")

=== Keys ===
name
format
write_mode
row_count

=== Values ===
customers
delta
upsert
1542

=== Key-Value Pairs ===
  name: customers
  format: delta
  write_mode: upsert
  row_count: 1542


### Nested dictionaries

Dictionaries can contain other dictionaries. This is exactly how Odibi's YAML configs work -- 
a pipeline config is a dictionary of dictionaries.

In [16]:
# Nested dictionaries -- like an Odibi pipeline config
pipeline = {
    "customers": {
        "source": "raw_customers.csv",
        "format": "csv",
        "write_mode": "upsert",
        "keys": ["customer_id"],
    },
    "orders": {
        "source": "raw_orders.csv",
        "format": "csv",
        "write_mode": "append",
        "keys": ["order_id"],
    },
}

# Access nested values
print(pipeline["customers"]["source"])     # raw_customers.csv
print(pipeline["orders"]["write_mode"])     # append
print(pipeline["customers"]["keys"])        # ["customer_id"]

# Loop through the pipeline
for node_name, config in pipeline.items():
    print(f"Node: {node_name}, Source: {config['source']}, Mode: {config['write_mode']}")

raw_customers.csv
append
['customer_id']
Node: customers, Source: raw_customers.csv, Mode: upsert
Node: orders, Source: raw_orders.csv, Mode: append


### Checking if a key exists

In [17]:
# Checking for keys
config = {"name": "customers", "format": "delta"}

# Use "in" to check if a KEY exists (not a value!)
print("name" in config)        # True
print("write_mode" in config)  # False

# Common pattern: check before accessing
if "write_mode" in config:
    print(f"Write mode: {config['write_mode']}")
else:
    print("No write mode specified, using default: overwrite")

# Even better: use .get() with a default
write_mode = config.get("write_mode", "overwrite")
print(f"Write mode: {write_mode}")  # overwrite (the default)

True
False
No write mode specified, using default: overwrite
Write mode: overwrite


### Exercise 2.1: Node config builder

Build a dictionary called `node_config` with these keys and values:
- name: "sales_fact"
- source_format: "csv"
- write_mode: "upsert"
- keys: ["sale_id", "date"]  (a list)
- enabled: True

Then:
1. Print the value of `keys`
2. Change `source_format` to `"delta"`
3. Add a new key `row_count` with value `0`
4. Loop through the config and print each key-value pair
5. Use `.get()` to safely access a key called `"description"` with default `"No description"`

In [22]:
# Exercise 2.1
# YOUR CODE HERE

# Build the dictionary

# 1. Print the value of keys
node_config = {
    "name": "sales_fact",
    "source_format": "csv",
    "write_mode": "upsert",
    "keys": ["sale_id", "date"],
    "enabled": True
}

# 2. Change source_format to "delta"
node_config["source_format"] = "delta"
# 3. Add row_count
node_config["row_count"] = 0

# 4. Loop through and print each key-value pair
for name, value in node_config.items():
    print(f"Key: {name} | Value: {value}")

# 5. Safely get "description"
desc = node_config.get("description", "No description")
print(desc)

Key: name | Value: sales_fact
Key: source_format | Value: delta
Key: write_mode | Value: upsert
Key: keys | Value: ['sale_id', 'date']
Key: enabled | Value: True
Key: row_count | Value: 0
No description


### Exercise 2.2: Pipeline summary

Given this nested pipeline dict, write code that:
1. Prints the number of nodes
2. Loops through each node and prints: `Node: {name} | Mode: {write_mode} | Keys: {keys}`
3. Finds and prints which nodes use "upsert" write mode

In [48]:
print(pipeline.keys())

dict_keys(['customers', 'orders', 'products', 'logs'])


In [56]:
# Exercise 2.2
# YOUR CODE HERE
pipeline = {
    "customers": {"write_mode": "upsert", "keys": ["customer_id"]},
    "orders": {"write_mode": "append", "keys": ["order_id"]},
    "products": {"write_mode": "upsert", "keys": ["product_id"]},
    "logs": {"write_mode": "append", "keys": ["log_id"]},
}

# 1. Print number of nodes
print(f"Number of nodes: {len(pipeline.values())}")

# 2. Print each node's details
for node, values in pipeline.items():
    print(f"Node: {node} | Mode: {values['write_mode']} | Keys: {values['keys']}")
# 3. Find nodes using upsert
upsert_nodes = ' ,'.join([node for node in pipeline.keys() if "upsert" in pipeline[node]['write_mode']])
print(f'Nodes using upsert: {upsert_nodes}')

Number of nodes: 4
Node: customers | Mode: upsert | Keys: ['customer_id']
Node: orders | Mode: append | Keys: ['order_id']
Node: products | Mode: upsert | Keys: ['product_id']
Node: logs | Mode: append | Keys: ['log_id']
Nodes using upsert: customers ,products


**Expected output:**
```
Pipeline has 4 nodes

Node: customers | Mode: upsert | Keys: ['customer_id']
Node: orders | Mode: append | Keys: ['order_id']
Node: products | Mode: upsert | Keys: ['product_id']
Node: logs | Mode: append | Keys: ['log_id']

Nodes using upsert: customers, products
```

---
## Section 3: Sets

A **set** is an unordered collection of **unique** values. No duplicates allowed.

Sets are used when you need to:
- Remove duplicates from a list
- Check membership quickly (faster than lists)
- Find what is in one collection but not another

In Odibi, sets are used for things like finding unexpected parameters, checking which 
columns exist vs. which are expected, and deduplicating lists.

Look at this line from `odibi/registry.py`:
```python
unexpected = set(params.keys()) - set(func_params.keys())
```
This finds parameters that were passed but are not expected. That is a set operation.

In [58]:
# Creating sets
engines = {"pandas", "spark", "polars"}  # Curly braces, but no key:value
print(engines)
print(type(engines))

# Duplicates are automatically removed
tags = {"bronze", "silver", "gold", "bronze", "silver"}
print(tags)  # Only 3 items, duplicates removed

# Create a set from a list (common way to deduplicate)
columns = ["id", "name", "email", "name", "id"]
unique_columns = set(columns)
print(unique_columns)  # Duplicates removed

# IMPORTANT: Empty set is set(), NOT {}
# {} creates an empty DICTIONARY
empty_set = set()
empty_dict = {}
print(type(empty_set))   # <class 'set'>
print(type(empty_dict))  # <class 'dict'>

{'spark', 'polars', 'pandas'}
<class 'set'>
{'bronze', 'silver', 'gold'}
{'name', 'email', 'id'}
<class 'set'>
<class 'dict'>


In [59]:
# Set operations -- this is what makes sets powerful
source_columns = {"id", "name", "email", "phone", "address"}
target_columns = {"id", "name", "email", "created_at"}

# Union: everything in EITHER set (OR)
all_columns = source_columns | target_columns
print(f"Union: {all_columns}")

# Intersection: only what is in BOTH sets (AND)
common = source_columns & target_columns
print(f"Common: {common}")

# Difference: what is in source but NOT in target
only_in_source = source_columns - target_columns
print(f"Only in source: {only_in_source}")

# What is in target but NOT in source
only_in_target = target_columns - source_columns
print(f"Only in target: {only_in_target}")

# Symmetric difference: in one OR the other, but NOT both
different = source_columns ^ target_columns
print(f"Different: {different}")

Union: {'address', 'phone', 'name', 'email', 'id', 'created_at'}
Common: {'name', 'email', 'id'}
Only in source: {'address', 'phone'}
Only in target: {'created_at'}
Different: {'address', 'phone', 'created_at'}


### Exercise 3.1: Column validation

You are building a validation check (like Odibi's validation engine). Given:
- `required_columns` -- columns that MUST exist in the data
- `actual_columns` -- columns that actually exist

Write code that:
1. Finds missing columns (required but not in actual)
2. Finds extra columns (in actual but not required)
3. Prints whether validation passed or failed

In [64]:
# Exercise 3.1
# YOUR CODE HERE
required_columns = {"customer_id", "name", "email", "created_at"}
actual_columns = {"customer_id", "name", "phone", "address", "email"}

# Find missing columns
missing_columns = set(required_columns) - set(actual_columns)
print(f'Missing columns: {missing_columns}')
# Find extra columns
extra_columns = set(actual_columns) - set(required_columns)
print(f'Extra columns: {extra_columns}')

# Print results
print(f"Validation FAILED: missing required columns: {missing_columns}")

Missing columns: {'created_at'}
Extra columns: {'address', 'phone'}
Validation FAILED: missing required columns: {'created_at'}


**Expected output:**
```
Missing columns: {'created_at'}
Extra columns: {'phone', 'address'}
Validation FAILED: missing required columns
```

---
## Section 4: Tuples

A **tuple** is like a list, but **immutable** -- once created, you cannot change it.

Use tuples when:
- You want to return multiple values from a function
- The data should not be modified (coordinates, database rows, config constants)
- You need a hashable key for a dictionary (lists cannot be dict keys, but tuples can)

### Creating tuples

In [65]:
# Creating tuples
position = (10, 20)              # Parentheses (but actually the comma makes it a tuple)
node_info = ("customers", 1542, True)
single = ("only_one",)           # Note the comma! Without it, this is just a string in parens

print(position)
print(type(position))

# Accessing (same as lists)
print(node_info[0])   # customers
print(node_info[1])   # 1542

# But you CANNOT modify
# node_info[0] = "orders"  # TypeError: 'tuple' object does not support item assignment

(10, 20)
<class 'tuple'>
customers
1542


### Tuple unpacking

This is extremely common in Python. When a function returns multiple values, 
it actually returns a tuple, and you **unpack** it into separate variables.

In [66]:
# Tuple unpacking
node_info = ("customers", 1542, True)

# Unpack into separate variables
name, count, active = node_info
print(f"Name: {name}, Count: {count}, Active: {active}")

# This is what happens when a function returns multiple values
def get_node_stats():
    return "customers", 1542, 3.45  # Actually returns a tuple

name, rows, duration = get_node_stats()
print(f"{name}: {rows} rows in {duration}s")

# Swapping variables (Python trick using tuple unpacking)
a = 1
b = 2
a, b = b, a  # Swap!
print(f"a={a}, b={b}")  # a=2, b=1

Name: customers, Count: 1542, Active: True
customers: 1542 rows in 3.45s
a=2, b=1


---
## Section 5: Essential Built-in Functions

Python has several built-in functions that work with collections. These come up constantly 
in both real code and interviews.

### enumerate()

When you need both the index AND the value while looping, use `enumerate()`.

The wrong way (but common for beginners):
```python
for i in range(len(my_list)):  # Do NOT do this
    print(i, my_list[i])
```

The Pythonic way:
```python
for i, item in enumerate(my_list):  # Do THIS
    print(i, item)
```

In [67]:
# enumerate() - index AND value
nodes = ["customers", "orders", "products"]

for i, node in enumerate(nodes):
    print(f"Step {i + 1}: Processing {node}")

# You can start the count at a different number
for i, node in enumerate(nodes, start=1):
    print(f"Step {i}: Processing {node}")

Step 1: Processing customers
Step 2: Processing orders
Step 3: Processing products
Step 1: Processing customers
Step 2: Processing orders
Step 3: Processing products


### zip()

`zip()` combines two or more lists element by element, like a zipper.

In [68]:
# zip() - combine lists element by element
nodes = ["customers", "orders", "products"]
row_counts = [1542, 8930, 234]
statuses = ["SUCCESS", "SUCCESS", "FAILED"]

# Combine two lists
for node, count in zip(nodes, row_counts):
    print(f"{node}: {count:,} rows")

print("---")

# Combine three lists
for node, count, status in zip(nodes, row_counts, statuses):
    print(f"[{status}] {node}: {count:,} rows")

# USEFUL: Create a dict from two lists
node_counts = dict(zip(nodes, row_counts))
print(node_counts)  # {'customers': 1542, 'orders': 8930, 'products': 234}

customers: 1,542 rows
orders: 8,930 rows
products: 234 rows
---
[SUCCESS] customers: 1,542 rows
[SUCCESS] orders: 8,930 rows
[FAILED] products: 234 rows
{'customers': 1542, 'orders': 8930, 'products': 234}


### sorted() with key=lambda

You already know `sorted()` returns a new sorted list. But what if you want to sort by 
something other than the default order?

The `key` parameter takes a function that extracts the value to sort by. 
`lambda` is a way to write a tiny function in one line.

```python
lambda x: x["name"]   # Same as: def get_name(x): return x["name"]
```

We will cover `lambda` in depth in Phase 5. For now, just understand the `sorted()` pattern.

In [70]:
# Sorting with key=lambda
nodes = [
    {"name": "customers", "rows": 1542},
    {"name": "orders", "rows": 8930},
    {"name": "products", "rows": 234},
]

# Sort by name (alphabetical)
by_name = sorted(nodes, key=lambda x: x["name"])
for n in by_name:
    print(f"  {n['name']}: {n['rows']}")

print("---")

# Sort by row count (descending)
by_rows = sorted(nodes, key=lambda x: x["rows"], reverse=True)
for n in by_rows:
    print(f"  {n['name']}: {n['rows']}")

  customers: 1542
  orders: 8930
  products: 234
---
  orders: 8930
  customers: 1542
  products: 234


### any() and all()

These check conditions across a collection:

In [71]:
# any() - True if ANY item is True
# all() - True if ALL items are True

results = [True, True, False, True]
print(f"any: {any(results)}")  # True (at least one is True)
print(f"all: {all(results)}")  # False (not ALL are True)

# Practical use: check if any validation tests failed
statuses = ["PASS", "PASS", "FAIL", "PASS"]
has_failures = any(s == "FAIL" for s in statuses)
all_passed = all(s == "PASS" for s in statuses)
print(f"Has failures: {has_failures}")  # True
print(f"All passed: {all_passed}")      # False

any: True
all: False
Has failures: True
All passed: False


### Exercise 5.1: Build execution report

Using `zip()` and `enumerate()`, build a formatted execution report from these parallel lists:

In [86]:
# Exercise 5.1
# YOUR CODE HERE
nodes = ["customers", "orders", "products", "inventory"]
durations = [3.45, 12.8, 1.2, 7.65]
statuses = ["SUCCESS", "SUCCESS", "FAILED", "SUCCESS"]
row_counts = [1542, 8930, 234, 4521]

# Print a report like:
# Step 1: [SUCCESS] customers - 1,542 rows in 3.45s
# Step 2: [SUCCESS] orders - 8,930 rows in 12.80s
# ...
# 
# Then print:
# - Total rows processed
# - Whether all nodes succeeded (use all())
# - Whether any node failed (use any())
total_row_count = 0
for statuse, node, row_count, duration in zip(statuses, nodes, row_counts, durations):
    print(f"[{statuse}] {node} - {row_count} rows in {duration}s")
    total_row_count += row_count

print(f'Total rows processed: {total_row_count}')
all_succeeded = all( f == "SUCCESS" for f in statuses)
has_failed = all( f == "FAILED" for f in statuses)

print(f"All nodes succeeded: {all_succeeded}")
print(f"Any node failed: {has_failed}")


[SUCCESS] customers - 1542 rows in 3.45s
[SUCCESS] orders - 8930 rows in 12.8s
[FAILED] products - 234 rows in 1.2s
[SUCCESS] inventory - 4521 rows in 7.65s
Total rows processed: 15227
All nodes succeeded: False
Any node failed: False


---
## Section 6: List Comprehensions

This is where Python becomes powerful. A **list comprehension** creates a new list by 
transforming or filtering an existing one -- in a single line.

Odibi uses over 280 comprehensions. Once you understand them, you will read Odibi code 
fluently.

### The basic pattern

```python
new_list = [expression for item in iterable]
```

This is equivalent to:
```python
new_list = []
for item in iterable:
    new_list.append(expression)
```

The comprehension version is shorter, faster, and more Pythonic.

In [87]:
# List comprehension - basic

# Long way (loop)
numbers = [1, 2, 3, 4, 5]
doubled = []
for n in numbers:
    doubled.append(n * 2)
print(f"Loop: {doubled}")

# Short way (comprehension)
doubled = [n * 2 for n in numbers]
print(f"Comprehension: {doubled}")

# They produce the same result, but the comprehension is one line

Loop: [2, 4, 6, 8, 10]
Comprehension: [2, 4, 6, 8, 10]


In [88]:
# Real Odibi examples

# From odibi/validation/engine.py:
# "Return only failures that are not empty"
failures = ["missing column", "", "null values found", "", "type mismatch"]
real_failures = [f for f in failures if f]  # Filter out empty strings
print(real_failures)

# From odibi/validation/engine.py:
# "Get only columns that exist in the dataframe"
test_columns = ["id", "name", "email", "phone"]
df_columns = ["id", "name", "created_at"]
valid_cols = [c for c in test_columns if c in df_columns]
print(valid_cols)  # Only columns that actually exist

# From odibi/transformers/delete_detection.py:
# "Lowercase all column names for comparison"
columns = ["Customer_ID", "First_Name", "EMAIL"]
lower_cols = [c.lower() for c in columns]
print(lower_cols)

['missing column', 'null values found', 'type mismatch']
['id', 'name']
['customer_id', 'first_name', 'email']


### Comprehension with condition (filtering)

```python
new_list = [expression for item in iterable if condition]
```

This only includes items where the condition is True.

In [89]:
# Comprehension with condition
numbers = [1, -2, 3, -4, 5, -6, 7]

# Only positive numbers
positives = [n for n in numbers if n > 0]
print(positives)  # [1, 3, 5, 7]

# Only even numbers
evens = [n for n in numbers if n % 2 == 0]
print(evens)  # [-2, -4, -6]

# Real Odibi pattern: filter nodes by status
nodes = [
    {"name": "customers", "success": True},
    {"name": "orders", "success": False},
    {"name": "products", "success": True},
]
failed = [n["name"] for n in nodes if not n["success"]]
print(f"Failed nodes: {failed}")  # ["orders"]

[1, 3, 5, 7]
[-2, -4, -6]
Failed nodes: ['orders']


### Comprehension with if/else (transform)

When you want to transform ALL items but differently based on a condition, 
put the if/else BEFORE the `for`:

```python
# if/else BEFORE for = transform every item
new_list = [x if condition else y for item in iterable]

# if AFTER for = filter items
new_list = [x for item in iterable if condition]
```

This distinction is important and commonly tested in interviews.

In [91]:
# if/else in comprehension (transform, not filter)
numbers = [1, -2, 3, -4, 5]

# Replace negatives with 0
cleaned = [n if n > 0 else 0 for n in numbers]
print(cleaned)  # [1, 0, 3, 0, 5]

# Label each number
labels = ["positive" if n > 0 else "negative" for n in numbers]
print(labels)

# Real Odibi pattern: status labels
results = [True, False, True, True, False]
labels = ["PASS" if r else "FAIL" for r in results]
print(labels)

[1, 0, 3, 0, 5]
['positive', 'negative', 'positive', 'negative', 'positive']
['PASS', 'FAIL', 'PASS', 'PASS', 'FAIL']


### Exercise 6.1: Comprehension practice

Write list comprehensions for each task (one line each):

1. Given `numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`, create a list of squares: `[1, 4, 9, 16, ...]`
2. Given the same list, create a list of only even numbers
3. Given `names = ["  customers ", "ORDERS", " Products "]`, create a list of cleaned names (strip + lower)
4. Given `columns = ["id", "name", "_temp", "email", "_internal"]`, create a list excluding columns that start with `_`
5. Given `counts = [1542, 0, 8930, 0, 234]`, create a list where 0s are replaced with `"EMPTY"` and non-zeros stay as numbers

In [104]:
# Exercise 6.1
# YOUR CODE HERE

# 1. Squares
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squares = [x**2 for x in numbers]
print(squares)

# 2. Even numbers only
even = [x for x in numbers if x % 2 ==0]
print(even)
# 3. Clean names
names = ["  customers ", "ORDERS", " Products "]

clean_names = [x.strip().lower() for x in names]
print(clean_names)
# 4. Exclude columns starting with _
columns = ["id", "name", "_temp", "email", "_internal"]
excluded = [x for x in columns if x.startswith('_') != True]
print(excluded)

# 5. Replace 0s with "EMPTY"
counts = [1542, 0, 8930, 0, 234]
replaced = ["Empty" if not x else x for x in counts]
print(replaced)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
[2, 4, 6, 8, 10]
['customers', 'orders', 'products']
['id', 'name', 'email']
[1542, 'Empty', 8930, 'Empty', 234]


**Expected output:**
```
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
[2, 4, 6, 8, 10]
['customers', 'orders', 'products']
['id', 'name', 'email']
[1542, 'EMPTY', 8930, 'EMPTY', 234]
```

---
## Section 7: Dictionary Comprehensions

Same idea as list comprehensions, but creates dictionaries:

```python
new_dict = {key_expr: value_expr for item in iterable}
```

In [105]:
# Dictionary comprehension
nodes = ["customers", "orders", "products"]
row_counts = [1542, 8930, 234]

# Create a dict from two lists
node_counts = {name: count for name, count in zip(nodes, row_counts)}
print(node_counts)

# With a condition: only nodes with > 1000 rows
big_nodes = {name: count for name, count in zip(nodes, row_counts) if count > 1000}
print(big_nodes)

# Transform values
node_status = {name: "big" if count > 1000 else "small" for name, count in zip(nodes, row_counts)}
print(node_status)

# From odibi/utils/config_loader.py:
# Recursive substitution on dict values
data = {"name": "  customers  ", "format": "  delta  ", "mode": "  upsert  "}
cleaned = {k: v.strip() for k, v in data.items()}
print(cleaned)

{'customers': 1542, 'orders': 8930, 'products': 234}
{'customers': 1542, 'orders': 8930}
{'customers': 'big', 'orders': 'big', 'products': 'small'}
{'name': 'customers', 'format': 'delta', 'mode': 'upsert'}


### Real Odibi dict comprehensions

These are actual patterns from the Odibi codebase:

In [None]:
# From odibi/registry.py line 100:
# Filter function parameters, excluding 'context' and 'current'
import inspect

def example_transform(context, current, column_name, default_value=None):
    """Example transform function."""
    pass

sig = inspect.signature(example_transform)
func_params = {k: v for k, v in sig.parameters.items() if k not in ["context", "current"]}
print(f"User-facing params: {list(func_params.keys())}")
# Output: ['column_name', 'default_value']
# This is exactly how Odibi's FunctionRegistry.validate_params() works

### Exercise 7.1: Dict comprehensions

1. Given a list of column names, create a dict mapping each name to its length
2. Given a dict of node->row_count, create a new dict with only nodes that have > 0 rows
3. Given a dict of node->row_count, create a new dict with values formatted as strings with commas

In [112]:
# Exercise 7.1
# YOUR CODE HERE

# 1. Column name -> length
columns = ["customer_id", "name", "email", "created_at"]

dict_1 = {x: len(x) for x in columns}
print(dict_1)
# 2. Only nodes with rows > 0
node_counts = {"customers": 1542, "temp": 0, "orders": 8930, "empty": 0}
dict_2 = {x : y for x,y in node_counts.items() if y}
print(dict_2)
# 3. Format values with commas
node_counts = {"customers": 1542, "orders": 8930, "products": 234}
dict_3 = {x: f'{y:,}' for x, y in node_counts.items()}
print(dict_3)

{'customer_id': 11, 'name': 4, 'email': 5, 'created_at': 10}
{'customers': 1542, 'orders': 8930}
{'customers': '1,542', 'orders': '8,930', 'products': '234'}


**Expected output:**
```
{'customer_id': 11, 'name': 4, 'email': 5, 'created_at': 10}
{'customers': 1542, 'orders': 8930}
{'customers': '1,542', 'orders': '8,930', 'products': '234'}
```

---
## Section 8: Putting It All Together

Now let us combine everything you have learned. These exercises reflect real patterns 
you will encounter in Odibi and in interviews.

### Nested data structures

In the real world, data is nested. A pipeline config has nodes, each node has transforms, 
each transform has parameters. You need to navigate these structures confidently.

In [114]:
# Working with nested data (like Odibi pipeline configs)
pipeline = {
    "name": "sales_pipeline",
    "engine": "pandas",
    "nodes": [
        {
            "name": "customers",
            "source": "raw_customers.csv",
            "transforms": [
                {"type": "rename_columns", "params": {"old": "cust_id", "new": "customer_id"}},
                {"type": "cast_types", "params": {"columns": {"created_at": "datetime"}}},
            ],
            "write_mode": "upsert",
        },
        {
            "name": "orders",
            "source": "raw_orders.csv",
            "transforms": [
                {"type": "filter_rows", "params": {"condition": "amount > 0"}},
            ],
            "write_mode": "append",
        },
    ],
}

# Navigate the nested structure
print(f"Pipeline: {pipeline['name']}")
print(f"Engine: {pipeline['engine']}")
print(f"Number of nodes: {len(pipeline['nodes'])}")

# Loop through nodes
for node in pipeline["nodes"]:
    print(f"\n  Node: {node['name']}")
    print(f"  Source: {node['source']}")
    print(f"  Write mode: {node['write_mode']}")
    print(f"  Transforms: {len(node['transforms'])}")
    for t in node["transforms"]:
        print(f"    - {t['type']}: {t['params']}")

Pipeline: sales_pipeline
Engine: pandas
Number of nodes: 2

  Node: customers
  Source: raw_customers.csv
  Write mode: upsert
  Transforms: 2
    - rename_columns: {'old': 'cust_id', 'new': 'customer_id'}
    - cast_types: {'columns': {'created_at': 'datetime'}}

  Node: orders
  Source: raw_orders.csv
  Write mode: append
  Transforms: 1
    - filter_rows: {'condition': 'amount > 0'}


### Exercise 8.1: Pipeline analyzer

Using the `pipeline` dict above, write code that:
1. Extracts a list of all node names using a list comprehension
2. Extracts a list of all transform types across ALL nodes (flatten the nested lists)
3. Creates a dict mapping node name -> number of transforms
4. Finds which nodes use "upsert" write mode

In [156]:
# Exercise 8.1
# YOUR CODE HERE
# (Use the pipeline dict defined in the cell above)
pipeline = {
    "name": "sales_pipeline",
    "engine": "pandas",
    "nodes": [
        {
            "name": "customers",
            "source": "raw_customers.csv",
            "transforms": [
                {"type": "rename_columns", "params": {"old": "cust_id", "new": "customer_id"}},
                {"type": "cast_types", "params": {"columns": {"created_at": "datetime"}}},
            ],
            "write_mode": "upsert",
        },
        {
            "name": "orders",
            "source": "raw_orders.csv",
            "transforms": [
                {"type": "filter_rows", "params": {"condition": "amount > 0"}},
            ],
            "write_mode": "append",
        },
    ],
}
# 1. List of all node names

node_names = [x['name'] for x in pipeline["nodes"]]
print(f'Node names: {node_names}')

# 2. List of all transform types (across all nodes)
transform_types = [y['type'] for x in pipeline['nodes'] for y in x['transforms']]
print(transform_types)
# 3. Dict: node name -> number of transforms
dict_4 = { x['name']: len(x['transforms']) for x in pipeline['nodes']}
print(f'Transform counts: {dict_4}')

# 4. Nodes using upsert
upserts = [x['name'] for x in pipeline["nodes"] if x['write_mode'] == 'upsert']
print(f'Upsert nodes: {upserts}')

Node names: ['customers', 'orders']
['rename_columns', 'cast_types', 'filter_rows']
Transform counts: {'customers': 2, 'orders': 1}
Upsert nodes: ['customers']


**Expected output:**
```
Node names: ['customers', 'orders']
All transforms: ['rename_columns', 'cast_types', 'filter_rows']
Transform counts: {'customers': 2, 'orders': 1}
Upsert nodes: ['customers']
```

### Exercise 8.2: Data grouping (interview classic)

Given a list of records, group them by a key. This is the manual version of what 
Pandas `groupby()` does, and it is a very common interview question.

Group these orders by customer_id:

In [178]:
# Exercise 8.2
# YOUR CODE HERE
orders = [
    {"order_id": 1, "customer_id": "C001", "amount": 150.00},
    {"order_id": 2, "customer_id": "C002", "amount": 200.00},
    {"order_id": 3, "customer_id": "C001", "amount": 75.00},
    {"order_id": 4, "customer_id": "C003", "amount": 300.00},
    {"order_id": 5, "customer_id": "C001", "amount": 50.00},
    {"order_id": 6, "customer_id": "C002", "amount": 125.00},
]

# Group by customer_id
# Result should be a dict where:
#   key = customer_id
#   value = list of orders for that customer
#
# Hint: Start with an empty dict. Loop through orders.
# For each order, check if the customer_id is already a key.
# If not, create it with an empty list. Then append the order.

grouped = {}

for order in orders:
    customer_id = order['customer_id']
    if not grouped.get(customer_id):
        grouped[customer_id] = []

    grouped[customer_id].append(order)



# Print the result
for customer_id, customer_orders in grouped.items():
    total = sum(o["amount"] for o in customer_orders)
    print(f"{customer_id}: {len(customer_orders)} orders, total ${total:.2f}")

C001: 3 orders, total $275.00
C002: 2 orders, total $325.00
C003: 1 orders, total $300.00


**Expected output:**
```
C001: 3 orders, total $275.00
C002: 2 orders, total $325.00
C003: 1 orders, total $300.00
```

Bonus: In Phase 3, you will learn about `collections.defaultdict` which makes this pattern 
even easier. For now, the manual approach teaches you the logic.

---
## Section 9: Interview Drill

These are common Python interview questions about data structures. 
Write your answers in the code cells.

### Drill 1: Remove duplicates from a list while preserving order

Input: `[1, 3, 2, 3, 1, 4, 2, 5]`
Output: `[1, 3, 2, 4, 5]`

Note: `set()` removes duplicates but does NOT preserve order. You need a different approach.

In [194]:
# Drill 1
# YOUR CODE HERE
items = [1, 3, 2, 3, 1, 4, 2, 5]
# Hint: Use a set to track what you've seen, and a list to build the result
seen = set()
result = []
for item in items:
    if item not in seen:
        seen.add(item)
        result.append(item)
print(result)

result2 = list(dict.fromkeys(items))
result2

[1, 3, 2, 4, 5]


[1, 3, 2, 4, 5]

### Drill 2: Flatten a list of lists

Input: `[[1, 2], [3, 4], [5, 6]]`
Output: `[1, 2, 3, 4, 5, 6]`

Do it with a comprehension (one line).

In [197]:
# Drill 2
# YOUR CODE HERE
nested = [[1, 2], [3, 4], [5, 6]]

# Hint: [item for sublist in nested for item in sublist]
new_list = []
for item in nested:
    new_list.extend(item)
print(new_list)

[1, 2, 3, 4, 5, 6]


### Drill 3: Count occurrences

Given a list of words, create a dictionary counting how many times each word appears.

Input: `["apple", "banana", "apple", "cherry", "banana", "apple"]`
Output: `{"apple": 3, "banana": 2, "cherry": 1}`

In [199]:
# Drill 3
# YOUR CODE HERE
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]

# Hint: Start with empty dict. For each word, use .get(word, 0) + 1
words_dict = dict()

for word in words:
    words_dict[word] = words_dict.get(word,0) + 1
print(words_dict)

{'apple': 3, 'banana': 2, 'cherry': 1}


### Drill 4: Invert a dictionary

Swap keys and values.

Input: `{"a": 1, "b": 2, "c": 3}`
Output: `{1: "a", 2: "b", 3: "c"}`

Do it with a dict comprehension.

In [202]:
# Drill 4
# YOUR CODE HERE
original = {"a": 1, "b": 2, "c": 3}

invered = { y: x for x, y in original.items()}
print(invered)


{1: 'a', 2: 'b', 3: 'c'}


### Drill 5: Two-sum problem (simplified)

Given a list of numbers and a target, find two numbers that add up to the target. 
Return them as a tuple.

Input: `nums = [2, 7, 11, 15]`, `target = 9`
Output: `(2, 7)`

This is one of the most famous interview questions. There is a solution using a set/dict 
that runs in O(n) time, but a simple nested loop solution is fine for now.

In [215]:
# Drill 5
# YOUR CODE HERE
nums = [2, 7, 11, 15]
target = 9

# Simple approach: try every pair
for i in range(len(nums)):
    for j in range(i+1, len(nums)):
        if nums[i] + nums[j] == target:
            print (nums[i], nums[j])
            break
        

# Better approach: use a set to track complements
seen = set()
for x in nums:
    complement = target - x
    if complement in seen:
        print(complement,x)
        break
    seen.add(x)

2 7
2 7


### Drill 6: Merge two dictionaries

Given two dicts, create a new dict with all keys. If a key exists in both, 
use the value from the second dict (it overrides).

Do it THREE ways:
1. Using a loop
2. Using `{**dict1, **dict2}` (unpacking)
3. Using `dict1 | dict2` (Python 3.9+)

In [223]:
{**defaults}

{'engine': 'pandas', 'format': 'csv', 'write_mode': 'overwrite'}

In [228]:
(defaults, overrides)

({'engine': 'pandas', 'format': 'csv', 'write_mode': 'overwrite'},
 {'format': 'delta', 'write_mode': 'upsert', 'keys': ['id']})

In [231]:
# Drill 6
# YOUR CODE HERE
defaults = {"engine": "pandas", "format": "csv", "write_mode": "overwrite"}
overrides = {"format": "delta", "write_mode": "upsert", "keys": ["id"]}

# Method 1: Loop
new_dict_1 = dict()
for d in (defaults, overrides):
    for k,v in d.items():
        new_dict_1[k] = v

print(new_dict_1)
# Method 2: Unpacking
new_dict_2 = {**defaults, **overrides}
print(new_dict_2)
# Method 3: Union operator (Python 3.9+)
new_dict_3 = defaults | overrides
print(new_dict_3)

{'engine': 'pandas', 'format': 'delta', 'write_mode': 'upsert', 'keys': ['id']}
{'engine': 'pandas', 'format': 'delta', 'write_mode': 'upsert', 'keys': ['id']}
{'engine': 'pandas', 'format': 'delta', 'write_mode': 'upsert', 'keys': ['id']}


**Expected output (all three should produce):**
```
{'engine': 'pandas', 'format': 'delta', 'write_mode': 'upsert', 'keys': ['id']}
```

---
## Checkpoint

You now have a solid foundation in Python's data structures. Here is what you learned:

- **Lists** -- ordered, mutable, indexed collections
- **Dictionaries** -- key-value pairs, the backbone of Python programs
- **Sets** -- unique values, set operations (union, intersection, difference)
- **Tuples** -- immutable sequences, unpacking, multiple return values
- **Built-in functions** -- `enumerate`, `zip`, `sorted` with `key`, `any`, `all`
- **List comprehensions** -- transforming and filtering in one line
- **Dict comprehensions** -- building dicts from iterables

These are not just academic concepts. Every piece of Odibi code uses these patterns.
When you read `odibi/validation/engine.py` and see:
```python
cols = [c for c in test.columns if c in df.columns]
```
You now understand exactly what that does -- filter test columns to only those present in the DataFrame.

**Next:** Notebook 03 -- Standard Library Deep Dive (os, pathlib, json, re, logging, datetime, 
hashlib, uuid, collections, dataclasses, enum, contextlib, functools).