# Phase 1: Python Basics
## Building blocks of everything in Odibi

Welcome. This notebook is the foundation for everything that follows. By the end of it, you will be able to:

- Create and use variables of every basic type
- Work with strings confidently (f-strings, methods, formatting)
- Write functions that take inputs and return outputs
- Use if/elif/else to control what your code does
- Write loops that repeat work
- Handle errors without your program crashing

Every example in this notebook uses real Odibi concepts. You are not learning Python in a vacuum --
you are learning it through the framework you built.

**Rules:**
1. Type every line of code yourself. Do not copy-paste.
2. Run every cell. Reading is not enough.
3. If you get an error, read the error message carefully before asking for help.
4. Complete every exercise before moving to the next section.

---
## Section 1: What is Python and How Does It Work?

Python is a **programming language**. That means it is a set of rules for writing instructions that a computer can follow.

Here is the most important thing to understand: **Python reads your code one line at a time, from top to bottom.** This is called being "interpreted." Some languages (like C or Java) need to be compiled -- translated into machine code -- before they run. Python skips that step. You write code, you run it, you see the result immediately.

When you write a `.py` file (like the files in Odibi), Python reads line 1, executes it, then line 2, executes it, and so on.

In this Jupyter notebook, you work with **cells**. Each cell is a chunk of code or text. When you run a code cell (Shift+Enter), Python executes just that cell. This lets you experiment piece by piece instead of running an entire file.

Odibi is written in Python because:
- Python is the standard language for data engineering
- Libraries like Pandas, PySpark, and Polars are all Python
- It is readable and practical -- you can build real systems quickly

Let us start. Run the cell below.

In [2]:
# This is your first Python code. Run this cell with Shift+Enter.
# The line below tells Python to display text on the screen.
print("Hello from Odibi!")

Hello from Odibi!


If you see `Hello from Odibi!` printed below the cell, Python is working. That `print()` is a **function** -- 
it takes whatever you put inside the parentheses and displays it. We will cover functions in depth later.

The lines starting with `#` are **comments**. Python ignores them completely. They are notes for humans reading the code.

---
## Section 2: Variables and Types

### What is a variable?

A variable is a **name that points to a value**. Think of it like a label on a box. The label is the name, the box holds the value.

```python
node_name = "customers"
```

This line says: "Create a label called `node_name` and stick it on a box that contains the text `customers`."

The `=` sign in Python does NOT mean "equals" like in math. It means **"assign"** -- take the value on the right and give it the name on the left.

### The basic types

Every value in Python has a **type**. The type tells Python what kind of data it is and what you can do with it.

| Type | What it is | Example | Used in Odibi for |
|------|-----------|---------|------------------|
| `int` | Whole number | `42` | Row counts, retry attempts |
| `float` | Decimal number | `97.5` | Pass rates, durations |
| `str` | Text (string) | `"customers"` | Node names, paths, SQL |
| `bool` | True or False | `True` | Flags, conditions |
| `None` | Nothing/empty | `None` | Missing values, defaults |

Let us see each one.

In [3]:
# Integer (int) - whole numbers, no decimal point
row_count = 1542
print(row_count)
print(type(row_count))  # type() tells you what type a value is

1542
<class 'int'>


In [4]:
# Float - numbers with a decimal point
pass_rate = 97.36
print(pass_rate)
print(type(pass_rate))

97.36
<class 'float'>


In [5]:
# String (str) - text, always wrapped in quotes
node_name = "customers"
print(node_name)
print(type(node_name))

# Single quotes work too - Python does not care which you use
engine_type = 'pandas'
print(engine_type)
print(type(engine_type))

customers
<class 'str'>
pandas
<class 'str'>


In [6]:
# Boolean (bool) - only two possible values: True or False
# IMPORTANT: Capital T and F. "true" (lowercase) will NOT work.
is_production = False
is_active = True
print(is_production)
print(type(is_active))

False
<class 'bool'>


In [7]:
# None - means "nothing" or "no value"
# This is NOT the same as 0 or "" (empty string). It means the value does not exist.
last_error = None
print(last_error)
print(type(last_error))

None
<class 'NoneType'>


### Type conversion

Sometimes you need to change a value from one type to another. Python gives you functions for this:

- `int()` -- converts to integer
- `float()` -- converts to float
- `str()` -- converts to string
- `bool()` -- converts to boolean

This comes up constantly. For example, when you read a number from a config file, it might come in as a string `"100"` and you need to convert it to an actual number `100`.

In [8]:
# Converting between types
count_as_string = "1542"
print(type(count_as_string))  # It is a string right now

count_as_number = int(count_as_string)  # Convert string to int
print(type(count_as_number))  # Now it is an int
print(count_as_number + 100)  # Now we can do math with it

# Converting number to string (useful for building messages)
message = "Processed " + str(row_count) + " rows"
print(message)

<class 'str'>
<class 'int'>
1642
Processed 1542 rows


### Exercise 2.1: Create Odibi variables

Create the following variables with appropriate types:
- `pipeline_name` -- a string, set it to `"sales_pipeline"`
- `node_count` -- an integer, set it to `5`
- `success_rate` -- a float, set it to `98.7`
- `is_production` -- a boolean, set it to `False`
- `last_error` -- None (no error has occurred)

Then print each one with its type.

In [10]:
# Exercise 2.1 - Create your variables below
# YOUR CODE HERE
pipeline_name = 'sales_pipeline'
print(pipeline_name, type(pipeline_name))
node_count = 5
print(node_count, type(node_count))
success_rate = 98.7
print(success_rate, type(success_rate))
is_production = False
print(is_production, type(is_production))
last_error = None
print(last_error, type(last_error))




# Print each variable and its type. Example for the first one:
# print(pipeline_name, type(pipeline_name))

sales_pipeline <class 'str'>
5 <class 'int'>
98.7 <class 'float'>
False <class 'bool'>
None <class 'NoneType'>


**Expected output:**
```
sales_pipeline <class 'str'>
5 <class 'int'>
98.7 <class 'float'>
False <class 'bool'>
None <class 'NoneType'>
```

### Exercise 2.2: Type conversion

You have `node_count = 5`. Convert it to a string and store it in a new variable called `node_count_str`.
Then create a message: `"This pipeline has 5 nodes"` using string concatenation (`+`).

Then try this: what happens if you do `int("hello")`? Run it and read the error message.

In [12]:
# Exercise 2.2
# YOUR CODE HERE
node_count = 5

# Convert node_count to string
node_count_str = str(node_count)

# Build the message using + (string concatenation)

print(f'This pipeline has {node_count_str} nodes')
# Now try: int("hello") - what error do you get?
int('hello')

This pipeline has 5 nodes


ValueError: invalid literal for int() with base 10: 'hello'

**Expected output for the message:** `This pipeline has 5 nodes`

**Expected error for `int("hello")`:** `ValueError: invalid literal for int() with base 10: 'hello'`

This is your first real Python error. Read it carefully:
- `ValueError` -- the TYPE of error (the value you gave was wrong)
- `invalid literal for int()` -- Python cannot convert this text to a number
- `'hello'` -- the actual value that caused the problem

Error messages are your best friend. They tell you exactly what went wrong and where.

---
## Section 3: Strings -- Working with Text

Strings are the most common type you will work with in Odibi. Node names, file paths, SQL queries, log messages, error messages -- they are all strings. If you get comfortable with strings, you can handle most of Odibi.

### String methods

A "method" is a function that belongs to a specific type. Strings have many built-in methods. You call them with a dot:

```python
name = "customers"
name.upper()  # Returns "CUSTOMERS"
```

IMPORTANT: Methods do NOT change the original string. They create a new one. Strings in Python are **immutable** (cannot be changed). This is a common interview question.

In [13]:
# String methods - run each line and observe the output
name = "  Bronze_Customers  "

print(name.upper())           # All uppercase
print(name.lower())           # All lowercase
print(name.strip())           # Remove whitespace from both ends
print(name.replace("_", " ")) # Replace underscores with spaces
print(name.startswith("  B")) # Does it start with "  B"?
print(name.endswith("  "))    # Does it end with spaces?

# IMPORTANT: The original string is unchanged!
print("Original:", name)

  BRONZE_CUSTOMERS  
  bronze_customers  
Bronze_Customers
  Bronze Customers  
True
True
Original:   Bronze_Customers  


In [14]:
# .split() - breaks a string into a list at a separator
# This is used CONSTANTLY in Odibi for parsing paths and configs
table_path = "bronze.customers.daily"
parts = table_path.split(".")  # Split at every dot
print(parts)         # ["bronze", "customers", "daily"]
print(parts[0])      # "bronze" - first element (index 0)
print(parts[1])      # "customers" - second element (index 1)
print(parts[2])      # "daily" - third element (index 2)

# IMPORTANT: Indexing starts at 0 in Python, not 1.
# This trips up beginners. The first item is always index 0.

['bronze', 'customers', 'daily']
bronze
customers
daily


In [23]:
print(parts)
for part in range(0,len(parts)):
    print(parts[part])

['bronze', 'customers', 'daily']
bronze
customers
daily


In [24]:
# .join() - the opposite of split. Glues a list into a string.
columns = ["id", "name", "email", "created_at"]
select_clause = ", ".join(columns)
print(select_clause)  # "id, name, email, created_at"

# Read it as: "Take the string ', ' and put it BETWEEN each item in the list"
# This is how Odibi builds SQL column lists

id, name, email, created_at


In [26]:
select_clause_2 = "_".join(columns)
print(select_clause_2)

id_name_email_created_at


### f-strings (Formatted Strings)

This is the single most important string feature you will use. An f-string lets you put variables directly inside a string.

You create an f-string by putting `f` before the opening quote, then using `{curly_braces}` around variable names:

```python
name = "customers"
count = 1542
print(f"Node {name} processed {count} rows")
# Output: Node customers processed 1542 rows
```

Without f-strings, you would have to write:
```python
print("Node " + name + " processed " + str(count) + " rows")
```

The f-string version is shorter, clearer, and you do not need `str()` conversion. Use f-strings everywhere.

**Odibi uses f-strings in almost every file.** Open `odibi/exceptions.py` and look at the `_format_error()` methods -- they all build messages with f-strings and `.join()`.

In [28]:
# f-string examples
node_name = "customers"
row_count = 1542
duration = 3.456
success = True

# Basic f-string
print(f"Node: {node_name}")

# You can put expressions inside the braces
print(f"Total: {row_count:,}")  # :, adds commas -> "1,542"

# Format decimals
print(f"Duration: {duration:.2f} seconds")  # .2f = 2 decimal places

# Format as percentage
rate = 0.9736
print(f"Pass rate: {rate:.2%}")  # .1% = percentage with 1 decimal -> "97.4%"

# You can even do math inside braces
print(f"Rows per second: {row_count / duration:.0f}")

Node: customers
Total: 1,542
Duration: 3.46 seconds
Pass rate: 97.36%
Rows per second: 446


### Multi-line strings

For long text, use triple quotes (`"""` or `'''`). This is also how Python **docstrings** work -- 
the documentation strings you see at the top of every function and class in Odibi.

In [29]:
# Multi-line string
error_message = """
[X] Node execution failed: customers
  Error: Column 'email' not found
  Available columns: [id, name, created_at]
  Suggestion: Check column names in your YAML config
"""
print(error_message)

# This is exactly how odibi/exceptions.py builds error messages!


[X] Node execution failed: customers
  Error: Column 'email' not found
  Available columns: [id, name, created_at]
  Suggestion: Check column names in your YAML config



### Exercise 3.1: Split and extract

Given `table_path = "gold.analytics.monthly_revenue"`, split it and print:
- The layer (first part): `gold`
- The schema (second part): `analytics`
- The table name (third part): `monthly_revenue`

In [34]:
# Exercise 3.1
# YOUR CODE HERE
table_path = "gold.analytics.monthly_revenue"

# Split the path
parts = table_path.split('.')
print(parts)
for part in  range(len(parts)):
    print(parts[part])
# Print each part with a label

['gold', 'analytics', 'monthly_revenue']
gold
analytics
monthly_revenue


**Expected output:**
```
Layer: gold
Schema: analytics
Table: monthly_revenue
```

### Exercise 3.2: Build a log message

Using f-strings, create and print a log message that looks exactly like this:

```
[2024-01-15] Node: customers | Status: SUCCESS | Rows: 1,542 | Duration: 3.46s
```

Use these variables:
- `date = "2024-01-15"`
- `node = "customers"`
- `status = "SUCCESS"`
- `rows = 1542`
- `duration = 3.456`

Hints: Use `:,` for the comma in 1,542. Use `:.2f` for 2 decimal places on duration.

In [36]:
# Exercise 3.2
# YOUR CODE HERE
date = "2024-01-15"
node = "customers"
status = "SUCCESS"
rows = 1542
duration = 3.456

print(
    f"""
        [{date}] Node: {node} | Status: {status} | Rows: {rows:,} | Duration: {duration:.2f}s
"""
)
# Build and print the log message using an f-string


        [2024-01-15] Node: customers | Status: SUCCESS | Rows: 1,542 | Duration: 3.46s



### Exercise 3.3: Clean a node name

You receive a messy node name: `"  My_Node_Name  "`. Clean it up:
1. Remove the whitespace from both ends
2. Convert to lowercase
3. Store the result and print it

You can chain methods: `text.strip().lower()` -- Python runs them left to right.

In [38]:
# Exercise 3.3
# YOUR CODE HERE
dirty_name = "  My_Node_Name  "

# Clean it up (strip + lower)
cleaned_name = dirty_name.strip().lower()

print(cleaned_name)

my_node_name


**Expected output:** `my_node_name`

---
## Section 4: Numbers and Math

Python can do all the math you need. Here are the operators:

| Operator | What it does | Example | Result |
|----------|-------------|---------|--------|
| `+` | Addition | `10 + 3` | `13` |
| `-` | Subtraction | `10 - 3` | `7` |
| `*` | Multiplication | `10 * 3` | `30` |
| `/` | Division | `10 / 3` | `3.333...` |
| `//` | Floor division (round down) | `10 // 3` | `3` |
| `%` | Modulo (remainder) | `10 % 3` | `1` |
| `**` | Power/exponent | `2 ** 10` | `1024` |

The ones that trip people up are `//` and `%`:
- `//` divides and drops the decimal: `10 // 3` = `3` (not 3.33)
- `%` gives the remainder: `10 % 3` = `1` (because 10 = 3*3 + 1)

These come up in interviews. `%` is commonly used to check if a number is even: `n % 2 == 0`.

In [42]:
# Math operators in action
print(10 / 3)    # 3.3333... (regular division always gives a float)
print(10 // 3)   # 3 (floor division - rounds DOWN)
print(10 % 3)    # 1 (remainder after dividing 10 by 3)
print(2 ** 10)   # 1024 (2 to the power of 10)

# Checking if a number is even or odd
number = 42
print(f"{number} is even: {number % 2 == 0}")  # True

# round() - round to N decimal places
print(round(3.14159, 2))   # 3.14
print(round(3.14159, 0))   # 3.0

3.3333333333333335
3
1
1024
42 is even: True
3.14
3.0


### Exercise 4.1: Calculate a validation pass rate

In Odibi, when data validation runs, it checks every row and counts how many pass vs fail.
The `GateFailedError` in `odibi/exceptions.py` displays this as a percentage.

Given:
- `total_rows = 870`
- `failed_rows = 23`

Calculate:
1. `passed_rows` (total minus failed)
2. `pass_rate` (passed / total) as a decimal (e.g., 0.9735...)
3. `pass_percentage` (pass_rate * 100), rounded to 2 decimal places
4. Print: `"Pass rate: 97.36% (847/870 rows passed)"`

In [48]:
# Exercise 4.1
# YOUR CODE HERE
total_rows = 870
failed_rows = 23

# Calculate passed_rows
passed_rows = total_rows - failed_rows

# Calculate pass_rate (as decimal)
pass_rate = passed_rows/total_rows

# Calculate pass_percentage (rounded to 2 decimals)
pass_percentage = round(pass_rate * 100, 2)

# Print the formatted message
print(f'Pass rate: {pass_rate:.2%} ({passed_rows}/{total_rows} rows passed)')

Pass rate: 97.36% (847/870 rows passed)


**Expected output:** `Pass rate: 97.36% (847/870 rows passed)`

Bonus: try using the `:.1%` f-string format on the decimal pass_rate directly:
```python
print(f"Pass rate: {pass_rate:.1%}")
```

---
## Section 5: Booleans and Comparisons

### The difference between `=` and `==`

This is the most common mistake beginners make and a guaranteed interview question:

- `=` means **assign** (put a value into a variable)
- `==` means **compare** (check if two values are equal)

```python
x = 5       # ASSIGN: x now holds the value 5
x == 5      # COMPARE: is x equal to 5? Returns True
x == 3      # COMPARE: is x equal to 3? Returns False
```

### All comparison operators

| Operator | Meaning | Example | Result |
|----------|---------|---------|--------|
| `==` | Equal to | `5 == 5` | `True` |
| `!=` | Not equal to | `5 != 3` | `True` |
| `<` | Less than | `3 < 5` | `True` |
| `>` | Greater than | `5 > 3` | `True` |
| `<=` | Less than or equal | `5 <= 5` | `True` |
| `>=` | Greater than or equal | `5 >= 6` | `False` |

In [49]:
# Comparison operators
engine = "pandas"
row_count = 1542
pass_rate = 0.97

print(engine == "pandas")     # True
print(engine == "spark")      # False
print(engine != "spark")      # True (not equal)
print(row_count > 1000)       # True
print(pass_rate >= 0.95)      # True
print(row_count < 100)        # False

True
False
True
True
True
False


### Logical operators: `and`, `or`, `not`

You can combine comparisons:

- `and` -- both must be True
- `or` -- at least one must be True
- `not` -- flips True to False and vice versa

```python
age >= 18 and has_id          # True only if BOTH are True
is_admin or is_owner          # True if EITHER is True
not is_deleted                # True if is_deleted is False
```

In [50]:
# Logical operators
engine = "pandas"
row_count = 1542
is_production = False

# and - both conditions must be true
print(engine == "pandas" and row_count > 0)      # True (both true)
print(engine == "pandas" and is_production)       # False (second is false)

# or - at least one condition must be true
print(engine == "spark" or engine == "pandas")    # True (second is true)

# not - flips the value
print(not is_production)    # True (because is_production is False)
print(not True)             # False

True
False
True
True
False


### Truthiness (This WILL come up in interviews)

In Python, every value can be treated as True or False, even if it is not a boolean. This is called "truthiness."

**These values are considered False ("falsy"):**
- `False` (obviously)
- `0` and `0.0`
- `""` (empty string)
- `None`
- `[]` (empty list)
- `{}` (empty dictionary)
- `set()` (empty set)

**Everything else is True ("truthy")**, including:
- `"False"` (this is a non-empty STRING, not the boolean False!)
- `[0]` (a list with one item -- it is not empty)
- `-1` (any non-zero number)

This matters because you will see code like:
```python
if my_list:    # This checks: "does my_list have any items?"
if not name:   # This checks: "is name empty or None?"
```

In [51]:
# Truthiness - predict the output before running each line!
print(bool(0))          # ?
print(bool(42))         # ?
print(bool(""))         # ?
print(bool("hello"))    # ?
print(bool("False"))    # ? (trick question!)
print(bool(None))       # ?
print(bool([]))         # ?
print(bool([0]))        # ? (trick question!)
print(bool(0.0))        # ?

False
True
False
True
True
False
False
True
False


### The `in` operator

You can check if a value exists in a collection using `in`. This is very Pythonic and used everywhere in Odibi.

In [52]:
# The "in" operator
valid_engines = ["pandas", "spark", "polars"]
engine = "pandas"

print(engine in valid_engines)        # True
print("duckdb" in valid_engines)      # False
print("duckdb" not in valid_engines)  # True

# Also works with strings
path = "bronze/customers/daily"
print("customers" in path)            # True
print("orders" in path)               # False

True
False
True
True
False


### Exercise 5.1: Write boolean expressions

Using the variables below, write expressions for each scenario. Print the result of each one.

Write expressions for:
1. `is_delta` -- True if format_type equals "delta"
2. `needs_retry` -- True if attempt is less than max_retries AND success is False
3. `has_data` -- True if row_count is truthy (non-zero)
4. `is_valid_engine` -- True if engine_type is in the list ["pandas", "spark", "polars"]

In [None]:
# Exercise 5.1
# YOUR CODE HERE
format_type = "delta"
engine_type = "pandas"
row_count = 1542
attempt = 2
max_retries = 3
success = False

# is_delta = ...
print(f'is_delta:', 'delta' in format_type)
# needs_retry = ...
print(f'needs_retry:', attempt < max_retries)
# has_data = ...
print(f'has_data:', row_count > 0)

# is_valid_engine = ...
print(f'is_valid_engine:', 'pandas' in engine_type)
# Print each one

is_delta: True
needs_retry: True
has_data: True
is_valid_engine: True


**Expected output:**
```
is_delta: True
needs_retry: True
has_data: True
is_valid_engine: True
```

### Exercise 5.2: Truthiness quiz

Without running the code first, write down what you think each of these returns. Then run it to check.

This is a common interview screening question.

In [55]:
# Exercise 5.2 - Predict each result, THEN run the cell
# Write your predictions as comments first!

# Prediction: ?
print(bool(0)) # False

# Prediction: ?
print(bool("")) # False

# Prediction: ?
print(bool("False")) # True

# Prediction: ?
print(bool([])) # False

# Prediction: ?
print(bool([0])) # True

# Prediction: ?
print(not None) # True

False
False
True
False
True
True


**Answers:**
- `bool(0)` = `False` (zero is falsy)
- `bool("")` = `False` (empty string is falsy)
- `bool("False")` = `True` (it is a non-empty string! The text inside does not matter)
- `bool([])` = `False` (empty list is falsy)
- `bool([0])` = `True` (the list has one item -- it is not empty)
- `not None` = `True` (None is falsy, so not None is True)

If you got `bool("False")` and `bool([0])` right on the first try, you are ahead of most candidates.

---
## Section 6: Print and Output

You have been using `print()` already. Let us understand it fully.

`print()` sends text to the screen. It is your main way to see what your code is doing.

In [59]:
# print() basics
print("Hello")                    # Simple text
print("Hello", "World")           # Multiple arguments separated by space
print("Hello", "World", sep="-")  # Change separator
print("No newline", end=" ")      # Change what comes at the end
print("after it")                 # This continues on the same line

Hello
Hello World
Hello-World
No newline after it


In [60]:
# Printing variables with f-strings (the way you should always do it)
node = "customers"
rows = 1542
duration = 3.456

# Bad way (string concatenation - messy, need str() conversion)
print("Node: " + node + " | Rows: " + str(rows))

# Good way (f-string - clean, readable)
print(f"Node: {node} | Rows: {rows:,} | Duration: {duration:.2f}s")

Node: customers | Rows: 1542
Node: customers | Rows: 1,542 | Duration: 3.46s


### A note about print() vs logging

In real code (like Odibi), you almost never use `print()`. Instead, you use Python's `logging` module. We will cover that in Phase 3. For now, `print()` is fine for learning and experimenting.

---
## Section 7: Control Flow -- if / elif / else

Control flow lets your code make decisions. Without it, code just runs line by line. With `if` statements, code can choose different paths.

### Indentation is EVERYTHING

This is the single most important syntax rule in Python. **Python uses indentation (spaces) to show which code belongs together.** Most languages use curly braces `{}`. Python uses spaces.

```python
if engine == "pandas":
    print("Using Pandas")    # This line is INSIDE the if (indented 4 spaces)
    print("Local mode")      # This line is also INSIDE the if
print("This runs always")    # This line is OUTSIDE the if (not indented)
```

The standard indentation is **4 spaces**. Your editor handles this when you press Tab. Never mix tabs and spaces -- pick one (spaces) and stick with it.

In [64]:
# Basic if/elif/else
engine = "pandas"

if engine == "pandas":
    print("Local development mode", "Using DataFrame operations")
    print("Using DataFrame operations")
elif engine == "spark":
    print("Production mode")
    print("Running on Databricks")
elif engine == "polars":
    print("High-performance local mode")
else:
    print(f"Unknown engine: {engine}")

Local development mode Using DataFrame operations
Using DataFrame operations


Notice the structure:
1. `if condition:` -- check the first condition (note the colon!)
2. `elif condition:` -- check another condition (only if the first was False)
3. `else:` -- if nothing above matched

You can have as many `elif` blocks as you want. `elif` and `else` are optional.

### Ternary expression (one-line if)

For simple conditions, Python has a shorthand:
```python
status = "active" if is_running else "stopped"
```
This reads as: "set status to active IF is_running is True, ELSE set it to stopped"

In [65]:
# Ternary expression - one-line if/else
row_count = 1542

# Long way
if row_count > 0:
    status = "has_data"
else:
    status = "empty"

# Short way (ternary) - same result
status = "has_data" if row_count > 0 else "empty"
print(f"Status: {status}")

Status: has_data


### Exercise 7.1: Engine selector

Write an if/elif/else that takes an `engine_type` variable and prints a description:
- `"pandas"` -> `"Local development - DataFrame operations"`
- `"spark"` -> `"Production - Databricks cluster"`
- `"polars"` -> `"High-performance local - Rust-based"`
- anything else -> `"Unknown engine: {engine_type}"`

In [None]:
# Exercise 7.1
# YOUR CODE HERE
engine_type = "polars"  # Try changing this to test all branches
    
# Write your if/elif/else here

if engine_type == 'pandas':
    print(f'"pandas" -> "Local development - DataFrame operations')
elif engine_type == 'spark':
    print(f'"spark" -> "Production - Databricks cluster')
elif engine_type == 'polars':
    print(f'"polars" -> "High-performane local - Rust-based')
else:
    print(f"Unknown engine: {engine_type}")


"polars" -> "High-performane local - Rust-based


### Exercise 7.2: Validate a write mode

Odibi supports these write modes: `overwrite`, `append`, `upsert`, `append_once`, `merge` (from `odibi/config.py` WriteMode enum).

Write code that:
1. Takes a `write_mode` variable
2. Checks if it is in the list of valid modes
3. If valid, prints `"Write mode '{write_mode}' is valid"`
4. If not, prints `"Invalid write mode '{write_mode}'. Must be one of: overwrite, append, upsert, append_once, merge"`

In [77]:
", ".join(valid_modes)

'overwrite, append, upsert, append_once, merge'

In [81]:
# Exercise 7.2
# YOUR CODE HERE
write_mode = "upsert"  # Try "invalid_mode" too
valid_modes = ["overwrite", "append", "upsert", "append_once", "merge"]

# Check if write_mode is valid
if write_mode in valid_modes:
    print(f" Write mode  '{write_mode}' is valid")
else:
    print(f"Invalid write mode '{write_mode}'. Must be one of: {', '.join(valid_modes)}")

 Write mode  'upsert' is valid


---
## Section 8: Loops

Loops let you repeat code. Instead of writing the same thing 100 times, you write it once and loop.

### The `for` loop

The `for` loop says: "for each item in this collection, do something."

```python
for item in collection:
    do_something(item)
```

In [82]:
# for loop over a list
nodes = ["customers", "orders", "products", "inventory"]

for node in nodes:
    print(f"Processing node: {node}")

Processing node: customers
Processing node: orders
Processing node: products
Processing node: inventory


In [83]:
# enumerate() - when you need the index AND the value
# This is more Pythonic than using range(len(list))
nodes = ["customers", "orders", "products", "inventory"]

for i, node in enumerate(nodes):
    print(f"  Node {i + 1}: {node}")

# i starts at 0, so we add 1 for human-readable numbering

  Node 1: customers
  Node 2: orders
  Node 3: products
  Node 4: inventory


In [84]:
# range() - generate a sequence of numbers
print(list(range(5)))       # [0, 1, 2, 3, 4] - starts at 0, stops BEFORE 5
print(list(range(1, 6)))    # [1, 2, 3, 4, 5] - starts at 1, stops BEFORE 6
print(list(range(0, 10, 2))) # [0, 2, 4, 6, 8] - every 2nd number

# Using range in a for loop
for i in range(3):
    print(f"Attempt {i + 1} of 3")

[0, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
[0, 2, 4, 6, 8]
Attempt 1 of 3
Attempt 2 of 3
Attempt 3 of 3


### The `while` loop

The `while` loop says: "keep going as long as this condition is True."

```python
while condition:
    do_something()
```

Be careful: if the condition never becomes False, your loop runs forever (infinite loop).

In [87]:
# while loop - retry logic (very common in data engineering)
max_retries = 3
attempt = 0
success = False

while attempt < max_retries and not success:
    attempt += 1  # This is shorthand for: attempt = attempt + 1
    print(f"Attempt {attempt} of {max_retries}...")

    # Simulate: succeed on attempt 2
    if attempt == 2:
        success = True
        print("Success!")

if not success:
    print(f"Failed after {max_retries} attempts")

Attempt 1 of 3...
Attempt 2 of 3...
Success!


### `break` and `continue`

- `break` -- exit the loop immediately
- `continue` -- skip to the next iteration

In [88]:
# break - stop the loop early
nodes = ["customers", "orders", "STOP", "products"]

for node in nodes:
    if node == "STOP":
        print("Stop signal received")
        break  # Exit the loop entirely
    print(f"Processing: {node}")

print("---")

# continue - skip this iteration
numbers = [1, -2, 3, -4, 5]
for n in numbers:
    if n < 0:
        continue  # Skip negative numbers
    print(f"Positive: {n}")

Processing: customers
Processing: orders
Stop signal received
---
Positive: 1
Positive: 3
Positive: 5


### Exercise 8.1: Build a SQL SELECT

Given a list of column names, build a SQL SELECT statement using a loop.

```python
columns = ["id", "name", "email", "created_at"]
table = "customers"
```

Expected output:
```
SELECT id, name, email, created_at FROM customers
```

Hint: You could use `.join()` from Section 3, or build the string in a loop. Try both ways!

In [109]:
# Exercise 8.1
# YOUR CODE HERE
columns = ["id", "name", "email", "created_at"]
table = "customers"

# Method 1: Using .join()
columns_str = ", ".join(columns)
print(f'SELECT {columns_str} FROM {table}')
# Method 2: Using a loop (build a string piece by piece)
query = 'SELECT'
for column in columns:
    query += f' {column},'

query = query.rstrip(',')
print(f'{query} FROM {table}')

query = 'SELECT'
for i, column in enumerate(columns):
    if i+1 < len(columns):
        query += f' {column},'
    else:
        query += f' {column}'

print(f'{query} FROM {table}')

SELECT id, name, email, created_at FROM customers
SELECT id, name, email, created_at FROM customers
SELECT id, name, email, created_at FROM customers


### Exercise 8.2: Retry logic

Write retry logic that:
1. Tries up to 5 times
2. On each attempt, prints `Attempt {n} of 5...`
3. Simulates success on attempt 3 (use `if attempt == 3`)
4. When successful, prints `Success on attempt {n}!` and stops
5. If all attempts fail, prints `Failed after 5 attempts`

This is real-world logic -- Odibi has retry configs in `odibi/config.py` (RetryConfig).

In [153]:
# Exercise 8.2
# YOUR CODE HERE
max_retries = 4
success_on = 3

# Write your retry loop
attempt = 0

while attempt < max_retries:
    attempt +=1
    print(f'Attempt {attempt} of {max_retries}')
    if attempt == success_on:
        print(f'Success on attempt {attempt}!')
        break
else: 
    print(f'Failed after {attempt} attempts')


Attempt 1 of 4
Attempt 2 of 4
Attempt 3 of 4
Success on attempt 3!


---
## Section 9: Functions

A function is a reusable block of code that:
1. Has a name
2. Takes inputs (called **parameters** or **arguments**)
3. Does something
4. Optionally returns an output

You have already been using functions: `print()`, `type()`, `int()`, `len()`. Now you will write your own.

### Why functions matter

Without functions, you would copy-paste the same code every time you need it. Functions let you write it once and call it many times. Every piece of Odibi is built from functions.

### The `def` keyword

```python
def function_name(parameter1, parameter2):
    """Docstring: explains what this function does."""
    # code goes here
    return result
```

Let us break this down:
- `def` -- tells Python you are defining a function
- `function_name` -- the name you will use to call it
- `(parameter1, parameter2)` -- inputs the function expects
- `:` -- colon, just like if statements
- The indented block is the function body
- `return` -- sends a value back to whoever called the function

In [154]:
# Your first function
def greet(name):
    """Return a greeting message."""
    return f"Hello, {name}!"

# Calling the function
result = greet("Odibi")
print(result)  # Hello, Odibi!

# You can also print directly
print(greet("Pipeline"))  # Hello, Pipeline!

Hello, Odibi!
Hello, Pipeline!


In [155]:
# Function with default parameters
# If the caller does not provide a value, the default is used
def format_node_status(name, status="PENDING", rows=0):
    """Format a node status message."""
    return f"[{status}] {name}: {rows:,} rows"

# Using all defaults
print(format_node_status("customers"))
# [PENDING] customers: 0 rows

# Override some defaults
print(format_node_status("customers", "SUCCESS", 1542))
# [SUCCESS] customers: 1,542 rows

# Using keyword arguments (named arguments) - you can skip the order
print(format_node_status("orders", rows=3200, status="RUNNING"))
# [RUNNING] orders: 3,200 rows

[PENDING] customers: 0 rows
[SUCCESS] customers: 1,542 rows
[RUNNING] orders: 3,200 rows


### Return vs Print

This is a CRITICAL distinction that comes up in every interview:

- `print()` displays text on the screen. It does NOT send anything back.
- `return` sends a value back to the code that called the function. It does NOT display anything.

If your function does not have a `return` statement, it returns `None` by default.

In [156]:
# print vs return - understand the difference

# This function PRINTS (displays) but returns None
def bad_add(a, b):
    print(a + b)

# This function RETURNS the result
def good_add(a, b):
    return a + b

result1 = bad_add(3, 4)   # Prints 7 on screen
print(f"bad_add returned: {result1}")   # None!

result2 = good_add(3, 4)  # Nothing printed
print(f"good_add returned: {result2}")  # 7

# Why does this matter? Because you need to USE the result:
total = good_add(3, 4) + good_add(5, 6)  # Works: 7 + 11 = 18
print(total)

# This would FAIL:
# total = bad_add(3, 4) + bad_add(5, 6)  # Error: None + None

7
bad_add returned: None
good_add returned: 7
18


### Docstrings

The triple-quoted string right after `def` is called a **docstring**. It documents what the function does. Every function in Odibi has one. It is not just a comment -- Python actually stores it and you can access it.

In an interview, writing docstrings shows professionalism. Always write them.

In [157]:
# Docstrings
def calculate_pass_rate(total, failed):
    """Calculate the validation pass rate.

    Args:
        total: Total number of rows
        failed: Number of failed rows

    Returns:
        Pass rate as a float between 0 and 1
    """
    passed = total - failed
    return passed / total

# You can read a function's docstring
help(calculate_pass_rate)

Help on function calculate_pass_rate in module __main__:

calculate_pass_rate(total, failed)
    Calculate the validation pass rate.
    
    Args:
        total: Total number of rows
        failed: Number of failed rows
    
    Returns:
        Pass rate as a float between 0 and 1



### Exercise 9.1: Node status formatter

Write a function called `format_node_result` that:
- Takes parameters: `name` (str), `success` (bool), `duration` (float), `rows` (int, default 0)
- Returns a string like: `"[SUCCESS] customers: 1,542 rows in 3.46s"` or `"[FAILED] customers: 0 rows in 1.23s"`
- The status should be "SUCCESS" if success is True, "FAILED" if False
- Include a docstring

In [168]:
# Exercise 9.1
# YOUR CODE HERE
def format_node_result(name: str, success: bool,duration: float, rows: int = 0):
    success_str = ""
    if success == 0:
        success_str = 'False'
        return f'[{success_str}] {name}: {rows:,} rows in {duration:.2f}s'
    else:
        success_str = 'True'
        return f'[{success_str}] {name}: {rows:,} rows in {duration:.2f}s'



# Test your function:
print(format_node_result("customers", True, 3.456, 1542))
print(format_node_result("orders", False, 1.234))

[True] customers: 1,542 rows in 3.46s
[False] orders: 0 rows in 1.23s


**Expected output:**
```
[SUCCESS] customers: 1,542 rows in 3.46s
[FAILED] orders: 0 rows in 1.23s
```

### Exercise 9.2: Write mode validator

Write a function called `validate_write_mode` that:
- Takes a `mode` parameter (str)
- Returns `True` if mode is one of: overwrite, append, upsert, append_once, merge
- Returns `False` otherwise
- Include a docstring

Then test it with both valid and invalid modes.

In [None]:
# Exercise 9.2
# YOUR CODE HERE
def validate_write_mode(mode: str):
    _modes = ['overwrite', 'append', 'upsert', 'append_once', 'merge']
    if mode in _modes:
        return True
    else:
        return False

# Test it:
print(validate_write_mode("upsert"))      # True
print(validate_write_mode("delete"))       # False
print(validate_write_mode("append_once"))  # True

True
False
True


### Exercise 9.3: SQL query builder

Write a function called `build_select` that:
- Takes: `table` (str), `columns` (list of str), `where` (str, default None)
- Returns a SQL SELECT string
- If `where` is provided, add a WHERE clause

Examples:
- `build_select("customers", ["id", "name"])` -> `"SELECT id, name FROM customers"`
- `build_select("orders", ["id", "total"], "total > 100")` -> `"SELECT id, total FROM orders WHERE total > 100"`

In [172]:
# Exercise 9.3
# YOUR CODE HERE
def build_select(table:str, columns:list[str], where:str = None):
    columns_str = ' ,'.join(columns)
    query = f'SELECT {columns_str} FROM {table}'
    if where:
        query = f'{query} WHERE {where}'
    return query




# Test it:
print(build_select("customers", ["id", "name", "email"]))
print(build_select("orders", ["id", "total"], "total > 100"))

SELECT id ,name ,email FROM customers
SELECT id ,total FROM orders WHERE total > 100


**Expected output:**
```
SELECT id, name, email FROM customers
SELECT id, total FROM orders WHERE total > 100
```

---
## Section 10: Error Handling -- try / except

When something goes wrong in Python, it raises an **exception**. If you do not handle it, your program crashes. Error handling lets you catch these exceptions and decide what to do.

This is CRITICAL for data engineering. Data is messy. Files go missing. APIs time out. Columns have unexpected types. Your code must handle these situations gracefully.

### Common exception types

| Exception | When it happens | Example |
|-----------|----------------|--------|
| `ValueError` | Wrong value | `int("hello")` |
| `TypeError` | Wrong type | `"hello" + 5` |
| `KeyError` | Key not in dict | `my_dict["missing_key"]` |
| `IndexError` | Index out of range | `my_list[999]` |
| `FileNotFoundError` | File does not exist | `open("missing.csv")` |
| `ZeroDivisionError` | Division by zero | `10 / 0` |
| `AttributeError` | Object has no attribute | `None.upper()` |

In [174]:
# Without error handling - the program CRASHES
# Uncomment the line below to see what happens:
# result = int("not_a_number")  # ValueError!

# With error handling - the program continues
try:
    result = int("not_a_number")
    print(f"Result: {result}")  # This line never runs
except ValueError as e:
    print(f"Could not convert: {e}")

print("Program continues normally!")  # This still runs

Could not convert: invalid literal for int() with base 10: 'not_a_number'
Program continues normally!


### The full try/except/else/finally pattern

```python
try:
    # Code that might fail
except SpecificError as e:
    # Handle the error
else:
    # Runs ONLY if try succeeded (no exception)
finally:
    # Runs ALWAYS, whether try succeeded or failed
```

`else` and `finally` are optional. Most of the time you just use `try/except`.

In [177]:
# Full pattern example
def safe_divide(a, b):
    """Safely divide two numbers."""
    try:
        result = a / b
    except ZeroDivisionError:
        print("Cannot divide by zero!")
        return None
    except TypeError as e:
        print(f"Wrong types: {e}")
        return None
    else:
        print(f"{a} / {b} = {result}")  # Only runs if no error
        return result
    finally:
        print("Division attempted")  # Always runs

safe_divide(10, 3)     # Works
print("---")
safe_divide(10, 0)     # ZeroDivisionError
print("---")
safe_divide(10, "a")   # TypeError

10 / 3 = 3.3333333333333335
Division attempted
---
Cannot divide by zero!
Division attempted
---
Wrong types: unsupported operand type(s) for /: 'int' and 'str'
Division attempted


### `raise` -- Creating your own errors

Sometimes YOU want to signal an error. Use `raise`:

```python
raise ValueError("Invalid engine type")
```

This is exactly what Odibi does. Look at `odibi/exceptions.py` -- it defines custom exceptions like `ConfigValidationError`, `ConnectionError`, `NodeExecutionError`. Each one inherits from `OdibiException` (which inherits from Python's built-in `Exception`).

We will cover custom exception classes fully in Phase 4 (OOP). For now, just know that you can `raise` built-in exceptions.

In [178]:
# raise - signal an error yourself
def validate_engine(engine_type):
    """Validate that engine type is supported."""
    valid = ["pandas", "spark", "polars"]
    if engine_type not in valid:
        raise ValueError(
            f"Invalid engine type: '{engine_type}'. "
            f"Must be one of: {', '.join(valid)}"
        )
    return f"Engine {engine_type} is valid"

# This works fine
print(validate_engine("pandas"))

# This raises an error - catch it!
try:
    validate_engine("duckdb")
except ValueError as e:
    print(f"Caught error: {e}")

Engine pandas is valid
Caught error: Invalid engine type: 'duckdb'. Must be one of: pandas, spark, polars


### Exercise 10.1: Safe string-to-int converter

Write a function called `safe_int` that:
- Takes a `value` parameter (any type)
- Tries to convert it to int using `int(value)`
- If successful, returns the integer
- If ValueError, prints a message and returns `None`
- If TypeError, prints a different message and returns `None`

Test with: `safe_int("42")`, `safe_int("hello")`, `safe_int(None)`

In [184]:
# Exercise 10.1
# YOUR CODE HERE
def safe_int(value:any):
    try:
        value = int(value)
    except ValueError as e:
        print(f"Value Error: {e}")
        return None
    except TypeError as e:
        print(f"Type Error: {e}")
        return None
    else:
        return value

# Test it:
print(safe_int("42"))      # Should return 42
print(safe_int("hello"))   # Should print error, return None
print(safe_int(None))      # Should print error, return None

42
Value Error: invalid literal for int() with base 10: 'hello'
None
Type Error: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
None


### Exercise 10.2: Config validator with raise

Write a function called `validate_config` that takes:
- `engine` (str)
- `write_mode` (str)
- `table_name` (str)

It should:
- Raise `ValueError` if engine is not in ["pandas", "spark", "polars"]
- Raise `ValueError` if write_mode is not in ["overwrite", "append", "upsert", "append_once", "merge"]
- Raise `ValueError` if table_name is empty
- Return `"Config is valid"` if everything passes

Then call it with both valid and invalid configs, using try/except to catch errors.

In [192]:
# Exercise 10.2
# YOUR CODE HERE
def validate_config(engine:str, write_mode:str, table_name:str):
    _engines = ['pandas', 'spark', 'polars']
    _write_modes = ['overwrite', 'append', 'upsert', 'append_once', 'merge']
    
    if engine not in _engines or write_mode not in _write_modes or not table_name:
        raise ValueError("Config is invalid")
    else:
        return "Config is valid"






# Test with valid config:
print(validate_config("pandas", "upsert", "customers"))

# Test with invalid configs (use try/except):
try:
    validate_config("duckdb", "upsert", "customers")
except ValueError as e:
    print(f"Error: {e}")

Config is valid
Error: Config is invalid


---
## Section 11: Interview Drill

Answer these without looking back at the notebook. If you cannot answer one, go back and re-read that section.

Write your answers in the code cells. Test them.

### Drill 1: What is the difference between `=` and `==`?

Answer in a comment:

In [None]:
# Drill 1
# YOUR ANSWER: = is for assignment while == compares 2 values

### Drill 2: What does `if not my_list:` check for?

Write code that demonstrates this:

In [None]:
# Drill 2
# YOUR CODE HERE
my_list = []
bool(my_list)
# What does this print? checks if it is true or false

False

### Drill 3: Sum without sum()

Write a function called `my_sum` that takes a list of numbers and returns their total. Do NOT use the built-in `sum()` function. Use a loop.

In [195]:
# Drill 3
# YOUR CODE HERE

def my_sum(numbers: list[int]):
    total=0
    for num in numbers:
        total += num
    return total


# Test: 
print(my_sum([1, 2, 3, 4, 5]))  # Should return 15

15


### Drill 4: What is the difference between `return` and `print()`?

Write two functions that demonstrate the difference:

In [199]:
# Drill 4
# YOUR CODE HERE

# Function that prints:
def test_func_1():
    print("yes")

# Function that returns:
def test_func_2():
    return "yes"


# Show why it matters:
a = test_func_1()
b = test_func_2()

print(a,b)

yes
None yes


### Drill 5: Handle division by zero

Write a function `safe_divide(a, b)` that returns `a / b`, but returns 0 if b is zero (instead of crashing).

In [220]:
# Drill 5
# YOUR CODE HERE
def safe_divide(a:int, b:int):
    try:
        result = a / b
    except ZeroDivisionError:
        return 0
    except TypeError as e:
        print(f"Wrong types: {e}")
        return None
    else:
        print(f"{a} / {b} = {result}")  # Only runs if no error
    return result
# Test:
print(safe_divide(10, 3))   # 3.333...
print(safe_divide(10, 0))   # 0

10 / 3 = 3.3333333333333335
3.3333333333333335
0


### Drill 6: f-string formatting

Print exactly this using ONE f-string:
```
Pipeline: sales | Nodes: 5 | Pass Rate: 97.4% | Duration: 12.35s
```

Use these variables: `name="sales"`, `nodes=5`, `rate=0.974`, `duration=12.345`

In [25]:
# Drill 6
# YOUR CODE HERE
name = "sales"
nodes = 5
rate = 0.974
duration = 12.345

# Print the formatted string
print(f"Pipeline: {name} | Node: {nodes} | Pass Rate: {rate:.1%} | Duration: {duration:.2f}")

Pipeline: sales | Node: 5 | Pass Rate: 97.4% | Duration: 12.35


### Drill 7: Loop and accumulate

Given a list of row counts from different nodes, write a loop that:
1. Prints each node's count
2. Calculates the total
3. Prints the total at the end

In [29]:
# Drill 7
# YOUR CODE HERE
node_counts = {"customers": 1542, "orders": 8930, "products": 234, "inventory": 4521}

# Loop through and print each, accumulate total
# Hint: loop through a dict with .items() -> for name, count in node_counts.items():
total = 0
for node, count in node_counts.items():
    total += count
    print(f'{node}: {count:,} rows')

print(f"Total : {total:,} rows")

customers: 1,542 rows
orders: 8,930 rows
products: 234 rows
inventory: 4,521 rows
Total : 15,227 rows


**Expected output:**
```
customers: 1,542 rows
orders: 8,930 rows
products: 234 rows
inventory: 4,521 rows
Total: 15,227 rows
```

---
## Checkpoint

If you completed all exercises and drills, you now have the Python basics down. These are the building blocks for everything else.

**What you learned:**
- Variables and types (int, float, str, bool, None)
- String methods and f-strings
- Math operators
- Boolean logic and truthiness
- if/elif/else
- for and while loops
- Functions with parameters, defaults, and return values
- Error handling with try/except/raise

**Next:** Notebook 02 -- Data Structures and Comprehensions (lists, dicts, sets, tuples, list comprehensions, dict comprehensions, zip, enumerate, sorted with lambda).