# Exception Handling

## Learning goals
By the end of this notebook you should be able to:

- Explain what an *exception* is and why it matters in Data Engineering and AI pipelines.
- Use `try` / `except` to handle predictable failures *without hiding bugs*.
- Use multiple `except` blocks, plus `else` and `finally`, to structure “happy path” vs “error path”.
- Raise your own exceptions with `raise` for validation and schema checks.
- Parse simple user/data inputs safely (avoiding `eval()`).


## Why exception handling matters in Data Engineering

In real pipelines, failures are normal:

- Input files may be missing (`FileNotFoundError`)
- Data may be malformed (`ValueError`, parsing errors)
- Encodings can be wrong (`UnicodeDecodeError`)
- External services can be unstable (timeouts, network errors)

Your goal is **not** to “ignore errors”. Your goal is to:

1. Handle *recoverable* problems with a clear fallback.
2. Fail fast (with a clear message) when you cannot safely continue.
3. Keep enough context to debug quickly.


## `try` / `except` fundamentals

The <a href="https://docs.python.org/3/tutorial/errors.html">`try`-`except`</a> statement lets you handle exceptions without the program terminating abruptly.

- Put code that *might fail* inside `try`.
- Catch **specific** exceptions in `except`.
- Avoid using exception handling as a patch for unclear logic.

**Best practice:** catch the most specific exception types you can reasonably predict.


In [45]:
def safe_divide(numerator: float, denominator: float) -> float | None : #Type Hinting, the function will return either float or None
    """Return numerator/denominator, or None if denominator is zero."""
    try:
        return numerator / denominator
    except ZeroDivisionError:
        return None

print(safe_divide(10, 2))
print(safe_divide(10, 0))

5.0
None


In [46]:
def safe_divide(numerator: float, denominator: float) -> float | str : #Type Hinting, the function will return either float or None
    """Return numerator/denominator, or None if denominator is zero."""
    try:
        return numerator / denominator
    except ZeroDivisionError:
        return "My String Value"

print(safe_divide(10, 2))
print(safe_divide(10, 0))

5.0
My String Value


## Catch specific exceptions first (and why “catch-all” is risky)

You can catch different exceptions with multiple `except` blocks.

- Put **specific** handlers first.
- Use broad handlers (`Exception`) carefully, and typically only when you log context and re-raise or return a safe fallback.

**Avoid:** bare `except:` (it can hide `KeyboardInterrupt`, `SystemExit`, and real programming bugs).


In [47]:
def parse_int(raw: str) -> int | None:
    """Parse an integer from a string, returning None if it is invalid."""
    try:
        return int(raw)
    except ValueError:
        return None

for s in ["42", "003", "3.14", "hello"]:
    print(s, "->", parse_int(s))

42 -> 42
003 -> 3
3.14 -> None
hello -> None


## `else` and `finally`

- `else` runs **only if no exception happened** in the `try` block.
  - Useful to keep the “success path” separate from the error handling.
- `finally` runs **no matter what**, even if an exception occurs.
  - Useful for cleanup: closing files, releasing resources, disconnecting clients.


In [48]:
def read_first_line(path: str) -> str:
    f = None
    try:
        f = open(path, "r", encoding="utf-8")
    except FileNotFoundError:
        return "MISSING_FILE"
    else:
        return f.readline().strip()
    finally:
        if f is not None:
            f.close()

print(read_first_line("data/input.txt"))

MISSING_FILE


## Raising exceptions deliberately with `raise`

Exceptions are not only accidental. You can **raise** them to enforce rules:

- Validate function arguments
- Validate dataset schema
- Enforce business logic constraints

In Data Engineering, this is a clean way to fail early when continuing would produce wrong outputs.


In [49]:
import pandas as pd

def require_columns(df: pd.DataFrame, required: list[str]) -> None:
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

demo_df = pd.DataFrame({"id": [1, 2], "value": [10, 20]})
require_columns(demo_df, ["id", "value"])
# require_columns(demo_df, ["id", "value1"])
print("Schema OK!")

Schema OK!


## Safe parsing: replace `eval()` with `ast.literal_eval()`

**Important:** `eval()` executes arbitrary Python code. Using it on user input or external data is unsafe.

If you only need to parse Python *literals* (numbers, strings, lists, dicts, tuples, booleans, None), use:

- `ast.literal_eval()` (safe for literals)

If parsing fails, fall back to treating the input as a raw string.


In [50]:
import ast

def what_type_safe() -> None:
    """Read a value from input, safely parse literals, then print the resulting type."""
    raw = input("Type something (e.g. 123, 'hi', [1,2], {'a':1}): ").strip()
    try:
        value = ast.literal_eval(raw)
    except (ValueError, SyntaxError):
        value = raw  # fallback: keep as string

    print("Parsed value:", value)
    print("Type:", type(value))

# Uncomment to run interactively:
what_type_safe()

Parsed value: hi
Type: <class 'str'>


In [51]:
type(eval("print(1,2)"))

1 2


NoneType

In [52]:
type(ast.literal_eval("[23,232,'hello']"))

list

## Data Engineering mini-example: robust JSON line parsing

A common pattern is to parse records one by one, skipping bad records while counting them.

- Recoverable: a few malformed lines (skip and continue)
- Non-recoverable: file missing (fail fast)


In [53]:
import json
from typing import Iterable

def parse_json_lines(lines: Iterable[str]) -> tuple[list[dict], int]:
    """Parse JSON objects from an iterable of strings.

    Returns:
        (records, bad_count)
    """
    records = []
    bad_count = 0

    for line in lines:
        line = line.strip()
        if not line:
            continue
        try:
            records.append(json.loads(line))
        except json.JSONDecodeError:
            bad_count += 1

    return records, bad_count

sample_lines = [
    '{"id": 1, "value": 10}',
    'not json',
    '{"id": 2, "value": 20}'
]

records, bad = parse_json_lines(sample_lines)
print("records:", records)
print("bad_count:", bad)

records: [{'id': 1, 'value': 10}, {'id': 2, 'value': 20}]
bad_count: 1


## Quick exercises (3–7 minutes each)

1) **Safe parse**
- Implement `safe_parse(raw: str)` using `ast.literal_eval`.
- If parsing fails, return the original string.

2) **Integer conversion**
- Implement `to_int_or_none(raw: str)`:
  - returns `int(raw)` if valid
  - otherwise returns `None`

3) **CSV load fallback**
- Implement `load_csv_or_empty(path: str)`:
  - if file exists, return `pd.read_csv(path)`
  - if missing, return an empty DataFrame with columns `["id", "value"]`

4) **Schema validation**
- Implement `validate_schema(df)` that raises `ValueError` if `["id", "value"]` are missing.


In [54]:
# 1) **Safe parse**
# - Implement `safe_parse(raw: str)` using `ast.literal_eval`.
# - If parsing fails, return the original string.
import ast

def safe_parse(raw: str):
    try:
        return ast.literal_eval(raw)
    except (ValueError, SyntaxError):
        return raw
print(f'Type : { type(safe_parse("123")) }, Value = { safe_parse("123") } ')
print(f'Type : { type(safe_parse("str(list(range(5)))")) }, Value = { safe_parse("str(list(range(5)))") } ')
print(f'Type : { type(safe_parse("[1,2,3,4]")) }, Value = { safe_parse("[1,2,3,4]") } ')


Type : <class 'int'>, Value = 123 
Type : <class 'str'>, Value = str(list(range(5))) 
Type : <class 'list'>, Value = [1, 2, 3, 4] 


In [55]:
# 2) **Integer conversion**
# - Implement `to_int_or_none(raw: str)`:
#   - returns `int(raw)` if valid
#   - otherwise returns `None`

def to_int_or_none(raw:str):
    try:
        return int(raw)
    except ValueError:
        return None
    
print(to_int_or_none(23), to_int_or_none(2.4), to_int_or_none("Hello"), to_int_or_none("3"))


23 2 None 3


In [56]:
# 3) **CSV load fallback**
# - Implement `load_csv_or_empty(path: str)`:
#   - if file exists, return `pd.read_csv(path)`
#   - if missing, return an empty DataFrame with columns `["id", "value"]`

import pandas as pd
from pathlib import Path
def load_csv_or_empty(path: str):
    try:
       return pd.read_csv(path)
    except FileNotFoundError:
       return pd.DataFrame(columns=['id','ValueError'])
    except pd.errors.ParserError:
       return pd.DataFrame(columns=['id','ValueError'])

print(load_csv_or_empty("sample.txt"),"\n------------\n", load_csv_or_empty(str(Path.cwd()) + "/data/sample.txt"))

Empty DataFrame
Columns: [id, ValueError]
Index: [] 
------------
    id   name  score
0   1  Alice     98
1   2    Bob     87
2   3  Carol     91
3   4    Dan     73
4   5    Eve     88


In [57]:
# 4) **Schema validation**
# - Implement `validate_schema(df)` that raises `ValueError` if `["id", "value"]` are missing.

import pandas as pd

def validate_schema(df:pd.DataFrame):
    required_columns = ['id','value']
    missing_columns = [c for c in required_columns if c not in df.columns]
    if missing_columns:
        raise ValueError("Required columns [id, value] are missing.")

df = pd.DataFrame(
    {'id':[1,2,3],
     'value':["Apple","Mango","Banana"]}
)
print(validate_schema(df))
#print(validate_schema(pd.DataFrame()))

None


In [58]:
# Suggested solutions
import ast
import pandas as pd

def safe_parse(raw: str):
    try:
        return ast.literal_eval(raw)
    except (ValueError, SyntaxError):
        return raw

def to_int_or_none(raw: str):
    try:
        return int(raw)
    except ValueError:
        return None

def load_csv_or_empty(path: str) -> pd.DataFrame:
    try:
        return pd.read_csv(path)
    except FileNotFoundError:
        return pd.DataFrame(columns=["id", "value"])
    except pd.errors.ParserError:
        return pd.DataFrame(columns=["id", "value"])

def validate_schema(df: pd.DataFrame) -> None:
    required = ["id", "value"]
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

print("Solutions cell executed successfully.")

Solutions cell executed successfully.


> Content created by **Carlos Cruz-Maldonado**.  
> Updated with additional best practices and Data Engineering examples.