# Data Validation for Scientists: Building Code That Fails Fast and Safely


## Introduction

### Background 

Realistically, most bugs in scientific code come from bad assumptions about inputs: wrong types, missing keys, unexpected shapes, empty strings, negative values that “should never happen,” and so on. Guard clauses are the blunt tool that prevents this mess.

The goal is to: 

  - **Fail fast**: you don’t waste time debugging downstream errors caused by nonsense inputs.
  - **Reduce nesting**: you don’t wrap the real logic in "if" jungles.
  - **Document assumptions**: the function states clearly what it won’t accept.
  - **Reduce ambiguity**: users can’t silently pass the wrong thing and hope for the best.


### Practice: Validate at the Edges

When you’re writing code for another researcher, the cleanest way to keep things stable is to validate inputs right at the boundary — the moment they enter your function, class, or pipeline. Your client will give you messy, half-specified data; that’s normal. If you don’t check it immediately, the mistake shows up later in a place that looks like your bug. Validating at the edges prevents that: you reject bad inputs early so the rest of the code can stay simple and trustworthy.


### The Data Will Get More Complex

In this notebook, we're focusing on very simple data--just single values, so we can see how challenging it is even in these cases.  In later sessions, we'll leverage frameworks for testing much more complex data structures, including those that live in scientific data files.

---

### Workshop Agenda

| Minutes | Activity | Requirements |
| :-- | :-- | :-- |
| 0 - 30 | Review the Pre-Workshop Exercises, Discuss Data Validation Practices | *Complete this Notebook before the Course |
| 30 - 100 | Breakout Rooms: Data Validation with Pydantic, Pandera, and Argparse | --- |
| 100 - 110 | Break |  --- |
| 110 - 190 | Breakout Rooms: Add Data Validation to Own Projects | *Have Your Project in GitHub, and an Idea of Something to Validate* |
| 190 - 210 | Mini-Retrospective |   --- | 



---

## Exercises

In the exercises in this notebook, we'll practice the basics of data validation, gradually-adding frameworks as we get familiar with the pattern and want to do more:

  1. [Data Validation when Running Functions](#the-guard-clause-pattern-check-yourself-before-you-break-yourself)
  2. [Data Validation when Instantiating Classes](#data-validation-when-instantiating-objects)
  3. [Data Validation when using Dataclasses](#data-validation-when-writing-dataclasses)
  4. [Pydantic for Data Validation on Custom Classes](#pydantic-a-framework-that-simplifies-data-validation-in-custom-classes)

### Utility Functions

The `check()` function below is used in the exercises to provide feedback on progress.  Just run the function and it'll work!

In [12]:
def check(code, expected, exception_message="", verbose=True):
    """
    a "pytest-lite" function.
    
    Takes code to evaluate and what's expected (whether a value, a exception type, or addtionally even a substring in the exception message).
    Returns whether the exception was met, and prints a message describing the finding.
    """
    try:
        output = eval(code) 

    except BaseException as exc:
        output = exc
        if type(expected) == type and issubclass(expected, BaseException):
            valid_exception_type = isinstance(exc, expected)
            valid_exception_message = exception_message in str(exc)
            valid = valid_exception_type and valid_exception_message
            if not valid:
                if not valid_exception_type:
                    expected_str = expected.__name__
                else:
                    expected_str = '\"...' + exception_message + '...\"'
            else:
                expected_str = ''
        else:
            valid = False
            expected_str = str(expected)

        output_str = type(output).__name__
    
    else:
        if type(expected) == type and issubclass(expected, Exception):
            valid = False
            expected_str = expected.__name__
        elif type(expected) == type:
            valid = True
            expected_str = expected.__name__ 
            
        else:
            valid = output == expected
            expected_str = str(expected)

        if " object at " in str(output):
            output_str = type(output).__name__
        else:
            output_str = str(output)

    
    if verbose:
        valid_str = "✅" if valid else "❌"

        # output_str = output if not isinstance(output, Exception) else type(output).__name__
        print(valid_str, code, "->", output_str, "" if valid else f"(Expected: {expected_str})")
    return valid
        


### The "Guard Clause" Pattern: "Check Yourself Before You Break Yourself"

A guard clause is a short, early check at the top of a function or method that refuses to continue when something is off. No ceremony, no clever abstractions. You validate the input and immediately raise, instead of letting the code wander forward and fail three layers deeper.

They’re the simplest, most reliable form of defensive programming. 

**Example**: Make all the checks pass.

In [13]:
def greet(name):
    """
    Says Hi to whoever you want!
    """

    ## Guard Clauses Go Here: ######
    if isinstance(name, (float, int)):
        raise TypeError("`name` should be a string. You are not a number.")
    ##################################
    
    return f"Hi, {name}!"


check("greet('Nicholas')", "Hi, Nicholas!");
check("greet(24601)", ValueError, "You are not a number");

✅ greet('Nicholas') -> Hi, Nicholas! 
❌ greet(24601) -> TypeError (Expected: ValueError)


**Exercise**: Make all the checks pass.

In [23]:
import numpy as np
from numbers import Real, Number
np.generic

def total_length(x, y):
    """
    Computes the total of two lengths of wire.

    Arguments:
      - x: a positive number
      - y: another positive number

    """

    ## Guard Clauses Go Here: ####
    try:
        float(x)
    except TypeError:
        raise TypeError("has to be a number!!!!!")
    
    try:
        float(y)
    except TypeError:
        raise TypeError("has to be a number!!!!!")
    

    ##############################

    return x + y


check("total_length(3.2, 1.2)", 4.4)
check("total_length([1, 2], [])", TypeError, "number")
check("total_length(-3, 5)", ValueError, "positive")
check("total_length(3, -5.2)", ValueError, "positive")
check("total_length(3, 'a')", TypeError, "number")
check("total_length('hello, ', 'world')", TypeError, "number")
check("total_length(1., 2)", 3.)
check("total_length(np.float32(3), 3)", np.float32(6));

✅ total_length(3.2, 1.2) -> 4.4 
✅ total_length([1, 2], []) -> TypeError 
❌ total_length(-3, 5) -> 2 (Expected: ValueError)
❌ total_length(3, -5.2) -> -2.2 (Expected: ValueError)
❌ total_length(3, 'a') -> ValueError (Expected: TypeError)
❌ total_length('hello, ', 'world') -> ValueError (Expected: TypeError)
✅ total_length(1., 2) -> 3.0 
✅ total_length(np.float32(3), 3) -> 6.0 


**Exercise**: Make all the checks pass.

In [5]:

def translate(rna):
    """
    Change a DNA sequence into an RNA sequence.
    """

    ## Guard Clauses Go Here: ##################



    ############################################

    from urllib.request import urlopen
    import json

    codons_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/codons.json"
    with urlopen(codons_url) as response_c:
        peptides = json.loads(response_c.read())

    peptides_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/peptides.json"
    with urlopen(peptides_url) as response_p:
        peptides_shorts = json.loads(response_p.read())
    
    
    out = []
    for c0, c1, c2 in zip(rna[::3], rna[1::3], rna[2::3]):
        codon = (c0 + c1 + c2)
        peptide = peptides[codon]
        peptide_short = peptides_shorts[peptide.lower()]
        out.append(peptide_short)

    return "".join(out)
    

check("translate('CCC')", 'P');
check("translate('GCAUUA')", 'AL');
check("translate('gca')", ValueError, "upper")
check("translate('TTT')", ValueError, "GCAU")
check("translate('GG')", ValueError, "three")
# check("")


✅ translate('CCC') -> P 
✅ translate('GCAUUA') -> AL 
❌ translate('gca') -> KeyError (Expected: ValueError)
❌ translate('TTT') -> KeyError (Expected: ValueError)
❌ translate('GG') ->  (Expected: ValueError)


False

### Data Validation when Instantiating Objects

When you create an object, you’re claiming: “This thing represents something real and internally consistent.” Most bugs show up because that claim quietly isn’t true.

In OOP, the constructor `__init__` is the boundary where you decide what counts as a valid object. If you let invalid data slip through here, the error will surface later in a place that’s harder to diagnose. That leads to the classic Python debugging experience: the real mistake happened 40 lines earlier, but you only notice when something unrelated explodes.

So the rule is simple: **If your object must obey certain constraints, enforce them at creation time.**

**Example**:

In [6]:
class Rectangle:

    def __init__(self, length, width):

        self.length = length
        self.width = width

        ## Data Validation Goes Here: #################
        if not isinstance(self.length, (int, float)):
            raise TypeError("length must be a number.")
        if self.length <= 0:
            raise ValueError("length must be positive")
        
        if not isinstance(self.width, (int, float)):
            raise TypeError("width must be a number.")
        if self.width <= 0:
            raise ValueError("width must be positive")
        
        ###############################################
    

check("Rectangle(4, 5)", Rectangle);
check("Rectangle('wide', 'tall')", TypeError);
check("Rectangle(-2, 2)", ValueError, "positive");

✅ Rectangle(4, 5) -> Rectangle 
✅ Rectangle('wide', 'tall') -> TypeError 
✅ Rectangle(-2, 2) -> ValueError 


**Exercise**: Make all the checks pass.

In [3]:
class Person:

    def __init__(self, name, age) -> None:

        self.name = name
        self.age = age

        ## Data Validation Goes Here: ##########


        ####################################



check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")

✅ Person('Nick', 37) -> Person 
❌ Person('Santa', 'old') -> Person (Expected: TypeError)
❌ Person('', -200) -> Person (Expected: ValueError)
❌ Person('', 12) -> Person (Expected: ValueError)


False

### Data Validation when Writing Dataclasses

Python classes are often just bags of data with a little validation sprinkled on top. Writing all the boilerplate (`__init__`, `__repr__`, comparisons, etc.) is tedious and error-prone.
dataclasses solve this by generating the boring parts for you.

When you mark a class with @dataclass, Python automatically creates:

  - an `__init__` assigning your fields,
  - a readable `__repr__`,
  - and other convenience defaults.

However, data classes do **not** automatically ensure that your data is correct; that, we still have to write ourself.  To provide a place for data validation, `__post_init__` runs immediately after the automatically generated `__init__`.
This is the hook where you enforce invariants — the things that must always be true for a valid instance.

**Example**: Make all the checks pass.

In [None]:
def append(num, values: list = None):
    if not values:
        values = []
    values.append(num)
    return values



append(1)
append(2)
append(3)

[1, 2, 3]

In [None]:
from typing import List, Union, Optional

x: list[Union[float, int]]

y: Optional[float]  # 



In [None]:
from dataclasses import dataclass, field
from uuid import uuid4, UUID
from typing import SupportsFloat, Sequence, Protocol

class Appendable:

    def append(self, object):
        ...





def append(val, values: Appendable):
    values.append(val)


append(3, [1, 2, 3])


In [None]:



@dataclass(frozen=True)
class Rectangle:
    length: float
    width: float
    colors: list[str] = field(default_factory=list)
    id: UUID = field(default_factory=uuid4, repr=False)
    _description = "Rectangle Shape"



    def __post_init__(self):
        if not isinstance(self.length, (int, float)):
            raise TypeError("length must be a number.")
        if self.length <= 0:
            raise ValueError("length must be positive")
        
        if not isinstance(self.width, (int, float)):
            raise TypeError("width must be a number.")
        if self.width <= 0:
            raise ValueError("width must be positive")
    

# check("Rectangle(4, 5)", Rectangle);

# check("Rectangle(-2, 2)", ValueError, "positive");

a = Rectangle(3, 4)
# a.length = 10
a

**Exercise**: Make all the checks pass.

In [None]:
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int

    def __post_init__(self):
        ...
        ## Guard Clauses Go Here: #########
        

        ###################################


check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")

### Pydantic: a Framework that simplifies Data Validation in Custom Classes

Manual guard clauses are fine for simple functions, but they get tiresome the moment you start defining structured objects — experiment configs, stimulus definitions, trial parameters, behavioral logs, etc. You end up repeating checks, writing boilerplate, and missing edge cases.

Pydantic exists to remove that tedium. It wraps your class in a validation layer that:

  - Enforces types automatically.
  - Runs field-level validation without you writing the same guard clauses over and over.
  - Builds errors that are actually readable, instead of stack traces buried in your own code.
  - Makes malformed data impossible to instantiate, which is exactly what you want for models representing “real-world” entities.

The key idea: your class shouldn’t exist in an invalid state. Pydantic makes that rule the default, not something you hope developers remember.

If your analysis pipeline depends on structured configuration or repeatedly loaded data formats, Pydantic pays for itself immediately. It standardizes validation, cuts boilerplate, and forces correctness at the boundary — before bad inputs poison the rest of your workflow.

**Example**: Make all the checks pass.

In [50]:
from pydantic.dataclasses import dataclass as p_dataclass

@p_dataclass
class Rectangle:
    length: float
    width: float


Rectangle(3, 'wide')

    


ValidationError: 1 validation error for Rectangle
1
  Input should be a valid number, unable to parse string as a number [type=float_parsing, input_value='wide', input_type=str]
    For further information visit https://errors.pydantic.dev/2.12/v/float_parsing

**Exercise**: Make all the checks pass.

In [None]:
from pydantic import ValidationError
from pydantic.dataclasses import dataclass as p_dataclass

@p_dataclass
class Person:
    name: str
    age: int

    ## Add field validators Here: ####



    ###################################


check("Person('Nick', 37)", Person);
check("Person('Santa', 'old')", ValidationError, "integer");
check("Person('', -200)", ValidationError, "positive");
check("Person('', 12)", ValidationError, "empty");

✅ Person('Nick', 37) -> Person(name=None, age=None) 
✅ Person('Santa', 'old') -> ValidationError 
✅ Person('', -200) -> ValidationError 
✅ Person('', 12) -> ValidationError 



## Conclusion


Data validation isn’t decoration; it’s the difference between code that quietly corrupts results and code you can trust. The pattern is always the same:

  - Reject invalid data early.
  - Fail fast and loudly.
  - Make illegal states unrepresentable.

Guard clauses handle the simple cases.
dataclasses' __post_init__ gives you a clean place to enforce invariants.
Pydantic scales the whole approach when your objects get complicated.

The Take-Away: **Good validation eliminates entire classes of bugs before they exist.**

**Q**: What about assertions?

In [11]:
def total_length(x, y):
   if x < 0:
    raise AssertionError("x must be positive")


total_length(-3, 3)

AssertionError: x must be positive

In [None]:
def total_length(x, y):
   
   assert x >= 0, "x must be positive"
   assert x >= 0, "x must be positive"

total_length(-3, 3)

AssertionError: x must be positive

In [None]:
def withdraw(account, money):
    assert account.balance - money >= 0
    account.balance -= money