# Statistical Typing: A Runtime Typing System for Data Science and Machine Learning

### Niels Bantilan

Pycon, May 15th 2021

#### Type systems help programmers reason about code and can make programs more computationally efficient.

In [None]:
from typing import Union

Number = Union[int, float]

def add_and_double(x: Number, y: Number) -> Number:
    ...

<br>
Can you predict the outcome of these function calls?

In [None]:
add_and_double(5, 2)
add_and_double(5, "hello")
add_and_double(11.5, -1.5)

## 🤔 What would a type system geared towards data science and machine learning look like?

 ## Outline

- 🐞🐞🐞 introduce you to some of my problems

- 📊📈 define a specification for data types in the statistical domain

- 🤯 make you realize that you've been doing statistical typing all along

- 🛠 demonstrate one way it might be put into practice using `pandera`

- 🏎 discuss where this idea can go next

## 🐞🐞🐞 An Introduction to Some of my Problems

### The worst bugs are the silent ones, especially if they're in ML models that took a lot of ⏰ to train

<br>
- `model ~ data`

- `Δdata -> Δmodel`

- I define my `model` as a function `f(x) -> y`

- Suppose I'm using `f` in an important business/scientific process

- How do I know if `f` is working as intended?

### Static Type-checking/Linting

Catches certain type errors before running code

In [None]:
from typing import Union

Number = Union[int, float]

def add_and_double(x: Number, y: Number) -> Number:
    return (x + y) * 2

```python
add_and_double(5, 2)        # ✅
add_and_double(5, "hello")  # ❌
add_and_double(11.5, -1.5)  # ✅
```

### Static Type-checking/Linting

**Problem:** What if the underlying implementation is wrong?

In [None]:
from typing import Union

Number = Union[int, float]

def add_and_double(x: Number, y: Number) -> Number:
    return (x - y) * 4

```python
add_and_double(5, 2)        # output: 12
add_and_double(5, "hello")  # raises: TypeError
add_and_double(11.5, -1.5)  # output: 52
```

### Unit Tests

Unit tests verify the behavior of isolated pieces of functionality and
let you know when changes cause breakages or regressions.

In [None]:
import pytest

def test_add_and_double():
    # 🙂 path
    assert add_and_double(5, 2) == 14
    assert add_and_double(11.5, -15) == 20.0
    assert add_and_double(-10, 1.0) == -18.0

def test_add_and_double_exceptions():
    # 😞 path
    with pytest.raises(TypeError):
        add_and_double(5, "hello")
    with pytest.raises(TypeError):
        add_and_double("world", 32.5)

### Property-based Tests

Property-based testing alleviates the burden of explicitly writing test cases

In [None]:
from hypothesis import given
from hypothesis.strategies import integers, floats, one_of, text

numbers = one_of(integers(), floats())

@given(x=numbers, y=numbers)
def test_add_and_double(x, y):
    assert add_and_double(x, y) == (x + y) * 2

@given(x=numbers, y=text())
def test_add_and_double_exceptions():
    with pytest.raises(TypeError):
        add_and_double(x, y)

### 🔎 ⛏ Testing code is hard, testing statistical analysis code is harder!

#### Toy Example: Training a "Which Disney Character Are You?" Model

In [None]:
from typing import List, TypedDict

Response = TypedDict("Response", q1=int, q2=int, q3=str)
Example = List[float]

def store_data(raw_response: str) -> Response:
    ...

def create_dataset(raw_responses: List[Response], other_features: List[Example]) -> List[Example]:
    ...

def train_which_disney_character_are_you_model(survey_responses: List[Example]) -> str:
    ...

<br>
- `store_data`'s scope of concern is atomic, i.e. it only operates
  on a single data point 🧘⚛

Easy to write test cases

- `create_dataset` needs to worry about with the statistical patterns of a
  sample of data points 😓📊

- So what if I want to test `create_dataset` on plausible example data?

Difficult to write test case

### 🤲 📀 🖼 hand-crafting example dataframes is a major barrier for unit testing.

<br>
.... it's not fun 😭

### What if I could do something like...

In [None]:
from IPython.display import display

In [None]:
import pandera as pa
from pandera.typing import Series

class Schema(pa.SchemaModel):
    variable1: Series[int] = pa.Field(ge=0)
    variable2: Series[float] = pa.Field(in_range={"min_value": 0, "max_value": 1})
    variable3: Series[str] = pa.Field(isin=list("abc"))

sample_data = Schema.example(size=5)
display(sample_data.head(3))

In [None]:
sample_data["variable1"] = sample_data["variable1"] * -1
try:
    Schema.validate(sample_data)
except Exception as e:
    print(e)

I won't say much else here except for that I'm not a big fan. It's really
tedious

## 📊📈 Define a Specification for Data Types in the Statistical Domain

> Statistical typing extends basic scalar data types with additional
> semantics about the properties held by a collection of data points

### `Boolean → Bernoulli`

```python
x1 = True
x2 = False
```

```python
support: Set[bool] = {x1, x2}
probability_distribution: Dict[str, float] = {True: 0.5, False, 0.5}
FairCoin = Bernoulli(support, probability_distribution)
```

```python
data: FairCoin = [1, 0, 0, 1, 1, 0]

mean(data)
chi_squared(data)
```

### `Enum → Categorical`

```python
class Animal(Enum):
    CAT = 1
    DOG = 2
    COW = 3
    OTHER = 4
```

```python
FarmAnimals = Categorical(
    Animal,
    probabilities={
        Animal.CAT: 0.01,
        Animal.DOG: 0.04,
        Animal.COW: 0.5,
        Animal.OTHER: 0.45,
    },
    ordered=False,
)
```

```python
data: FarmAnimals = [Animal.CAT] * 50 + [Animal.DOG] * 50

check_type(data)  # raise a TypeError
```

### `Int → Poisson`

```python
PatientsAdmitted = Poisson(expected_rate=10, interval=datetime.timedelta(days=1))
```

```python 
data: List[int] = sample(PatientsAdmitted)
```

```python 
assert all(x >= 0 for x in data)
```

### `Float → Gaussian`

```python
TreeHeightMeters = Gaussian(mean=10, standard_deviation=1)
```

```python  
def test_process_data():
    data: List[float] = sample(TreeHeightMeters)
    result = mean(data)
    assert 8 <= result  <= 12
```

### Statistical Type Specification: Types as Schemas

For each variable in my dataset, define:

- **basic datatype**: `int`, `float`, `bool`, `str`, etc.

- **deterministic properties**: domain of possible values, e.g. `x >= 0`

- **probabilistic properties**: distributions that apply to the variable and
  their sufficient statistics, e.g. `mean` and `standard deviation`

## Have you ever done something like this?

In [7]:
import math

def normalize(x: List[float]):
    """Mean-center and scale with standard deviation"""
    mean = sum(x) / len(x)
    std = math.sqrt(sum((i - mean) ** 2 for i in x) / len(x))
    x_norm = [(i - mean) / std for i in x]

    # runtime assertions
    assert any(i < 0 for i in x_norm)
    assert any(i > 0 for i in x_norm)

    return x_norm

#### 🤯 You've Been Doing Statistical Typing All Along

## Implications

<br>
Some statistical properties can be checked statically, e.g. the mean operation cannot be applied to categorical data
```python
mean(categorical) ❌
```

<br>
Others can only be checked at runtime, e.g. this sample of data is drawn from a Gaussian of particular parameters
```python
scipy.stats.normaltest(normalize(raw_data))
```

<br>
Schemas can be implemented as generative data contracts that can be used for type checking and sampling

## 🛠 Statistical Typing in Practice with `pandera`

Suppose we're building a predictive model of house prices given features about different houses:

In [2]:
raw_data = """
square_footage,n_bedrooms,property_type,price
750,1,condo,200000
900,2,condo,400000
1200,2,house,500000
1100,3,house,450000
1000,2,condo,300000
1000,2,townhouse,300000
1200,2,townhouse,350000
"""

<br>
- `square_footage`: positive integer

- `n_bedrooms`: positive integer

- `property type`: categorical

- 🎯 `price`: positive real number

### Pipeline

In [3]:
def process_data(raw_data):  # step 1: prepare data for model training
    ...
    
def train_model(processed_data): # step 2: fit a model on processed data
    ...

### Define Schemas with `pandera`

In [4]:
import pandera as pa
from pandera.typing import Series, DataFrame

PROPERTY_TYPES = ["condo", "townhouse", "house"]


class BaseSchema(pa.SchemaModel):
    square_footage: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 3000})
    n_bedrooms: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 10})
    price: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 1000000})

    class Config:
        coerce = True


class RawData(BaseSchema):
    property_type: Series[str] = pa.Field(isin=PROPERTY_TYPES)


class ProcessedData(BaseSchema):
    property_type_condo: Series[int] = pa.Field(isin=[0, 1])
    property_type_house: Series[int] = pa.Field(isin=[0, 1])    
    property_type_townhouse: Series[int] = pa.Field(isin=[0, 1])

### Pipeline

With Type Annotations

In [5]:

def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    ...
    
def train_model(processed_data: DataFrame[ProcessedData]):
    ...

### Pipeline

With Implementation

In [6]:
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.linear_model import LinearRegression


@pa.check_types
def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    return pd.get_dummies(
        raw_data.astype({"property_type": pd.CategoricalDtype(PROPERTY_TYPES)})
    )


@pa.check_types
def train_model(processed_data: DataFrame[ProcessedData]) -> BaseEstimator:
    return LinearRegression().fit(
        X=processed_data.drop("price", axis=1),
        y=processed_data["price"],
    )

### Running the Pipeline

Validate the statistical type of raw and processed data every time we
run our pipeline.

In [7]:
from io import StringIO


def run_pipeline(raw_data):
    processed_data = process_data(raw_data)
    estimator = train_model(processed_data)
    # evaluate model, save artifacts, etc...
    print("✅ model training successful!")


run_pipeline(pd.read_csv(StringIO(raw_data.strip())))

✅ model training successful!


### Fail Early and with Useful Information

In [8]:
invalid_data = """
square_footage,n_bedrooms,property_type,price
750,1,unknown,200000
900,2,condo,400000
1200,2,house,500000
"""

try:
    run_pipeline(pd.read_csv(StringIO(invalid_data.strip())))
except Exception as e:
    print(e)

error in check_types decorator of function 'process_data': <Schema Column(name=property_type, type=<class 'str'>)> failed element-wise validator 0:
<Check isin: isin({'house', 'condo', 'townhouse'})>
failure cases:
   index failure_case
0      0      unknown


### Schemas as Generative Contracts

Define property-based unit tests with `hypothesis`

In [9]:
from hypothesis import given


@given(RawData.strategy(size=3))
def test_process_data(raw_data):
    process_data(raw_data)

    
@given(ProcessedData.strategy(size=3))
def test_train_model(processed_data):
    estimator = train_model(processed_data)
    predictions = estimator.predict(processed_data.drop("price", axis=1))
    assert len(predictions) == processed_data.shape[0]

<br>
Run test suite

In [10]:
def run_test_suite():
    test_process_data()
    test_train_model()
    print("✅ tests successful!")    
    
run_test_suite()

✅ tests successful!


### Catch Errors in Data Processing Code

Define property-based unit tests with `hypothesis`

In [11]:
@pa.check_types
def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    return raw_data

try:
    run_test_suite()
except Exception as e:
    print(e)

Falsifying example: test_process_data(
    raw_data=   square_footage  n_bedrooms  price property_type
    0               0           0      0         condo
    1               0           0      0         condo
    2               0           0      0         condo,
)
error in check_types decorator of function 'process_data': column 'property_type_condo' not in dataframe
   square_footage  n_bedrooms  price property_type
0               0           0      0         condo
1               0           0      0         condo
2               0           0      0         condo


### Bootstrapping a Schema from Sample Data

For some datasets, it might make sense to infer a schema from a sample of
data and go from there:

In [18]:
raw_df = pd.read_csv(StringIO(raw_data.strip()))
display(raw_df.head(3))

   square_footage  n_bedrooms property_type   price
0             750           1         condo  200000
1             900           2         condo  400000
2            1200           2         house  500000


In [22]:
schema = pa.infer_schema(raw_df)
schema.to_yaml()
schema.to_script()
print(schema)

<Schema DataFrameSchema(
    columns={
        'square_footage': <Schema Column(name=square_footage, type=int64)>
        'n_bedrooms': <Schema Column(name=n_bedrooms, type=int64)>
        'property_type': <Schema Column(name=property_type, type=str)>
        'price': <Schema Column(name=price, type=int64)>
    },
    checks=[],
    coerce=True,
    pandas_dtype=None,
    index=<Schema Index(name=None, type=int64)>,
    strict=False
    name=None,
    ordered=False
)>


## 🪛🪓🪚 Use Cases

- CI tests for ETL/model training pipeline
- Alerting for dataset shift
- Monitoring model quality in production

## 🏎 Where Can this Idea Go Next?

### Statically analyze code that performs statistical operations
```python
data: FarmAnimals = [Animal.CAT] * 50 + [Animal.DOG] * 50
mean(data)  # ❌ cannot apply mean to Categorical
```

### Infer model architecture space based on function signatures
```python
def model(input_data: Normal) -> Bernoulli:
    ...

type(model)
# [LogisticRegression, RandomForestClassifier, ...]
```

### Infer Statistical Types from Data

Model-based statistical types


- schema inference can be arbitrarily complex

- statistical types can also be arbitrarily complex

- data can be encoded as statistical models, and those model artifacts can be
  used as components in a schema

## GAN Schema

In theory, a generative adversarial network can be used as a schema to validate
real-world data and generate synthetic data

## GAN Schema

The discriminator, which is typically discarded after training, can validate
real or upstream synthetic data.

### Validation and Data Synthesis for Complex Statistical Types

In [31]:
dataframe = pd.DataFrame({
    "category": ["cat", "dog", "cow", "horse", "..."],
    "images": ["image1.jpeg", "image2.jpeg", "image3.jpeg", "image4.jpeg", "..."],
})
display(dataframe)

Unnamed: 0,images,class
0,image1.jpeg,cat
1,image2.jpeg,dog
2,image3.jpeg,cow
3,image4.jpeg,horse


```python
class ImageSchema(pa.SchemaModel):
    category: Series[str] = pa.Field(isin=["cat", "dog", "cow", "horse", "..."])
    images: Series[Image] = pa.Field(drawn_from=GenerativeAdversarialNetwork("weights.pt"))
```

## Takeaway

Statistical typing extends basic data types into the statistical domain,
opening up a bunch of testing capabilities that make statistical code
more robust and easier to reason about.

## Thanks!