# Dataclasses For The Masses
> Make big data small again!

- toc: true 
- badges: true
- comments: true
- categories: [attrs, dataclasses, numpy, pandas, OLAP]
- image: images/arrays-hector-j-rivas-87hFrPk3V-s-unsplash.jpeg

## DaaS Made Easy

Many data analysts, data engineers, and data scientists (trying to be inclusive here) build tables for a living.  The product of their work is data.  The desired features of the data product vary from customer to customer.  A finance executive prefers a bunch of spreadsheets.  An operations team likes their dashboards. For consumption within the data team, you open up dataframes-as-a-service shop.  If you are pursuing enterprise-grade solutions, there's a high chance they will be centered around [OLAP databases](https://blog.treasuredata.com/blog/2016/02/10/whats-the-difference-between-aws-redshift-aurora/).

I would like to share a Python tool that helps me design, document, and deploy my tables.  The goal was to make the data class declaration as concise as it gets, so that one can prototype rapidly and arrive at the optimal data model.

```
class SweetTable:
    a: datatype1
    b: datatype2
```

You might notice parallels to ORM solutions like [SQLAlchemy](https://docs.sqlalchemy.org/en/14/orm/quickstart.html) and [Django](https://docs.djangoproject.com/en/4.1/topics/db/models/).

> Django `Model` is the single, definitive source of information about your data. It contains the essential fields and behaviors of the data you’re storing. Generally, each model maps to a single database table.  



```
from django.db import models

class Musician(models.Model):
    first_name = models.CharField(max_length=50)
    last_name = models.CharField(max_length=50)
    instrument = models.CharField(max_length=100)

class Album(models.Model):
    artist = models.ForeignKey(Musician, on_delete=models.CASCADE)
    name = models.CharField(max_length=100)
    release_date = models.DateField()
    num_stars = models.IntegerField()
```

The main issue with existing Python ORMs is that they are made for transactional tasks, where row-by-row data manipulation is common and data relationships are important.  I want none of that.  What do I want?  I want what The Most Interesting Data Engineer In The World (TMIDEITW) wants.

**What does TMIDEITW want?**
+ Her Parquet files smell of rich mahogany.  
+ Her data types are enforced on every continent.  Even the NULLs.  
+ She does not often add data. But when she does, she adds a billion rows.  


In less interesting but more clear terms, we want to construct **dataclasses** for a future **columnar database**:
+ efficient, typed, nullable data structures
+ interfaces with pandas for familiar analytical tasks 
+ on-demand conversion to `pyarrow` objects, en route to Parquet files

**Why not `pandas` or `arrow` or...?**

> I strongly feel that [Arrow](https://arrow.apache.org/docs/python/index.html) is a key technology for the next generation of data science tools. I laid out my vision for this recently in my 2017 JupyterCon keynote.

*&ndash;Wes McKinney (creator of Pandas)*

Pandas rules.  It's like a Swiss-army knife - you have all kinds of tools but they're not optimal for the tasks.  The price you pay: it's slow.  Pandas defaults to the most accommodating data types: `object`, `np.float64`, `np.int64`.  Can some of your fields be casted to something more efficient?  My loose definition of a boundary between *small data* and *big data* is the amount of RAM of your laptop.  Optimizing your data structures, starting with data types, can make "big" data "small" again.

Arrow is harder to say no to.  It's efficient, typed, and nullable.  There are [neat pandas-like functions](https://arrow.apache.org/docs/python/compute.html) not offered by the more math-heavy numpy.  The library is large, but it's already required for writing parquet files.  Probably the main reason for not choosing Pyarrow is the desire to meet the data pros where they are, and today this means `numpy`.  But don't worry: both `pd.DataFrame` and pyarrow table will be at your fingertips.  We do not consider [Python's native typed arrays](https://docs.python.org/3/library/array.html) - not to be confused with lists!  The fact that one needs to be reminded that they're not the same speaks a lot to how often they are used.

## Implementation

Read through the story or skip <a href="#In-The-End">get to the point</a>.

### Minimal Example

Let's start with a very simple class with two short `numpy` arrays as predefined attributes.  We'll also define a function that performs a simple analytical task, a row-wise sum.  I spell out `Two` in the class name to remind me that we have two data elements.

In [1]:
import numpy as np

class SimpleDataclassTwo:
    a = np.array([-123, 1, 99])
    b = np.array([1234, 0, -9876])
    
    def row_sum(self):
        return self.a + self.b

simpledata2 = SimpleDataclassTwo()
simpledata2.row_sum()

array([ 1111,     1, -9777])

In [2]:
simpledata2

<__main__.SimpleDataclassTwo at 0x7fda41f5ef70>

that was not very helpful... but at least we can access the attributes:

In [3]:
simpledata2.a

array([-123,    1,   99])

In [4]:
simpledata2.b

array([ 1234,     0, -9876])

try to make another instance?

In [5]:
try:
    SimpleDataclassTwo(np.array([1,2]), np.array([41, 40]))
except TypeError:
    print("failed, because we don't have __init__")

failed, because we don't have __init__


...and that's about it.  Before we start writing `__init__` and `__repr__`, let me introduce you to your new best friend: `attrs`

### Attrs

> `attrs` is the Python package that will bring back the joy of writing classes.  Trusted by NASA for Mars missions since 2020!

&ndash; https://www.attrs.org/en/stable/index.html

NB: if you've previously used `attrs`, they [recently](https://www.attrs.org/en/stable/changelog.html#id40) switched from `@attr.s(auto_attrib=True)` to `@attrs.define`.

In [6]:
import attr
import attrs

In [7]:
@attrs.define(slots=False)
class AttrsDataclassTwo:
    a: np.array = np.array([-123, 1, 99])
    b: np.array = np.array([1234, 0, -9876])
    
    def row_sum(self):
        return self.a + self.b

attrs_data2 = AttrsDataclassTwo(np.array([1,2]), np.array([41, 40]))
attrs_data2

AttrsDataclassTwo(a=array([1, 2]), b=array([41, 40]))

With one decorator, two problems with the original `SimpleDataclassTwo` are solved before you knew they were problems.  Let's learn more about this class instance.  For this purpose, I often reach for `__dict__`:

In [8]:
attrs_data2.__dict__

{'a': array([1, 2]), 'b': array([41, 40])}

this reminds me of...how you instantiate a Pandas dataframe!

### Enable Pandas and Pyarrow 

In [9]:
import pandas as pd
import pyarrow as pa

In [10]:
pd.DataFrame(attrs_data2.__dict__)

Unnamed: 0,a,b
0,1,41
1,2,40


In [11]:
@attrs.define(slots=False)
class AttrsPandasArrowDataclassTwo:
    a: np.array = np.array([-123, 1, 99])
    b: np.array = np.array([1234, 0, -9876])

    @property
    def df(self):
        return pd.DataFrame(self.__dict__)

    @property
    def pa(self):
        return pa.table(self.__dict__)

data = AttrsPandasArrowDataclassTwo()

In [12]:
data.df

Unnamed: 0,a,b
0,-123,1234
1,1,0
2,99,-9876


In [13]:
data.pa

pyarrow.Table
a: int64
b: int64
----
a: [[-123,1,99]]
b: [[1234,0,-9876]]

### Nullable arrays

### Type Enforcement

### Inheritance

### In The End

In [14]:
import numpy as np
import pandas as pd
import pyarrow as pa

class ACHTUNG(Exception):
    pass

@attrs.define(slots=False)
class BaseColumnarModel:

    data_dict = {}
    
    def __attrs_post_init__(self):
        fields = list(self.__dict__)
        for field in fields:
            v = self.__dict__[field]
            _expected_dtype = self.__annotations__[field]
            if hasattr(v, '_data'):  
                _dtype = v._data.dtype  # for pandas ExtensionDtypes
            elif hasattr(v, 'dtype'):
                _dtype = v.dtype  # for numpy arrays
            else:
                _dtype = type(v)  # for scalars
            
            x = np.empty(0, dtype=_expected_dtype).dtype
            y = np.empty(0, dtype=_dtype).dtype
            if x != y:
                raise ACHTUNG(f'wrong dtype in `{field}`: `{_expected_dtype}` is not `{_dtype}`')
        return

    def __len__(self):
        return max(np.array(v).size for v in self.__dict__.values())
    
    @classmethod
    def empty(cls):
        empty_fields = {}
        for k, v in cls.__annotations__.items():
            empty_fields[k] = np.empty(0, dtype=v)
        return cls(**empty_fields)

    @property
    def df(self):
        return pd.DataFrame(self.__dict__)

    @property
    def pa(self):
        _length = self.__len__()
        arrow_dict = {}
        for k, v in self.__dict__.items():
            if pd.api.types.is_scalar(v):
                arrow_dict[k] = pa.array(np.full(_length, v, dtype=type(v)))
            else:
                arrow_dict[k] = pa.array(v)
        return pa.table(arrow_dict, metadata=self.data_dict)

@attrs.define(slots=False)
class CoolDataFour(BaseColumnarModel):
    """Example Dataclass."""
    
    a: np.int8
    b: np.bool_
    c: str
    d: float
    
    data_dict = {
        'a': 'integer field',
        'b': 'is it important?',
        'c': 'text',
        'd': 'price to pay',
    }

In [15]:
CoolDataFour.empty().df  # for building database templates

Unnamed: 0,a,b,c,d


In [16]:
all_scalars = CoolDataFour(a=np.int8(42), b=np.bool_(True), c='abc', d=1.2)
all_scalars.pa

pyarrow.Table
a: int8
b: bool
c: string
d: double
----
a: [[42]]
b: [[true]]
c: [["a"]]
d: [[1.2]]

In [17]:
somenulls = CoolDataFour(a=pd.array([99,None], dtype=pd.Int8Dtype()), b=np.bool_([0,1]), c='abc', d=1/2)
somenulls.pa

pyarrow.Table
a: int8
b: bool
c: string
d: double
----
a: [[99,null]]
b: [[false,true]]
c: [["a","a"]]
d: [[0.5,0.5]]

In [18]:
somenulls.pa.schema

a: int8
b: bool
c: string
d: double
-- schema metadata --
a: 'integer field'
b: 'is it important?'
c: 'text'
d: 'price to pay'

## Now What?

Share your feedback in the comments!