Skip to content

Latest commit

 

History

History
114 lines (88 loc) · 4.71 KB

architecture_overview.rst

File metadata and controls

114 lines (88 loc) · 4.71 KB

Architecture overview

There are four modules in squirrel which are instrumental in understanding the architecture and the overall design:

  • usage/catalog:Catalog: organizing, accessing, and sharing datasets.
  • usage/driver:Driver: performant and convenient read.
  • usage/iterstream:IterStream: :pyComposable is the foundational building block in the usage/iterstream:IterStream module which provides a mechanism to chain iterables, and a fluent api that includes methods such as map, filter, and async_map.
  • usage/store:Store: a key/value abstraction for reading data from and writing data to arbitrary storage backends such as filesystem, object store, database, etc.

These modules are designed in a way that can be used together, but this is not enforced in order to maximize flexibility. This may make it difficult to realize the intended and recommended way of combining squirrel primitives. Although there are many such ways already provided (and many more that can be implemented for specific use-cases), here we focus on one concrete example that captures the most common and most widely applicable pattern through a code snippet and its equivalent UML diagram. Here is a complete data loading pipeline:

from squirrel.catalog import Catalog

catalog = Catalog.from_plugins()  # Catalog
train_data = (
    catalog["imagenet"]  # CatalogSource
    .get_driver()  # Driver
    .get_iter()  # Composable
    .map(lambda x: transform(x))  # Composable
    .filter(lambda x: filter_func(x))  # Composable
)  # Composable
model = YourModelTrainer(train_data).fit()  # e.g. PyTorch DataLoader, XGBoost, etc.

A :pyCatalog contains zero to many :pyCatalogSources, each of which may be retrieved by an identifier (or a tuple of an identifier and the version, see usage/catalog:Catalog for more details). :pyCatalogSource contains all necessary information to instantiate an object of type :pyDriver. :pyDriver may have a method :pyget_iter which returns an object of type :pyComposable (which belongs to usage/iterstream:IterStream module). train_data is an iterable that generates items lazily in a streaming way for minimal memory footprint.

Note

To access available datasets using Catalog.from_plugins(), check out squirrel-dataset-core repository.

The following diagram illustrates a (simplified and slightly idealized) view of the relationships between these classes through one concrete implementation provided by squirrel. Note that here we assume that the data is in messagepack format (see usage/store:Store for information about different types of store).

classDiagram

MutableMapping <|-- Catalog class Catalog { Dict _sources } Catalog -- "0.." CatalogSource %% CatalogSource : get_driver() class CatalogSource { string identifier int version List~int~ versions

get_driver() Driver

}

class MessagepackDriver {

string name SquirrelStore store

get(key) Iterable~Dict~ keys() List~string~ get_iter() Composable

}

%% realization CatalogSource ..|> MessagepackDriver

MessagepackDriver ..> Composable MessagepackDriver ..> SquirrelStore

<<abstract>> Composable class Composable { source Iterable~Any~ } Composable : __iter__() Iterable~Any~ Composable : map() Composable Composable : filter() Composable

SquirrelStore : set(value, key) None SquirrelStore : get(key) Iterable~Any~ SquirrelStore : keys() Iterable~string~

SquirrelStore "1" --> MessagepackSerializer class MessagepackSerializer { serialize(obj) deserialize(obj) serialize_shard_to_file(obj, fp) deserialize_shard_from_file(fp) }

The relationships between these components and the methods they provide depends on the particular implementation of the abstract classes (i.e. :pyDriver, :pyAbstractStore, :pySquirrelSerializer). For instance, an implementation of the :pyDrive may not need to or may choose not to use :pySquirrelStore or :pyComposable at all.

Note

:pyCatalogSource is an internal representation of a :pySource. For more information on how to add a :pySource to a catalog, please refer to usage/catalog:Catalog.