Skip to content

Latest commit

 

History

History
231 lines (165 loc) · 6.18 KB

catalog.rst

File metadata and controls

231 lines (165 loc) · 6.18 KB

Catalog

Squirrel provides an easy way of both defining a data source and how to read from it at the same time:

from squirrel.catalog import Source

source = Source(
    driver_name="file",
    driver_kwargs={"path": "path/to/my/file"},
    metadata={"owner": "Merantix", "license": "Other"},
) 

That's it, we created our first :py~squirrel.catalog.source.Source. Note that:

  • Our source can be read using a certain :py~squirrel.driver.Driver, i.e. the Driver that corresponds to "file".

    Note

    Note that each driver defines its name with the name class variable. You can check out :py~squirrel.driver.FileDriver to verify that its name is indeed "file".

    You can refer to :pysquirrel.driver to see all built-in squirrel drivers. To see how you can implement your own driver and register it so that a Catalog can use it, see the Plugin Tutorial.

  • The driver will be called using some arguments, i.e. path="path/to/my/file"
  • We can add metadata to the Source

If we have a lot of Sources, then keeping track of them can be troublesome. This is where :py~squirrel.catalog.Catalog comes in.

Catalog is a dictionary-like data structure that maintains Sources. Let's create a Catalog and add our Source to it.

from squirrel.catalog import Catalog

cat = Catalog()
cat["my_source"] = source

Now we can see how the Source is stored:

>>> cat["my_source"]
{
    "identifier": "my_source",
    "driver_name": "file",
    "driver_kwargs": {
        "path": "path/to/my/file",
    },
    "metadata": {
        "owner": "Merantix",
        "license": "Other"
    },
    "version": 1
    "versions": [
        1
    ]
}

Note that there are some new fields, more specifically, the source now also has:

  • An identifier
  • A version

We specified the identifier when adding the Source to the Catalog. However, the version was automatically set. We could have set the version ourselves as well:

cat["my_source"][2] = source  # setting version 2 of the same source

Now the catalog entry has become:

>>> cat["my_source"]

{
    "identifier": "my_source",
    "driver_name": "file",
    "driver_kwargs": {
        "path": "path/to/my/file",
    },
    "metadata": {
        "owner": "Merantix",
        "license": "Other"
    },
    "version": 2
    "versions": [
        1,
        2
    ]
}

The catalog returns us the latests version available, which is v2. It is possible to specify which version to get:

>>> cat["my_source"][1]  # getting a specific version
{
    "identifier": "my_source",
    "driver_name": "file",
    "driver_kwargs": {
        "path": "path/to/my/file",
    },
    "metadata": {
        "owner": "Merantix",
        "license": "Other"
    },
    "version": 1
}

It is possible to get a Driver instance to read from a Catalog Source.

import tempfile

# create a dummy file and load it using a FileDriver
with tempfile.TemporaryDirectory() as tmp_dir:
    fpath = f"{tmp_dir}/myfile.txt"
    with open(fpath, "w") as f:
        for i in range(5):
            f.write(f"line #{i}\n")

    new_source = Source(driver_name="file")
    cat["new_source"] = new_source

    driver = cat["new_source"].get_driver(path=fpath)
    with driver.open() as f:
        f.readlines() # -> ['line #0\n', 'line #1\n', 'line #2\n', 'line #3\n', 'line #4\n']

Catalog Operations

We will be using the following Catalogs to demonstrate the operations:

cat1 = Catalog()
cat2 = Catalog()

# shared sources
for i in range(2):
    key, src = f"shared_{i}", Source(driver_name="file")
    cat1[key] = src
    cat2[key] = src

# distinct sources
cat1["distinct_for_1"] = Source(driver_name="file")
cat2["distinct_for_2"] = Source(driver_name="file")

Catalogs can be sliced so that only some sources are left:

res = cat1.slice(["shared_1"])
list(res.keys())  # -> ["shared_1"]

Catalogs can be summed together:

res = cat1.union(cat2)
list(res.keys())  # -> ['shared_0', 'shared_1', 'distinct_for_1', 'distinct_for_2']

The difference between two catalogs can be taken:

res = cat1.difference(cat2)
list(res.keys())  # -> ['distinct_for_1', 'distinct_for_2']

Catalogs can be intersected:

res = cat1.intersection(cat2)
list(res.keys())  # -> ['shared_0', 'shared_1']

To see all catalog operations, check out the API reference.

Sharing your Catalog

As most things, Catalogs are more fun when shared with others. To share a Catalog, you must first serialize it. Luckily, Squirrel provides Catalog.to_file() method, which will serialize your catalog for you and write it to a .yaml file with all information regarding the sources.

import tempfile

temp_d = tempfile.TemporaryDirectory()
fp = f"{temp_d.name}/my_catalog.yaml"
cat1.to_file(fp)

You can see that all source information is neatly stored in this file:

>>> with open(fp, "r") as f:
>>>     for _ in range(10):
>>>         print(f.readline())
!YamlCatalog
version: 0.11.0
sources:
- !YamlSource
identifier: shared_0
driver_name: MyDriver
driver_kwargs: {}
version: 1
metadata: {}
- !YamlSource

Reading the file into a catalog is also simple:

cat_reloaded = Catalog.from_files([fp])
cat1 == cat_reloaded # -> True

That's it, now you know (almost) everything about Catalogs!

If you are willing to learn more, check out the Catalog Tutorial to see some real-world examples or the Plugins Tutorial to see how you can implement and register a new plugin. You can also refer to the API reference to discover more information such as implementation details.

Don't forget to clean up:

temp_d.cleanup()