Skip to content

how_to_read_a_dataset

Justin Joyce edited this page Jul 15, 2021 · 7 revisions

How to: Read a Dataset

For information on how to call the reader see the reader documentation.

What is a Dataset?

A Dataset is a set of records, these fall into three broad categories:

  • Static Data Data that doesn't change and is always the same - this is generally reference data, like conversion rates between units.
  • Batch Data Data that you're able to get a complete copy of, but the copy may only be true for a period of time, like an employee list; you can get a complete list of employees but there are joiners and leavers so the list may change daily.
  • Stream Data Data that doesn't end, it's just a constant feed of new information, like readings from a thermometer; you just append any new readings to the existing set of readings.

How Do I Filter?

There are two main ways to filter data on reading; filtering by date and filtering by attributes.

Filtering By Date

All reads are filtered by date unless the reader is created with the parameter raw_path set to True

Filtering By Attribute

Attribute filters are written using lists of tuples. Each tuple has the format: (key, op, value). When run the filter will extract the key field from the dictionary and compare to the value using the operator op.

The supported op values are: = or ==, !=, <, >, <=, >=, in, !in (not in), contains, !contains (doesn't contain) and like.

If the op is in or !in, the value must be a collection such as a list, a set or a tuple. like performs similar to the SQL operator; % is a multi-character wildcard and _ is a single character wildcard.

Lists of filters are ANDed together, lists of lists are ORed together:

('name', '==', 'jupiter')
[('name', '==', 'jupiter')]

Both these variations return records where the name field is jupiter.

These are both single-condition filters.

[('name', '==', 'jupiter'), ('size', '>', '1000000')]

Returns records where the name field is jupiter AND the size field is greater than 1 million.

This is a list of conditions that are ANDed together.

[[('name', '==', 'jupiter')], [('name', '==', 'saturn')]]

Returns records where the name field is jupiter OR the name field is saturn

This is a list of conditions that are ORed together.