Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-32131: Merge Cassandra branch of APDB #23

Merged
merged 42 commits into from Oct 26, 2021
Merged

Conversation

andy-slac
Copy link
Collaborator

Adds new implementation for Apdb interface based on Apache Cassandra.

This merge contains a more or less complete history of all development and experimentation from the past couple of years, I decided not to squash that history as it may be potentially useful. It does not make sense to review individual commits, it's better to look at the final version.

There is also some refactoring of SQL implementation and tests to reduce code duplication between two implementations, but functionality of ApdbSql did not change. The pixelId column was removed from schema YAML definition, it is now added where needed by schema class - this is to keep YAML schema reusable between SQL and Cassandra (the latter does not have pixelId but it adds a different set of columns).

Cassandra does not store NULL so for realistic test we need non-NULL
payload for all table columns. I added new configuration option for
Cassandra backend which fills columns that are not explicitly set with
random data. I'm doing it the backend instead of ap_proto because it is
much simplere to do here and I need a quick and dirty way to do it right
now. For final implementation we won't need it ao it will be removed
when we settle on what final implementation is going to look like.
Previous commits on this branch come from rebasing of the testing branch
(u/andy-slac/cassandra-2), next step is to update the implementation
for new abstact interface. This commit adds a unit test for
ApdbCassandra to make sure that it imports and can trivially function.
This unit test can only be ran against actual Cassandra cluster, needs
special environment. I added a protection for the case of the missing
cassandra-driver package, it is still possible to import the module in
that case but instantiation of ApdbCassandra fails.
Copy link
Contributor

@morriscb morriscb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small stuff to change.

@@ -0,0 +1,86 @@
import lsst.dax.apdb.apdbCassandra
assert type(config)==lsst.dax.apdb.apdbCassandra.ApdbCassandraConfig, 'config is of type %s.%s instead of lsst.dax.apdb.apdbCassandra.ApdbCassandraConfig' % (type(config).__module__, type(config).__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated by pex_config, I want to avoid fixing formatting in this file. In case I want to re-generate it later again, I'd have to re-format it again and again.

)
extra_schema_file = Field(
dtype=str,
doc="Location of (YAML) configuration file with extra schema",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth explaining how an extra schema works relative to the standard schema.

doc="If True then print/log timing information",
default=False)
db_url = Field(
dtype=str, doc="SQLAlchemy database connection URI"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to newline the doc like the other config options?

default=64
)
htm_index_column = Field(
dtype=str, default="pixelId",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline?

doc="Name of a HTM index column for DiaObject and DiaSource tables"
)
ra_dec_columns = ListField(
dtype=str, default=["ra", "decl"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline?

doc="Cassandra protocol version to use, default is V4",
default=cassandra.ProtocolVersion.V4 if CASSANDRA_IMPORTED else 0
)
read_sources_months = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems as if some configs are duplicated between ApdbConfig, ApdbSqlConfig, and here. Add them in needed to the base config and remove them here and in the SQL config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A leftover from refactoring, I'll remove duplicates.

)
time_partition_start = Field(
dtype=str,
doc=(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the parentheses here or in other multi-line config doc strings.

full_name = self._schema.tableName(table_name)
tables = [full_name]
mjd_now = visit_time.get(system=dafBase.DateTime.MJD)
mjd_begin = mjd_now - months * 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else:
# Assume it's seconds since epoch, Cassandra
# datetime is in milliseconds
value = int(value * 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove spaces

if self.config.packing == "cbor":
blob = b"cbor:" + cbor.dumps(blob)
values.append(blob)
holders = ','.join(['%s'] * len(values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove spaces

Unit test now runs successfully against a small one node cluster.
The code is stil la big mess and needs cleanup. Updated schema YAML
files to reflect actual data, and dropped pixelId from C9 schema.
Columns that compise PK have to be included into pandas catalog (or
static columns).
Define couple of enum classes for static parts of schema (table names
and index types).
@andy-slac andy-slac merged commit 75bb458 into master Oct 26, 2021
@andy-slac andy-slac deleted the tickets/DM-32131 branch October 26, 2021 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants