New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-32131: Merge Cassandra branch of APDB #23
Conversation
Cassandra does not store NULL so for realistic test we need non-NULL payload for all table columns. I added new configuration option for Cassandra backend which fills columns that are not explicitly set with random data. I'm doing it the backend instead of ap_proto because it is much simplere to do here and I need a quick and dirty way to do it right now. For final implementation we won't need it ao it will be removed when we settle on what final implementation is going to look like.
708074c
to
1832b1a
Compare
Previous commits on this branch come from rebasing of the testing branch (u/andy-slac/cassandra-2), next step is to update the implementation for new abstact interface. This commit adds a unit test for ApdbCassandra to make sure that it imports and can trivially function. This unit test can only be ran against actual Cassandra cluster, needs special environment. I added a protection for the case of the missing cassandra-driver package, it is still possible to import the module in that case but instantiation of ApdbCassandra fails.
1832b1a
to
edf7415
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small stuff to change.
@@ -0,0 +1,86 @@ | |||
import lsst.dax.apdb.apdbCassandra | |||
assert type(config)==lsst.dax.apdb.apdbCassandra.ApdbCassandraConfig, 'config is of type %s.%s instead of lsst.dax.apdb.apdbCassandra.ApdbCassandraConfig' % (type(config).__module__, type(config).__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put spaces around all =
and ==
when setting and testing. https://developer.lsst.io/python/style.html#binary-operators-should-be-surrounded-by-a-single-space-except-for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is generated by pex_config, I want to avoid fixing formatting in this file. In case I want to re-generate it later again, I'd have to re-format it again and again.
python/lsst/dax/apdb/apdb.py
Outdated
) | ||
extra_schema_file = Field( | ||
dtype=str, | ||
doc="Location of (YAML) configuration file with extra schema", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth explaining how an extra schema works relative to the standard schema.
python/lsst/dax/apdb/apdbSql.py
Outdated
doc="If True then print/log timing information", | ||
default=False) | ||
db_url = Field( | ||
dtype=str, doc="SQLAlchemy database connection URI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to newline the doc like the other config options?
python/lsst/dax/apdb/apdbSql.py
Outdated
default=64 | ||
) | ||
htm_index_column = Field( | ||
dtype=str, default="pixelId", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline?
python/lsst/dax/apdb/apdbSql.py
Outdated
doc="Name of a HTM index column for DiaObject and DiaSource tables" | ||
) | ||
ra_dec_columns = ListField( | ||
dtype=str, default=["ra", "decl"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline?
doc="Cassandra protocol version to use, default is V4", | ||
default=cassandra.ProtocolVersion.V4 if CASSANDRA_IMPORTED else 0 | ||
) | ||
read_sources_months = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems as if some configs are duplicated between ApdbConfig, ApdbSqlConfig, and here. Add them in needed to the base config and remove them here and in the SQL config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A leftover from refactoring, I'll remove duplicates.
) | ||
time_partition_start = Field( | ||
dtype=str, | ||
doc=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the parentheses here or in other multi-line config doc strings.
full_name = self._schema.tableName(table_name) | ||
tables = [full_name] | ||
mjd_now = visit_time.get(system=dafBase.DateTime.MJD) | ||
mjd_begin = mjd_now - months * 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | ||
# Assume it's seconds since epoch, Cassandra | ||
# datetime is in milliseconds | ||
value = int(value * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove spaces
if self.config.packing == "cbor": | ||
blob = b"cbor:" + cbor.dumps(blob) | ||
values.append(blob) | ||
holders = ','.join(['%s'] * len(values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove spaces
Unit test now runs successfully against a small one node cluster. The code is stil la big mess and needs cleanup. Updated schema YAML files to reflect actual data, and dropped pixelId from C9 schema.
Columns that compise PK have to be included into pandas catalog (or static columns).
Define couple of enum classes for static parts of schema (table names and index types).
Add config fields for begin/end times for per-partition tables.
Instead of specifying partitioning columns and index in YAML files, schema class now adds them dynamically based on config parameters. This simplifies YAML configration and reduces number of different YAML files.
Both implementations now use the same YAML config but they add their specializations in schema classes. Dropping C9-specific YAML files.
edf7415
to
0f555da
Compare
Adds new implementation for
Apdb
interface based on Apache Cassandra.This merge contains a more or less complete history of all development and experimentation from the past couple of years, I decided not to squash that history as it may be potentially useful. It does not make sense to review individual commits, it's better to look at the final version.
There is also some refactoring of SQL implementation and tests to reduce code duplication between two implementations, but functionality of ApdbSql did not change. The
pixelId
column was removed from schema YAML definition, it is now added where needed by schema class - this is to keep YAML schema reusable between SQL and Cassandra (the latter does not havepixelId
but it adds a different set of columns).