Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27390: replace DimensionGraph.{encode,decode} #415

Merged
merged 5 commits into from
Nov 3, 2020

Conversation

TallJimbo
Copy link
Member

No description provided.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, one minor comment.

Comment on lines 247 to 250
hasher = hashlib.blake2b(digest_size=self.DIGEST_SIZE//2)
for name in self.required.names:
hasher.update(name.encode("ascii"))
return hasher.hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not trust hashes in general due to potential collisions. In this case collisions are possible for cases like ["a", "b"] vs just ["ab"]. It may help adding a non-word separator character (e.g. TAB) to avoid this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll at least add that separator; good idea.

I also considered just using an autoincrement integer for the ID (so there would be no DimensionGraph method to get it). The problem with that is that we'd have to load essentially the entire table from the database in order to see if a particular graph already has an ID, though that's more about code complexity than performance because the table should always be small. I also liked the idea of making the ID deterministic across different data repositories, but I don't have a concrete reason why that needs to be the case. Do you think an autoincrement ID would be better here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deterministic ID would be nice to have but I guess it's hard to construct or make it stable. If we do not need this ID outside database relation context then autoincrement is probably a better match for that. That table is small so it just a single query/round-trip to database, so I would not worry about performance, and code complexity should not be terrible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another commit that uses autoincrement integer IDs instead; please take a look. Once the review is finished I'll squash commits and probably just remove DimensionGraph.digest.

ddl.FieldSpec(
name="dimension_name",
dtype=sqlalchemy.String,
length=64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have probably defined a bunch of constants for these 64 magic numbers too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure they don't get converted to TEXT...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'd be more inclined to just use sqlalchemy.Text and leave the length blank. So many of them are arbitrary, or at best "guess and check".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(messages crossed) I think this does get translated to TEXT, but that's controlled by different code, so the magic number here is still a bit problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is to note that if you needed it to be exactly 64 then it probably isn't happening. The threshold is >32 for auto conversion to TEXT. You can use TEXT explicitly in the type if you want it to be TEXT. You can't though require a long varchar to actually be a varchar.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple of minor comments.

@@ -50,6 +53,33 @@
_VERSION = VersionTuple(5, 0, 0)


def _makeDimensionGraphTableSpecs() -> ddl.TableSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd deleted that. I certainly will now.

-------
graph : `DimensionGraph`
Retrieved graph.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add Raises section to docstring?

The dataset record storage classes were already being passed
a DimensionRecordStorageManager at construction, so passing the
DimensionUniverse to their methods (mostly added very recently on
DM-27251) was just unnecessary.  We'll soon want to pass
DimensionRecordStorageManager to CollectionManager at construction for
other reasons, so I've done that here to use the same pattern of
passing dimensions once early across the board.
@TallJimbo TallJimbo merged commit bf031c4 into tickets/DM-27033 Nov 3, 2020
@TallJimbo TallJimbo deleted the tickets/DM-27390 branch November 3, 2020 04:53
TallJimbo added a commit that referenced this pull request Nov 4, 2020
DM-27390: replace DimensionGraph.{encode,decode}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants