-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-27390: replace DimensionGraph.{encode,decode} #415
Conversation
f976944
to
a8ab6b1
Compare
6b292c6
to
95bba04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK, one minor comment.
hasher = hashlib.blake2b(digest_size=self.DIGEST_SIZE//2) | ||
for name in self.required.names: | ||
hasher.update(name.encode("ascii")) | ||
return hasher.hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not trust hashes in general due to potential collisions. In this case collisions are possible for cases like ["a", "b"]
vs just ["ab"]
. It may help adding a non-word separator character (e.g. TAB) to avoid this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll at least add that separator; good idea.
I also considered just using an autoincrement integer for the ID (so there would be no DimensionGraph
method to get it). The problem with that is that we'd have to load essentially the entire table from the database in order to see if a particular graph already has an ID, though that's more about code complexity than performance because the table should always be small. I also liked the idea of making the ID deterministic across different data repositories, but I don't have a concrete reason why that needs to be the case. Do you think an autoincrement ID would be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deterministic ID would be nice to have but I guess it's hard to construct or make it stable. If we do not need this ID outside database relation context then autoincrement is probably a better match for that. That table is small so it just a single query/round-trip to database, so I would not worry about performance, and code complexity should not be terrible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added another commit that uses autoincrement integer IDs instead; please take a look. Once the review is finished I'll squash commits and probably just remove DimensionGraph.digest
.
ddl.FieldSpec( | ||
name="dimension_name", | ||
dtype=sqlalchemy.String, | ||
length=64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have probably defined a bunch of constants for these 64
magic numbers too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure they don't get converted to TEXT...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I'd be more inclined to just use sqlalchemy.Text
and leave the length blank. So many of them are arbitrary, or at best "guess and check".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(messages crossed) I think this does get translated to TEXT, but that's controlled by different code, so the magic number here is still a bit problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment is to note that if you needed it to be exactly 64 then it probably isn't happening. The threshold is >32 for auto conversion to TEXT. You can use TEXT explicitly in the type if you want it to be TEXT. You can't though require a long varchar to actually be a varchar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, couple of minor comments.
@@ -50,6 +53,33 @@ | |||
_VERSION = VersionTuple(5, 0, 0) | |||
|
|||
|
|||
def _makeDimensionGraphTableSpecs() -> ddl.TableSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I'd deleted that. I certainly will now.
------- | ||
graph : `DimensionGraph` | ||
Retrieved graph. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add Raises
section to docstring?
455d209
to
1657740
Compare
The dataset record storage classes were already being passed a DimensionRecordStorageManager at construction, so passing the DimensionUniverse to their methods (mostly added very recently on DM-27251) was just unnecessary. We'll soon want to pass DimensionRecordStorageManager to CollectionManager at construction for other reasons, so I've done that here to use the same pattern of passing dimensions once early across the board.
1657740
to
b9daf2d
Compare
DM-27390: replace DimensionGraph.{encode,decode}
No description provided.