New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SharedString Attribution design doc #11760
Conversation
|
@connorskees @anthony-murphy for awareness--starting to get a more fleshed out design in this area. Thinking about SharedString-specific aspects with the rest of my sprint. edit: Maybe could have made this a draft (last section needs elaboration), but I think feedback is appreciated in any case. So oh well :P |
|
|
||
| However, this conceptualization of attribution does suggest a reasonable split of concerns that can be individually assessed: | ||
| none of the association between sequence numbers, timestamps, clientIds, and user information is specific to any given DDS. | ||
| Thus, all of this bookkeeping could be generically done by the framework (potential candidates include on container runtime, data store runtime, or channel context), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something interesting to consider in this regard, as that an individual op only ever targets a single entity. usually, the entity is a dds. non dds case are possible, but not very interesting for this discussion, so i'll use dds from here on out. since ops have a one to one mapping with a single dds, some of this bookkeeping could be pushed down to the dds itself in an opt in manner, for this discussion that interesting piece is probably just timestamp, but one could also image dds storing serialized/linearized ops (each op rebased to previous op), something like that would then allow time travel like functionality, which is out of scope here, but still interesting. This could also be useful when thinking about garbage collection, as the only data shared across dds will be user info, as the dds could internally manage sequence number to timestamp and anything else it wants. The easiest solution here would be to just put all the user data in the dds too, but that could cause duplications we may not want, so user data only could be lifted.
| } | ||
| ``` | ||
|
|
||
| Another way to accomplish similar gains without the need to compare `IUser`s would be to maintain a historical `clientId -> IUser` lookup and use `clientId` in place of `userRef`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move away from clientId. it really doesn't tell us much, as it changes all the time on reconnect and such. IUser requires a user id, but even that is a bit large. I think we can actually leverage the sequence number of the first join op for a user with edits, this is a unique and small id. it will be possible for this to change, but only in the case where a user info gets gc'd, and the next join + referenced edit will re-add them at their new seq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might use client id for a dds to get a reference to the user info for an op in the collab window, so that it can get a reference to the user info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also consider the unique id stuff @noencke is looking at. particularly at managing the local to acked transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also know @andre4i is looking at removing join ops in some cases that use custom servers. that will basically make any kind of attribution impossible. maybe that this is fine as it will be pay to play, but something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole scheme does have some similarities with the ID compression scheme. We actually built an attribution scheme into the legacy SharedTree's ID compression code since we could re-use the table we already had for IDs. It maps directly from "content ID" -> "user". This doc is proposing splitting that mapping up into two parts: "content ID" -> "sequence number" and then "sequence number" -> "user". The latter is accomplished by the runtime, and the former is accomplished by the DDS. But the former could also be a runtime service for DDS_s; in fact, @justus-camp is investigating this sprint what we could do to lift the ID Compression scheme up to the runtime level as a service. We could at that time also build in an API to the ID compression that gives the sequence number that created an ID. Then both of these services could work together: you use the id generator service to give IDs to your content, you can query it later to get the sequence number, and then you can use the attribution service to get the user for that sequence number. Both services have some similar requirements (for example, you aren't allowed to serialize the objects they give you without first going through some kind of special serializer), and that could help them feel familiar to users and reduce the concept count for picking them (i.e. once you learn the "gotchas" of the attribution API you also know the "gotchas" of the ID compression API).
Another option would be to compose the ID Compressor API over the attribution API and have the ID compressor also provide an "ID -> user" API (which under the hood uses the attribution service). Then developers using the ID compressor API don't even have to know about the attribution service directly, but if they don't want ID compression we leave the attribution API exposed for direct use and they can maintain their own "content" -> "sequence number" table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move away from clientId. it really doesn't tell us much, as it changes all the time on reconnect and such. IUser requires a user id, but even that is a bit large.
yeah, I think storing the user info directly is probably the better choice here anyway. I mostly wanted to call out here that user objects aren't really comparable/serializable by default right now as they're a point of extensibility. I don't think it will be an issue in practice (most representations people use are probably already JSON.stringify-able without issue and worst-case we could string compare that version).
I think we can actually leverage the sequence number of the first join op for a user with edits, this is a unique and small id. it will be possible for this to change, but only in the case where a user info gets gc'd, and the next join + referenced edit will re-add them at their new seq.
It's an interesting idea, but what benefit does this scheme give us? Seems like dictionary compression of the user field should already reduce size enough here.
RE: similarities with a more generalized ID compression:
This doc is proposing splitting that mapping up into two parts: "content ID" -> "sequence number" and then "sequence number" -> "user".
I'd challenge this a bit. It's not a general feature of DDSes that each piece of content has a unique ID, that's true for SharedTree but not true for SharedString, Matrix, etc. Generally we should be starting with content and not some ID. It is an interesting point though, that rather than make DDSes responsible for extracting a seq from the content, they could extract one or more IDs associated with that content that an IdCompressor could attribute to a user.
(I think it is important that the design for attribution allows possibility for multiple-authoring, which is why I mention potentially more than 1 ID)
One relatively clean way to conceptualize this would be for each user to use their id compressor to mint an id for each op they generate, which gives every sequence number an id that could be referenced by DDSes that don't explicitly give ids to their content. This still gives fine-grained attribution detail for DDSes that don't have fine granularity on their content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was kind of misleading there; by "content ID" -> "sequence number" I simply mean that a DDS must have some way to get a sequence number from a piece of content, not that that content necessarily is associated with a number/string, etc. that we'd call an "ID".
I really like your idea of using the compressor to make an ID for each op. The compressor can mint IDs for anything, not just for individual pieces of content!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having something small, like a join seq integer for the id of a user makes it cheaper for the dds to store so the dds can look up a user for an op. this is premised on the idea that dds themselves store op data (seq, timestamp, etc) and the higher layers only store user info.
As for user info, the IUser object already must be serializable as it comes over the wire in a join op, and is serialized in the quorum snapshot. we only require and id on that object, but most servers add more properties. the ids should also be comparable. multiple clients in the quorum as allowed the same user id, but that should only happen for the same user viewing from multiple clients (tabs)
| } | ||
| ``` | ||
|
|
||
| #### Timestamp Binning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure any kind of binning is really necessary or saves much. basically seq + timestamp can be stored as two parallel arrays where a seq's index matches the same index for the timestamp. these two arrays can then be very efficiently compressed: https://www.timescale.com/blog/time-series-compression-algorithms-explained/
we can use the data here to play with different ways of storing data to see what we'd want: https://github.com/microsoft/FluidFrameworkTestData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could even add a third array here for user info id's, that will have lots of duplicates and should also compress well. we're basically doing columnar compression if you think about the data as rows with columns: seq, timestamp, userid(seq)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rather than binning we should just think about precision, which is similar, but different. this could be configured per dds for at what precision they store timestamps, 1u, 10ms, 1s, 1m .... this would only affect individual timestamps, so the lower the precision the more duplicates, less diff, better compression. the interesting thing is you could change it any time without any other changes, as the mapping from seq to timestamp is still 1 to 1. you could even imagine a scheme where we drop precision over time, so older attribution data is less precise, but more compressible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the metapoint here of viewing things as time series and explaining in terms of existing compression schemes. That will be especially handy in terms of code size / reusability if it works well with some of Justus's work on op compression.
maybe rather than binning we should just think about precision, which is similar, but different. this could be configured per dds for at what precision they store timestamps, 1u, 10ms, 1s, 1m .... this would only affect individual timestamps, so the lower the precision the more duplicates, less diff, better compression. the interesting thing is you could change it any time without any other changes, as the mapping from seq to timestamp is still 1 to 1. you could even imagine a scheme where we drop precision over time, so older attribution data is less precise, but more compressible
This description aligns pretty closely with what I was trying to express with the binning section. So there's probably a gap somewhere in the document :). I'm guessing some wording I had was confusing, could you point me to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it was just the binning language, and prominence of the topic. as i think with run length encoding, even high precision (ms) is probably fine data wise. The code sample is also misleading, as it loses date information which i think we will almost always want.
| ``` | ||
|
|
||
|
|
||
| ### Bookkeeping Placement Considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even given my comments above, i'm still pretty open to where we land here, but it think considering storing the op info (seq, timestamp, op,...) compressed and near the dds, and only keeping user info in the upper layers could be a good compromise.
Description
This adds a design document for the attribution feature of SharedString.
The immediate goal of this feature is to reduce snapshot size of SharedString documents that use an out-of-platform attribution scheme. Since attribution is an application desire orthogonal to the particular DDS that application uses, the document covers a general split of concerns between elements of the attribution scheme that are SharedString specific and those that could be later lifted higher into the runtime.
Currently, it's primarily the non-dds-specific aspects that are more fleshed out.