Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared String: Obliterate op #8690

Closed
pleath opened this issue Jan 8, 2022 · 18 comments
Closed

Shared String: Obliterate op #8690

pleath opened this issue Jan 8, 2022 · 18 comments
Assignees
Labels
ado Issue is tracked in the internal Azure DevOps backlog area: dds: sharedstring design-required This issue requires design thought status: stale
Milestone

Comments

@pleath
Copy link
Contributor

pleath commented Jan 8, 2022

Feature

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Existing work

Describe alternatives you've considered

Additional context

Scenario
Here's the original ask from a partner:

We recalculate field results and place those results in the SharedString. If multiple endpoints attempt to update the field at the same time, the deletes will coalesce, but the inserts won’t, leading to duplicated (though not necessarily identical) content in the document.

E.g:

<DateFieldCode | 1/2/2012>

Could become:
<DateFieldCode | 1/2/20121/2/2012>

Proposal: The Obliterate Op

Normally when we update the field code, we’ll end up performing a delete over the range between the field separator character “|” and the field end character “>” followed by an insert at the same location. Instead, we need to delete existing text in this range, and also any text that our client hasn’t yet received that would end up in that range.

When the local client has a locally applied obliterate, any inserts that come within the obliterate range or adjacent to the obliterate range would be immediately marked as deleted (but no delete ops created).

E.g.

Shared string:

“1234obliterated content567890” @ Sequence #1

Local client creates: Obliterate [4, 23), Ref Seq#1
Result: “123467890”

Incoming op:
Insert “!” @ 4 Ref Seq #1 (client1)

Result: “123467890”

Incoming op:
Insert “*” @ 24 Ref Seq#1 (client1)

Result: “123467890”

Incoming op:
Insert “^” @ 16 Ref Seq#1 (client1)

Result: “123467890”

Incoming op:
Insert “!” @ 4 Ref Seq #1 (client1)

Result: “123467890”

When applying the obliterate, the obliterate would similarly expand to include any adjacent text which was unknown to the sender. That is, the obliterate will expand to include any adjacent ranges with a sequence # lower than the reference sequence number of the obliterate op.

Example:

Shared string:

“1234obliterated content567890” @ Sequence #1

“1234!!obliterated content67890” @ Sequence #5

Incoming op:
Obliterate [4, 23), Ref Seq #1 (client2)

Result: “123467890”

AB#158

@pleath pleath self-assigned this Jan 8, 2022
@pleath pleath changed the title Shared String Functionality: Obliterate op Shared Interval: Obliterate op Jan 8, 2022
@anthony-murphy
Copy link
Contributor

annotate seems like it would be the solution here

@pleath
Copy link
Contributor Author

pleath commented Jan 13, 2022

Are you saying we could annotate the delete op to indicate that it should have the obliterate semantics? Or something more complicated?

@anthony-murphy
Copy link
Contributor

No. the data they want to obliterate should be stored as annotations, either on the text or in a marker. annotates work like maps, last write wins, which seems like the behavior they want.

@pleath
Copy link
Contributor Author

pleath commented Jan 13, 2022

I see.

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Jan 25, 2022

I was thinking about this a bit yesterday, and a possible solution is something like below. i'm not wholly convinced it will work right, so it needs more vetting.

  1. We allow insertion of markers, with a specified length, there data is still in annotated properties. for the below, it must also have a maker id. marker id handling may have bugs, so if we depend on it we probably have some stabilization to do there.
  2. add a new op which is an atomic replace for a marker based on its id. this would be implemented as a remove and insert, ideally mergetree.ts never even knows about this, its all handled in client.ts. there is complexity around pending local operations, need to ensure last sequenced wins on all clients.

i also don't have a clear picture of how we handle properties for this marker on atomic update, do they coalesce. combine, or overwrite? this is also true for pure annotates, how do they mix with atomic updates.

also, what happens if there is not marker with the id on atomic update, my hunch is it should be a no-op

@anthony-murphy
Copy link
Contributor

fyi @DLehenbauer and @jack-williams

@jack-williams
Copy link
Contributor

jack-williams commented Jan 27, 2022

Just to understand the original ask. The goal is to really have something like a range replace:

replace(pos, length, content)

That removes range from pos to pos + length and replaces it with content, and additionally removes any concurrently inserted content into that range?

The obliterate op being proposed is a special type of delete that additionally removes segments in the delete range that were concurrently inserted, as normal delete will not delete other client's content. Is that right? So to achieve the replace you would do something like:

replace(pos, length, content) === obliterate(pos, length) then insert (pos, content)

The replace would need to be sent as a single op and interpreted locally as obliterate + insert to avoid interleaving.

I think that could work, but beyond this case I don't know how useful obliterate is as a new primitive. (Haven't thought about it much).

@anthony-murphy's marker approach, which if I understand correctly is effectively adding "tokens" to the merge tree, that cannot be split, have identity, and can be updated using a variety of resolutions. That seems quite versatile and powerful to me, and would be worth looking into IMO.

I'm not very familiar with markers. Are ID's tracked globally to prevent concurrent insert of markers with the same ID?

Is the reason for giving markers arbitrary length only to make it easier to map positions when the sequence is rendered as something like a string. For instance. Could it be possible to use a 0 or 1 length marker, but understand that positions after the marker in the string must be adjusted. For instance:

Segments: ["text": len 4][marker{id=6}: len 1]["more text": len 9]
Rendered as "textCONTENTmoretext"

Visual updates to position 12 get adjusted by -6 (to account more marker content) before being applied to the shared sequence. I'm guessing that if this is added to shared string, having the segments reflect the rendered string is desirable, hence giving the marker the length of the content avoids this translation.

If markers have arbitrary length, what happens for inserts or deletes partially overlapping. Is this handled by the segment split logic returning undefined?

also, what happens if there is not marker with the id on atomic update, my hunch is it should be a no-op

Yeah agreed.

@anthony-murphy
Copy link
Contributor

Great description @jack-williams. Let me answers a couple of the questions

I'm not very familiar with markers. Are ID's tracked globally to prevent concurrent insert of markers with the same ID?

Yes, at least that is how they are indented to work. They are currently not well used, so there could be existing bugs

Is the reason for giving markers arbitrary length only to make it easier to map positions when the sequence is rendered as something like a string. For instance. Could it be possible to use a 0 or 1 length marker, but understand that positions after the marker in the string must be adjusted. For instance:

Today markers have a default and fixed length of 1, no segment can have a length of 0 (text or marker) that will be rejected by the merge tree on insert. The reason to allow a configurable length is exactly as you laid out, to allow easier mapping between shared string cp space and whatever the application model is.

If markers have arbitrary length, what happens for inserts or deletes partially overlapping. Is this handled by the segment split logic returning undefined?

Today markers cannot be split, there is code to enforce this, but it is weak. This has not been a problem as all markers have length 1, so there is nothing to split. A change like this would introduce complexity. We have a similar problem with annotates. For local operations i would just reject it the specified range didn't cover the whole marker width. Remote is much more complicated. I don't know what the right behavior for remote operations is but whatever we do will need to be:

  • eventually consistent - this has implications for the originator too as all clients need to reach the same end state
  • instance stability - currently a lot of people bi-directionally link there model directly to the share string segments. I would love to update our api to just support this, like local annotations, so free our hands a bit here
  • predictable semantics - keep the behavior simple and easy to understand and predict

My hunch is we'd go with expansion to cover the full range. Currently, if you annotate a segment that some else deletes, your annotate basically becomes a no-op. The same thing probably happens with atomic update in this case in that the segment is gone, so your update is a no-op.

@pleath pleath added design-required This issue requires design thought area: dds: sharedstring labels Jan 31, 2022
@pleath pleath added this to the February 2022 milestone Feb 7, 2022
@pleath pleath modified the milestones: February 2022, March 2022 Mar 1, 2022
@pleath pleath changed the title Shared Interval: Obliterate op Shared String: Obliterate op Mar 2, 2022
@pleath
Copy link
Contributor Author

pleath commented Mar 2, 2022

One solution @anthony-murphy and I were kicking around involves inserting markers with length=1 at the beginning and end of the obliterated range. It seems like that would let us cordon off an obliterated subtree in a way that would withstand splitting of segments, etc. That in turn could help with conflict resolution. Keep length=1 also avoids all the complexity around making sure that markers themselves can't be split.

It does seem like it would take an op to remove the markers. (Which brings up the question: when, if ever, could the markers be removed? From the partner ask, it seems like the obliterate command continues to prevent changes to the obliterated region after the obliterate's own sequence number. So it seems like we would keep the markers in place until we received an "end obliterate" command from the user. The API should probably be clarified with the partner.)

The fact that the markers have non-zero length also means that the app model would have to take them into account. In side discussions, @vladsud has mentioned that "virtualization of cp space" would likely be required.

@jack-williams
Copy link
Contributor

If the solution inevitably involves having to map between the clients view and the merge tree, is it also worth considering the approach that uses a single length 1 marker/token with the content stored in an annotation?

@vladsud
Copy link
Contributor

vladsud commented Mar 7, 2022

Paul - I'm not sure I follow your last proposal.

Tony's proposal essentially is about implementing obliterate for markers. It's simpler than doing same for arbitrary text range (as it has a lot of cases that needs to be taking into account, which makes design really complex). But can we avoid adding obliterate semantics?
What if we treat length of a marker segment similar to how we treat any other property on a marker? At least that's how it's modelled internally (the actual implementations can warry, and we can expose it (through API) however we like it). Treating it as any other property has clear semantics RE if marker is deleted - all property updates are dropped on the floor.

There is a complexity of updating "user" properties (actual text) and marker length through one op (such that application can maintain consistency here - length of text in one of the properties to be equal to marker's segment size). Have not thought much about it, but I hope we can make it happen.

I also think we do not need to have some unique ID for marker. Or at least I do not see (outside above problem) how handling of such marker and its properties today differs from what we need, it seems like app would have everything it needs to be successful without adding more constructs.

And there should be no issues with splitting such marker, as there is no API to do so (same as today). One can insert delete markers or update marker properties, but that's about it.

Even better if it was a separate layer (not baked into Sequence) that allowed CP virtualization (i.e., only this layer would know about such property as being special). While it may be complex to build, once we have a skeleton, it would be super easy to reuse it in other places where app would have different needs RE virtualization.

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Mar 7, 2022

The issue with Markers is that any property will have last write win semantics. however, the partner still wants collaborative edit in the range they wish to obliterate. Their specific scenario is that they store comments inline with the text, this has some nice features around locality. They want comments to be collaboratively editable, both individually, and threads. however, if the whole comment thread is deleted, they want to obliterate the whole range of the comment thread, and throw any incoming changes away.

@jack-williams
Copy link
Contributor

If you add virtualization over the sequence, is "obliterate" possible to emulate by toggling the markers as "off" and having the rendering logic skipping all content (regardless of tombstone status) between "off" pairs of markers.

This leaves the issue of clearing up concurrently inserted content that does not get rendered by the view, and also deleting the markers.

If there is a process to delete text between "off" markers, then eventually will we reach a point where zamboni GC's the segments between the markers and we get adjacent "off" markers? At this point could they be deleted themselves?

@vladsud
Copy link
Contributor

vladsud commented Mar 8, 2022

@anthony-murphy - hm, I was assuming this is about fields in Word, and this issue starts with description of fields. Curious, is this new requirement? Would be great to capture in the top comment of this issue.
I think for fields, there is no ability to implement collaborative editing, it's LWW for whole field (user text and calculated portion)

@anthony-murphy
Copy link
Contributor

yeah. when we talked to them they brought up a bunch of other cases, like comments that don't fit the field model

@vladsud
Copy link
Contributor

vladsud commented Mar 8, 2022

Commenting scenario seems like (logically) can be achieved as marker with a handle pointing to another Sequence instance. Not sure RE requirement of CP space / virtualization. I wonder if we are better off investing into cheaper "child" sub-Sequences in some form, also helping other scenarios (like tables).

@pleath pleath modified the milestones: March 2022, April 2022 Mar 10, 2022
@pleath pleath removed their assignment Mar 10, 2022
@vladsud
Copy link
Contributor

vladsud commented Mar 30, 2022

Internal docwith a bit more info.

@anthony-murphy anthony-murphy modified the milestones: April 2022, June 2022 Apr 4, 2022
@curtisman curtisman added the ado Issue is tracked in the internal Azure DevOps backlog label Jun 20, 2022
@ghost ghost added the status: stale label Dec 18, 2022
@ghost
Copy link

ghost commented Dec 18, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@ghost ghost closed this as completed Dec 26, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ado Issue is tracked in the internal Azure DevOps backlog area: dds: sharedstring design-required This issue requires design thought status: stale
Projects
None yet
Development

No branches or pull requests

5 participants