Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Stop breaking surrogate pairs in toDelta()/fromDelta() #80

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dmsnell
Copy link

@dmsnell dmsnell commented Nov 9, 2019

Fixes #10, #59, #68
Alternate to #69

Fixes for Java, JavaScript, Objective C, Python2, Python3

Status

  • Please scrutinize the code and think of any test cases not covered already in this patch.
  • We've deployed these changes in Simplenote and are uncovering any issues we can find. We've been running the code in this patch for years now and no issues or regressions have appeared.
  • Please let us know your thoughts on the direction here. This is a pragmatic patch in that it's addressing the problem where it's at and tries to impose no new constraints on the library where possible.
  • Since we haven't received any direction it's my intention to merge this into my fork when done and start referencing dmsnell/diff-match-patch instead of google/diff-match-patch. I'd prefer merging these fixes upstream and would happily do the work required if @NeilFraser provides guidance or opinions or a simple 👍.
  • Everything in this patch improves current problems. Even though this doesn't resolve all of the problems breaking surrogate pairs everywhere it's only a net win for many use cases. I'm working on fixing this everywhere though, but don't see any reason to

Todo:

  • Fix CPP or verify not a problem
  • Fix CSharp or verify not a problem
  • Fix Dart or verify not a problem
  • Fix Lua or verify not a problem
  • Refactor to use cleanupSplitSurrogates in JavaScript
  • Refactor to use cleanupSplitSurrogates in Java
  • Refactor to use cleanupSplitSurrogates in Objective C
  • Refactor to use cleanupSplitSurrogates in Python2
  • Refactor to use cleanupSplitSurrogates in Python3
  • Refactor to use cleanupSplitSurrogates in CPP
  • Refactor to use cleanupSplitSurrogates in CSharp
  • Refactor to use cleanupSplitSurrogates in Dart
  • Refactor to use cleanupSplitSurrogates in Lua
  • Fix patch_toText in JavaScript
  • Fix patch_toText in Java
  • Fix patch_toText in Objective C
  • Fix patch_toText in Python2
  • Fix patch_toText in Python3
  • Fix patch_toText in CPP
  • Fix patch_toText in CSharp
  • Fix patch_toText in Dart
  • Fix patch_toText in Lua
  • Figure out a "minimal" set of unit tests so we can get rid of the big chunk currently in the PR, then carry it around to all the libraries. The triggers are well understood, so we can write targeted tests instead of broad ones.

Major changes

  • in toDelta() we no longer attempt to create URI-encoded strings with split surrogates. this was either crashing client applications or corrupting the data, depending on which platform was in use.
  • in fromDelta() we are recovering when possible from corrupted delta strings produced by unpatched versions of this library. there are two fixes:
    • sometimes we encode surrogate-pair halves as if they were valid Unicode (they aren't). we'll now decode those anyway because after applying a diff with these invalid code points we should re-join the surrogate halves and get valid Unicode out
    • in unpatched ObjectiveC libraries we might have created invalid diffs where we send (null) in between two surrogate halves. this is caused by the original bug and we might receive these sequences in a delta string: (high surrogate)(null)(low surrogate). If we leave these in here then diff-match-patch will operate fine but it will send invalid Unicode that it created onto the consuming application. in this patch I've added guards against this very specific pattern so that we can remove the unintentionally injected (null) and re-join the surrogate halves. this means that we're mangling the input but that input would have already been mangled anyway and presumably it'd be "impossible" to send the same sequence of code units to diff-match-patch since something would have had to send invalid Unicode in the first place.

Working on this bug has surfaced a bigger question about what diff-match-patch expects. All of the Simplenote applications work at the level of UTF-16 code units and a large part of that is because this is how most of the diff-match-patch platforms work.

Something seems ideal about referencing Unicode code points in diff-match-patch instead of any particular encoding, but to change now seems like it would break way more than it would fix. I'd like to propose that we think about creating a new output version which could be selected to specify that diffs do that.

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" )
  // no specification means existing behavior
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'native' } // existing behavior - whatever the unit of a string is
) // "=?	+%F0%9F%85%B1	=?";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'ucs2' } // make UCS2 the "default" since most libraries used it already, or something
) // "=2	+%F0%9F%85%B1	=2";

diff_toDelta(
  diff_main( "🅰🅲", "🅰🅱🅲" ),
  { unit: 'codePoint' } // empty addition group at the beginning to denote Unicode indices
) // "+	=1	+%F0%9F%85%B1	=1"; // empty group ignored by old clients

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into toDelta() we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but shouldn't change
that much - only by a single character at surrogate boundaries.

Notes

  • This is my first contribution to this repo and I'm obviously not familiar with the style and this patch is obviously incomplete. My purpose is to propose code with working tests to demonstrate the change and I expect to clean up any remaining details that need resolving.
  • I've pulled in the surrogate tests that @josephrocca added and they are all passing. In JS: Handle surrogate pairs correctly #69 the issue causing these failed tests was using isSurrogate() for all cases when we should have been using isHighSurrogate() and isLowSurragate() depending on the context.

Thanks for any and all help and feedback.

@dmsnell dmsnell changed the title JavaScript: Stop breaking surrogate pairs in toDelta() Stop breaking surrogate pairs in toDelta() Nov 9, 2019
@dmsnell dmsnell force-pushed the issues/69-broken-surrogate-pairs branch from 01426e2 to 71646fb Compare November 9, 2019 21:07
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 12, 2019
See google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 12, 2019
See google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 13, 2019
See: Simperium/node-simperium#91
See: google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This patch updates the dependency on the patched version of
`node-simperium`.
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Nov 14, 2019
See google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.
dmsnell added a commit to Automattic/simplenote-electron that referenced this pull request Nov 14, 2019
* Fix: No more crashing/skipped changes for certain changes

See: Simperium/node-simperium#91
See: google/diff-match-patch#80

There have been certain cases where consecutive surrogate pairs crash
`diff-match-patch` when it's building the _delta string_. This is
because the diffing algorithm finds a common prefix up to and including
half of the surrogate pair match when three consecutive code points
share the same high surrogate.

```
// before - \ud83d\ude4b\ud83d\ue4b
// after  - \ud83d\ude4b\ud83d\ue4c\ud83d\ude4b
// deltas - eq \ud83d\ude4b\ud83d -> add \ude4c -> eq \ud83d \ude4b
```

This crashes when trying to encode the invalid sequence of UTF-16 code
units into `URIEncode`, which expects a valid Unicode string.

After the fix the delta groups are normalized to prevent this situation
and the `node-simperium` library should be able to process those
problematic changesets.

This patch updates the dependency on the patched version of
`node-simperium`.
@dmsnell
Copy link
Author

dmsnell commented Dec 5, 2019

@NeilFraser is there anything I can do to help this patch/issue move along?

@dmsnell dmsnell force-pushed the issues/69-broken-surrogate-pairs branch 2 times, most recently from fbb8c73 to 55a8779 Compare December 13, 2019 06:47
@dmsnell
Copy link
Author

dmsnell commented Dec 13, 2019

Quick update for those following: we identified issues when the surrogate pairs were in DIFF_DELETE groups which led to crashing behaviors.

I've updated the PR with an improved patch to the JS/Java libraries and hope to port the changes to the other libraries tomorrow.

In this update I've included a custom decodeURI. The Python library happily URI-encodes a string that isn't valid UTF-8 and that causes other libraries to crash. Well, it's fine by us if we treat half of a surrogate pair as its own code point. The illegitimate part comes because those code points are reserved for surrogate pairs, not because there's no way to decode the value as if it were a real code point.

This custom decodeURI makes it possible for clients to handle diff deltas produced by libraries that don't have the fixes in this PR.

@dmsnell dmsnell changed the title Stop breaking surrogate pairs in toDelta() Stop breaking surrogate pairs in toDelta()/fromDelta() Dec 13, 2019
@NeilFraser
Copy link
Contributor

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

@dmsnell
Copy link
Author

dmsnell commented Dec 17, 2019

Thank you so much @NeilFraser for the response!

Q: What happens in the different languages when diffing "人" vs "中"? JavaScript should just say delete one character, insert another. But what about Python?

Right now it appears like diff-match-patch leaves this unspecified. The Lua library of course makes this plain and clear.

It doesn't even need to be a question about language though. Python2 is internally inconsistent because if Python is compiled in narrow mode it returns different numbers than if Python is compiled in wide mode.

In my patch I'm normalizing doDelta on UTF-16 code units because that seems most consistent with most libraries in most situations. Fundamentally I don't see a way to provide full backwards compatibility on this change because there was no backwards behavior - or the behavior was already backwards 🙃

In other words before my patch it's more different across libraries and platforms than after it. After this patch the languages which have been implemented here should all return the same delta - delete one character, insert another.

In tests across JavaScript, Java, Objective-C, Python2, Python3

@dmsnell
Copy link
Author

dmsnell commented Dec 20, 2019

@NeilFraser is there a way to run the suite of relative performance diffs or is that simply generated from running speedtest in each language/platform?

I've been running speedtest on this PR and I haven't seen anything significant in terms of performance. I think I've buttoned up about as much as we can with the bug.

the only real major issue is the question of what the indices should be, and I've tried to not change that other than in Python2 compiled in wide mode and in Python 3. we can back out the Python 3 change if we need to preserve the behavior. the other languages/platforms in this patch all previously returned UCS-2 code units.

@salzhrani
Copy link

We also faced an issue with surrogate pairs here where the substring breaking doesn't account for surrogate pairs.

dmsnell added a commit to Simperium/node-simperium that referenced this pull request Feb 5, 2020
In 1.0.2 we applied an update from google/diff-match-patch#80 to resolve the
surrogate pair encoding issue.  Since that time we found a bug in the fix and
resolved that and therefore this patch brings those additional updates to
Simeprium.

There previously remained issues when decoding broken patches that already
existed or which came from unpatched versions of the iOS library and also
remained issues when unexpectedly receiving empty diff groups in `toDelta`
dmsnell added a commit to Simperium/node-simperium that referenced this pull request Feb 17, 2020
In 1.0.2 we applied an update from google/diff-match-patch#80 to resolve the
surrogate pair encoding issue.  Since that time we found a bug in the fix and
resolved that and therefore this patch brings those additional updates to
Simeprium.

There previously remained issues when decoding broken patches that already
existed or which came from unpatched versions of the iOS library and also
remained issues when unexpectedly receiving empty diff groups in `toDelta`
@dmsnell
Copy link
Author

dmsnell commented Mar 13, 2020

@NeilFraser what can we do to be helpful here? We've been running with these changes in Simplenote for more than a month and it eliminated all of the encoding issues we were experiencing.

I know that there is still a "better option" of adjusting the way that we break strings up so that we never get broken surrogate halves, but from a practicality and simplicity standpoint I think this is solid (maybe there are bugs we've silently missed).

@dmsnell
Copy link
Author

dmsnell commented Mar 29, 2020

When reading through a Rust implementation I found that the author created a separate cleanup_char_boundary(diffs) function to adjust the output when the diff groups would otherwise split. that's a method very close to this one except it occurs in the diffing stage instead of the delta production.

https://github.com/dtolnay/dissimilar/blob/master/src/lib.rs#L402

@mustafamunawar
Copy link

mustafamunawar commented Dec 21, 2020

@dmsnell What's the status of this? Does Unicode work correctly yet in master?

@dmsnell
Copy link
Author

dmsnell commented Dec 21, 2020

@mustafamunawar we are waiting feedback or direction from @NeilFraser. We have been using this branch successfully in Simplenote for the past nine months without issue. You are welcome to try it out and watch this PR for an eventual merge.

valr added a commit to valr/website that referenced this pull request May 3, 2021
@dmsnell
Copy link
Author

dmsnell commented May 22, 2021

@dmak

first doing the diff on byte level and then mutating (extending / shrinking) a bit the resulting diffs so that they do not eventually stop in the middle of multi-byte sequence - @dmak

this is the basic procedure in this PR as well as has been taken in the dissimilar library linked above. it's a good idea and one that solves the problem of crashing when splitting Unicode boundaries.

linking here to avoid clouding your Maven PR (#2) with noise about Unicode boundaries

@dmsnell
Copy link
Author

dmsnell commented May 26, 2022

@NeilFraser I ran into this problem again today through a dependency-of-a-dependency pulling in diff-match-patch (via jsondiffpatch) and wondering what I can do to help move this forward.

For Simplenote we're able to pull in this branch and that works, but with the transitive dependencies in other projects we don't get as much of a luxury. Would rather avoid forking the repo especially since everything in the world basically refers to this one (and you've put so much work into it), but I feel like if we don't fix master we're going to continue to spot this Unicode violation over and over again and I don't know how to get there without your help.

Again, I'm happy to put more energy into the fix. We've been going for more than two years with this branch in Simplenote and as far as we can infer the issue is entirely eliminated and no side-effects have crept in. Our users write some very long texts too and we've had no indication of performance issues.

The approach in the Rust library I think is more elegant than my solution but I think mine here is the least intrusive on the overall library, addressing the problem of splitting surrogate pairs only where it becomes a problem, that is, when URL-encoding the patch.

Really appreciate all you've built here and I assume you're busy; just trying to find a way to keep this from being a constant headache.

@gamedevsam
Copy link

I've tried all 3 solutions (#13, #69, #80) to this problem and #13 is the only one that solved the encodingURI exception for me. I created a minimal repo which clearly reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue

Hope it's helpful to make this solution more robust.

@dmsnell
Copy link
Author

dmsnell commented Sep 10, 2022

@gamedevsam I think it's possible you confused diff-match-patch with jsondiffpatch. This patch has been working fine with Simplenote for coming up on three years now and has been fully reliable.

It does limit itself to toDelta, as noted, and jsondiffpatch is using the toText functions. If you'd like you can propose a commit on this PR to add support there as the process should be the same as with toDelta. I'm busy on other things so I probably won't do this any time soon without other contributions and since @NeilFraser has been silent on the issue (if @NeilFraser could give us some direction or indication that this would merge I'd go ahead and clean up the patch).

Unfortunately #13 has a massive performance problem particularly for long texts and likely that approach (converting strings to arrays of integers) won't be viable for a generic library like this, especially since this patch (#80) has the fix to get to the same result.

You can check with the author of jsondiffpatch and look at these lines. Get that library to use this patch and then apply the same updates to toText as we have with toDelta and it should work for you 👍

@michal-kurz
Copy link

michal-kurz commented Oct 4, 2022

@dmsnell Thank you so much for all the work you've put into this accross all the threads! I initially tried fixing this myself, and you helped me realize my approach was way too naive! :)

We use jsondiffpatch, and are running into this problem. Since this repo seems to be dead, I would love to make a public jsondiffpatch fork that fixes this issue. I came up with a POC fix of toText() function which builds on top of your PR, Here it is. Would you please care to review it? It seems to fix our use-case, but it's very difficult for me predict if it could introduce any new problems (the tests pass, but that doesn't give me much confidence) - maybe you will see something I don't :)

About my solution:

  • I pretty much just separated your surrogate-pair-joining logic into a separate function, as it seems to me like a concern separate from any other - it may make sense to use it in toDelta() also, but I didn't do that for now to keep the diff minimal
  • I'm also wondering whether it's optimal to be fixing this problem at point of toDelta()/toText() conversion - have you maybe considered fixing at patch creation, so this problem would never occur in the first place? Maybe inside patch_make, or something. Although it may very well be that other functions like diff_cleanupSemantic could re-break the patch... I haven't tried this, just wondering if there could be a central way to fix this once for good :)
  • It's quite clear that dpm prioritizes performance above all else, and it's hard for me to reason about such micro-optimizations.
    • Do you think there is any value in not creating a separate newDiffs array, but writing into the existing one instead (and popping the difference in length in the end, since your algo skips empty patches)?
    • Likewise, it might be preferable not to use my function inside toDelta(), as it would require an additional iteration of all diffs for each patch

🙏

@dmsnell
Copy link
Author

dmsnell commented Oct 4, 2022

Nice work @michal-kurz. If you want to propose a PR with that code against this branch we can merge it in. I'd like to explore using your function in toDelta, and I'd like to look at removing the extra iteration, as I don't believe it offers any value above the cloning method you took (though I'm happy to be shown something I'm overlooking!)

@michal-kurz
Copy link

michal-kurz commented Oct 4, 2022

@dmsnell Thank you! I made a PR to your branch. Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the Github diff keeps looking different from my local, no idea why. I'm way too frustrated to continue with that at this point 😠

As for incorporating the splitSurrotagePairs into toDelta: yes, I don't think there is anything to gain other than proper separation of that concern, which I tend to generally value much higher than preventing an additional iteration - I really don't think there is a clean way to retain both. But I guess you could find some middle ground (i.e. sending a callback inside splitSurrogatePairs to do the bidding of the invoker rather than returning corrected diffs).

@dmsnell
Copy link
Author

dmsnell commented Oct 4, 2022

Unfortunately there are some unwanted changes in whitespaces comming with it - I spent the last 25 minutes failing to undo them, and the Github diff keeps looking different from my local

oh the pain; I feel you: been there, done that. typically I find that some IDE is adding or removing the extra whitespace. did you try using nano or vim or pico or some editor that won't muck around with anthing?

might also verify you don't have custom pre-commit hooks that are running formatting.


by the way I suspect you meant to poke me, @dmsnell, and not dmaclach

@michal-kurz
Copy link

@dmsnell Solved, I ended up soft-resetting, re-commiting only the desired lines and force-pushing the result.

And yes, I did mess up the mention tag, also fixed.

@michal-kurz
Copy link

michal-kurz commented Oct 6, 2022

So unfortunately I found my solution for toText to be incomplete.

I sometimes end up with a single surrogate on the very edge of the whole patch (first char of first diff, or last char of last diff). For example:

str1 = '. 🔧🚗\n\--X'
str2 = '. 🔧🚗\n\--Y'

dmp.patch_make(str1, str2)
// Results in three diffs:
// [
//    {0: 0, 1: '\uDE97\n--'},
//    {0: -1, 1: 'X'},
//    {0: 1, 1: 'Y'}
// ]

Such case cannot be fully "losslessly" shifted with @dmsnell 's approach - I have to decide to either expand or shrink the whole diff scope.

In my case, the broken diffs were created as prefix and suffix generated inside patch_addContext_ - those happen to get sampled at a fixed distance from the actual change.

I patched those by extending them one character outward in case it's an unpaired surrogate, and this seems to have solved the problem (here is a PR targeted at our fork of dmp). I decided for this approach, because I'm worried that shrinking the scope of the diff could cause some unforseen problems (mainly when coverting back to patch via fromText and then using inside other functions). And I unfortunately couldn't do this inside toText, because I don't have access to the full text there (and thus don't know what character to expand my diff with).

I hope this is the only potential source of such broken patch. If so, the toDelta solution in this PR should be safe, as prefix and suffix are both DIFF_EQUAL, and toDelta is only concerned about diff's length in that case, not its contents.

@dmsnell
Copy link
Author

dmsnell commented May 10, 2023

@michal-kurz I added a patch that's almost identical to yours here (should have read yours first 🙃) but I make a couple extra bounds checks to avoid extending the patch context before or after the document boundaries.

had to apply both your fix though and the _toDelta fix at the same time else we were still getting split surrogates in the patch_ family of functions. by moving my fix into a new cleanupSplitSurrogates function though that seems to cover the bases.

one thing I wish were better is that when I moved cleanupSplitSurrogates into the same chain of fixup commands through the stages of diffing, it seemed to break. as is, it appears to be necessary as the last cleanup in the chain, reserved for output functions that encode URIs.

that being said, we're still without regression in Simplenote and other places at Automattic where we've deployed this patch and it's been a few years now, so I think whatever practical concerns there may be with this approach, it's clearly better than master and clearly not introducing surprise gotchas that will amount to anything near the existing problems. in other words, we haven't found any problems with it even though there could be something in this I've missed.

if there are problems with the approach in this PR, they are so rare that it's not comparable to the wildly-common bugs present in master.

Resolves google#69 for JavaScript

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Java

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Objective-C

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python2

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Resolves google#69 for Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
@dmsnell dmsnell force-pushed the issues/69-broken-surrogate-pairs branch from d681ef6 to 50f1542 Compare January 30, 2024 23:54
@dmsnell
Copy link
Author

dmsnell commented Jan 31, 2024

All: I have merged this into my fork and it's now in the master branch.
I'd welcome help in porting the fix to the other libraries, or in rewriting the fix to use the approach taken by dissimilar.

Unfortunately I won't have a lot of time to actively maintain the fork, but I will do my best to be responsive and help merge in updates.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JavaScript implementation crashes on Unicode code points
6 participants