Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.OsPath.Data.ByteString.Short.Word16.isInfixOf looks dodgy #195

Closed
sol opened this issue Jun 9, 2023 · 9 comments · Fixed by #199
Closed

System.OsPath.Data.ByteString.Short.Word16.isInfixOf looks dodgy #195

sol opened this issue Jun 9, 2023 · 9 comments · Fixed by #199
Labels

Comments

@sol
Copy link
Member

sol commented Jun 9, 2023

I haven't tried anything, but from reading the code, it looks like this is just a re-export from bytestring. Consequently, I think it can yield false positives if you were to use it for substring matching of UTF-16.

Is this by intention?

@hasufell
Copy link
Member

hasufell commented Jun 9, 2023

Hi, windows is not really UTF-16, but UCS-2. Windows does not enforce well formed surrogate pairs (as in: it accepts any sequence of WCHARs).

Does that answer your question?

@sol
Copy link
Member Author

sol commented Jun 10, 2023

Does that answer your question?

Not really. What I'm saying is that while you can of course use e.g. isPrefixOf and isSuffixOf from Data.ByteString.Short on UTF-16 (and UCS-2 for that matter) without getting surprising results, this is not generally true for isInfixOf.

If you use Data.ByteString.Short.isInfixOf for substring matching on UTF-16 you can get false positives (the same is true for e.g. UTF-32, but notably not for UTF-8).

Example:

ghci> import System.OsString.Internal.Types
ghci> foo <- System.OsPath.Windows.encodeFS "λ"   -- UTF-16:      0xbb 0x03
ghci> bar <- System.OsPath.Windows.encodeFS "믒괃" -- UTF-16: 0xd2 0xbb 0x03 0xad
ghci> System.OsPath.Data.ByteString.Short.Word16.isInfixOf (getWindowsString foo) (getWindowsString bar)
True

That's why I was surprised to see that System.OsPath.Data.ByteString.Short.Word16.isInfixOf is simply a re-export of Data.ByteString.Short.isInfixOf.

So my question still remains: Is this an oversight, or is there some rational behind this.

@hasufell
Copy link
Member

Yeah, I think you're right. We have to respect Word16 boundaries.

@hasufell hasufell added the bug label Jun 10, 2023
@hasufell
Copy link
Member

Ok, two more observations:

  1. This function is not used anywhere by filepath
  2. It's also considered internal, although not marked as such

So yes, this needs to be fixed, but the impact is very likely low.

@hasufell
Copy link
Member

I think consequently, breakSubstring is also busted.

@sol
Copy link
Member Author

sol commented Jun 23, 2023

Yes, exactly.

From when I checked last time, I think only those two are problematic.

@hasufell
Copy link
Member

@Bodigrim I'm not too familiar with the breakSubstring algorithm and from a little tinkering I couldn't figure out how to make it work for Word16 boundaries.

Do you have advice?

https://github.com/haskell/bytestring/blob/dac5675ea9dae639758e21314a49b15fa90e2bd7/Data/ByteString.hs#L1584-L1634

I tried using a proper definition of breakByte, but it seems there's more to it.

@Bodigrim
Copy link
Contributor

You can run Data.ByteString.breakSubstring and check the length of the prefix. If it's even, all good, you are done. If it's odd, slice the input string past the prefix and run Data.ByteString.breakSubstring again.

hasufell added a commit that referenced this issue Jul 2, 2023
hasufell added a commit that referenced this issue Jul 2, 2023
hasufell added a commit that referenced this issue Jul 2, 2023
hasufell added a commit that referenced this issue Jul 3, 2023
hasufell added a commit that referenced this issue Jul 3, 2023
hasufell added a commit that referenced this issue Jul 4, 2023
hasufell added a commit that referenced this issue Jul 11, 2023
@hasufell
Copy link
Member

This is released in https://hackage.haskell.org/package/filepath-1.4.100.4 commit 367f6bf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants