Proper Treatment of Encodings #7

Merged
merged 22 commits into from Aug 29, 2012

Conversation

Projects
None yet
2 participants
Collaborator

adimit commented Aug 27, 2012

Hi Greg,

Here are the finished patches, including documentation updates. The big changes are:

  • Everything uses Data.Text now, including warning and error messages.
  • Data.Text is encoded and decoded according to Configuration and/or ExcerptConfiguration's respective encoding fields.
  • addQueries is no more. I replaced the informal usage of Put with a Query data type that contains the query, index and query comment, which addQuery handled in the prior version. You pass [Query] to runQueries directly.

I hope the changes are OK with you. Note that the package versioning policy requires a major version bump, so this one is 0.6.0.

Regards,
Aleks

adimit added some commits May 1, 2012

@adimit adimit Add encoding option to configuration.
We're also using UTF-8 as default.
230bcb1
@adimit adimit Add Text serialization function txt.
It's similar to the existing str, except that we're using strict
ByteStrings (since text-icu does) and we need a Converter value to do
proper unicode conversion.

str still uses Char8.pack, which isn't ideal.
f6fbc88
@adimit adimit Add the Query ADT, which represents a query.
This will replace the occurences of Put in the Text.Search.Sphinx
module's API.

Since we can't do proper unicode conversion with text-icu outside of the
IO monad (we need to call Text.ICU.Convert.open, to obtain a Converter
from IO) I decided to move the calls to the binary serialization
functions to runQueries', which is in the IO monad.

This also hides that we're using Put behind the scenes, and prevents the
user from just Put'ting some bogus bits at the end of a query.
aad36d2
@adimit adimit Add simpleQuery; it produces lightweight queries. 5b4f2d7
@adimit adimit Properly encode queries with the set encoding.
The addQuery is now serializeQuery and shouldn't be called by the user
anymore. Instead, the user now makes Query objects directly (since
they're trivial) and sends these to runQueries.
bcbb442
@adimit adimit Only export the functions the user needs to see. 0a90bca
@adimit adimit Fix some documentation strings. bd77f1f
@adimit adimit Add dependencies to the cabal file. 4b9f91c
@adimit adimit Use Text in QueryResult, Attr for indexed strings.
Only the strings which may carry encoding sensitive data were changed to
Text. The rest remains lazy ByteString.
b6b4817
@adimit adimit Remove dependency on utf8-string.
Since T.S.S.Indexable was the only module using utf8-string, and since
it's not necessary there anymore to decode ByteString data because it's
stored as Text, we can drop the dependency outright.
4a168fc
@adimit adimit Add documentation note about encodings. c7e228d
@adimit adimit Adapt buildExcerpts to also use Text. ef70d5a
@adimit adimit Bump version.
This is necessary according to the versioning policy, since major parts
of the API changed.
fd73bc8
@adimit adimit Add escapeText function.
We should/will probably remove the escapeString function then…
915de01
@adimit adimit Make error types use Text as well.
Since some errors return part of the query, and might otherwise contain
unicode characters as well, we just treat them as Text data to be
encoded/decoded with the global encoding configuration setting as well.
6429c50
@adimit adimit Update documentation. af6abfe
@adimit adimit Remove escapeString from export list. 6a400e9
@adimit adimit Update the readme for 0.6.0 938060f
@adimit adimit Add note about transition to 0.6.0 3017723
@adimit adimit Merge git://github.com/gregwebs/haskell-sphinx-client 48dfbe4

@gregwebs gregwebs and 1 other commented on an outdated diff Aug 28, 2012

Text/Search/Sphinx.hs
-- however, in normal searching they will all be ignored
-escapeString :: String -> String
-escapeString [] = []
-escapeString (x:xs) = if x `elem` escapedChars
- then '\\':x:escapeString xs
- else x:escapeString xs
+escapeText :: Text -> Text
+escapeText = X.concatMap (\x -> if x `elem` escapedChars
+ then X.pack $ '\\':[x]
+ else X.singleton x)
@gregwebs

gregwebs Aug 28, 2012

Owner

Do you think this might be an improvement in clarity or efficiency?

escapeText = X.intersperse '\\' . X.breakBy  (`elem` escapedChars)
@adimit

adimit Aug 28, 2012

Collaborator

There isn't a function Data.Text.breakBy, however, I think I know what you mean. One can simulate it like this:

breakBy = X.grupBy . const . fmap not

Also, intersperse isn't what you're looking for, intercalate is. (intersperse is Char -> Text -> Text, intercalate is `Text -> [Text] -> Text)

(btw, we cannot use split or splitOn because those delete what they consider to be delimiters from their output.)

You are right that it would be easier to read (I think.) I'll benchmark it real quick.

@adimit

adimit Aug 28, 2012

Collaborator

Well. It seems that in extreme situations the current escapeText is faster, but in more normal situations, it isn't. I've put the results online.

The code I used is here: https://gist.github.com/3496913

Since the intercalate code is easier to read than the concatMap code, I think I'll go for the former.

@gregwebs gregwebs commented on the diff Aug 28, 2012

Text/Search/Sphinx.hs
@@ -89,24 +108,25 @@ connect host port = do
-- | TODO: add configuration options
buildExcerpts :: ExConf.ExcerptConfiguration -- ^ Contains host and port for connection and optional configuration for buildExcerpts
- -> [String] -- ^ list of document contents to be highlighted
- -> String -- ^ The indexes, "*" means every index
- -> String -- ^ The query string to use for excerpts
- -> IO (T.Result [BS.ByteString]) -- ^ the documents with excerpts highlighted
+ -> [Text] -- ^ list of document contents to be highlighted
+ -> String -- ^ The indexes, \"*\" means every index
@gregwebs

gregwebs Aug 28, 2012

Owner

Can we change this to a Text also?

@adimit

adimit Aug 28, 2012

Collaborator

I overlooked that one! Brb, fixin' it…

@adimit

adimit Aug 28, 2012

Collaborator

Ah wait, do you mean the indexes, i.e. that all the indexes are also given as Text? (because I thought I overlooked to change the sig for buildExcerpts but it seems I didn't.)

@gregwebs

gregwebs Aug 28, 2012

Owner

yeah, I meant indexes. Obviously not a big deal either way, but I thought it might make the API more consistent to remove String altogether.

Owner

gregwebs commented Aug 28, 2012

looks great! Only thing I am wondering is if it would have been nicer to try to find a way to do the convert on the entire contents at once rather than many individual conversions.

Owner

gregwebs commented Aug 28, 2012

I know that @luite and @snoyberg are users of this library, they may want to look over

Collaborator

adimit commented Aug 28, 2012

What exactly do you mean by converting the entire contents? Do you mean the encoding/decoding in Get/Put?

I haven't done any benchmarks on the current (or even the previous) stuff.

One thing that sticks out like a sore thumb is the duplication of code between runQueries and query (specifically, the case statements) and the duplication between buildExcerpts and runQueries. However, I don't think it would be (very) easy to fiddle those into one abstract procedure.

oh, I was looking at an older version of Text that actually had breakBy implemented :) I don't think this new version with groupBy would work correctly.
I think this would be more efficient:

escapeText t | t == X.empty = t
             | otherwise =  let (untouched, broken) = X.break (`elem` escapedChars)
                             in untouched `X.append` (`\\` `X.cons` escapeText broken)
Owner

gregwebs commented Aug 28, 2012

Yeah, I mean encoding/decoding the entire Get/Put somehow. Its fine how it is though.

Collaborator

adimit commented Aug 28, 2012

Right, groupBy doesn't handle the case where the character-to-escape is first in the string.
Your version doesn't work either, since X.break doesn't put the character it broke on in the untouched part, but leaves it in the broken part. This will cause an infinite loop.

You will therefore need to use X.take and X.drop in order to adjust the broken part, and also you have one more element to put in the concatenation. Note that X.cons (quite unintuitively) and X.append are O(n) operations which need to copy the entire array. The resulting code benchmarks much worse than the original variant in the extreme case, and a little worse than the original in the more "normal" case, plus, I think it's much less readable. See here and the code.

It seems our best bet is to use the intercalate variant, but with a non-broken breakBy. I'm curious why @bos took it out. I'll try to whip up something.

(jeez, so much work for a puny escaping function. Who knew?)

Collaborator

adimit commented Aug 28, 2012

This is the best I could do for now. It's a hack, I know, but performance-wise it's much better than the alternatives, and I also find it more or less readable.

This doesn't change the fact that Data.Text lacks a single-Char groupBy, or non-destructive split. I actually know how to implement that through Data.Text.Internal, but that would require access to either to a function that is not exported from Data.Text (particularly findAIndexOrEnd) or stuff from Data.Text.Unsafe (specifically, Iter and iter, in order to re-implement the above function.)

I will open a feature request/bug report for text.

Owner

gregwebs commented Aug 29, 2012

From my reading of the benchmark report it seems that there is not a huge difference in general. Sorry for getting you side-tracked. I did find this very interesting though. In the future I will benchmark first!

@gregwebs gregwebs added a commit that referenced this pull request Aug 29, 2012

@gregwebs gregwebs Merge pull request #7 from adimit/master
Proper Treatment of Encodings
0804a0b

@gregwebs gregwebs merged commit 0804a0b into gregwebs:master Aug 29, 2012

Owner

gregwebs commented Aug 29, 2012

Thanks again! I am going to do some testing this weekend before releasing to hackage.

Collaborator

adimit commented Aug 29, 2012

Thanks; my own testing wasn't extensive enough, so I'm glad it's getting some more testing in.

BTW, I haven't migrated the index specifications from String to Text yet, but it's on my to-do list.

Collaborator

adimit commented Sep 1, 2012

Since you've added me to the project, I've taken the liberty to push a patch that migrates index specification from String to Text, as requested earlier. Happy testing :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment