Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Proper Treatment of Encodings #7

Merged
merged 22 commits into from

2 participants

@adimit
Collaborator

Hi Greg,

Here are the finished patches, including documentation updates. The big changes are:

  • Everything uses Data.Text now, including warning and error messages.
  • Data.Text is encoded and decoded according to Configuration and/or ExcerptConfiguration's respective encoding fields.
  • addQueries is no more. I replaced the informal usage of Put with a Query data type that contains the query, index and query comment, which addQuery handled in the prior version. You pass [Query] to runQueries directly.

I hope the changes are OK with you. Note that the package versioning policy requires a major version bump, so this one is 0.6.0.

Regards,
Aleks

adimit added some commits
@adimit adimit Add encoding option to configuration.
We're also using UTF-8 as default.
230bcb1
@adimit adimit Add Text serialization function txt.
It's similar to the existing str, except that we're using strict
ByteStrings (since text-icu does) and we need a Converter value to do
proper unicode conversion.

str still uses Char8.pack, which isn't ideal.
f6fbc88
@adimit adimit Add the Query ADT, which represents a query.
This will replace the occurences of Put in the Text.Search.Sphinx
module's API.

Since we can't do proper unicode conversion with text-icu outside of the
IO monad (we need to call Text.ICU.Convert.open, to obtain a Converter
from IO) I decided to move the calls to the binary serialization
functions to runQueries', which is in the IO monad.

This also hides that we're using Put behind the scenes, and prevents the
user from just Put'ting some bogus bits at the end of a query.
aad36d2
@adimit adimit Add simpleQuery; it produces lightweight queries. 5b4f2d7
@adimit adimit Properly encode queries with the set encoding.
The addQuery is now serializeQuery and shouldn't be called by the user
anymore. Instead, the user now makes Query objects directly (since
they're trivial) and sends these to runQueries.
bcbb442
@adimit adimit Only export the functions the user needs to see. 0a90bca
@adimit adimit Fix some documentation strings. bd77f1f
@adimit adimit Add dependencies to the cabal file. 4b9f91c
@adimit adimit Use Text in QueryResult, Attr for indexed strings.
Only the strings which may carry encoding sensitive data were changed to
Text. The rest remains lazy ByteString.
b6b4817
@adimit adimit Remove dependency on utf8-string.
Since T.S.S.Indexable was the only module using utf8-string, and since
it's not necessary there anymore to decode ByteString data because it's
stored as Text, we can drop the dependency outright.
4a168fc
@adimit adimit Add documentation note about encodings. c7e228d
@adimit adimit Adapt buildExcerpts to also use Text. ef70d5a
@adimit adimit Bump version.
This is necessary according to the versioning policy, since major parts
of the API changed.
fd73bc8
@adimit adimit Add escapeText function.
We should/will probably remove the escapeString function then…
915de01
@adimit adimit Make error types use Text as well.
Since some errors return part of the query, and might otherwise contain
unicode characters as well, we just treat them as Text data to be
encoded/decoded with the global encoding configuration setting as well.
6429c50
@adimit adimit Update documentation. af6abfe
@adimit adimit Remove escapeString from export list. 6a400e9
@adimit adimit Update the readme for 0.6.0 938060f
@adimit adimit Add note about transition to 0.6.0 3017723
@adimit adimit Merge git://github.com/gregwebs/haskell-sphinx-client 48dfbe4
Text/Search/Sphinx.hs
((8 lines not shown))
-- however, in normal searching they will all be ignored
-escapeString :: String -> String
-escapeString [] = []
-escapeString (x:xs) = if x `elem` escapedChars
- then '\\':x:escapeString xs
- else x:escapeString xs
+escapeText :: Text -> Text
+escapeText = X.concatMap (\x -> if x `elem` escapedChars
+ then X.pack $ '\\':[x]
+ else X.singleton x)
@gregwebs Owner

Do you think this might be an improvement in clarity or efficiency?

escapeText = X.intersperse '\\' . X.breakBy  (`elem` escapedChars)
@adimit Collaborator
adimit added a note

There isn't a function Data.Text.breakBy, however, I think I know what you mean. One can simulate it like this:

breakBy = X.grupBy . const . fmap not

Also, intersperse isn't what you're looking for, intercalate is. (intersperse is Char -> Text -> Text, intercalate is `Text -> [Text] -> Text)

(btw, we cannot use split or splitOn because those delete what they consider to be delimiters from their output.)

You are right that it would be easier to read (I think.) I'll benchmark it real quick.

@adimit Collaborator
adimit added a note

Well. It seems that in extreme situations the current escapeText is faster, but in more normal situations, it isn't. I've put the results online.

The code I used is here: https://gist.github.com/3496913

Since the intercalate code is easier to read than the concatMap code, I think I'll go for the former.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@gregwebs gregwebs commented on the diff
Text/Search/Sphinx.hs
@@ -89,24 +108,25 @@ connect host port = do
-- | TODO: add configuration options
buildExcerpts :: ExConf.ExcerptConfiguration -- ^ Contains host and port for connection and optional configuration for buildExcerpts
- -> [String] -- ^ list of document contents to be highlighted
- -> String -- ^ The indexes, "*" means every index
- -> String -- ^ The query string to use for excerpts
- -> IO (T.Result [BS.ByteString]) -- ^ the documents with excerpts highlighted
+ -> [Text] -- ^ list of document contents to be highlighted
+ -> String -- ^ The indexes, \"*\" means every index
@gregwebs Owner

Can we change this to a Text also?

@adimit Collaborator
adimit added a note

I overlooked that one! Brb, fixin' it…

@adimit Collaborator
adimit added a note

Ah wait, do you mean the indexes, i.e. that all the indexes are also given as Text? (because I thought I overlooked to change the sig for buildExcerpts but it seems I didn't.)

@gregwebs Owner

yeah, I meant indexes. Obviously not a big deal either way, but I thought it might make the API more consistent to remove String altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@gregwebs
Owner

looks great! Only thing I am wondering is if it would have been nicer to try to find a way to do the convert on the entire contents at once rather than many individual conversions.

@gregwebs
Owner

I know that @luite and @snoyberg are users of this library, they may want to look over

@adimit
Collaborator

What exactly do you mean by converting the entire contents? Do you mean the encoding/decoding in Get/Put?

I haven't done any benchmarks on the current (or even the previous) stuff.

One thing that sticks out like a sore thumb is the duplication of code between runQueries and query (specifically, the case statements) and the duplication between buildExcerpts and runQueries. However, I don't think it would be (very) easy to fiddle those into one abstract procedure.

@gregwebs
Owner

oh, I was looking at an older version of Text that actually had breakBy implemented :) I don't think this new version with groupBy would work correctly.
I think this would be more efficient:

escapeText t | t == X.empty = t
             | otherwise =  let (untouched, broken) = X.break (`elem` escapedChars)
                             in untouched `X.append` (`\\` `X.cons` escapeText broken)
@gregwebs
Owner

Yeah, I mean encoding/decoding the entire Get/Put somehow. Its fine how it is though.

@adimit
Collaborator

Right, groupBy doesn't handle the case where the character-to-escape is first in the string.
Your version doesn't work either, since X.break doesn't put the character it broke on in the untouched part, but leaves it in the broken part. This will cause an infinite loop.

You will therefore need to use X.take and X.drop in order to adjust the broken part, and also you have one more element to put in the concatenation. Note that X.cons (quite unintuitively) and X.append are O(n) operations which need to copy the entire array. The resulting code benchmarks much worse than the original variant in the extreme case, and a little worse than the original in the more "normal" case, plus, I think it's much less readable. See here and the code.

It seems our best bet is to use the intercalate variant, but with a non-broken breakBy. I'm curious why @bos took it out. I'll try to whip up something.

(jeez, so much work for a puny escaping function. Who knew?)

@adimit
Collaborator

This is the best I could do for now. It's a hack, I know, but performance-wise it's much better than the alternatives, and I also find it more or less readable.

This doesn't change the fact that Data.Text lacks a single-Char groupBy, or non-destructive split. I actually know how to implement that through Data.Text.Internal, but that would require access to either to a function that is not exported from Data.Text (particularly findAIndexOrEnd) or stuff from Data.Text.Unsafe (specifically, Iter and iter, in order to re-implement the above function.)

I will open a feature request/bug report for text.

@gregwebs
Owner

From my reading of the benchmark report it seems that there is not a huge difference in general. Sorry for getting you side-tracked. I did find this very interesting though. In the future I will benchmark first!

@gregwebs gregwebs merged commit 0804a0b into from
@gregwebs
Owner

Thanks again! I am going to do some testing this weekend before releasing to hackage.

@adimit
Collaborator

Thanks; my own testing wasn't extensive enough, so I'm glad it's getting some more testing in.

BTW, I haven't migrated the index specifications from String to Text yet, but it's on my to-do list.

@adimit
Collaborator

Since you've added me to the project, I've taken the liberty to push a patch that migrates index specification from String to Text, as requested earlier. Happy testing :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 1, 2012
  1. @adimit

    Add encoding option to configuration.

    adimit authored
    We're also using UTF-8 as default.
  2. @adimit

    Add Text serialization function txt.

    adimit authored
    It's similar to the existing str, except that we're using strict
    ByteStrings (since text-icu does) and we need a Converter value to do
    proper unicode conversion.
    
    str still uses Char8.pack, which isn't ideal.
  3. @adimit

    Add the Query ADT, which represents a query.

    adimit authored
    This will replace the occurences of Put in the Text.Search.Sphinx
    module's API.
    
    Since we can't do proper unicode conversion with text-icu outside of the
    IO monad (we need to call Text.ICU.Convert.open, to obtain a Converter
    from IO) I decided to move the calls to the binary serialization
    functions to runQueries', which is in the IO monad.
    
    This also hides that we're using Put behind the scenes, and prevents the
    user from just Put'ting some bogus bits at the end of a query.
  4. @adimit
  5. @adimit

    Properly encode queries with the set encoding.

    adimit authored
    The addQuery is now serializeQuery and shouldn't be called by the user
    anymore. Instead, the user now makes Query objects directly (since
    they're trivial) and sends these to runQueries.
  6. @adimit
  7. @adimit
  8. @adimit
Commits on May 2, 2012
  1. @adimit

    Use Text in QueryResult, Attr for indexed strings.

    adimit authored
    Only the strings which may carry encoding sensitive data were changed to
    Text. The rest remains lazy ByteString.
  2. @adimit

    Remove dependency on utf8-string.

    adimit authored
    Since T.S.S.Indexable was the only module using utf8-string, and since
    it's not necessary there anymore to decode ByteString data because it's
    stored as Text, we can drop the dependency outright.
  3. @adimit
Commits on May 5, 2012
  1. @adimit
Commits on Aug 22, 2012
  1. @adimit

    Bump version.

    adimit authored
    This is necessary according to the versioning policy, since major parts
    of the API changed.
Commits on Aug 26, 2012
  1. @adimit

    Add escapeText function.

    adimit authored
    We should/will probably remove the escapeString function then…
  2. @adimit

    Make error types use Text as well.

    adimit authored
    Since some errors return part of the query, and might otherwise contain
    unicode characters as well, we just treat them as Text data to be
    encoded/decoded with the global encoding configuration setting as well.
  3. @adimit

    Update documentation.

    adimit authored
  4. @adimit
Commits on Aug 27, 2012
  1. @adimit

    Update the readme for 0.6.0

    adimit authored
  2. @adimit
  3. @adimit
Commits on Aug 28, 2012
  1. @adimit

    Simplify escapeText.

    adimit authored
  2. @adimit

    Fix breakBy.

    adimit authored
This page is out of date. Refresh to see the latest.
View
62 README.md
@@ -6,30 +6,53 @@ Version 0.5 is Compatible with sphinx version 2.0-beta, but you can pass the ver
# Usage
-`query` executes a single query.
-`runQueries` executes multiple queries at once that were created them with `addQuery`
+## Constructing Queries
-In extended mode you may want to escape special query characters with `escapeString`
+The data type `Query` is used to represent queries to the server. It specifies
+a search string and the indexes to run the query on, as well as a comment,
+which may be the empty string. In order to run a query on all indexes, use
+`"*"` in the index field.
-`buildExcerpts` creates highlighted excerpts
+The convenience function `query` executes a single query and constructs the
+`Query` by itself, so you don't have to.
-You will probably need to import the types also:
+To execute more than one `Query`, use `runQueries`. Details are below in the
+section [*Batch Queries*](#batch-queries). To construct simple queries, you can
+also use `simpleQuery :: Text -> Query` which constructs a `Query` over all
+indexes. Don't forget that you can use record updates on a `Query`.
+
+In extended mode you may want to escape special query characters with `escapeString`.
+
+All interaction with the server, including sending queries and receiving
+results, is based on the `Data.Text` string type. You might therefore want to
+enable the `OverloadedStrings` pragma.
+
+## Excerpts and XML Indexes
+
+`buildExcerpts` creates highlighted excerpts.
+
+You will probably need to import the types as well:
import qualified Text.Search.Sphinx as Sphinx
import qualified Text.Search.Sphinx.Types as SphinxT
-There is also an `Indexable` module for generating an xml file of data to be indexed
+There is also an `Indexable` module for generating an xml file of data to be indexed.
-## runQueries helpers
+## Batch Queries
-`runQueries` pipelines multiple queries together. If you are trying to combine the results, there are some helpers such as `maybeQueries` and `resultsToMatches`
+You can send more than one query per request to the server (which may enable
+server-side query optimization in certain cases. Refer to the
+[Sphinx manual](http://sphinxsearch.com/docs/2.0.4/api-func-addquery.html)
+for details.) The function `runQueries` pipelines multiple queries together. If you
+are trying to combine the results, there are some helpers such as
+`maybeQueries` and `resultsToMatches`.
~~~~~~ {.haskell}
mr <- Sphinx.maybeQueries sphinxLogger sphinxConfig [
- addQuery "db1" query1
- , addQuery "db2" query1
- , addQuery "db1" query2
- , addQuery "db2" query2
+ SphinxT.Query query1 "db1" ""
+ , SphinxT.Query query1 "db2" ""
+ , SphinxT.Query query2 "db1" ""
+ , SphinxT.Query query2 "db2" ""
]
case mr of
Nothing -> return Nothing
@@ -40,8 +63,23 @@ There is also an `Indexable` module for generating an xml file of data to be ind
else return $ Just combined
~~~~~~
+**A note** for those transitioning from `0.5.*` to `0.6`: the function `addQueries`
+has been removed. You can now directly send a list of `Query` to the server by using
+`runQueries`, which will handle the serialization for you behind the scenes.
+
+## Encoding
+The sphinx server itself does not know about encodings except for the
+difference between single-byte encodings and multi-byte encodings. It assumes
+that all incoming queries are already properly encoded and matches the raw
+bytes it receives; the same holds for the results returned by the server. Hence
+the responsibilty for using the proper encoding (and decoding) routines lies
+with the caller.
+Version 0.6.0 of `haskell-sphinx-client` introduces the `encoding` field in
+both the `Configuration` data type and the `ExcerptConfiguration` data type.
+The library handles proper encoding and decoding in the background; just
+make sure you set the right encoding setting in the configuration!
Details
=======
View
147 Text/Search/Sphinx.hs
@@ -5,13 +5,22 @@
-- setFilterFloatRange, setGeoAnchor
-- resetFilters, resetGroupBy
-- updateAttributes,
--- buildKeyWords, escapeString, status, open, close
-module Text.Search.Sphinx ( module Text.Search.Sphinx
+-- buildKeyWords, status, open, close
+module Text.Search.Sphinx
+ ( escapeText
+ , query
+ , buildExcerpts
+ , runQueries
+ , runQueries'
+ , resultsToMatches
+ , maybeQueries
+ , T.Query(..), simpleQuery
, Configuration(..), defaultConfig
) where
import qualified Text.Search.Sphinx.Types as T (
Match,
+ Query(..),
VerCommand(VcSearch, VcExcerpt),
SearchdCommand(ScSearch, ScExcerpt),
Filter, Filter(..),
@@ -21,9 +30,9 @@ import qualified Text.Search.Sphinx.Types as T (
import Text.Search.Sphinx.Configuration (Configuration(..), defaultConfig)
import qualified Text.Search.Sphinx.ExcerptConfiguration as ExConf (ExcerptConfiguration(..))
-import Text.Search.Sphinx.Get (times, getResult, readHeader, getStr)
+import Text.Search.Sphinx.Get (times, getResult, readHeader, getStr, getTxt)
import Text.Search.Sphinx.Put (num, num64, enum, list, numC, strC, foldPuts,
- numC64, stringIntList, str, cmd, verCmd)
+ numC64, stringIntList, str, txt, cmd, verCmd)
import Data.Binary.Put (Put, runPut)
import Data.Binary.Get (runGet, getWord32be)
@@ -38,6 +47,10 @@ import Data.Bits ((.|.))
import Prelude hiding (filter, tail)
import Data.List (nub)
+import Data.Text (Text)
+import qualified Data.Text as X
+import qualified Data.Text.ICU.Convert as ICU
+
{- the funnest way to debug this is to run the same query with an existing working client and look at the difference
- sudo tcpflow -i lo dst port 9306
import Debug.Trace; debug a = trace (show a) a
@@ -46,23 +59,22 @@ import Debug.Trace; debug a = trace (show a) a
escapedChars :: String
escapedChars = '"':'\\':"-!@~/()*[]="
--- | escape all possible meta characters.
--- most of these characters only need to be escaped in certain contexts
+-- | Escape all possible meta characters.
+-- Most of these characters only need to be escaped in certain contexts
-- however, in normal searching they will all be ignored
-escapeString :: String -> String
-escapeString [] = []
-escapeString (x:xs) = if x `elem` escapedChars
- then '\\':x:escapeString xs
- else x:escapeString xs
+escapeText :: Text -> Text
+escapeText = X.intercalate "\\" . breakBy (`elem` escapedChars)
+ where breakBy p t | X.null t = [X.empty]
+ | otherwise = (if p $ X.head t then ("":) else id) $ X.groupBy (\_ x -> not $ p x) t
-- | The 'query' function runs a single query against the Sphinx daemon.
--- To pipeline multiple queries in a batch, use addQuery and runQueries
+-- To pipeline multiple queries in a batch, use and 'runQueries'.
query :: Configuration -- ^ The configuration
- -> String -- ^ The indexes, "*" means every index
- -> String -- ^ The query string
+ -> String -- ^ The indexes, \"*\" means every index
+ -> Text -- ^ The query string
-> IO (T.Result T.QueryResult) -- ^ just one search result back
query config indexes search = do
- let q = addQuery config search indexes ""
+ let q = T.Query search indexes X.empty
results <- runQueries' config [q]
-- same as toSearchResult, but we know there is just one query
-- could just remove and use runQueries in the future
@@ -75,9 +87,16 @@ query config indexes search = do
T.Retry retry -> T.Retry retry
T.Warning warning (result:results) -> case result of
T.QueryOk result -> T.Warning warning result
- T.QueryWarning w result -> T.Warning (BS.append warning w) result
+ T.QueryWarning w result -> T.Warning (X.append warning w) result
T.QueryError code e -> T.Error code e
+-- | This is a convenience function which accepts a search string and
+-- builds a query for that string over all indexes without attaching
+-- comments to the queries.
+simpleQuery :: Text -- ^ The query string
+ -> T.Query -- ^ A query value that can be sent to 'runQueries'
+simpleQuery q = T.Query q "*" X.empty
+
connect :: String -> Int -> IO Handle
connect host port = do
connection <- connectTo host (PortNumber $ fromIntegral $ port)
@@ -89,24 +108,25 @@ connect host port = do
-- | TODO: add configuration options
buildExcerpts :: ExConf.ExcerptConfiguration -- ^ Contains host and port for connection and optional configuration for buildExcerpts
- -> [String] -- ^ list of document contents to be highlighted
- -> String -- ^ The indexes, "*" means every index
- -> String -- ^ The query string to use for excerpts
- -> IO (T.Result [BS.ByteString]) -- ^ the documents with excerpts highlighted
+ -> [Text] -- ^ list of document contents to be highlighted
+ -> String -- ^ The indexes, \"*\" means every index
@gregwebs Owner

Can we change this to a Text also?

@adimit Collaborator
adimit added a note

I overlooked that one! Brb, fixin' it…

@adimit Collaborator
adimit added a note

Ah wait, do you mean the indexes, i.e. that all the indexes are also given as Text? (because I thought I overlooked to change the sig for buildExcerpts but it seems I didn't.)

@gregwebs Owner

yeah, I meant indexes. Obviously not a big deal either way, but I thought it might make the API more consistent to remove String altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ -> Text -- ^ The query string to use for excerpts
+ -> IO (T.Result [Text]) -- ^ the documents with excerpts highlighted
buildExcerpts config docs indexes words = do
conn <- connect (ExConf.host config) (ExConf.port config)
- let req = runPut $ makeBuildExcerpt addExcerpt
+ conv <- ICU.open (ExConf.encoding config) Nothing
+ let req = runPut $ makeBuildExcerpt (addExcerpt conv)
BS.hPut conn req
hFlush conn
(status, response) <- getResponse conn
case status of
- T.OK -> return $ T.Ok (getResults response)
- T.WARNING -> return $ T.Warning (runGet getStr response) (getResults response)
- T.RETRY -> return $ T.Retry (errorMessage response)
- T.ERROR n -> return $ T.Error n (errorMessage response)
+ T.OK -> return $ T.Ok (getResults response conv)
+ T.WARNING -> return $ T.Warning (runGet (getTxt conv) response) (getResults response conv)
+ T.RETRY -> return $ T.Retry (errorMessage conv response)
+ T.ERROR n -> return $ T.Error n (errorMessage conv response)
where
- getResults response = runGet ((length docs) `times` getStr) response
- errorMessage response = BS.tail (BS.tail (BS.tail (BS.tail response)))
+ getResults response conv = runGet ((length docs) `times` getTxt conv) response
+ errorMessage conv response = runGet (getTxt conv) (BS.drop 4 response)
makeBuildExcerpt putExcerpt = do
cmd T.ScExcerpt
@@ -114,19 +134,19 @@ buildExcerpts config docs indexes words = do
num $ fromEnum $ BS.length (runPut putExcerpt)
putExcerpt
- addExcerpt :: Put
- addExcerpt = do
+ addExcerpt :: ICU.Converter -> Put
+ addExcerpt conv = do
num 0 -- mode
num $ excerptFlags config
str indexes
- str words
+ txt conv words
strC config [ExConf.beforeMatch, ExConf.afterMatch, ExConf.chunkSeparator]
numC config [ExConf.limit, ExConf.around, ExConf.limitPassages, ExConf.limitWords, ExConf.startPassageId]
str $ ExConf.htmlStripMode config
#ifndef ONE_ONE_BETA
str $ ExConf.passageBoundary config
#endif
- list str docs
+ list (txt conv) docs
modeFlag :: ExConf.ExcerptConfiguration -> (ExConf.ExcerptConfiguration -> Bool) -> Int -> Int
modeFlag cfg setting value = if setting cfg then value else 0
@@ -144,10 +164,10 @@ buildExcerpts config docs indexes words = do
])
--- | Use with addQuery to pipeline multiple queries.
+-- | Make multiple queries at once, using a list of 'T.Query'.
-- For a single query, just use the query method
-- Easier handling of query result than runQueries'
-runQueries :: Configuration -> [Put] -> IO (T.Result [T.QueryResult])
+runQueries :: Configuration -> [T.Query] -> IO (T.Result [T.QueryResult])
runQueries cfg qs = runQueries' cfg qs >>= return . toSearchResult
where
-- with batched queries, each query can have an error code,
@@ -161,39 +181,39 @@ runQueries cfg qs = runQueries' cfg qs >>= return . toSearchResult
toSearchResult :: T.Result [T.SingleResult] -> T.Result [T.QueryResult]
toSearchResult results =
case results of
- T.Ok rs -> fromOk rs [] BS.empty
+ T.Ok rs -> fromOk rs [] X.empty
T.Warning warning rs -> fromWarn warning rs []
T.Retry retry -> T.Retry retry
T.Error code error -> T.Error code error
where
- fromOk :: [T.SingleResult] -> [T.QueryResult] -> BS.ByteString -> T.Result [T.QueryResult]
- fromOk [] acc warn | BS.null warn = T.Ok acc
+ fromOk :: [T.SingleResult] -> [T.QueryResult] -> Text -> T.Result [T.QueryResult]
+ fromOk [] acc warn | X.null warn = T.Ok acc
| otherwise = T.Warning warn acc
fromOk (r:rs) acc warn = case r of
T.QueryOk result -> fromOk rs (acc ++ [result]) warn
- T.QueryWarning w result -> fromOk rs (acc ++ [result]) (BS.append warn w)
+ T.QueryWarning w result -> fromOk rs (acc ++ [result]) (X.append warn w)
T.QueryError code e -> T.Error code e
- fromWarn :: BS.ByteString -> [T.SingleResult] -> [T.QueryResult] -> T.Result [T.QueryResult]
+ fromWarn :: Text -> [T.SingleResult] -> [T.QueryResult] -> T.Result [T.QueryResult]
fromWarn warning [] acc = T.Warning warning acc
fromWarn warning (r:rs) acc = case r of
T.QueryOk result -> fromWarn warning rs (result:acc)
- T.QueryWarning w result -> fromWarn (BS.append warning w) rs (result:acc)
+ T.QueryWarning w result -> fromWarn (X.append warning w) rs (result:acc)
T.QueryError code e -> T.Error code e
--- | lower level- called by 'runQueries'
--- | This may be useful for debugging problems- warning messages won't get compressed
-runQueries' :: Configuration -> [Put] -> IO (T.Result [T.SingleResult])
+-- | Lower level- called by 'runQueries'.
+-- This may be useful for debugging problems- warning messages won't get compressed
+runQueries' :: Configuration -> [T.Query] -> IO (T.Result [T.SingleResult])
runQueries' config qs = do
conn <- connect (host config) (port config)
- BS.hPut conn request
+ conv <- ICU.open (encoding config) Nothing
+ let queryReq = foldPuts $ map (serializeQuery config conv) qs
+ BS.hPut conn (request queryReq)
hFlush conn
- getSearchResult conn
+ getSearchResult conn conv
where
numQueries = length qs
- queryReq = foldPuts qs
-
- request = runPut $ do
+ request qr = runPut $ do
cmd T.ScSearch
verCmd T.VcSearch
num $
@@ -202,24 +222,24 @@ runQueries' config qs = do
#else
8
#endif
- + (fromEnum $ BS.length (runPut queryReq))
+ + (fromEnum $ BS.length (runPut qr))
#ifndef ONE_ONE_BETA
num 0
#endif
num numQueries
- queryReq
+ qr
- getSearchResult :: Handle -> IO (T.Result [T.SingleResult])
- getSearchResult conn = do
+ getSearchResult :: Handle -> ICU.Converter -> IO (T.Result [T.SingleResult])
+ getSearchResult conn conv = do
(status, response) <- getResponse conn
case status of
- T.OK -> return $ T.Ok (getResults response)
- T.WARNING -> return $ T.Warning (runGet getStr response) (getResults response)
- T.RETRY -> return $ T.Retry (errorMessage response)
- T.ERROR n -> return $ T.Error n (errorMessage response)
+ T.OK -> return $ T.Ok (getResults response conv)
+ T.WARNING -> return $ T.Warning (runGet (getTxt conv) response) (getResults response conv)
+ T.RETRY -> return $ T.Retry (errorMessage conv response)
+ T.ERROR n -> return $ T.Error n (errorMessage conv response)
where
- getResults response = runGet (numQueries `times` getResult) response
- errorMessage response = BS.tail (BS.tail (BS.tail (BS.tail response)))
+ getResults response conv = runGet (numQueries `times` getResult conv) response
+ errorMessage conv response = runGet (getTxt conv) (BS.drop 4 response)
-- | Combine results from 'runQueries' into matches.
@@ -237,7 +257,7 @@ resultsToMatches maxResults = combine
-- | executes 'runQueries'. Log warning and errors, automatically retry.
-- Return a Nothing on error, otherwise a Just.
-maybeQueries :: (BS.ByteString -> IO ()) -> Configuration -> [Put] -> IO (Maybe [T.QueryResult])
+maybeQueries :: (Text -> IO ()) -> Configuration -> [T.Query] -> IO (Maybe [T.QueryResult])
maybeQueries logCallback conf queries = do
result <- runQueries conf queries
case result of
@@ -245,9 +265,8 @@ maybeQueries logCallback conf queries = do
T.Retry msg -> logCallback msg >> maybeQueries logCallback conf queries
T.Warning w r -> logCallback w >> return (Just r)
T.Error code msg ->
- logCallback (BS.concat ["Error code ",BS8.pack $ show code,". ",msg]) >> return Nothing
+ logCallback (X.concat ["Error code ",X.pack $ show code,". ",msg]) >> return Nothing
--- | TODO: hide this function
getResponse :: Handle -> IO (T.Status, BS.ByteString)
getResponse conn = do
header <- BS.hGet conn 8
@@ -259,15 +278,15 @@ getResponse conn = do
return (status, response)
-- | use with runQueries to pipeline a batch of queries
-addQuery :: Configuration -> String -> String -> String -> Put
-addQuery cfg query indexes comment = do
+serializeQuery :: Configuration -> ICU.Converter -> T.Query -> Put
+serializeQuery cfg conv (T.Query qry indexes comment) = do
numC cfg [ offset
, limit
, fromEnum . mode
, fromEnum . ranker
, fromEnum . sort]
str (sortBy cfg)
- str query
+ txt conv qry
list num (weights cfg)
str indexes
num 1 -- id64 range marker
@@ -287,7 +306,7 @@ addQuery cfg query indexes comment = do
stringIntList (indexWeights cfg)
num (maxQueryTime cfg)
stringIntList (fieldWeights cfg)
- str comment
+ txt conv comment
num 0 -- attribute overrides (none)
str (selectClause cfg) -- select-list
where
View
20 Text/Search/Sphinx/Configuration.hs
@@ -3,11 +3,28 @@ module Text.Search.Sphinx.Configuration where
import qualified Text.Search.Sphinx.Types as T
-- | The configuration for a query
+--
+-- A note about encodings: The encoding specified here is used to encode
+-- every @Text@ value that is sent to the server, and it used to decode all
+-- of the server's answers, including error messages.
+--
+-- If the specified encoding doesn't support characters sent to the server,
+-- they will silently be substituted with the byte value of @\'\\SUB\' ::
+-- 'Char'@ before transmission.
+--
+-- If the server sends a byte value back that the encoding doesn't understand,
+-- the affected bytes will be converted into special values as
+-- specified by that encoding. For example, when decoding invalid UTF-8,
+-- all invalid bytes are going to be substituted with @\'\\65533\' ::
+-- 'Char'@.
+--
data Configuration = Configuration {
-- | The hostname of the Sphinx daemon
host :: String
-- | The portnumber of the Sphinx daemon
, port :: Int
+ -- | Encoding used to encode queries to the server, and decode server responses
+ , encoding :: String
-- | Per-field weights
, weights :: [Int]
-- | How many records to seek from result-set start (default is 0)
@@ -50,7 +67,7 @@ data Configuration = Configuration {
, maxQueryTime :: Int
-- | Per-field-name weights
, fieldWeights :: [(String, Int)]
- -- | attributes to select, defaults to '*'
+ -- | attributes to select, defaults to \"*\"
, selectClause :: String -- setSelect in regular API
}
deriving (Show)
@@ -59,6 +76,7 @@ data Configuration = Configuration {
defaultConfig = Configuration {
port = 3312
, host = "127.0.0.1"
+ , encoding = "UTF-8"
, weights = []
, offset = 0
, limit = 20
View
3  Text/Search/Sphinx/ExcerptConfiguration.hs
@@ -7,6 +7,8 @@ data ExcerptConfiguration = ExcerptConfiguration {
host :: String
-- | The portnumber of the Sphinx daemon
, port :: Int
+ -- | Encoding used to encode queries to the server, and decode server responses
+ , encoding :: String
, beforeMatch :: String
, afterMatch :: String
, chunkSeparator :: String
@@ -33,6 +35,7 @@ data ExcerptConfiguration = ExcerptConfiguration {
defaultConfig = ExcerptConfiguration {
port = 3312
, host = "127.0.0.1"
+ , encoding = "utf8"
, beforeMatch = "<b>"
, afterMatch = "</b>"
, chunkSeparator = "..."
View
31 Text/Search/Sphinx/Get.hs
@@ -10,6 +10,8 @@ import Control.Monad
import qualified Text.Search.Sphinx.Types as T
import Data.Maybe (isJust, fromJust)
+import qualified Data.Text.ICU.Convert as ICU
+
-- Utility functions
getNum :: Get Int
getNum = getWord32be >>= return . fromEnum
@@ -24,17 +26,23 @@ readList f = do num <- getNum
num `times` f
times = replicateM
+getTxt conv = liftM (ICU.toUnicode conv) getStrStr
+
getStr = do len <- getNum
getLazyByteString (fromIntegral len)
-getResult :: Get (T.SingleResult)
-getResult = do
+-- Get a strict 'ByteString'.
+getStrStr = do len <- getNum
+ getByteString (fromIntegral len)
+
+getResult :: ICU.Converter -> Get (T.SingleResult)
+getResult conv = do
statusNum <- getNum
case T.toQueryStatus statusNum of
- T.QueryERROR n -> do e <- getStr
+ T.QueryERROR n -> do e <- getTxt conv
return $ T.QueryError statusNum e
T.QueryOK -> getResultOk >>= return . T.QueryOk
- T.QueryWARNING -> do w <- getStr
+ T.QueryWARNING -> do w <- getTxt conv
getResultOk >>= return . (T.QueryWarning w)
where
getResultOk = do
@@ -42,17 +50,18 @@ getResult = do
attrs <- readList readAttrPair
matchCount <- getNum
id64 <- getNum
- matches <- matchCount `times` readMatch (id64 > 0) (map snd attrs)
+ matches <- matchCount `times` readMatch (id64 > 0) (map snd attrs) conv
[total, totalFound, time, numWords] <- 4 `times` getNum
- wrds <- numWords `times` readWord
+ wrds <- numWords `times` readWord conv
return $ T.QueryResult matches total totalFound wrds (map fst attrs)
-readWord = do s <- getStr
- [doc, hits] <- 2 `times` getNum
- return (s, doc, hits)
+readWord conv = do
+ s <- getStrStr
+ [doc, hits] <- 2 `times` getNum
+ return (ICU.toUnicode conv s, doc, hits)
-readMatch isId64 attrs = do
+readMatch isId64 attrs conv = do
doc <- if isId64 then getNum64 else (getNum >>= return . fromIntegral)
weight <- getNum
matchAttrs <- mapM readAttr attrs
@@ -60,7 +69,7 @@ readMatch isId64 attrs = do
where
readAttr (T.AttrTMulti attr) = (readList (readAttr attr)) >>= return . T.AttrMulti
readAttr T.AttrTBigInt = getNum64 >>= return . T.AttrBigInt
- readAttr T.AttrTString = getStr >>= return . T.AttrString
+ readAttr T.AttrTString = getStrStr >>= return . T.AttrString . ICU.toUnicode conv
readAttr T.AttrTUInt = getNum >>= return . T.AttrUInt
readAttr T.AttrTFloat = getFloat >>= return . T.AttrFloat
readAttr _ = getNum >>= return . T.AttrUInt
View
4 Text/Search/Sphinx/Indexable.hs
@@ -4,7 +4,7 @@ module Text.Search.Sphinx.Indexable (
)
where
-import Data.ByteString.Lazy.UTF8 (toString)
+import Data.Text (unpack)
import qualified Text.Search.Sphinx.Types as T
--import Text.Search.Sphinx.Types
@@ -37,7 +37,7 @@ docEl :: (String, T.Attr) -> Element
docEl (name, content) = normalEl name `text` indexableEl content
indexableEl (T.AttrUInt i) = simpleText $ show i
-indexableEl (T.AttrString s) = simpleText $ toString s
+indexableEl (T.AttrString s) = simpleText $ unpack s
indexableEl (T.AttrFloat f) = simpleText $ show f
indexableEl _ = error "not implemented"
View
9 Text/Search/Sphinx/Put.hs
@@ -8,6 +8,10 @@ import Data.ByteString.Lazy.Char8 (pack)
import qualified Data.ByteString.Lazy as BS
import qualified Text.Search.Sphinx.Types as T
+import Data.Text (Text)
+import qualified Data.Text.ICU.Convert as ICU
+import qualified Data.ByteString as Strict (length)
+
num = putWord32be . fromIntegral
num64 i = putWord64be $ fromIntegral i
@@ -39,3 +43,8 @@ foldPuts :: [Put] -> Put
foldPuts [] = return ()
foldPuts [p] = p
foldPuts (p:ps) = p >> foldPuts ps
+
+txt :: ICU.Converter -> Text -> Put
+txt conv t = do let bs = ICU.fromUnicode conv t
+ num (fromEnum $ Strict.length bs)
+ putByteString bs
View
22 Text/Search/Sphinx/Types.hs
@@ -6,6 +6,14 @@ module Text.Search.Sphinx.Types (
import Data.ByteString.Lazy (ByteString)
import Data.Int (Int64)
import Data.Maybe (Maybe, isJust)
+import Data.Text (Text,empty)
+
+-- | Data structure representing one query. It can be sent with 'runQueries'
+-- or 'runQueries'' to the server in batch mode.
+data Query = Query { queryString :: Text -- ^ The actual query string
+ , queryIndexes :: String -- ^ The indexes, \"*\" means every index
+ , queryComment :: Text -- ^ A comment string.
+ } deriving (Show)
-- | Search commands
data SearchdCommand = ScSearch
@@ -160,7 +168,7 @@ data QueryResult = QueryResult {
-- | Total amount of matching documents in index.
, totalFound :: Int
-- | processed words with the number of docs and the number of hits.
- , words :: [(ByteString, Int, Int)]
+ , words :: [(Text, Int, Int)]
-- | List of attribute names returned in the result.
-- | The Match will contain just the attribute values in the same order.
, attributeNames :: [ByteString]
@@ -169,15 +177,15 @@ data QueryResult = QueryResult {
-- | a single query result, runQueries returns a list of these
data SingleResult = QueryOk QueryResult
- | QueryWarning ByteString QueryResult
- | QueryError Int ByteString
+ | QueryWarning Text QueryResult
+ | QueryError Int Text
deriving (Show)
-- | a result returned from searchd
data Result a = Ok a
- | Warning ByteString a
- | Error Int ByteString
- | Retry ByteString
+ | Warning Text a
+ | Error Int Text
+ | Retry Text
deriving (Show)
data Match = Match {
@@ -196,6 +204,6 @@ instance Eq Match where
data Attr = AttrMulti [Attr]
| AttrUInt Int
| AttrBigInt Int64
- | AttrString ByteString
+ | AttrString Text
| AttrFloat Float
deriving (Show)
View
5 sphinx.cabal
@@ -1,5 +1,5 @@
Name: sphinx
-Version: 0.5.3.1
+Version: 0.6.0
Synopsis: Haskell bindings to the Sphinx full-text searching daemon.
Description: Haskell bindings to the Sphinx full-text searching daemon. Compatible with Sphinx version 2.0
Category: Text, Search, Database
@@ -31,7 +31,8 @@ library
Build-Depends: base >= 4 && < 5,
binary, data-binary-ieee754,
bytestring, network,
- xml, utf8-string >= 0.3
+ xml,
+ text < 0.12, text-icu < 0.7
if flag(version-1-1-beta)
cpp-options: -DONE_ONE_BETA
Something went wrong with that request. Please try again.