Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ #206 ] Deriving FromField/ToField instances #207

Draft
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

stevladimir
Copy link

Works only for the following representations:

  • Single nullary constructor
    By default encodes constructor name
  • Sum with nullary or unary constructors
    Encoding is similar to 'UntaggedValue' encoding from 'aeson'

@stevladimir
Copy link
Author

Hello! Any thoughts on this?

@stevladimir
Copy link
Author

Friendly ping

Works only for the following representations:
* Single nullary constructor
  By default encodes constructor name
* Sum with nullary or unary constructors
  Encoding is similar to 'UntaggedValue' encoding from 'aeson'
@andreasabel
Copy link
Member

Sorry for the slow reaction.

I rebased this onto master to get CI to run.
The testsuite has (trivial) problems with the legacy GHCs (< 7.10).

tests/UnitTests.hs:485:30: Not in scope: ‘pure’

@stevladimir
Copy link
Author

Adding 'pure' to imports fixed the problem

@andreasabel
Copy link
Member

Adding 'pure' to imports fixed the problem

I moved this to the conditional import of Control.Applicative to avoid the unused import warning on new GHCs.

@andreasabel
Copy link
Member

I see your parser supports untagged unions via <|>. Backtracking can be costly, so I do not know if this is in the spirit of this package:

cassava/README.md

Lines 78 to 86 in fd168df

general purpose parsing library. There are plenty of those. The goal
is to roughly accept what the Python
[`csv`](https://docs.python.org/3/library/csv.html) module accepts.
The Python `csv` module (which is implemented in C) is also considered
the base-line for performance. Adding options (e.g. the above
mentioned parsing "flexibility") will have to be a trade off against
performance. There's been complaints about performance in the past,
therefore, if in doubt performance wins over features.

I added a performance test for parsing unions that should take O(2ˆn) wrong branches until it succeeds. I only played it to n=10, and there no problem is visible to the naked eye. However, maybe we should have more serious benchmarking, or drop this feature. Opinions?

Also, could you draft a CHANGELOG entry describing the new features?

@stevladimir
Copy link
Author

I totally agree that performance is important, so will add benchmarks.

Though I don't think extreme cases like you provided are a real problem unless there is a significant performance loss that is "unexpected" and can be avoided. As example of "unexpected" performance loss: parsing of enum happens to be O(n^2) instead of O(n), where n is the number of constructors. But otherwise parsing of (Either Foo (Either Bar (Either Baz (...)))) can't be super fast in its nature. I would say there is an issue if there are cases where we have complexity worse than O(n) with respect to the number of "branches".

Encoding for sums with unary constructors is somewhat fishy for another reason similar to UntaggedValue:

Constructor names won't be encoded. Instead only the contents of the constructor will be encoded as if the type had a single constructor. JSON encodings have to be disjoint for decoding to work properly.

When decoding, constructors are tried in the order of definition. If some encodings overlap, the first one defined will succeed.

This may be confusing but IMHO parsing sums with unary constructors is useful, e.g. data Limit = NoLimit | Limit Int | Unlimited. And IMHO it is acceptable to put obligation on a user to ensure that encodings of constructors are disjoint. It's still less tedious and less error prone than writing instances manually :) But it is up to you as a maintainer to make a decision.

Let me add some benchmarks and than we can come back to the discussion. Any ideas of what else to benchmark beside for "branched" unions?

Also, could you draft a CHANGELOG entry describing the new features?

Sure thing, but I think it makes sense to do after we resolve opened question.

@andreasabel
Copy link
Member

I agree to your argumentation. Banning untagged unions altogether just because there are problematic cases would throw out the baby with the bathwater. So, let's include them.

@stevladimir stevladimir marked this pull request as draft November 9, 2022 08:36
@stevladimir
Copy link
Author

stevladimir commented Nov 9, 2022

TODO

  • Benchmarks for Either.
  • Static type errors for unsupported representations.
  • Error reporting
    Original implementation tries to be informative about how parsing failed, e.g. fail $ "Can't parse " <> show expected <> " from " <> show field.
    Replacing that with just empty makes fail case ~10x faster (see *-dumb-errors-* benchmark below) and even slightly more performant than manual instance (I do suspect that this is because generic instance is implemented using Parser constructor, which is otherwise hidden, so hand written instances have to use pure/fail interface instead of CPSed Parser \onFailure onSuccess).
    What is more preferable in this case: performance or informativeness? We can also make this customizable via Cabal flag or Options, so can have both and let end-user to choose.
  • Would it make sense to introduce a separate FieldOptions type or Options.constructorTagModifier field? It feels not intuitive that fieldLabelModifier is used for constructors. I initially used existing Options for simplicity, but IMO it should not be merged as is. I can see few options:
    1. New structure FieldOptions (I would go with this option).
      In that case I would also rename Options to RecordOptions. To be backward compatible we can:
      • type Options = RecordOptions and {-# DEPRECATED Options "Use 'RecordOptions'" #-}
      • pattern Options and deprecate it too (not sure about type/constructor name clashes in DEPRECATED pragmas)
    2. Options.constructorTagModifier field
    3. Just leave as is (me strongly opposed)
  • Decide on Nix thing - keep or drop?
  • Docs and changelog.

@stevladimir
Copy link
Author

stevladimir commented Dec 10, 2022

I cleaned up TODO list above, so duplicating myself here

Sorry for long pause - I digged into the rabbit hole of performance :) Also was somewhat limited in time.

The original implementation turned to be 10x slower than manually written instance for enums. I managed to make generic instances almost as performant.
I also added Nix shell as it is easier to switch GHCs with it. Let it be here for a while - if you think it is valuable it is better to make a separate PR IMO.

Below is the most recent set of benchmarks.
I hope filenames are self descriptive (let me know if something is unclear).
'-dumb-errors-' is the result of the following patch applied:

@@ -775,8 +775,9 @@ genericParseField
 genericParseField opts field = Parser $ \onFailure onSuccess ->
   unParser (gParseField opts field) (\_ -> onFailure err) (onSuccess . to . M1)
   where
-    err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
-      <> " from " <> show field
+    err = "fail"
+    -- err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
+    --   <> " from " <> show field
 {-# INLINE genericParseField #-}
 
 -- | A type that can be converted to a single CSV field.
@@ -1405,7 +1406,7 @@ instance (Constructor c) => GFromField (C1 c U1) where
   gParseField opts field = Parser $ \onFailure onSuccess ->
     if field == expected
     then onSuccess val
-    else onFailure $ "Can't parse " <> show expected <> " from " <> show field
+    else onFailure "Fail" -- $ "Can't parse " <> show expected <> " from " <> show field
     where
       expected = encodeConstructor opts val
       val :: C1 c U1 p

Llist of open questions I need your opinion about (much duplicates TODO):

  1. Anything else to be benchmarked?
  2. Error reporting
    My original implementation tries to be informative about how parsing failed, i.e. fail $ "Can't parse " <> show expected <> " from " <> show field.
    Replacing that with just empty makes fail case ~10x faster (see benchmark-iut-dumb-errors benchmarks below) and even slightly more performant than manual instance (I do suspect that this is because generic instance is implemented using Parser constructor, which is otherwise hidden, so hand written instances have to use pure/fail interface instead of CPSed Parser \onFailure onSuccess).
    What is more preferable in this case: performance or informativeness? We can also make this customizable via Cabal flag or Options, so can have both and let end-user to choose.
  3. Would it make sense to introduce a separate FieldOptions type or Options.constructorTagModifier field? It feels not intuitive that fieldLabelModifier is used for constructors. I initially used existing Options for simplicity, but IMO it should not be merged as is. I can see few options:
    1. New structure FieldOptions (I would go with this option).
      In that case I would also rename Options to RecordOptions. To be backward compatible we can:
    • type Options = RecordOptions and {-# DEPRECATED Options "Use 'RecordOptions'" #-}
    • pattern Options and deprecate it too (not sure about type/constructor name clashes in DEPRECATED pragmas)
    1. Options.constructorTagModifier field
    2. Just leave as is (me strongly opposed)

Few side notes (posting here not to forget to discuss separately whether it makes sense to investigate further):

  1. I would consider splitting tests into modules as that generally makes compilation faster. But that's a separate PR.
  2. Explore and probably expose Parser constructor or add respective combinator to get the most of the performance.
  3. Also I've noticed some performance difference between benchmark-ref and benchmark-iut (the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref models what end-user sees, while benchmark-iut uses cassava as internal library.
    I can assume that's because in the latter case -O2 is applied to both cassava-iut library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.

With best regards!

@stevladimir
Copy link
Author

@stevladimir
Copy link
Author

stevladimir commented Dec 10, 2022

Also I've noticed some performance difference between benchmark-ref and benchmark-iut (the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref models what end-user sees, while benchmark-iut uses cassava as internal library.
I can assume that's because in the latter case -O2 is applied to both cassava-iut library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.

I've carefully revisited things and now almost sure that my original understanding was wrong. The purpose of benchmark-ref seems to be a reference implementation (yeah, ref in its name), which always uses cassava from Hackage (or whatever is installed in the environment). And benchmark-iut is to test actual changes done locally, so one can easily compare performance changes.

Beside for that I've added generic instances for uninhabited types (Data.Void.Void is an example of such types - added instance for it too).

Just to summarize - supported representations:

-- 1. Uninhabited types: implementation is absurd, but it is expected that such terms never exist.
-- Still they may be useful when one needs to statically guarantee non-existence of term:
-- e.g., "Maybe Void" can exist only as "Nothing" (ignoring bottom) 
-- An example from practice
data Foo (p :: Bool) =  Foo
  { foo :: Maybe (Restricted p)
  , ... }
type family Restricted (p :: Bool) where
  Restricted True = Int
  Restricted False = Void
-- It's ofc a simplified example and this could be modeled with passing type explicitly,
-- but that is not always feasible for other reasons.

-- 2. Nullary constructor types
data Foo0 = Foo0 
-- toField Foo0 == "Foo0"

-- 3. Unary constructor types
-- This was absent in original implementation, but unlike with JSON there is no ambiguity,
-- so see no sense to not provide instance for them.
-- In JSON `data Foo a = Foo{foo :: a}` can be reasonably encoded
-- both as `toJSON a` and `object ["foo" .= toJSON a]`.
newtype Foo1 a = Foo1 a
-- toField (Foo1 a) == toField a

-- 4. Sum with branches fitting into 2. or 3.
data FooSum a = FooA | FooB a
-- toField FooA == "FooA"
-- toField (FooB a) == toField a

@stevladimir
Copy link
Author

Just in case: I'm waiting for some feedback

@stevladimir
Copy link
Author

Friendly ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants