Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forwarded parser in cats-parse #4147

Merged
merged 10 commits into from Jan 23, 2021
Merged

Forwarded parser in cats-parse #4147

merged 10 commits into from Jan 23, 2021

Conversation

fredshonorio
Copy link
Contributor

@fredshonorio fredshonorio commented Jan 6, 2021

Took a little more than I'd hope but here it is.

This fails a bunch of tests related to parsing the Host. I'm unsure where the problem is but it looks encoding related, the tests are ~tests/testOnly org.http4s.*.Forwarded*Spec.

I'm also unsure about the quoted implementation, if the approach is to run the parser with the parsed string or something else. If it is I don't know what error message to use.

I wanted feedback on this so I haven't yet deleted the parboiled parsers or added MiMa stuff.

@rossabaker rossabaker added this to In progress in Dotty cross-compilation via automation Jan 7, 2021
Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is one of the most complicated headers.

def fromString(s: String): ParseResult[Obfuscated] =
new ModelNodeObfuscatedParser(s).parse
parser.parseAll(s).left.map { e =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a ParseResult.fromParser that's better for this.

// obfport = "_" 1*(ALPHA / DIGIT / "." / "_" / "-")
val nodePort: P[Node.Port] =
Numbers.digits1
// is it worth it to consume only up to 5 chars or just let it fail later?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A max is coming in cats-parse-0.3 that will make this easier. We can let it fail later and clean it up on upgrade, I'd say.

def quoted[A](p: P[A]): P[A] =
Rfc7230.token
.orElse(Rfc7230.quotedString)
.flatMap(str =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to flatMap these, or is each p sufficient on its own? Is it the quoting that makes this so complicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the previous parser and this was more or less what it was doing. I imagined quoted-string was more than putting the quotes in the beginning and the end but it isn't, so I think I can rewrite this as
oneOf(p.between('"', '"'), p).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ this is not true, i'm now getting more failures because the host parser is consuming the text of the following pairs, e.g. host="http4s.org;for=_kadabra;proto=http;by=_abra". Maybe quoted-string is more restrictive than Host? Will investigate later.

@rossabaker
Copy link
Member

Also, we're about to fold the dotty branch into main and give up on the idea of cats-parsed 0.21 release. So don't worry about the MiMa parsers.

@rossabaker rossabaker changed the base branch from dotty to series/0.22 January 8, 2021 19:05
@fredshonorio
Copy link
Contributor Author

@rossabaker have you had a change to look at this? I'm stuck in the encoding stuff.

@rossabaker
Copy link
Member

I have not. I'll try to take a deeper look tonight.

One thing we might be able to do in the meantime is merge series/0.22 to and fix the compile errors: the target branch is now on cats-parse-0.3. Mostly, Parser becomes Parser0 and Parser1 becomes Parser. There's a scalafix for it, but I don't know whether this is big enough to be worth running, or just fixing up manually.

@hamnis
Copy link
Contributor

hamnis commented Jan 18, 2021

@fredshonorio are you still up for picking this up again?

@fredshonorio
Copy link
Contributor Author

@fredshonorio are you still up for picking this up again?

I'm blocked in the encoding stuff I mentioned in the first comment, and could use some help. I can handle the rest once that's done.

@rossabaker
Copy link
Member

Tests are still failing, but I went ahead and dealt with the cats-parse-0.3 upgrade that's already on the target branch. Trying to examine the failures before I sleep...

@rossabaker
Copy link
Member

Some of the complication seems to be flowing from the arbitrary RegName, which it was noted was generating encoded values to store in a decoded field. I cleaned that up, but that's not the only issue.

@rossabaker
Copy link
Member

ForwardedSpec was failing whenever the RegName begins with a number. That's because we weren't backtracking from IPv4 addresses. That's fixed in 1dc3f87.

We store a decoded value, but weren't encoding the rendered value. That's fixed in f2835c4, along with removing a previous hack around that bug.

With that, ForwardedSpec passes.

@rossabaker
Copy link
Member

It looks like at least two of the outstanding failures in ForwardedRenderingSpec are related to the encoding of weird RegNames, but I need to be up. I'll try to take another pass tomorrow if nobody else has figured it out.

We need to get a lot more rigorous in the URI about what's encoded and what's not. That's the main source of trouble here.

/cc @satorg, who worked on the original model.

@hamnis
Copy link
Contributor

hamnis commented Jan 21, 2021

Seems like the problem is in the generator for hostnames.
I'm not sure that it is allowed with arbitrary strings here.

Dont we need to punycode any non-ascii chars for the hostname?

@rossabaker
Copy link
Member

I already took a shot at reforming the hostname generator here. In practice, non-ASCII characters in a hostname would be punycoded. But the URI spec intentionally says it doesn't necessarily follow the rules for hostname resolution.

Some of the problem @fredshonorio ran into was that the generator was creating percent-encoded strings, and storing them in a model that stores decoded strings. I started fixing that in my commits, but I think there may be more nuance.

As we think about a better URI design, a newtype to distinguish percent-encoded values from raw values could help us avoid bugs like this.

@hamnis
Copy link
Contributor

hamnis commented Jan 21, 2021

A possible implementation using arbitraries adapted from ip4s:

Index: laws/src/main/scala/org/http4s/laws/discipline/ArbitraryInstances.scala
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/laws/src/main/scala/org/http4s/laws/discipline/ArbitraryInstances.scala b/laws/src/main/scala/org/http4s/laws/discipline/ArbitraryInstances.scala
--- a/laws/src/main/scala/org/http4s/laws/discipline/ArbitraryInstances.scala	(revision b708314ac351305bf50ee23063d9403c6b2c249e)
+++ b/laws/src/main/scala/org/http4s/laws/discipline/ArbitraryInstances.scala	(date 1611266887474)
@@ -665,24 +665,55 @@
       .contramap(ipv6 => (ipv6.a, ipv6.b, ipv6.c, ipv6.d, ipv6.e, ipv6.f, ipv6.g, ipv6.h))
 
   implicit val http4sTestingArbitraryForUriHost: Arbitrary[Uri.Host] = {
-    val http4sTestingRegNameGen: Gen[Uri.RegName] =
-      Gen
-        .frequency(
-          // Focus on things vaguely resembling hostnames
-          // TODO Replace with an ip4s generator for more realism
-          19 -> Gen.stringOf(
-            Gen.frequency(
-              8 -> Gen.alphaNumChar,
-              1 -> Gen.const('-'),
-              1 -> Gen.const('.')
-            )
-          ),
-          // But arbitrary strings are allowed!
-          1 -> getArbitrary[String]
-        )
-        .map(s => Uri.RegName(s))
+    val hostnameGenerator: Gen[Uri.RegName] = {
+      val genLabel: Gen[String] = for {
+        first <- Gen.alphaNumChar
+        middleLen <- Gen.chooseNum(0, 61)
+        middle <- Gen.listOfN(middleLen, Gen.oneOf(Gen.alphaNumChar, Gen.const('-'))).map(_.mkString)
+        last <- if (middleLen > 0) Gen.alphaNumChar.map(Some(_)) else Gen.option(Gen.alphaNumChar)
+      } yield first.toString + middle + last.fold("")(_.toString)
+      for {
+        numLabels <- Gen.chooseNum(1, 5)
+        labels <- Gen.listOfN(numLabels, genLabel)
+        if labels.foldLeft(0)(_ + _.size) < (253 - (numLabels - 1))
+      } yield Uri.RegName(labels.mkString("."))
+    }
+
+    val idnGenerator: Gen[Uri.RegName] = {
+      val DotPattern = "[\\.\u002e\u3002\uff0e\uff61]"
+
+      /** Constructs a `IDN` from a string. */
+      def apply(value: String): Option[Uri.RegName] =
+        value.length match {
+          case 0 => None
+          case _ =>
+            val labels = value
+              .split(DotPattern)
+              .iterator
+              .toList
+            Option(labels).filterNot(_.isEmpty).flatMap { _ =>
+              Try(java.net.IDN.toASCII(value)).toOption.map(Uri.RegName(_))
+            }
+        }
+
+      val genChar: Gen[Char] = Gen.oneOf(Gen.alphaNumChar, Gen.const('δ'), Gen.const('π'), Gen.const('θ'))
+      val genLabel: Gen[String] = for {
+        first <- genChar
+        middleLen <- Gen.chooseNum(0, 61)
+        middle <- Gen.listOfN(middleLen, Gen.oneOf(genChar, Gen.const('-'))).map(_.mkString)
+        last <- if (middleLen > 0) genChar.map(Some(_)) else Gen.option(genChar)
+        str = first.toString + middle + last.fold("")(_.toString)
+        if Try(java.net.IDN.toASCII(str)).toOption.size < 64
+      } yield str
+      for {
+        numLabels <- Gen.chooseNum(1, 5)
+        labels <- Gen.listOfN(numLabels, genLabel)
+        dot <- Gen.oneOf('.', '\u002e', '\u3002', '\uff0e', '\uff61')
+        idn = apply(labels.mkString(dot.toString)) if idn.isDefined
+      } yield idn.get
+    }
     Arbitrary(
-      oneOf(getArbitrary[Uri.Ipv4Address], getArbitrary[Uri.Ipv6Address], http4sTestingRegNameGen)
+      oneOf(getArbitrary[Uri.Ipv4Address], getArbitrary[Uri.Ipv6Address], idnGenerator, hostnameGenerator)
     )
   }

@rossabaker
Copy link
Member

@hamnis' suggestion passes the tests. I don't think it's complete per the spec, but I don't think it's incomplete in a way that is interesting enough to hold up progress.

@fredshonorio
Copy link
Contributor Author

fredshonorio commented Jan 22, 2021

Nice, I haven't been able to give it all a full look yet, are there any outstanding issues still?
Regarding your suggestion to use ParseResult.fromParser: I think I can't do it without changing some tests that are expecting specific error messages. Should I do that?
I'll fix the port parser to consume only 5 chars soon.

@rossabaker
Copy link
Member

Hmm. The tests passed for me locally, but I only ran tests/test. If you see tests failing on the error message, then yes, those would be great to fix.

There's an unmoored doc comment that I'll push in just a moment.

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning up the max port length would be good, but the tests pass, and I think that could be done in a followup. If we merge this, I think most of parboiled2 is gone.

@hamnis hamnis merged commit 1fe8325 into http4s:series/0.22 Jan 23, 2021
Dotty cross-compilation automation moved this from In progress to Done Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants