#116 : refactor to allow url with basic auth a:b@toto.com #118

Merged
merged 14 commits into from Jul 25, 2012

Projects

None yet

3 participants

@bpaquet
Contributor
bpaquet commented Jul 8, 2012

It's break a little the design, but I do not see a better solution :(

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
@@ -346,6 +354,8 @@ enum header_states
#define IS_NUM(c) ((c) >= '0' && (c) <= '9')
#define IS_ALPHANUM(c) (IS_ALPHA(c) || IS_NUM(c))
#define IS_HEX(c) (IS_NUM(c) || (LOWER(c) >= 'a' && LOWER(c) <= 'f'))
+#define IS_MARK(c) ((c) == '-' || (c) == '_' || (c) == '.' || (c) == '!' || (c) == '~' || (c) == '*' || (c) == '\'' || (c) == '(' || (c) == ')')
+#define IS_USERINFO_CHAR(c) (IS_ALPHANUM(c) || IS_MARK(c) || (c) == '%' || (c) == ';' || (c) == ':' || (c) == '&' || (c) == '=' || (c) == '+' || (c) == '$' || (c) == ',')
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Wrap at 80 columns.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
@@ -402,7 +412,7 @@ enum header_states
* URL and non-URL states by looking for these.
*/
static enum state
-parse_url_char(enum state s, const char ch)
+parse_url_char(enum state s, const char ch, int * found_at)
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Style: too much whitespace.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
@@ -1002,7 +970,7 @@ size_t http_parser_execute (http_parser *parser,
parser->state = s_req_host_start;
}
- parser->state = parse_url_char((enum state)parser->state, ch);
+ parser->state = parse_url_char((enum state)parser->state, ch, 0);
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Use NULL, not zero. Signals to the reader that it's a pointer.

@bnoordhuis bnoordhuis commented on the diff Jul 17, 2012
http_parser.c
+ }
+ break;
+
+ case s_http_host_start:
+ if (ch == '[') {
+ return s_http_host_v6_start;
+ }
+
+ if (IS_HOST_CHAR(ch)) {
+ return s_http_host;
+ }
+
+ break;
+
+ case s_http_host:
+
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Superfluous blank line.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
+
+ case s_http_host_port:
+ case s_http_host_port_start:
+ if (IS_NUM(ch)) {
+ return s_http_host_port;
+ }
+
+ break;
+
+ default:
+ break;
+ }
+ return s_http_host_dead;
+}
+
+#include <stdio.h>
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Why is that here?

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
@@ -2012,21 +2109,16 @@ const char * http_method_str (enum http_method m)
old_uf = uf;
}
- /* CONNECT requests can only contain "hostname:port" */
- if (is_connect && u->field_set != ((1 << UF_HOST)|(1 << UF_PORT))) {
- return 1;
+ // host must be present if there is a schema
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Don't use C++/C99 comments.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
+ ,.port=0
+ ,.field_data=
+ {{ 0, 0 } /* UF_SCHEMA */
+ ,{ 0, 0 } /* UF_HOST */
+ ,{ 0, 0 } /* UF_PORT */
+ ,{ 0, 10 } /* UF_PATH */
+ ,{ 0, 0 } /* UF_QUERY */
+ ,{ 11, 4 } /* UF_FRAGMENT */
+ ,{ 0, 0 } /* UF_USERINFO */
+ }
+ }
+ ,.rv=0
+ }
+
+, {.name="complex URL fragment"
+ ,.url="http://www.webmasterworld.com/r.cgi?f=21&d=8405&url=http://www.example.com/index.html?foo=bar&hello=world#midpage"
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Long line.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
+ ,{ 7, 8 } /* UF_HOST */
+ ,{ 16, 4 } /* UF_PORT */
+ ,{ 20, 8 } /* UF_PATH */
+ ,{ 29, 12 } /* UF_QUERY */
+ ,{ 42, 4 } /* UF_FRAGMENT */
+ ,{ 0, 0 } /* UF_USERINFO */
+ }
+ }
+ ,.rv=0
+ }
+
+, {.name="complex URL with basic auth from node js url parser doc"
+ ,.url="http://a:b@host.com:8080/p/a/t/h?query=string#hash"
+ ,.is_connect=0
+ ,.u=
+ {.field_set= (1<<UF_SCHEMA) | (1<<UF_HOST) | (1<<UF_PORT) | (1<<UF_PATH) | (1<<UF_QUERY) | (1<<UF_FRAGMENT) | (1<<UF_USERINFO)
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Long line.

@bnoordhuis bnoordhuis commented on an outdated diff Jul 17, 2012
http_parser.c
}
- /* FALLTHROUGH */
- case s_req_port_start:
- if (IS_NUM(ch)) {
- return s_req_port;
+ if (ch == '@') {
+ if (found_at) {
+ *found_at = 1;
@bnoordhuis
bnoordhuis Jul 17, 2012 Member

Why the side channel? Why not e.g. a separate s_req_host_at state?

Bertrand Paquet added some commits Jul 17, 2012
@bpaquet
Contributor
bpaquet commented Jul 17, 2012

I have accepted all your remarks.

@bnoordhuis
Member

@pgriess You should review this as well.

@pgriess pgriess commented on an outdated diff Jul 24, 2012
+#include <string.h>
+
+void
+dump_url (const char *url, const struct http_parser_url *u)
+{
+ char part[512];
+ unsigned int i;
+
+ printf("\tfield_set: 0x%x, port: %u\n", u->field_set, u->port);
+ for (i = 0; i < UF_MAX; i++) {
+ if ((u->field_set & (1 << i)) == 0) {
+ printf("\tfield_data[%u]: unset\n", i);
+ continue;
+ }
+
+ memcpy(part, url + u->field_data[i].off, u->field_data[i].len);
@pgriess
pgriess Jul 24, 2012 Contributor

Rather than copying into part to do null-termination, you could use a format specifier to indicate the number of characters in the string. For example, something like this:

printf("%.*s\n", u->field_data[i].len, url + u->field_data[i].off);
@pgriess pgriess commented on the diff Jul 24, 2012
http_parser.c
@@ -346,6 +355,12 @@ enum header_states
#define IS_NUM(c) ((c) >= '0' && (c) <= '9')
#define IS_ALPHANUM(c) (IS_ALPHA(c) || IS_NUM(c))
#define IS_HEX(c) (IS_NUM(c) || (LOWER(c) >= 'a' && LOWER(c) <= 'f'))
+#define IS_MARK(c) ((c) == '-' || (c) == '_' || (c) == '.' || \
@pgriess
pgriess Jul 24, 2012 Contributor

These are starting to get long. At some point, we might want to replace this with lookups into an array.

@pgriess pgriess commented on an outdated diff Jul 24, 2012
http_parser.c
}
- /* FALLTHROUGH */
- case s_req_port_start:
- if (IS_NUM(ch)) {
- return s_req_port;
+ if (ch == '@') {
@pgriess
pgriess Jul 24, 2012 Contributor

Let's check this before the IS_USERINFO_CHAR() ... check above because it's cheaper.

@pgriess pgriess commented on an outdated diff Jul 24, 2012
http_parser.c
@@ -254,12 +254,8 @@ enum state
, s_req_schema_slash
, s_req_schema_slash_slash
, s_req_host_start
- , s_req_host_v6_start
- , s_req_host_v6
- , s_req_host_v6_end
, s_req_host
@pgriess
pgriess Jul 24, 2012 Contributor

Since the "host" parsing is now really parsing full-fledged RFC2396 authorities, maybe we should rename these states and the other symbols that reference host parsing.

@pgriess pgriess commented on an outdated diff Jul 24, 2012
http_parser.c
@@ -455,62 +470,28 @@ enum header_states
break;
- case s_req_host_start:
- if (ch == '[') {
- return s_req_host_v6_start;
- }
-
- if (IS_HOST_CHAR(ch)) {
- return s_req_host;
+ case s_req_host_with_at:
@pgriess
pgriess Jul 24, 2012 Contributor

I'm not sure we really need this state. The caller is feeding in characters one-at-a-time anyway. If they care about whether or not there's a literal @ in the authority, they can check for themselves. The only thing information is used for in parse_url_char() is to make sure that we don't have consecutive @s, but even that's not a complete check -- we should really be checking for whether or not there are more than one. Again, the caller is pretty well positioned to do this with a simple counter.

@pgriess pgriess commented on the diff Jul 24, 2012
http_parser.c
@@ -2012,21 +2120,16 @@ const char * http_method_str (enum http_method m)
old_uf = uf;
}
- /* CONNECT requests can only contain "hostname:port" */
- if (is_connect && u->field_set != ((1 << UF_HOST)|(1 << UF_PORT))) {
- return 1;
+ /* host must be present if there is a schema */
@pgriess
pgriess Jul 24, 2012 Contributor

Maybe a comment here about how http_parse_host() will fail for empty hosts? Otherwise it's not clear why we think we're going to be able to correctly fail this condition.

@pgriess pgriess commented on the diff Jul 24, 2012
http_parser.c
+ size_t buflen = u->field_data[UF_HOST].off + u->field_data[UF_HOST].len;
+
+ u->field_data[UF_HOST].len = 0;
+
+ s = found_at ? s_http_userinfo_start : s_http_host_start;
+
+ for (p = buf + u->field_data[UF_HOST].off; p < buf + buflen; p++) {
+ enum http_host_state new_s = http_parse_host_char(s, *p);
+
+ if (new_s == s_http_host_dead) {
+ return 1;
+ }
+
+ switch(new_s) {
+ case s_http_host:
+ if (s != s_http_host) {
@pgriess
pgriess Jul 24, 2012 Contributor

Some day it'd be nice to unify the switch statement/logic from here and in http_parser_parse_url(), both of which do basically the same thing.

@pgriess pgriess commented on the diff Jul 24, 2012
http_parser.c
@@ -1938,6 +1911,144 @@ const char * http_method_str (enum http_method m)
return http_strerror_tab[err].description;
}
+static enum http_host_state
+http_parse_host_char(enum http_host_state s, const char ch) {
+ switch(s) {
+ case s_http_userinfo:
+ case s_http_userinfo_start:
+ if (ch == '@') {
@pgriess
pgriess Jul 24, 2012 Contributor

Why do we need the s_http_userinfo_start state at all? It sees to behave the same as s_http_userinfo.

@bpaquet
bpaquet Jul 24, 2012 Contributor

I try to remove it, but I have to modify
https://github.com/bpaquet/http-parser/blob/master/http_parser.c#L2022.
I think the code is much cleaner with this state.

If you have a proper solution, I'm ok to implement.

On Tue, Jul 24, 2012 at 7:13 PM, Peter Griess <
reply@reply.github.com

wrote:

@@ -1938,6 +1911,144 @@ const char * http_method_str (enum http_method m)
return http_strerror_tab[err].description;
}

+static enum http_host_state
+http_parse_host_char(enum http_host_state s, const char ch) {

  • switch(s) {
  • case s_http_userinfo:
  • case s_http_userinfo_start:
  •  if (ch == '@') {
    

Why do we need the s_http_userinfo_start state at all? It sees to behave
the same as s_http_userinfo.


Reply to this email directly or view it on GitHub:
https://github.com/joyent/http-parser/pull/118/files#r1227343

@pgriess pgriess commented on an outdated diff Jul 24, 2012
http_parser.c
+
+ s = found_at ? s_http_userinfo_start : s_http_host_start;
+
+ for (p = buf + u->field_data[UF_HOST].off; p < buf + buflen; p++) {
+ enum http_host_state new_s = http_parse_host_char(s, *p);
+
+ if (new_s == s_http_host_dead) {
+ return 1;
+ }
+
+ switch(new_s) {
+ case s_http_host:
+ if (s != s_http_host) {
+ u->field_data[UF_HOST].off = p - buf;
+ }
+ u->field_data[UF_HOST].len ++;
@pgriess
pgriess Jul 24, 2012 Contributor

There's some weird spacing going on before the ++ operator, here and below.

@pgriess
Contributor
pgriess commented Jul 24, 2012

Anyway, none of my comments are really deal-breakers IMO. @bnoordhuis are you cool w/ this going in now? If so, @bpaquet do you want time to make the suggested changes? Or should we just merge this now?

@pgriess
Contributor
pgriess commented Jul 24, 2012

Also, I think this is turning into a case for breaking out the URL parser into its own library.

@bpaquet
Contributor
bpaquet commented Jul 24, 2012

Hi,

I have make some of the suggested changes. I did not make biggest ones,
especially these which are long time refactor (extract url parser, unify
the switch statement/logic).
I agree we can manage the with_at in the http_parser_parse_url, but the
code will be less uniform (characters processing in http_parser_parse_url).

Are you ok to merge the pull request without these refactors ?

Regards,

On Tue, Jul 24, 2012 at 7:22 PM, Peter Griess <
reply@reply.github.com

wrote:

Anyway, none of my comments are really deal-breakers IMO. @bnoordhuis are
you cool w/ this going in now? If so, @bpaquet do you want time to make the
suggested changes? Or should we just merge this now?


Reply to this email directly or view it on GitHub:
#118 (comment)

@pgriess
Contributor
pgriess commented Jul 25, 2012

Cool, thanks for making those changes @bpaquet! I'll merge this now.

@pgriess pgriess merged commit fb3eeb7 into nodejs:master Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment