New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command language #3087

Open
cortesi opened this Issue Apr 29, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@cortesi
Copy link
Member

cortesi commented Apr 29, 2018

Command language proposal

In mimtproxy, users interact with addons (and by extension with mitmproxy itself) ONLY through commands and options. Commands have globally unique names, a set of typed arguments, and a single typed return value. The command language is the textual language users use to invoke and combine these typed addon commands.

  • Command names are always in the form of period separated paths (addon.command). Top level commands are indicated with a leading period (.cut), and are a shorthand for a fully qualified command where the addon and command name are the same (.cut == cut.cut).
  • Command invocation is in the style of Python function invocation, but without commas to separate arguments. In the top scope, brackets can optionally be omitted. In inner scopes brackets are mandatory. Invoked commands are executed once immediately when a command string is run.
  • Deferred commands are command invocations preceded by a &. Deferred commands are executed every time they are used as a value.
  • Command pointers are command invocations preceded by a *. Command pointers are used as a value, with the command and all arguments passed to a receiving command. They are not invoked by the command language itself - whether and how to evaluate a command pointer is up to the receiving command. In our type notation we write a command pointer type as *cmd.
  • A source<T> yields a possibly infinite stream of items of type T. Sources are the return value from a source command. The Python representation is as yet undefined, but these might be represented as asynchronous iterators or queues.
  • A sink<T> accepts a stream of items of type T. Concretely, a sink<T> accepts values of type T, [T] and source<T>, and can be considered to be a union of these types. Sinks are the first argument of a sink command. The Python representation is as yet undefined, but might be iterators over asynchronous iterator objects or consumers of asynchronous queues.
  • Two-sided pipe commands. A command can be both a sink and a source, by accepting a sink as its first argument, and returning a source.
  • Pipes are denoted with the | operator, and link sinks and sources by passing the LHS as the first argument to the RHS. A command string can compose multiple pipe components into a pipe chain by using two-sided pipe commands. If a pipe chain ends with a source, the final source's emitted values are consumed and discarded. Using the | symbol with a non-sink command on the RHS is a type error. Sources can only be used with the pipe operator.
  • Pipe variables. Within a pipe component, the variable $this can be used for the current pipe item. We can define this precisely by imagining a command .each sink<T> *cmd -> source<U>. This command inspects the arguments to *cmd to find all occurances of $this. It then reads values from the sink, executes *cmd with occurances of $this replaced by the current value. If the command returns a value, that value is written to the output source. Otherwise, the input value (which may have been modified in-place) is written to the output source.
  • Arrays are enclosed in [].
  • Strings are the primary value type in the language. Strings can be quoted if they contain special characters, or unquoted if they don't.
  • Type shorthands Each type can define a string format that is dynamically expanded into the underlying type. This can be a complex expansion. For example, for flows the string is resolved by selecting flows from the current view. So wherever a flow, [flow] or sink<flow> occurs, a simple flow selection shorthand x is equivalent to view.select(x). The implementation of the resolving function might be different for different tools.
  • Multiple indpendent commands can be separated with ;. These are executed in sequence, with each component running to completion before we proceed to the next.

Commands

The following commands are used in the examples below. These are for illustrative purposes and might not be precisely what we have in the core.

client.replay sink<flow>
console.choose str [str] -> str
console.command [str]
.each sink<T> *cmd -> source<U>
export.file sink<flow> format path
export.formats -> [str]
.filter sink<flow> str -> source<flow>
.random [str] -> str
.setheader flow str str
.tap event -> source<flow>
.time -> str
view.select str -> [flow]

In the above, T and U are generic types.

Examples

1

client.replay "~h google.com"

2

"~h google.com" | client.replay

Since client.replay accepts a sink<flow> argument, the filter specificaiton string is resolved using view.select. It is equivalent to:

view.select "~h google.com" | client.replay

Or

client.replay(view.select("~h google.com"))

3

.tap request | .filter "~h google.com" | export.file json ~/foo.json

4

@marked | .setheader $this foo bar

This command has an implicit each, and is equivalent to:

@marked | .each *.setheader($this foo bar)

5

@marked |
.setheader $this foo bar |
client.replay

This command has an implicit .each, and is equivalent to:

@marked |
.each *setheader($this foo bar) |
client.replay

6

.tap request |
.filter "~h google.com" |
.setheader $this foo bar |
export.file json ~/foo.json

7

console.command [
    @focus "|" export.file
    console.choose(Format export.formats())
]

8

.tap request |
.setheader(
    $this
    "Host"
    console.choose(
        Host
        ["google.com" "yahoo.com"]
    )
)

The console.choose function is invoked once whent he command first run. The header is set to the same value for each flow.

9

.tap request | .setheader($this "Date" &time())

The time function is invoked anew every time it's used as a value, so the header is always set to the current time, and might have a different value for each flow.

10

.setheader @marked foo bar; client.replay @marked

Error examples

1

Type error. Notionally infinite sources may only be used with the pipe operator.

client.replay(tap(request))

2

Syntax error. Pipes are only valid in the outermost scope.

client.replay(
    @marked | setheader $this foo bar
)

3

Type error. Pipes can only used with a sink on the RHS:

export.formats | random

The correct invocation would be:

random export.formats()
@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented Apr 29, 2018

We're converging, but there are a few remaining doubts and disagreements:

  • Both @Kriechi and @mhils feel that example 10 should not be an error. I disagree on a number of. First, I'm afraid of surprising users with commands that hang infinitely, and would prefer to clearly signal those with pipes. Second, if we DO accept nested invocations like this, they will always be categorically less clear than the equivalent pipe notation with the "entry point" initial source at the start. Third, the behaviour can't be defined rigorously when a command contains multiple sources, which this syntax doesn't prevent. Confining ourselves to a single top-level pipe chain means we always have a single starting source, and the sequencing is always clear. I could be convinced IF we have an execution model that doesn't unexpectedly block interaction for the user on infinite commands, and IF we can come up with an acceptable definition for what happens if we have multiple sources in a compound command.
  • I think @mhils feels 11 should not be an error. I think strongly that it should be. Nested pipes will be confusing and aren't defined rigorously (they are equivalent to the invocation style in 10, and have the same issues).
  • I believe @mhils feels that 12 should not be an error. On balance, I believe that we should use pipes for possibly infinite commands only to make the distinction between infinite and finite commands clear. I feel less strongly about this one than some of the other disagreements. Note, though, that if we DO accept this case, we will have to prohibit passing a source to a finite sequence receiver in the type system.
  • I've included command pointers in the proposal, despite my previous suggestion that we might omit them for now. On more reflection, the only rigorous way to define what the $this pipe variable does, is in terms of an each command. Pointers will also be useful in other contexts, and I'm now pretty strongly in favour of including this in the core language. I'm open to two simplifications: 1) using the concept in defining pipe variables, but not actually implementing pointers in the language, OR 2) omitting pipe variables entirely in favour of an explicit each.

I propose that we accept the spec as it stands. This isn't the final word, and there's no doubt we'll adjust it anyway once we have the concrete implementation. We can progressively relax the requirements that make 10 and 12 errors once we get a feel for the language, if we feel it makes sense.

If we have agreement on this, we can commit this initial description to the repo as part of the documentation, where we can expand it over time.

@cortesi cortesi added the gsoc label Apr 30, 2018

@kajojify

This comment has been minimized.

Copy link
Contributor

kajojify commented May 5, 2018

Overall language spec draft here. Anyone can add anything -> https://docs.google.com/document/d/1PLytUD2KJuEHr4n0JS7ckJ0KBnnWJvJ_q8eqmhSGE5Y/edit?usp=sharing
I would be very glad to get any recommendation from you or see your additions/corrections!

Talking about a few remaining doubts and disagreements.
Personally I like that we divide commands into finite and infinite commands explicitly. Ordinary commands (chained with parentheses) are alway finite. Pipe commands are infinite (may not).
But I also don't feel that some of examples 10-12 must raise an error or mustn't. I think we should start implement the language, use it for a while and only then develop logical, well-grounded restrictions relying on the specific situations.

I feel that using the concept of defining pipe variable is better, than using explicit each:

  • user won't need to remember one more language concept(pointers);
  • $this variable is pretty readable and looks like allowable magic in my opinion;
@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented May 6, 2018

@kajojify This looks good to me! A few comments:

  • Agreed on keeping infinite commands explicit with the pipe operator.
  • Agreed on the aesthetics of $this. I think we should, however, also have an explicit each command, which is used to define and illustrate the exact behaviour of $this.

I can't wait to get to the point where we have something to play with here. I think a lot of issues will become clear once we have something concrete to work with.

@kajojify

This comment has been minimized.

Copy link
Contributor

kajojify commented May 7, 2018

EBNF draft for our commands language

command_line = ( plain_str | quoted_str | array | command_call_no_parentheses ), { pipe } ;

pipe = whitespaces, "|", whitespaces, command_call_no_parentheses ;

command_call_with_parentheses = command, { ws }, "(", arguments_list, ")", { ws } ;
command_call_no_parentheses = command, whitespaces, arguments_list ;

arguments_list = { ws }, [ argument, { whitespaces, argument }, { ws } ] ;
array = "[", arguments_list, "]" ;
argument = plain_str | quoted_str | array | command_call_with_parentheses ;

command = [ "&" | "*" ], standard_command_format ;
standard_command_format = command_part, ".", command_part, { ".", command_part } ; 
command_part = char, { char } ;

quoted_str = ( '"', { full_char | ws }, '"' ) | ( "'" , { full_char | ws }, "'" ) ;
plain_str = full_char, { full_char } ;

full_char = char | special_symbol ;
char = ( "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" ) | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i"| "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "_";
special_symbol = "!" | "@" | "#" | "$" | "%" | "^" | "&" | "*" | "(" | ")" | "+" | "{" | "}" | ":" | "<" | ">" | "?" | "/" | "." | "," | "\'" | ";" | "/" | "]" | "[" | "=" | "-" | "~" ;
whitespaces = ws, { ws } ;
ws =  " " | "\n" | "\r" | "\t" | "\v" | "\f" ; (* All white space symbols should be here *)
@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented May 7, 2018

@kajojify Excellent beginning. A few comments:

  • We might want to support single and double-quoted strings.
  • Command pointers? We should still support these explicitly.
  • I think the compound types like flow_selector, filter_expression, etc. should not occur in the EBNF for the language (IF the intent is to build a parser on this, rather than just explain the landscape). Instead, these are just strings. In not too long, we'll allow addons to define their own types (where a type is mostly the translation from a string to an underlying expansion). So from the language and parser perspective these are strings, and the strings are translated by the interpreter to underlying types. I can see that you're using the stricter definitions to try to distinguish between a plain command invocation and a noncommand_beginning string, but I think we need a different plan here.
@kajojify

This comment has been minimized.

Copy link
Contributor

kajojify commented May 10, 2018

As you can see we have some one-word commands like setheader, each, tap etc in our examples. The problem is it is inconvenient to distinguish such commands and plain strings on the lexing stage. So after splitting our command line into tokens, we will have to provide another check in order to define whether plain string is command or not.
I think, that all command (without exceptions) must look like myaddon.command. So if mitmproxy faces something like a.b.c.d in the command line, it tries to invoke such command. If such command exists - ok, if doesn't - error.
Coming back to setheader, each and similar one-word command, we need them to fit our command notation. But since such top level commands will likely be used very often, we need simple and short names for them.
After the conversation with @cortesi I am proposing to use .toplevel_command notation for such commands. We can also use toplevel_commands.toplevel_commands. The examples will look like:

.tap request |
.filter "~h google.com" |
.setheader $this foo bar |
export.file json ~/foo.json

or

tap.tap request |
filter.filter "~h google.com" |
setheader.setheader $this foo bar |
export.file json ~/foo.json

@mhils @Kriechi, what do you think?

@Kriechi

This comment has been minimized.

Copy link
Member

Kriechi commented May 10, 2018

I'm not sure I like this... it adds a lot of characters to type.
Why is it "inconvenient" during/after lexing?

@kajojify

This comment has been minimized.

Copy link
Contributor

kajojify commented May 10, 2018

Let's say I use regex-based lexer. I am defining 4 token: PLAINSTR, COMMAND, PIPE and WHITESPACE.
Let's lex this tap request | export.file json ~/foo.json.
We will have something like:

LexToken(PLAINSTR,'tap',1,0)
LexToken(WHITESPACE,' ',1,3)
LexToken(PLAINSTR,'request',1,4)
LexToken(WHITESPACE,' ',1,11)
LexToken(PIPE,'|',1,12)
LexToken(WHITESPACE,' ',1,13)
LexToken(COMMAND,'export.file',1,14)
LexToken(WHITESPACE,' ',1,25)
LexToken(PLAINSTR,'json',1,26)
LexToken(WHITESPACE,' ',1,30)
LexToken(COMMAND,'~/foo.json',1,31)

We see that we can't distiguish tap command and plain string. There is ambiguity. Looking at top level command, we can't say for sure, whether it is a plain string and we should raise error or it is a command and we should invoke it.
So in order to lex tap like COMMAND we need to put it into command regex explicitly like ...|tap|filter.
We can also write is_command function:

def is_command(string):
    toplevel_commands = get_toplevel_commands()
    if string in toplevel_commands:
        return True

and check any plain string after lexing.
All this doesn't seem to look ok.

@Kriechi

This comment has been minimized.

Copy link
Member

Kriechi commented May 10, 2018

Why do we need to handle this in the lexer? Why not check it in the next stage (the parser)?

@kajojify

This comment has been minimized.

Copy link
Contributor

kajojify commented May 10, 2018

The thing is I don't think the idea to check each plain string is rational. Why do we need to check anything, if we can avoid any checks, stating that if plain string has dot, then it is command?

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented May 10, 2018

This proposal is compelling to me. It makes what is and isn't a command clear both visually and syntactically. At the moment, we can have a phrasing like this:

 foo | my.command

It's not clear here whether foo is a command or the string "foo". A fallback here is to say that IF there's a command called "foo" a naked string is considered to be a command invocation, else it's a string. This is not clear visually, though, and behaviour can differ based on what addons are loaded. It can also result in confusing behaviour for the user, if their intent is to invoke a function but they really get a string expanded behind the scenes. It's much clearer to say that naked strings in command-form ("."-delimited, or with a leading ".") are always commands. If you want them to be a string you have to quote them. Now this is a command invocation, which will give an error if the foo command doesn't exist:

 .foo | my.command

And these are strings:

foo | my.command
".foo" | my.command

There's a bigger picture here too. I want to start enforcing name prefixing for addons, so that all commands will be of the form addonname.command and options will be addonname_command. However, we clearly want our own addons to be able to put commands in the "top scope" without prefixes, and I'd argue we should give user addons this ability as well. If we define .foo to be an alias for foo.foo that gives us a very elegant way to handle the situation. Each addon can introduce one global command, commands stay confined within the addon module so there's no encapsulation issue, and it's pretty easy to explain to users what's going on. So, for instance, we currently have a top-level cut command. That would become cut.cut (the addon name prefix would be added automatically), and the user would be able to invoke this as cut.cut or just as .cut.

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented May 11, 2018

@kajojify @Kriechi Just so that we don't block on this, let's go with the unambiguous period-delimited commands for the moment. Unambiguous commands in the lexer is the simplest thing at this point, and we can relax that constraint later. These are fine points about the "hand feel" of the language, and I think we need working examples to try it out before another round of bikesheddingdesign discussions.

@cortesi cortesi added the RFC label May 13, 2018

@cortesi

This comment has been minimized.

Copy link
Member

cortesi commented May 16, 2018

I've just updated the language proposal to include unambiguous delimited command names. I've also added another concept that cropped up while playing with concrete use cases - semicolons to indicate multiple independent commands to be executed in sequence.

Let's now give @kajojify room to implement. After we have something to play with, we can do another round of design tweaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment