Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOL character detection #514

Closed
errordeveloper opened this issue Apr 20, 2014 · 15 comments
Closed

EOL character detection #514

errordeveloper opened this issue Apr 20, 2014 · 15 comments

Comments

@errordeveloper
Copy link
Contributor

Looks like \n\r is accepted, while \r\n isn't.

> printf 'def setup():\n\r\tprint("123")\n\r\n\rsetup()\n\r' | ./micropython
Micro Python build 8e998ed-dirty on 2014-04-20; UNIX version
123

> printf 'def setup():\r\n\tprint("123")\r\n\r\nsetup()\r\n' | ./micropython
Micro Python build 8e998ed-dirty on 2014-04-20; UNIX version
  File "<stdin>", line 2, column 1
SyntaxError: invalid syntax
  File "<stdin>", line 1, column 9
IndentationError: unexpected indent
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'setup' is not defined

Looks like the \r\n is the DOS EOL, while the one we accept is an odd one:

> printf 'def setup():\n\r\tprint("123")\n\r\n\rsetup()\n\r' | cat -v
def setup():
^M  print("123")
^M
^Msetup()
^M

> printf 'def setup():\r\n\tprint("123")\r\n\r\nsetup()\r\n' | cat -v
def setup():^M
    print("123")^M
^M
setup()^M
@errordeveloper
Copy link
Contributor Author

I have tried messing around with next_char() and is_physical_newline() like so:

diff --git a/py/lexer.c b/py/lexer.c
index 0360537..0706578 100644
--- a/py/lexer.c
+++ b/py/lexer.c
@@ -79,7 +79,16 @@ STATIC bool is_end(mp_lexer_t *lex) {
 }

 STATIC bool is_physical_newline(mp_lexer_t *lex) {
-    return lex->chr0 == '\n' || lex->chr0 == '\r';
+    if (lex->chr0 == '\n' && lex->chr0 != '\r')
+      return true;
+    else if (lex->chr0 == '\r' && lex->chr0 != '\n')
+      return true;
+    else if (lex->chr0 == '\n' && lex->chr0 == '\r')
+      return true;
+    else if (lex->chr0 == '\r' && lex->chr0 == '\n')
+      return true;
+    else
+      return false;
 }

 STATIC bool is_char(mp_lexer_t *lex, char c) {
@@ -149,18 +158,28 @@ STATIC void next_char(mp_lexer_t *lex) {

     int advance = 1;

-    if (lex->chr0 == '\n') {
+    if (lex->chr0 == '\n' && lex->chr1 != '\r') {
         // LF is a new line
+        printf("LF is a new line\n");
         ++lex->line;
         lex->column = 1;
-    } else if (lex->chr0 == '\r') {
+    } else if (lex->chr0 == '\r' && lex->chr1 != '\n') {
         // CR is a new line
+        printf("CR is a new line\n");
         ++lex->line;
         lex->column = 1;
-        if (lex->chr1 == '\n') {
-            // CR LF is a single new line
-            advance = 2;
-        }
+    } else if (lex->chr0 == '\n' && lex->chr1 == '\r') {
+        // CR LF is a single new line
+        printf("CR LF is a single new line\n");
+        ++lex->line;
+        lex->column = 1;
+        advance = 2;
+    } else if (lex->chr0 == '\r' && lex->chr1 == '\n') {
+        // LF CR is a single new line
+        printf("LF CR is a single new line\n");
+        ++lex->line;
+        lex->column = 1;
+        advance = 2;
     } else if (lex->chr0 == '\t') {
         // a tab
         lex->column = (((lex->column - 1 + TAB_SIZE) / TAB_SIZE) * TAB_SIZE) + 1;
@@ -310,6 +329,7 @@ STATIC void mp_lexer_next_token_into(mp_lexer_t *lex, mp_token_t *tok, bool firs
     bool had_physical_newline = false;
     while (!is_end(lex)) {
         if (is_physical_newline(lex)) {
+            printf("Found an physical new line!\n");
             had_physical_newline = true;
             next_char(lex);
         } else if (is_whitespace(lex)) {

But go nowhere so far. Here is my test script:

printf '\n' | ./micropython
printf '\r' | ./micropython
printf '\n\r\n\r' | ./micropython
printf '\r\n\r\n' | ./micropython

printf 'def setup():\n\tprint("123")\n\nsetup()\n' | ./micropython
printf 'def setup():\r\tprint("123")\r\rsetup()\r' | ./micropython
printf 'def setup():\n\r\tprint("123")\n\r\n\rsetup()\n\r' | ./micropython
printf 'def setup():\r\n\tprint("123")\r\n\r\nsetup()\r\n' | ./micropython

I'm mostly unsure why there are two places where we attempt to detect EOL character...

@pfalcon
Copy link
Contributor

pfalcon commented Apr 20, 2014

But what's exactly the subject of the report here? You seem to presume that uPy unix binary (or maybe all ports at all?) should behave in some particular way, without providing grounds and motivation for that. For me this report looks as good as reporting that "p1r1i1n1t1(111)1" doesn't work and trying to "fix" that.

@dpgeorge
Copy link
Member

If you put your example scripts into files then both kinds of newlines work just fine with uPy.

I think the problem you have is piping the output from printf to micropython.

@errordeveloper
Copy link
Contributor Author

@pfalcon, I think that once \n\r appears tobe accepted as newline caracter sequence, so should \r\n. I am just trying to debug an odd issue and then ended-up looking at the lexer implementation... As I said, lexer appears to have two separate bits of logic, which doesn't seem like a great idea to me.

@errordeveloper
Copy link
Contributor Author

@dpgeorge, that's interesting... In my opinion there should be no difference, whether it's from a file or piped. Although, I am usimg do_str() from the bare ARM port, which haven't double checked. I am in fact getting odd syntax errors even with UNIX EOL now, and it doesn't occur with the UNIX binary... I'll do more debugging and let you know.

@errordeveloper
Copy link
Contributor Author

@dpgeorge now I see why - it's piping a script into REPL vs reading script from file, the functions being called are diffrent. Although, I still would call this a bug, perhaps it's very low priority...

@pfalcon to your point, I have changed my test script to pipe into python3, and it does what I expected - no syntax errors.

printf 'def setup():\n\tprint("123")\n\nsetup()\n' | python3
printf 'def setup():\r\tprint("123")\r\rsetup()\r' | python3
printf 'def setup():\n\r\tprint("123")\n\r\n\rsetup()\n\r' | python3
printf 'def setup():\r\n\tprint("123")\r\n\r\nsetup()\r\n' | python3

@dpgeorge
Copy link
Member

Yes, to get this working you would need to detect (in unix/main.c) if stdin
is a tty or not, and act accordingly.

@pfalcon
Copy link
Contributor

pfalcon commented Apr 21, 2014

@pfalcon to your point, I have changed my test script to pipe into python3,

But do you have reference confirming this is supported usage? I personally never saw somebody doing that. There's "-c" option to CPython to pass Python code to execute on commandline, and that's what I always see being used.

If you can't provide evidence that such usage should behave in particular way, then I'd argue following account is valid: Calling python executable w/o "-c" or filename starts it in interactive mode, where some input is interpreted specially. For example, you may expect that "\t" would lead to completion variants to be output. That doesn't happen with GNU readline, but there's no warranty it won't happen with other readline implementations. In other words, behavior you're trying to use is undefined, and then it's undefined for line-endings too.

Feel free to prove me wrong or at least explain why you think it's worth to define it in particular way.

@errordeveloper
Copy link
Contributor Author

@pfalcon it's a valid point...

I suppose my thinking was biased towards how this works with Ruby, there is ruby command which can interpret either what comes in on standard input, take it from a file name passed as first argument or one-liner passed with -e. For the Ruby REPL one would call irb command, which is different. So my initial expectation would be that it's quite a similar thing with Python, even though you have no separate REPL out of the box and it's all done with just one executable, unless you install IPython or something.

One thing that's might prove my point is that when you input a script to CPython through a pipe or redirection, it doesn't print the version info nor the >>> prompt, unless you run it with -i.

I am not sure about your example with \t, cannot seem to reproduce it here... are you?

Below are the key parts of python3 -h:

-i     : inspect interactively after running script; forces a prompt even
         if stdin does not appear to be a terminal; also PYTHONINSPECT=x
file   : program read from script file
-      : program read from stdin (default; interactive mode if a tty)

I am not saying that UNIX port of uPy should be CLI-compatible, but you asked to prove the point of why should treat standard input why it's not a TTY the same way as we do a file on filesystem. If there is some difficulty with doing that, it's a different question then.

My original point was actually more around what goes on in py/lexer.c with respect to identifying new line charter sequences. One thing I'd suggest on that note is that it would reasonable to simplify the implementation by saying that new line sequence should be the same across the whole program. ..

@pfalcon
Copy link
Contributor

pfalcon commented Apr 21, 2014

  •  : program read from stdin (default; interactive mode if a tty)
    

Ok, thanks, you're right then.

I am not sure about your example with \t, cannot seem to reproduce it here... are you?

No, because, as I mentioned, GNU readline is smart enough to call isatty() itself and not interpret chars from non-tty as commands. But there're bunch of readline replacements (https://github.com/antirez/linenoise , http://www.s11n.net/editline/) which may do or not do that. (Not just an abstract point - GNU readline is GPL, so linking against it causes binary to be GPL in the strict reading of GPL).

Anyway, as you pointed yourself, fixing it may be a bit tricky. And hard to do satisfying everyone. For example, my concern is that it will make uPy bigger/more complicated. One "obvious" way to "fix" it is to just skip "\r" - but then one sweet day someone will come asking "why did you eat CR inside my string literals?", etc.

Well, now that it's proven as something worth a fix, I hope you'll be able to find a good way to do that.

@dpgeorge
Copy link
Member

My original point was actually more around what goes on in py/lexer.c with respect to identifying new line charter sequences. One thing I'd suggest on that note is that it would reasonable to simplify the implementation by saying that new line sequence should be the same across the whole program. ..

They are. I don't see anything wrong, or duplicated code, with how newlines are handled in py/lexer.c. Your re-write of is_physical_newline above is equivalent to the original (check the logic carefully).

is_physical_newline just checks to see if a newline is coming up, it doesn't try to parse the newline. A newline is coming up if either \n or \r is next, irrespective of whether it's \r\n or a single character.

The only thing that needs improving is unix/main.c, detecting if it's connected to a tty or not.

@errordeveloper
Copy link
Contributor Author

Ok, so I found a mega simple fix for my do_str() implementation that I have taken from bare-arm port.

-    mp_parse_node_t pn = mp_parse(lex, MP_PARSE_SINGLE_INPUT, &parse_error_kind);
+    mp_parse_node_t pn = mp_parse(lex, MP_PARSE_FILE_INPUT, &parse_error_kind);

This is not for the isatty() case of course, it's just fixing my original problem where I obtained a script from a remote server and then called do_str() on the buffer and it wouldn't work for a script that is longer then one line. As the newline characters happen to be \r\n I suspected that being handled incorrectly, so, per Damien's recommendation, I the tried testing it on UNIX and picked hackish approach with piping stuff into standard input, which, in turns, got me into this whole puzzle. It all just happen to relate to original issue very vaguely.

The only thing I'm thinking of now is that those constants could be renamed... May be MP_PARSE_LINEWISE and MP_PARSE_MULTILINE are more appropriate? Doesn't look like there are any other semantics associated with MP_PARSE_FILE_INPUT, i.e. streaming or chunked parsing of any sort, as far as I can tell. Do you guys somewhat agree?

@dpgeorge
Copy link
Member

These names follow the names used in the Python 3 grammar specification: https://docs.python.org/3/reference/grammar.html (see the first 3 non-comment lines on that page). I think changing them to something else would just confuse the situation for those who are already familiar with the grammar.

@dpgeorge
Copy link
Member

BTW, MP_SINGLE_INPUT does actually accept multiple lines. The word "single" refers to the fact that it accepts a single statment.

@errordeveloper
Copy link
Contributor Author

Fair enough, I think this can be closed now.

tannewt pushed a commit to tannewt/circuitpython that referenced this issue Mar 9, 2018
tannewt added a commit to tannewt/circuitpython that referenced this issue Mar 9, 2018
fixes hardware dotstar support for 3.0 and addresses issue micropython#514
tannewt pushed a commit to tannewt/circuitpython that referenced this issue Mar 20, 2018
Uncomment lines in mpconfigport.h for gemma_m0 to allow dotstar access.  same issue as micropython#514 for trinket_m0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants