Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement find / xargs in Python #85

Open
andychu opened this issue Mar 1, 2018 · 39 comments

Comments

@andychu
Copy link
Contributor

commented Mar 1, 2018

I'm marking this "help wanted" -- it could be a fun task for someone who's interested in contributing to Oil, but would rather write new code rather than wade into the existing code.

I haven't blogged about this, but find should be folded into the shell. It is almost its own shell -- it starts processes (-exec), has a recursive boolean predicate language, and has its own -printf, globs, and regexes! Some detail here:

https://github.com/oilshell/oil/wiki/Unix-Tools

Lots of complex find expressions here:

https://github.com/oilshell/oil/blob/master/test/wild.sh

A couple old comments on the relationship between find and awk. You can think of both as predicate/action languages:

https://lobste.rs/s/jfarwh/find_is_beautiful_tool#c_rkmlpz

https://lobste.rs/s/kkre7i/find_1_revisited_2002#c_yyfqsk

To be consistent with the rest of the project I suppose it should be implemented in Python. I got the OPy compiler running on all of Oil recently, so hopefully that will speed things up and eventually break the dependence on the CPython interpreter.

@andychu andychu added the help wanted label Mar 1, 2018

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2018

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 23, 2019

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

FWIW there are dozens of simple find invocations here. I would accept anything that runs these and these only!

https://github.com/oilshell/oil/blob/master/test/wild.sh

That would be a good start for find, should be under 1000 LOC. Maybe less than 500.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

Hi andy, I've started an xargs implementation for you to use in OSH.
I think I've done all the low hanging fruit by now, so going forward I wanted to know if there is something from your side that I should pay attention to.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

OK great!

I tend to follow of the strategy of checking something in that passes a few tests at first. And then later, increase the number of tests it passes, and maybe reuse some common code from Oil.

But it makes sense for it to stand alone now. As long as it runs some real xargs invocations, I'm all for it :) Maybe you can write a shell script to show that it behaves the same way as GNU xargs?

I have some scripts in test/gold.sh that could be adapted. All it does it run the same program twice, capturing stdout / exit code, and then it compares those.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

e.g. the function _compare here:

https://github.com/oilshell/oil/blob/master/test/gold.sh

It's also possible the GNU findutils implementation has some tests, but I tended to follow the strategy of writing my own tests.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

I've uploaded the latest changes, so status update: xargs.py should now be fully functional for sequential operation. Actually it should be functional for parallel operation aswell, but I'm not 100% sure how test that part and I have yet to implement the --process-slot-var option.
Otherwise missing are --version and --show-limits.
Also I learned that xargs' default input handling is quite arcane, parsing quotes, continuing lines on trailing whitespace and so on. I'm not sure I caught every corner-case, I just hope shlex does the job.
You'll find all tests I've run so far in the t/ directory.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

OK cool, I cloned it and ran the tests, and they worked. I don't understand everything in the code yet but that's expected :)

Now the one I think may change the code significantly is xargs -P, and that's one of the main reasons I like xargs.

If you grep the Oil repo, there are a bunch of usages of it. In particular the spec tests run in parallel and get a nice speedup from -P. They take about 17 seconds on my machine, and otherwise they would probably take over a minute.

Results: file:///home/andy/git/oilshell/oil/_tmp/spec/index.html

real    0m17.172s
user    1m11.182s
sys     0m21.225s

I'm not sure exactly how xargs -P is implemented. (I read a little bit about make -j, and I think it's slightly different. make -j uses a pipe.)

If you have time to do that, it might be useful for me to chat about it. Or if you think it's easy I wouldn't mind seeing it. I think it makes sense to switch to fork() / exec() for parallel processes rather than subprocess? subprocess does have the ability to do things async perhaps.

spec/show-fd-table.sh:  seq 10 | xargs -n 1 -P 10 --verbose -- $0 show-fd-table
test/arena.sh:  cat $MANIFEST | xargs -n 2 -- $0 _compare-wild
test/common.sh:# For xargs -P in spec-runner.sh, wild-runner.sh.
test/common.sh:  # Long-term solution: our xargs should have --format.
test/lint.sh:  find-src | xargs grep -n $'\t'
test/lint.sh:  find-src | xargs grep -n '^.\{81\}' | grep -v 'http'
test/lint.sh:  #scripts/count.sh oil-osh-files | grep -F '.py' | xargs $0 bin-flake8 --select F841
test/spec-runner.sh:    | xargs -n 1 -P $JOBS --verbose -- $0 run-cases || true
test/spec-runner.sh:    | xargs -n 1 -P 8 --verbose -- $0 test-to-html || true
test/unit.sh:  find . -name '*.pyc' | xargs --no-run-if-empty -- rm || true
test/unit.sh:    return 255  # xargs aborts
test/unit.sh:  time tests-to-run | xargs -n 1 -- $0 run-test-and-maybe-abort
test/unit.sh:  #tests-to-run | xargs python -m unittest
test/unit.sh:  time tests-to-run | xargs -n 1 -- $0 run-test-and-log $tasks_csv || status=1
test/wild-runner.sh:  xargs -n 2 -P $JOBS -- $0 process-file || failed=1
@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

And I wouldn't mind some feedback on the language design issue. I want Oil to support both existing xargs invocations, and also have something a bit nicer. I'm thinking of each right now. Like in opy/regtest.sh, you'll see:

  all-py-files '%p %P\n' | xargs -n 2 -- $0 _copy $dest

This takes "each 2" things, hence the -n 2, and passes three args to the _copy function. The first is a base path that is fixed, and the second two are the absolute source path %p and the relative path %P.

_copy is:

_copy() {                                                                                                                                                                                                                                     
  local dest_dir=$1                                                                                                                                                                                                                           
  local src_path=$2                                                                                                                                                                                                                           
  local dest_rel_path=$3                                                                                                                                                                                                                      
                                                                                                                                                                                                                                              
  local dest=$dest_dir/$dest_rel_path                                                                                                                                                                                                         
                                                                                                                                                                                                                                              
  dest=${dest%c}  # .pyc -> py                                                                                                                                                                                                                
                                                                                                                                                                                                                                              
  mkdir -p $(dirname $dest)                                                                                                                                                                                                                   
  cp -v --no-target-directory $src_path $dest                                                                                                                                                                                                 
}   

This would be nicer as something like:

local dest=...  # still in scope
all-py-files ... | each {
  local src_path=$1
  local dest_rel_path=$2
  ...
  cp ...
}

I'm not sure when I'll be able to get to this but I think it's a nice addition for an idiom I use a lot.

One open question is if we should use flags or a different syntax, e.g.

all-py-files | each -P 4 { ... }   # do them in parallel
@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

OK cool, I cloned it and ran the tests, and they worked. I don't understand everything in the code yet but that's expected :)

scribbles something about "documentation" onto the TODO list

Now the one I think may change the code significantly is xargs -P, and that's one of the main reasons I like xargs.

You might want to have a look at GNU parallel

If you have time to do that, it might be useful for me to chat about it. Or if you think it's easy I wouldn't mind seeing it. I think it makes sense to switch to fork() / exec() for parallel processes rather than subprocess? subprocess does have the ability to do things async perhaps.

Afaik subprocess is just fine. To be perfectly clear, processes are already started in parallel, just the waiting for children and processing the returncodes isn't there yet.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

And I wouldn't mind some feedback on the language design issue. I want Oil to support both existing xargs invocations, and also have something a bit nicer. I'm thinking of each right now.

So my first question would be: how does each split it's input exactly?
Does it read a line, split that into args and pass them to the scriptblock? Or is each supposed to behave like xargs -n2?

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

Good question! I've seen the need for a table-like data format in many parts of shell:

https://github.com/oilshell/oil/wiki/TSV2-Proposal

Not just xargs and find, but also awk.

This page needs more work, but the TSV2 part is the data format:

https://github.com/oilshell/oil/wiki/each-Keyword-in-Oil

I created a home page as well: https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil

It looks like a lot, but the core mechanism is very simple. And the reason that I want a compatible xargs so you can skip all this if you're doing something quick and dirty. Just like I do with shell, the xargs and each code should be shared.

The part about starting parallel processes with a rate limit is interesting thing. The string manipulation is a separate issue my goal with this TSV2 proposal is to free users of thinking about strings too much. It should "just work" with no corner cases.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

BTW I've seen GNU parallel around for ages, but as far as I can tell, xargs can do everything it does (and it's installed by default on most distros).

GNU parallel seems like a little language stuffed in a command line tool, which is what I'm trying to get away from in Oil. Since we have a real language and a real parser, we can invent a more intuitive syntax.

On top of the language issue, I think GNU parallel also has its own set of weird quoting / tokenization rules too. My goal is to folding that part into Oil with TSV2 so that the equivalents of say xargs, awk, cut, find, etc. can use the same rules.

find and xargs already share the -print0 and -0 flags, but IMO that doesn't go far enough.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

Also, in the short term I would like to get an xargs into the repo that can run test/spec.sh all, and time should show the parallel speedup.

Results: file:///home/andy/git/oilshell/oil/_tmp/spec/index.html

real    0m17.172s
user    1m11.182s
sys     0m21.225s

After looking at the code more, I see you are using os.wait() and the poll method.

https://docs.python.org/2/library/subprocess.html#subprocess.Popen.poll

OK I think that makes sense! Let me try it out.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

OK I got a chance to try it out... I think there is a bug where it doesn't wait for the last process in -P mode?

andy@lisa:~/git/oilshell/xargs.py$ time seq 1 | ./xargs.py -n 1 -P 2 -- sh -c 'echo sleep $1; sleep $1' dummy
sleep 1

real    0m0.020s
user    0m0.020s
sys     0m0.000s
andy@lisa:~/git/oilshell/xargs.py$ time seq 2 | ./xargs.py -n 1 -P 2 -- sh -c 'echo sleep $1; sleep $1' dummy
sleep 1
sleep 2

real    0m1.027s
user    0m0.024s
sys     0m0.005s
andy@lisa:~/git/oilshell/xargs.py$ time seq 3 | ./xargs.py -n 1 -P 2 -- sh -c 'echo sleep $1; sleep $1' dummy
sleep 1
sleep 2
sleep 3

real    0m2.024s
user    0m0.020s
sys     0m0.006s
@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

OK I got a chance to try it out... I think there is a bug where it doesn't wait for the last process in -P mode?

In fact it doesn't wait for the last P processes.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

It looks like a lot, but the core mechanism is very simple. And the reason that I want a compatible xargs so you can skip all this if you're doing something quick and dirty. Just like I do with shell, the xargs and each code should be shared.

Yes. xargs goes through a sequence of phases:

  • read the input (until EOF or -E eof_str)
  • split the input into args (pseudo-shell format or -d delimiter)
  • group the args according to the limits given by -L, -n and -s
  • build a full command from a group of args (the command and initial args passed to xargs and -I)
  • execute said commands (interactive, sequential or in parallel)

I think xargs.py does a decent job reflecting this in the code - except for the last two points, which I still have to work on.
Anyway, the second point is where you can easily add a new format. Any format for that matter.
The thing I'm wondering is if the last point, the execution of the commands, wasn't better left to some external command/module/whatever, so that other tools and constructs could also benefit from the parallel-execution, job control and so on.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

Yes, that's what I'm thinking -- the execution engine should be factored out from input handling.

For the pseudo-shell format, I probably have a parser in Oil that will do! Shell is full of all these silly formats and Oil has 6-7 lexers already, including the big one with 14-15 modes.

http://www.oilshell.org/blog/2019/02/07.html#toc_7

Thanks for laying out the architecture. That makes sense, but it's not apparent from the syntax.

If you look at GNU find it's the same thing. It has a very structured flow, but the syntax completely obscures that. It's basically:

find [ROOTS] [OPTIONS] [EXPR]

where EXPR is a silly expression language embedded in things that look like flags. The semantics are sensible and useful, but the UI is very bad. It seems like >50% of Unix users don't know how to use find :-(

Thanks for working on this -- it's looks very useful and I'm excited to try the next version with Oil's shell scripts :)

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

Actually speaking of find, I've spent quite some time thinking about that particular tool (independent of this issue here).
The very short version is that you can probably cover like 90% of its use-cases by combining a tool that can traverse the file-system-hierarchy in pre- or postorder with stest.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

OK interesting I didn't know about stest. I think that's true, and the remaining 10% is probably just -prune, which I have occasionally used. e.g.

find . -name _tmp -a -prune -o -name '*.py' -a print

prevents it from walking the huge _tmp directory.

Basically you can modify the traversal while it's happening.

I think the default for find should just be something you can pipe into stest. But for people who want control, then there is the more advanced expression language with side effects.

In Oil it might look something more like:

fs . (name == '_tmp' && prune || name ~ '*.py' && print)

But yeah I agree 90% of use cases don't need to modify the traversal Iike that.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

May I suggest fs EXPR [path...]? Essentially like grep, with path being optionally read from stdin

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

If I understand correctly,

find . | stest -f

is equivalent to

find . -type f

The latter has less "pipe traffic", which I think is less important than file system traffic. At first I thought they were equivalent in terms of FS system calls, but now I'm not so sure. The stest version might do some extra stats?


That issue aside, to match find, we'll need 3 parts: [flags] [expr] [paths...]. I'm not totally sure about the order, but the opposite might be OK too.

The way I usually do things is to rewrite some real examples from working code. So if you see any more find expressions other than the ones I posted in wild.sh, please send this this way :)

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

So with the latest commits I have polished up the xargs.py source a bit and I think it is ready for it's first steps in the wild.
There is one rather obscure test it still fails (using -I with empty input - I still haven't figured out why and how GNU xargs behaves the way it does) and apart from that the only missing feature is --show-limits.

I have had a first look into find, but before I work on that I would like to know if there is any parsing-components from oil that could be used there.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 2, 2019

OK great, I just ran the tests and got the one failure.

!!! replace_empty failed: files differ

But no problem for now -- I will go ahead and try it on Oil's shell scripts.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 2, 2019

Hm actually I see one anomaly. xargs.py takes 3 seconds instead of 2 here. I'm running 6 processes that sleep 1 second (not N seconds), with 3-way parallelism.

Hm and actually seq 10 seems to hang forever?

$ time seq 6 | ./xargs.py -n 1 -P 3 -- sh -c 'echo $1; sleep 1' argv0
1
2
3
4
5
6

real    0m3.035s
user    0m0.036s
sys     0m0.007s
$ time seq 6 | xargs -n 1 -P 3 -- sh -c 'echo $1; sleep 1' argv0
1
2
3
4
5
6

real    0m2.005s
user    0m0.010s
sys     0m0.002s

When I Ctrl-C it's hanging on:

^CTraceback (most recent call last):
  File "./xargs.py", line 302, in <module>
    sys.exit(main(xargs_args))
  File "./xargs.py", line 248, in main
    while subprocs[i].poll() is not None:
KeyboardInterrupt
@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 2, 2019

And about find -- the way I do it is to write a little grammar, and then a recursive descent parser. The bool_parse.py code is very close to the find expression language -- it has AND OR NOT as -a -o -n with

https://github.com/oilshell/oil/blob/master/osh/bool_parse.py

So it should be quite short. We can chat about it on Zulip -- https://oilshell.zulipchat.com/ . There is some discussion there about structured data, which is relevant to both find and xargs.

And a tip I have to writing parsers is to always use a sentinel like None, or Id.Eof_Real in Oil. It makes the edge cases work out!

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

Hm and actually seq 10 seems to hang forever?

Ahh yes, good ol' synchronization code. I hope I got it right this time.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2019

OK I tried it out and it looks like it works! Thanks a lot!

I will definitely start from this code since you did a lot of valuable reverse engineering on how xargs behaves.

I would take a PR to put it in tools/xargs.py, next to readlink.py and maybe rename t/ to tools/testdata or something. Or I can do that too depending on your preference.

  1. One thing I might do is allow running it as a builtin. Then you get some nicer syntax errors for flags:
$ bin/osh -c 'type -z'
  type -z
       ^~
[ -c flag ]:1: 'type' doesn't accept flag '-z'
  1. Eventually I will statically type it with MyPy. The whole OSH front end is statically typed now which I believe will aid in making it faster (long story).

  2. Another thing I noticed is that it doesn't use the return value of os.wait() which is (PID, status). I think you can avoid polling every process on every iteration by using that. I'll have to play with it.

But that is a detail. Let me know if you want to talk about find! We can chat on https://oilshell.zulipchat.com/ if it's more convenient. The bool_parse.py parser is pretty close, and there are several other parsers in the code I wrote using the same style if you grep */*.py

  • _RangeParser
  • _FormatStringParser
  • GlobParser
Using ~/git/oilshell/xargs.py/xargs.py
test/spec-runner.sh test-to-html alias
...
test/spec-runner.sh test-to-html xtrace
'_tmp/web' -> '/home/andy/git/oilshell/oil/web'

*** All 77 tests PASSED

Results: file:///home/andy/git/oilshell/oil/_tmp/spec/index.html

real    0m19.402s
user    1m19.739s
sys     0m22.988s
@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2019

BTW here's another recursive descent parser in the repo, not written by me:

https://github.com/oilshell/oil/blob/master/pgen2/pgen.py#L164

That's a parser for the Python grammar language: https://docs.python.org/3/reference/grammar.html

written by Guido van Rossum for lib2to3 over 10 years ago.

http://python-history.blogspot.com/2018/05/the-origins-of-pgen.html

The structure of the code is the same as that of the 3 parsers I mentioned above. The gettoken() corresponds to my Next(). It stores a single token of lookahead in the Parser instance.

And then the parsing methods parse_* correspond to rules in the grammar for the pgen language (yes very meta). And it's helpful to write a grammar line in BNF syntax as a comment in each method. The imperative code in the method more or less directly corresponds to the grammar line.

I'm not sure how familiar you are with this style, but I hope that helps. The thing that made recursive descent "click" for me is to realize it's unlike most textbook recursive algorithms: it mutates the state as it recurses, rather than simply using args and return values. Calling Next() while recursing is perhaps unusual but very effective!


Oh also, that parser builds a DFA while it parses, but it's more conventional to build a tree. That is what all my parsers do. I use Zephyr ASDL (frontend/syntax.asdl) but for simple cases it's reasonable to write the node classes by hand.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

I would take a PR to put it in tools/xargs.py, next to readlink.py and maybe rename t/ to tools/testdata or something. Or I can do that too depending on your preference.

You have a PR.
Edit: actually I have missed some stuff that needs cleanup

1. Eventually I will statically type it with MyPy.  The whole OSH front end is statically typed now which I believe will aid in making it faster (long story).

Oh god I hate dynamically typed languages. Whatever time you save in development, you spend debugging. I didn't know python even had static typing systems. (Well, technically it's through an external tool but still). Anyway, I've gone ahead and added type annotations. I can't actually check them myself right now, but they should serve as a good starting point.

2. Another thing I noticed is that it doesn't use the return value of `os.wait()` which is (PID, status).  I think you can avoid polling every process on every iteration by using that.  I'll have to play with it.

I think the best way to approach this is to write a proper process queue. Something along the lines of multiprocessing.Pool maybe?
Also it might be a good idea to make use of SIGCHLD instead of calling os.wait() in a loop. Speaking of which, xargs also needs to handle SIGUSR1 and SIGUSR2.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

Oh also, that parser builds a DFA while it parses, but it's more conventional to build a tree. That is what all my parsers do. I use Zephyr ASDL (frontend/syntax.asdl) but for simple cases it's reasonable to write the node classes by hand.

I haven't looked into Zephyr so far, but as far as I understand find is an simple case only on the surface. It actually does lots of optimization on the expression you pass to it and can give pretty detailed debug-output. The -D flag should give you an idea.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

I think the best way to approach this is to write a proper process queue. Something along the lines of multiprocessing.Pool maybe? Also it might be a good idea to make use of SIGCHLD instead of calling os.wait() in a loop. Speaking of which, xargs also needs to handle SIGUSR1 and SIGUSR2.

From what I remember about the multiprocessing module, its design was somewhat warped by the requirement to support Windows. The APIs for process management on Windows and Unix are of course very different.

SIGUSR1 and SIGUSR2 is a good point. Would threading.Sempahore or threading.BoundedSemaphore be useful? Maybe it could look like:

  1. Initialize the value of the semaphore to the -P value
  2. acquire before starting a process
  3. On SIGCHLD, release.
  4. on SIGUSR1, release. On SIGUSR2, acquire. (The signal handlers in CPython run on the main thread.)

Although maybe since there are no threads that doesn't make sense. Maybe the cleaner way to do it is like an event loop with the "self-pipe" trick.

Now that I look at xargs.c, it uses sig_atomic_t:

#define MAX_PROC_MAX SIG_ATOMIC_MAX                                                                                                                                                                                                           
static volatile sig_atomic_t proc_max = 1;                                                                                                                                                                                                    
                                                                                                                                                                                                                                              
/* If nonzero, we've been signaled that we can start more child processes. */                                                                                                                                                                 
static volatile sig_atomic_t stop_waiting = 0;  

Let me think about it a bit. I also wonder if it is worth doing the "back end" in C. Anything with Python and signals is a bit weird. (e.g. see #198)

Yeah I think it's better to move onto find for now and circle back to xargs once more of Oil is in C. I won't be surprised if we have to rewrite the process pool part, but that's OK since the point now is to get the interface right.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

About find -- I did a bit of research this morning. The basic syntax from the man page is below ...

And it makes me think that we can use pgen2 to parse it pretty easily (that's what I've been working on lately to parse the Oil language). The grammar should be a pretty straightforward transcription of what's in the man page. Let me try it out and see how it works.

I did notice that find "optimizes" the expression to try to do I/O. That is cool, although I have a hard time imagining a case where it matters. We should definitely do that part last. I think they do it by AST -> AST rewrites, so it's an optional step that shouldn't affect the overall architecture of find.


       ( expr )
              Force precedence.  Since parentheses are special to the shell, you will normally need to quote them.  Many of the examples in this manual page use backslashes for this purpose: `\(...\)' instead of `(...)'.

       ! expr True if expr is false.  This character will also usually need protection from interpretation by the shell.

       -not expr
              Same as ! expr, but not POSIX compliant.

       expr1 expr2
              Two expressions in a row are taken to be joined with an implied "and"; expr2 is not evaluated if expr1 is false.

       expr1 -a expr2
              Same as expr1 expr2.

       expr1 -and expr2
              Same as expr1 expr2, but not POSIX compliant.

       expr1 -o expr2
              Or; expr2 is not evaluated if expr1 is true.

       expr1 -or expr2
              Same as expr1 -o expr2, but not POSIX compliant.

       expr1 , expr2
              List;  both expr1 and expr2 are always evaluated.  The value of expr1 is discarded; the value of the list is the value of expr2.  The comma operator can be useful for searching for several different types of thing, but
              traversing the filesystem hierarchy only once.  The -fprintf action can be used to list the various matched items into several different output files.

andychu pushed a commit that referenced this issue Jun 5, 2019

Andy Chu
[tools] pgen2 parser for find.
Addresses issue #85.
@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

@drwilly I wrote up a grammar for find and it works well:

https://github.com/oilshell/oil/blob/master/tools/find/find.pgen2

The grammar is clean -- almost better than the man page! -- but the surrounding parse.py code is ugly. That's because pgen2 was meant to work with Python's tokenizer, so I hacked it to work with a different tokenizer.

If you pull the Oil tree, you should be able to run this. Let me know if you can't.

If you're interested in hacking on this I can clean up the code and maybe add Zephyr ASDL. pgen2 gives you a (homogeneous) parse tree, but it still needs to be converted to a (heterogeneous) AST for execution/optimization.

https://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees

$ tools/find/run.sh unit-tests |head -n 30

--- -type f ---

0 eval_input -
  0 factor -type
    0 -type -type
    1 STRING f
  1 ENDMARKER 

--- -true ---

0 eval_input -
  0 -true -true
  1 ENDMARKER 

--- -type f -print ---

0 eval_input -
  0 term -type
    0 factor -type
      0 -type -type
      1 STRING f
    1 -print -print
  1 ENDMARKER 

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

First off, I'm still sitting on some xargs-changes. Seems like I underestimated the quirkiness of GNU-Xargs. Basically consider using more than one of -s, -n or -L broken.
Also see this line? Thats a crash that just by accident didn't show up in my tests. Anyway.

@drwilly I wrote up a grammar for find and it works well:

https://github.com/oilshell/oil/blob/master/tools/find/find.pgen2

Are you sure you got the precedences for list (expr ',' expr) right?
In the man-page it says list < or < and < not < group.

@andychu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

Hm interesting, I didn't know about -s and -L. At times when reverse engineering I suspect that "nobody uses this anymore". There is of course a lot of legacy.

I use -n all the time.

-s looks like it is for "old" systems. I feel like xargs should do the right thing on every system, and users shouldn't need to specify this. Of course it doesn't mean that some script doesn't use it. But the philosophy I have is that I don't want to prevent us from implementing something in the future, but we don't have to handle every corner case now. I actually prefer to see the "real" use case first, because that helps me design the Oil language. (And in this case the "each" keyword).

-L is also a weird input format. It doesn't look hard to support, but I can see why -s and -L together is weird. I think simply raising an error could be OK, like -s and -L can't be used together ?


I'm not sure yet I got the precedence right. That should be easy to resolve with some tests!

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

Hm interesting, I didn't know about -s and -L. At times when reverse engineering I suspect that "nobody uses this anymore". There is of course a lot of legacy.

Oh it's not like they can't be used together - they just cancel out each other: find -n2 -L1 -n3 is basically just -n3 and -I gets in there aswell.
I agree with your assesment though: those options scream "embedded".

andychu pushed a commit that referenced this issue Jun 7, 2019

Andy Chu
[tools/find] Automation to build the 'find' grammar.
Not used yet.

Addresses issue #85.
@jasom

This comment has been minimized.

Copy link

commented Jun 14, 2019

FWIW I find that the -exec ... + in GNU find removes most needs for piping find to xargs; xargs can still be useful for other sources of file names though.

@drwilly

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

I'd put it the other way around: properly sanitized xargs input removes the need for find's -exec command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.