/
Parser.pm6
610 lines (481 loc) · 21.8 KB
/
Parser.pm6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
=begin pod
=begin NAME
Perl6::Parser - Extract a Perl 6 AST from the NQP Perl 6 Parser
=end NAME
=begin SYNOPSIS
my $pt = Perl6::Parser.new;
my $source = Q:to[_END_];
code-goes-here();
that you( $want-to, $parse );
_END_
# Get a fully-parsed data tree
#
my $tree = $pt.to-tree( $source );
say $pt.dump-tree( $tree );
# Return only the displayed tokens (including whitespace) in the document
#
my @token = $pt.to-tokens-only( $source );
# Return all tokens and structures in the document
#
my @everything = $pt.to-list( $source );
# This used to fire BEGIN and CHECK phasers, it no longer does so.
# As of 2019-03-05 $*PURE-PERL no longer exists, because it's still
# a bit of a misnomer. And it tends to mess with the test suites a bit.
# I've instead chosen to just make it the default, and add a comment to
# lines that go directly to the pure Perl6 parser.
=end SYNOPSIS
=begin DESCRIPTION
Uses the built-in Perl 6 parser exposed by the internal nqp module, so that you can parse Perl 6 using Perl 6 itself. If it scares you... well, it probably should. Assuming everything works out, you'll get back a Perl 6 object tree that exactly mirrors your source file's layout, with every bit of whitespace, POD, and code given one or more tokens.
Redisplaying it becomes a matter of calling the C<to-string()> method on the tree, which passes an optional formatting hashref down the tree. You can use the format methods as they are, or add a role or subclass the objects created as you see fit to create new objects.
This process B<will> be simplified and encapsulated in the near future, as reformatting Perl 6 code was what this rather extensive module was designed to do.
I've added fairly extensive debugging documentation to the L<README.md> of this module, along with an internal L<DEBUGGING.pod> file talking about what you're seeing here, and why on B<earth> didn't I do it B<this> way? I have my reasons, but can be talked out of it with a good argument.
Please note that this does compile your code, but it takes the precaution of munging C<BEGIN> and C<CHECK> phasers into C<ENTER> instead so that it won't run at compile time.
As it stands, the C<.parse> method returns a deeply-nested object representation of the Perl 6 code it's given. It handles the regex language, but not the other braided languages such as embedded blocks in strings. It will do so eventually, but for the moment I'm busy getting the grammar rules covered.
While classes like L<EClass> won't go away, their parent classes like L<DecInteger> will remove them from the tree once their validation job has been done. For example, while the internals need to know that L<< $/<eclass> >> (the exponent for a scientific-notation number) hasn't been renamed or moved elsewhere in the tree, you as a consumer of the L<DecInteger> class don't need to know that. The C<DecInteger.perl6> method will delete the child classes so that we don't end up with a B<horribly> cluttered tree.
Classes representing Perl 6 object code are currently in the same file as the main L<Perl6::Parser> class, as moving them to separate files caused a severe performance penalty. When the time is right I'll look at moving these to another location, but as shuffling out 20 classes increased my runtime on my little ol' VM from 6 to 20 seconds, it's not worth my time to break them out. And besides, having them all in one file makes editing en masse easier.
=end DESCRIPTION
=begin DEBUGGING
Some notes on how I go about debugging various issues follow.
Let's take the case of the following glitch: The terms C<$_> and C<0> don't show up in the fragment C<[$_, 0 .. 100]>, for various reasons, mostly because there's a branch that's either not being traversed, or simply a term has gone missing.
The first thing is to break the offending bit of code out so it's easier to debug. The test file I make to do this (usually an existing test file I've got lying around) looks partially like this, with boilerplate stripped out:
my $source = Q:to[_END_];
my @a; @a[$_, 0 .. 100];
_END_
my $p = $pt.parse( $source );
say $p.dump;
my $tree = $pt.build-tree( $p );
say $pt.dump-tree($tree);
is $pt.to-string( $tree ), $source, Q{formatted};
Already a few things might stand out. First, the code inside the here-doc doesn't actually do anything, it'll never print anything to the screen or do anything interesting. That's not the point at this stage in the game. At this point all I need is a syntactically valid bit of Perl 6 that has the constructs that reproduce the bug. I don't care what the code actually does in the real world.
Second, I'm not doing anything that you as a user of the class would do. As a user of a the class, all you have to do is run the to-string() method, and it does what you want. I'm breaking things down into their component steps.
(side note - This will probably have changed in detail since I wrote this text - Consult your nearest test file for examples of current usage.)
Internally, the library takes sevaral steps to get to the nicely objectified tree that you see on your output. The two important steps in our case are the C<.parse( 'text goes here' )> method call, and C<.build-tree( $parse-tree )>.
The C<.parse()> call returns a very raw L<NQPMatch> object, which is the Perl 6 internal we're trying to reparse into a more useful form. Most of the time you can call C<.dump()> on this object and get back a semi-useful object tree. On occasion this B<will> lock up, most often because you're trying to C<.dump()> a L<list> accessor, and that hasn't been implemented for NQPMatch. The actual C<list> accessor works, but the C<.dump()> call will not. A simple workaround is to call C<$p.list.[0].dump> on one of the list elements inside, and hope there is one.
Again, these are NQP internals, and aren't quite as stable as the Perl 6 main support layer.
Once you've got a dump of the offending area, it'll look something like this:
- postcircumfix: [0, $_ ... 100]
- semilist: 0, $_ ... 100
- statement: 1 matches
- EXPR: ...
- 0: ,
- 0: 0
- value: 0
# ...
- 1: $_
- variable: $_
- sigil: $
# ...
- infix: ,
- sym: ,
- O: <object>
- OPER: ,
The full thing will probably go on for a few hundred lines. Hope you've got scrollback, or just use tmux as I do.
With this you can almost immediately jump to what appears to be the offending expression, albeit with a bit of interpretation. You'll see C<- 0: 0> and C<- 1: $_>, and these are exactly the two bits of syntax that have gone missing in our example. Down below you'll see C<- infix: ,> which is where the comma separator goes.
We can combine these facts and reason that we have to search for the bit of code where we find an C<infix> operator and two list elements. The C<- 0> and C<- 1> bits tell us that we're dealing with two list elements, and the C<- infix> and C<- OPER> bits tell us that we've also got two hash keys to find.
Look down in this file for a C<assert-hash-keys()> call attempting to assert the existence of exactly a C<infix> and C<OPER> tag. There may actually be other hash keys in this data structure, as C<.dump()> doesn't report unused hash keys; this caused me a deal of confusion.
Eventually you'll find in the C<_EXPR()> method this bit of code: (due to change, obviously)
when self.assert-hash-keys( $_, [< infix OPER >] ) {
@child.append(
self._infix( $_.hash.<infix> )
)
}
You're most welcome to use the Perl 6 debugger, but what I just do is add a C<say 1;> statement just above the C<@child.append()> method call (and anywhere else I find a C<infix> and C<OPER> hash key hiding, because there are multiple places in the code that match an C<infix> and C<OPER> hash key) and rerun the test.
Now that we've confirmed where the element C<should> be getting generated, but somehow isn't, we need to look at the actual text to verify that this is actually where the C<$_> and C<0> bits are being matched, and we can use another internal debugging tool to help out with that. Add these lines:
key-bounds $_.list.[0];
key-bounds $_.hash.<infix>;
key-bounds $_.list.[1];
key-bounds $_.hash.<OPER>;
key-bounds $_;
And rerun your code. Or make the appropriate calls in the debugger, it's your funeral :) What this function does is return something that looks like this:
45 60 [[0, $_ ... 100]]
The two numbers are the glyphs where the matched text starts and stops, respectively. The bit in between [..] (in this case seemingly doubled, but that's because the text is itself bracketed) is the actual text that's been matched.
So, by now you know what the text you're matching actually looks like, where it is in the string, and maybe eve have a rough idea of why what you're seeing isn't being displayedk
With any luck you'll see that the code simply isn't adding the variables to the list that is being glommed onto the C<@child> array, and can add it.
By convention, when you're dumping a C<- semilist:> match, you can always call C<self._semilist( $p.hash.<semilist> )> in order to get back the list of tokens generated by matching on the C<$p.hash.semilist> object. You might be wondering why I don't just subclass NQPMatch and add this as a generic multimethod or dispatch it in some other way.
Well, the reason it's called C<NQP> is because it's Not Quite Perl 6, and like poor Rudolph, Perl 6 won't let NQP objects join in any subclasing or dispatching games, because it's Not Quite Perl. And yes, this leads to quite a bit of frustruation.
Let's assume that you've found the C<$_.list.[0]> and C<< $_.hash.<infix> >> handlers, and now need to add whitespace between the C<$_> and C<0> elements in your generated code.
Now we turn to the rather bewildering array of methods on the C<Perl6::WS> class. This profusion of methods is because of two things:
* Whitespace can be inside a token, before or after a token, or even simply not be there because it's actually inside the L<NQPMatch> object one or more levels up, so you're never B<quite> sure where your whitespace will be hiding.
* If each of the ~50 classes handled whitespace on its own, I'd have to track down each of the ~50 whitespace-generating methods in order to see which of the whitespace calls is being made. This way I can just look for L<Perl6::WS> and know that those are the only B<possible> places where a whitespace token could be being generated.
=end DEBUGGING
=begin METHODS
=item _roundtrip( Str $perl-code ) returns Perl6::Parser::Root
Given a string containing valid Perl 6 code ... well, return that code. This
is mostly a shortcut for testing purposes, and wil probably be moved out of the
main file.
=item to-tree( Str $source )
This is normally what you want, it returns the Perl 6 parsed tree corresponding to your source code.
=item parse( Str $source )
Returns the underlying NQPMatch object. This is what gets passed on to C<build-tree()> and every other important method in this module. It does some minor wizardry to call the Perl 6 reentrant compiler to compile the string you pass it, and return a match object. Please note that it B<has> to compile the string in order to validate things like custom operators, so this step is B<not> optional.
=item build-tree( Mu $parsed )
Build the Perl6::Element tree from the NQPMatch object. This is the core, and runs the factory which silly-walks the match tree and returns one or more tokens for every single match entry it finds, and B<more>.
=item consistency-check( Perl6::Element $root )
Check the integrity of the data structure. The Factory at its core puts together the structure very sloppily, to give the tree every possible chance to create actual quasi-valid text. This method makes sure that the factory returned valid tokens, which often doesn't happen. But since you really want to see the data round-tripped, most users don't care what the tree loos like internally.
=item to-string( Perl6::Element $tree ) returns Str
Call .perl6 on each element of the tree. You can subclass or override this method in any class as you see fit to properly pretty-print the methods. Right now it's awkward to use, and will probably be removed in favor of an upcoming L<Perl6::Tidy> module. That's why I wrote this yak.. er, module in the first place.
=item dump-tree( Perl6::Element $root ) returns Str
Given a Perl6::Document (or other) object, return a full nested tree of text detailing every single token, for debugging purposes.
=item ruler( Str $source )
Purely a debugging aid, it puts an ASCII ruler above your source so that you don't have to go blind counting whitespace to figure out which ' ' a given token belongs to. As a courtesy it also makes newlines visible so you don't have to count those separately. I might use the visible space character later to make it easier to read, if I happen to like it.
=end METHODS
=begin Further-Information
For further information, there's a L<DEBUGGING.pod> file detailing how to go about tracing down a bug in this module, and an extensive test suite in L<t/README.pod> with some ideas of how I'm structuring the test suite.
=end Further-Information
=end pod
use Perl6::Parser::Factory;
my role Debugging {
method _interrogate-element( Perl6::Element $node ) {
my @problem;
if $node.WHAT.perl eq 'Perl6::Element' {
@problem.push( Q{raw element} );
return @problem;
}
unless $node.^can('is-leaf') {
@problem.push( Q{no leaf test} );
return @problem;
}
unless $node.^can('is-twig') {
@problem.push( Q{no twig test} );
return @problem;
}
unless $node.^can('from') {
@problem.push( Q{no from} );
return @problem;
}
unless $node.^can('to') {
@problem.push( Q{no to} );
return @problem;
}
@problem.push( Q{from} ) if $node.from < 0;
@problem.push( Q{to} ) if $node.to < 0;
@problem.push( Q{cross} ) if $node.from > $node.to;
if $node.is-twig {
given $node {
when Perl6::Block {
@problem.push( Q{structure start} ) if
$node.child[0] !~~
Perl6::Balanced::Enter;
@problem.push( Q{structure end} ) if
$node.child[*-1] !~~
Perl6::Balanced::Exit;
}
}
}
elsif $node.is-leaf {
@problem.push( Q{empty} ) if $node.from == $node.to;
@problem.push( Q{short} ) if
$node.to - $node.from < $node.content.chars;
@problem.push( Q{long} ) if
$node.to - $node.from > $node.content.chars;
given $node {
when Perl6::Comment | Perl6::String | Perl6::Sir-Not-Appearing-In-This-Statement { }
when Perl6::WS | Perl6::Newline {
@problem.push( Q{WS} ) if
$node.content ~~ /\S/;
}
default {
@problem.push( Q{WS} ) if
$node.content ~~ /\s/;
@problem.push( Q{EMPTY} ) if
$node.content eq Q{};
}
}
}
else {
@problem.push( Q{not tree member} );
}
@problem;
}
#constant indent = "\t";
my constant indent = ' ';
method _dump-tree( Perl6::Element $root,
Bool $display-ws = True,
Int $depth = 0 ) {
my $str = ( indent xx $depth ) ~ self.dump-term( $root ) ~ "\n";
if $root.is-twig {
for ^$root.child {
my @problem;
@problem.append(
self._interrogate-element(
$root.child.[$_]
)
);
# Mark the tokens that don't overlap.
#
if $root.child.[$_+1].defined and
$root.child.[$_].to !=
$root.child.[$_+1].from {
@problem.push( 'G' )
}
$str ~= @problem.join( ' ' ) if @problem;
$str ~= self._dump-tree(
$root.child.[$_],
$display-ws, $depth + 1
)
}
}
$str
}
method dump-tree( Perl6::Element $root,
Bool $display-ws = True,
Int $depth = 0 ) {
my $str;
for $.factory.here-doc.keys.sort -> $k {
$str ~= "Here-Doc ($k-{$.factory.here-doc.{$k}})" ~
"\n";
}
$str ~= self._dump-tree( $root, $display-ws, $depth );
$str
}
method dump-term( Perl6::Element $term ) {
my $line = $term.WHAT.perl;
$line ~~ s/'Perl6::'//;
if $term.is-leaf {
if $term ~~ Perl6::Number {
$line ~= " ({$term.content})"
}
# Circumfix operators don't have content.
# Their children do.
elsif $term ~~ Perl6::Operator::PostCircumfix or
$term ~~ Perl6::Operator::Circumfix {
}
else {
$line ~= " ({$term.content.perl})"
}
}
if $term.^can('from') and not (
$term ~~ Perl6::Document | Perl6::Statement
) {
$line ~= " ({$term.from}-{$term.to})";
}
if $term.^can('here-doc') {
$line ~= " <{$term.here-doc}>" if
$term.here-doc and $term.here-doc ne '';
}
$line ~= " (line {$term.factory-line-number})" if
$term.factory-line-number;
if $term.is-end or !$term.next {
$line ~= " -> END";
}
else {
my $next = $term.next;
my $name = $next.WHAT.perl;
$name ~~ s/'Perl6::'//;
my $next-bounds = "{$next.from}-{$next.to}";
$line ~= " -> $name ($next-bounds)";
}
$line;
}
method ruler( Str $source ) {
my Str $munged = substr( $source, 0, min( $source.chars, 72 ) );
$munged ~= '...' if $source.chars > 72;
my Int $blocks = $munged.chars div 10 + 1;
$munged ~~ s:g{ \n } = Q{};
my Str $nums = '';
for ^$blocks {
$nums ~= " {$_+1}";
}
my $ruler = '';
$ruler ~= '#' ~ ' ' ~ $nums ~ "\n";
$ruler ~= '#' ~ ('0123456789' x $blocks) ~ "\n";
$ruler ~= '#' ~ $munged ~ "\n";
}
}
my role Testing {
method _roundtrip( Str $source ) {
my $tree = self.to-tree( $source );
my $formatted = self.to-string( $tree );
$formatted
}
}
my role Validating {
method _consistency-check( Perl6::Element $node ) {
my @problems = self._interrogate-element( $node );
if @problems {
$*ERR.say( @problems ~ ": " ~ $node.perl );
}
if $node.is-twig {
if $node.child.elems > 1 {
for $node.child.kv -> $index, $_ {
self._consistency-check( $_ );
next if $index == 0;
if $node.child.[$index-1] ~~ Perl6::WS and
$node.child.[$index] ~~ Perl6::WS {
$*ERR.say( "Two WS entries in a row" );
}
if $node.child.[$index-1].to !=
$node.child.[$index].from {
$*ERR.say( "Gap between two items" );
}
}
}
}
}
# Just in case we need to pass in parameters later on...
# sigh.
#
method consistency-check( Perl6::Element $root ) {
self._consistency-check( $root );
}
}
my class CompleteIterator {
also does Iterator;
has Perl6::Element $.head;
has Bool $.is-done = False;
method pull-one {
if $.head.is-end {
if $.is-done {
return IterationEnd;
}
else {
$!is-done = True;
return $.head;
}
}
else {
my $elem = $.head;
$!head = $.head.next;
$elem;
}
}
method is-lazy { False }
}
my class TokenIterator {
also does Iterator;
has Perl6::Element $.head;
has Bool $.is-done = False;
method pull-one {
if $.head.is-end-leaf {
if $.is-done {
return IterationEnd;
}
else {
$!is-done = True;
return $.head;
}
}
else {
my $elem = $.head;
$!head = $.head.next;
$!head = $!head.next while !$.head.is-leaf;
$elem;
}
}
method is-lazy { False }
}
my class Munge-Phasers {
has @.munged-BEGIN;
has @.munged-CHECK;
method _munge-element( Perl6::Element $node ) {
return unless $node.^can( 'content' );
for @.munged-BEGIN -> $from {
next unless $node.from <= $from <= $node.to;
next unless $node.content.substr(
$from - $node.from, 'BEGIN'.chars
) eq 'ENTER';
$node.content.substr-rw(
$from - $node.from,
#$from,
'BEGIN'.chars
) = 'BEGIN';
last;
}
for @.munged-CHECK -> $from {
next unless $node.from <= $from <= $node.to;
next unless $node.content.substr(
$from - $node.from, 'CHECK'.chars
) eq 'ENTER';
$node.content.substr-rw(
$from - $node.from,
'CHECK'.chars
) = 'CHECK';
last;
}
}
method _munge-phasers( Perl6::Element $node ) {
self._munge-element( $node );
if $node.is-twig {
for $node.child.kv -> $index, $_ {
self._munge-phasers( $_ );
}
}
}
# Just in case we need to pass in parameters later on...
# sigh.
#
method munge-phasers( Perl6::Element $root ) {
return unless @.munged-BEGIN or @.munged-CHECK;
self._munge-phasers( $root );
}
}
class Perl6::Parser:ver<0.3.0> {
also does Debugging;
also does Testing;
also does Validating;
use nqp;
has $.factory = Perl6::Parser::Factory.new;
has @.munged-BEGIN;
has @.munged-CHECK;
# These could easily be a single method, but I'll separate them for
# testing purposes.
#
method parse( Str $source ) {
my $*LINEPOSCACHE;
my $compiler := nqp::getcomp('perl6');
my $g := nqp::findmethod(
$compiler,'parsegrammar'
)($compiler);
#$g.HOW.trace-on($g);
my $a := nqp::findmethod(
$compiler,'parseactions'
)($compiler);
@.munged-BEGIN = ();
my $munged-source = $source;
$munged-source ~~ s:g{ 'BEGIN' } = 'ENTER';
for @( $/ ) -> $BEGIN {
@.munged-BEGIN.push: $BEGIN.from;
}
$munged-source ~~ s:g{ 'CHECK' } = 'ENTER';
for @( $/ ) -> $CHECK {
@.munged-CHECK.push: $CHECK.from;
}
my $parsed = $g.parse(
$munged-source,
:p( 0 ),
:actions( $a )
);
$parsed;
}
method build-tree( Mu $parsed ) {
my $tree = $.factory.build( $parsed );
my $munger = Munge-Phasers.new(
:munged-BEGIN( @.munged-BEGIN ),
:munged-CHECK( @.munged-CHECK ),
);
$munger.munge-phasers( $tree );
self.consistency-check( $tree ) if
$*CONSISTENCY-CHECK;
$tree
}
method to-tree( Str $source ) {
my $parsed = self.parse( $source );
self.build-tree( $parsed );
}
method to-string( Perl6::Element $tree ) {
my $str = $tree.to-string;
$str;
}
method to-tokens-only( Str $source ) {
my $tree = self.to-tree( $source );
$.factory.thread( $tree );
my $head = $.factory.flatten( $tree );
$head = $head.next while !$head.is-leaf and !$head.is-end-leaf;
Seq.new( TokenIterator.new( :head( $head ) ) );
}
method to-list( Str $source ) {
my $tree = self.to-tree( $source );
$.factory.thread( $tree );
my $head = $.factory.flatten( $tree );
Seq.new( CompleteIterator.new( :head( $head ) ) );
}
}