[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-research/r_u_sure/blob/main/r_u_sure/notebooks/pseudo_parser_demo.ipynb)

##### Copyright 2023 Google LLC.

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Pseudo-Parser Demo

This notebook demonstrates the behaviour of the pseudo-parser developed for use with R-U-SURE, by presenting a set of of examples of source code with the corresponding pseudo parse tree.

## Setup

### Installation

To run this notebook, you need a Python environment with `r_u_sure` installed. 

If you are running this from Colab, you can install it by running the following command:

In [0]:
try:
  import r_u_sure
except ImportError:
  try:
    import google.colab
    in_colab = True
  except ImportError:
    in_colab = False
  
  if in_colab:
    print("Installing r_u_sure from GitHub...")
    %env NUMBA_DISABLE_TBB=1
    %env NUMBA_DISABLE_OPENMP=1
    !pip install "r_u_sure @ git+https://github.com/google-research/r_u_sure"
  else:
    # Don't install in this case, to avoid messing up the python environment.
    print("WARNING: Not running in Colab and r_u_sure not found. "
          "Please install r_u_sure following the instructions in the README.")
    raise

### Imports

In [0]:
import numpy as np
import textwrap
from IPython import display

%matplotlib inline

from r_u_sure.wrappers import parser_tools
from r_u_sure.tree_structure import sequence_node_helpers


## Basic Bracket Matching Examples for Java


Our pseudo parser supports python, cpp, java, and javascript. All of these are the same except for python which has some extra language specific parse tree transformations. The language specific parameters are specified in `stack_parser.py`, and it is simple to add additional languages by modifying that file. 

We begin with some examples using simpler the java version.

In [0]:
java_parser_helper = parser_tools.ParserHelper(language="java")

In [0]:
#@title Tokens can be delimited with whitespace

parsed = java_parser_helper.parse_to_nodes('''foo bar''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo bar'
  TOK(CONTENT_LEAF): 'foo'
  DEC: ' '
  TOK(CONTENT_LEAF): 'bar'


In [0]:
#@title Statements will be split into groups demarcated by semicolons

parsed = java_parser_helper.parse_to_nodes('''foo bar; baz qux;''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo bar; baz qux;'
  GROUP(SPLIT_GROUP): 'foo bar;'
    TOK(CONTENT_LEAF): 'foo'
    DEC: ' '
    TOK(CONTENT_LEAF): 'bar'
    TOK(CONTENT_LEAF): ';'
  GROUP(SPLIT_GROUP): ' baz qux;'
    DEC: ' '
    TOK(CONTENT_LEAF): 'baz'
    DEC: ' '
    TOK(CONTENT_LEAF): 'qux'
    TOK(CONTENT_LEAF): ';'


In [0]:
#@title Brackets are matched to yield sub-trees

parsed = java_parser_helper.parse_to_nodes('''foo(bar)''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo(bar)'
  TOK(CONTENT_LEAF): 'foo'
  GROUP(MATCH): '(bar)'
    TOK(MATCH_LEFT): '('
    GROUP(MATCH_INNER): 'bar'
      TOK(CONTENT_LEAF): 'bar'
    TOK(MATCH_RIGHT): ')'


In [0]:
#@title Non-matching brackets are tolerated ...

parsed = java_parser_helper.parse_to_nodes('''foo(bar]''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo(bar]'
  TOK(CONTENT_LEAF): 'foo'
  GROUP(MATCH): '(bar]'
    TOK(MATCH_LEFT): '('
    GROUP(MATCH_INNER): 'bar]'
      TOK(CONTENT_LEAF): 'bar'
      TOK(CONTENT_LEAF): ']'
    TOK(MATCH_RIGHT): ''


In [0]:
#@title ... but the non-matching brackets are handled somewhat arbitrarily

parsed = java_parser_helper.parse_to_nodes('''foo(bar)]''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo(bar)]'
  GROUP(MATCH): 'foo(bar)]'
    TOK(MATCH_LEFT): ''
    GROUP(MATCH_INNER): 'foo(bar)'
      TOK(CONTENT_LEAF): 'foo'
      GROUP(MATCH): '(bar)'
        TOK(MATCH_LEFT): '('
        GROUP(MATCH_INNER): 'bar'
          TOK(CONTENT_LEAF): 'bar'
        TOK(MATCH_RIGHT): ')'
    TOK(MATCH_RIGHT): ']'


## More Complex Bracket Matching Examples for Python

Python does creative things with whitespace which we account for in our pseudo parser.

In [0]:
python_parser_helper = parser_tools.ParserHelper(language="python")

In [0]:
#@title Splitting occurs on newlines rather than whitespace:

parsed = python_parser_helper.parse_to_nodes('''foo bar\nbaz qux\n''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'foo bar\nbaz qux\n'
  GROUP(SPLIT_GROUP): 'foo bar\n'
    TOK(CONTENT_LEAF): 'foo'
    DEC: ' '
    TOK(CONTENT_LEAF): 'bar'
    DEC: '\n'
  GROUP(SPLIT_GROUP): 'baz qux\n'
    TOK(CONTENT_LEAF): 'baz'
    DEC: ' '
    TOK(CONTENT_LEAF): 'qux'
    DEC: '\n'


In [0]:
#@title Splitting does not occurs on newlines contained in parantheses:

parsed = python_parser_helper.parse_to_nodes('''[foo, bar,\nbaz, qux]\n''')
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): '[foo, bar,\nbaz, qux]\n'
  GROUP(SPLIT_GROUP): '[foo, bar,\nbaz, qux]\n'
    GROUP(MATCH): '[foo, bar,\nbaz, qux]'
      TOK(MATCH_LEFT): '['
      GROUP(MATCH_INNER): 'foo, bar,\nbaz, qux'
        TOK(CONTENT_LEAF): 'foo'
        TOK(CONTENT_LEAF): ','
        DEC: ' '
        TOK(CONTENT_LEAF): 'bar'
        TOK(CONTENT_LEAF): ','
        DEC: '\n'
        TOK(CONTENT_LEAF): 'baz'
        TOK(CONTENT_LEAF): ','
        DEC: ' '
        TOK(CONTENT_LEAF): 'qux'
      TOK(MATCH_RIGHT): ']'
    DEC: '\n'


In [0]:
#@title Python indents and dedents are matched (and rendered as empty strings here)
parsed = python_parser_helper.parse_to_nodes(
'''
if x:
 y=x
 return y
'''[1:])
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'if x:\n y=x\n return y\n'
  GROUP(SPLIT_GROUP): 'if x:\n y=x\n return y\n'
    GROUP(SPLIT_GROUP): 'if x:\n'
      TOK(CONTENT_LEAF): 'if'
      DEC: ' '
      TOK(CONTENT_LEAF): 'x'
      TOK(CONTENT_LEAF): ':'
      DEC: '\n'
    GROUP(SPLIT_GROUP): ' y=x\n return y\n'
      GROUP(MATCH): ' y=x\n return y'
        TOK(MATCH_LEFT): ''
        GROUP(MATCH_INNER): ' y=x\n return y'
          GROUP(SPLIT_GROUP): ' y=x\n'
            DEC: ' '
            TOK(CONTENT_LEAF): 'y'
            TOK(CONTENT_LEAF): '='
            TOK(CONTENT_LEAF): 'x'
            DEC: '\n'
          GROUP(SPLIT_GROUP): ' return y'
            DEC: ' '
            TOK(CONTENT_LEAF): 'return'
            DEC: ' '
            TOK(CONTENT_LEAF): 'y'
        TOK(MATCH_RIGHT): ''
      DEC: '\n'


In [0]:
#@title We infer the number of spaces per python indent / dedent
parsed = python_parser_helper.parse_to_nodes(
'''
if x:
      y=x
      return y
'''[1:])
print(sequence_node_helpers.render_debug(parsed))

GROUP(ROOT): 'if x:\n      y=x\n      return y\n'
  GROUP(SPLIT_GROUP): 'if x:\n      y=x\n      return y\n'
    GROUP(SPLIT_GROUP): 'if x:\n'
      TOK(CONTENT_LEAF): 'if'
      DEC: ' '
      TOK(CONTENT_LEAF): 'x'
      TOK(CONTENT_LEAF): ':'
      DEC: '\n'
    GROUP(SPLIT_GROUP): '      y=x\n      return y\n'
      GROUP(MATCH): '      y=x\n      return y'
        TOK(MATCH_LEFT): ''
        GROUP(MATCH_INNER): '      y=x\n      return y'
          GROUP(MATCH): '      y=x\n      return y'
            TOK(MATCH_LEFT): ''
            GROUP(MATCH_INNER): '      y=x\n      return y'
              GROUP(SPLIT_GROUP): '      y=x\n'
                DEC: '      '
                TOK(CONTENT_LEAF): 'y'
                TOK(CONTENT_LEAF): '='
                TOK(CONTENT_LEAF): 'x'
                DEC: '\n'
              GROUP(SPLIT_GROUP): '      return y'
                DEC: '      '
                TOK(CONTENT_LEAF): 'return'
                DEC: ' '
                TOK(CONTENT_LEAF): 'y