Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add custom cell magic parser #213

Merged
merged 20 commits into from Sep 9, 2020
Merged

Conversation

@plamut
Copy link
Contributor

@plamut plamut commented Aug 4, 2020

Fixes #166.
Fixes #108.

This PR is to preview the custom cell magic line parser.

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
@google-cla google-cla bot added the cla: yes label Aug 4, 2020
@plamut
Copy link
Contributor Author

@plamut plamut commented Aug 4, 2020

@shollyman I would like to hear your thoughts on this.

The current cell magic line parser is not aware of Python structures (dicts, lists ...) and can thus incorrectly parse the --params option (example issues: #166 and #108). The default parser breaks these structures on whitespace, and while we stitch these parts back together and operate on the result, this only happens after the parsing has already taken place and bugs have sneaked in.

I took a stab at it and came up with a POC custom parser that successfully parses non-trivial input lines such as the following:

%%bigquery    target_var --use_legacy_sql --project my.project.name --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal',} --verbose  

The following test snippets demonstrates that the parser correctly recognized the options:

from google.cloud.bigquery.ipython_magics import line_arg_parser as lap

example_line = (
    "  target_var "
    "--use_legacy_sql "
    "--project my.project.name "
    "--params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2,)}, "
    "\"answer\": -42, 'array': ['aaa bbb', [\"nested\", 'list'], 'ccc ddd'], "
    "'--params': 'tricky --inner-option string literal',} "
    "--verbose  "
)

lexer = lap.lexer.Lexer(line)
parser = lap.parser.Parser(lexer)
parse_tree = parser.input_line()

printer = lap.visitors.TreePrinter()
printer.visit(parse_tree)

Output:

Input:
    destination_var:    target_var
    Command options:
        --use_legacy_sql 
        --project my.project.name
        --params {'nested_dict': {'bar': 'baz', 'some_tuple': (-1, 3.2)}, "answer": -42, 'array': ['aaa bbb', ["nested", 'list'], 'ccc ddd'], '--params': 'tricky --inner-option string literal'}
        --verbose 

The custom parser could be used to extract out the --params <...> part, evaluate it separately, and only feed the remainder of the input to the default parser.

Do you think we should pursue this further?

The grammar used is of course a simplification that is aimed to be "good enough" to reliably extract the --params part in typical uses cases, but it does not try to be bullet-proof. For instance, it currently does not support non-string dict keys, sets, floats in exponential notation, etc., although that can all be added depending on how much we want to complicate. :)


cc: @tswast Do customers complain about the default parser's limitations often? Do you think there even exist reasonable use cases where --params`dictionary would be incredibly complex?

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Aug 4, 2020

Do customers complain about the default parser's limitations often?

I wouldn't say often, but it has come up more than once

Loading

@plamut plamut force-pushed the fix-cellmagic-parser branch 2 times, most recently from 5aa7f07 to e778cf2 Aug 7, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch from e778cf2 to 4e30e4c Aug 17, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch 8 times, most recently from 5359b3a to b514850 Aug 27, 2020
@plamut plamut requested a review from cguardia Aug 28, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch from b514850 to 5a6257a Aug 28, 2020
@plamut plamut requested a review from shollyman Aug 28, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch from 5a6257a to 976567c Aug 28, 2020
@plamut plamut marked this pull request as ready for review Aug 28, 2020
@shollyman shollyman requested a review from tswast Sep 1, 2020
Copy link
Contributor

@cguardia cguardia left a comment

This looks pretty good. Thanks.

Loading

Loading
Loading
The parser should accept as wide a range of values as possible and let
the code that delas with the semantics to decide whether the values are
good or not.
plamut added 3 commits Sep 2, 2020
The --params option spec must be followed by a non-alphanumeric
character, otherwise it's a different option spec (e.g. --paramsX).
@plamut plamut requested a review from cguardia Sep 2, 2020
Copy link
Contributor

@cguardia cguardia left a comment

I think this is looking good.

Loading

@plamut plamut force-pushed the fix-cellmagic-parser branch from 767e60b to 5249c6a Sep 3, 2020
Copy link
Contributor

@tswast tswast left a comment

This is awesome! It does require a bit of refactoring, though. We need to keep google.cloud.bigquery.magics module/package name due to google.cloud.bigquery.magics.context

I'd also like to see if we can lean on pyparsing for a lot of the lexing / parsing work. I worry that those modules are going to be particularly challenging to maintain without either "literate programming" level comments or relying on common modules to make the parser easier to understand.

Loading

docs/magics.rst Outdated Show resolved Hide resolved
Loading
google/cloud/bigquery/ipython_magics/__init__.py Outdated Show resolved Hide resolved
Loading
Loading
Loading
The context still needs to be importable from the old path
@plamut plamut requested a review from tswast Sep 4, 2020
@plamut plamut force-pushed the fix-cellmagic-parser branch from 1a97836 to 19af056 Sep 4, 2020
Token = namedtuple("Token", ("type_", "lexeme", "pos"))
StateTransition = namedtuple("StateTransition", ("new_state", "total_offset"))

# Pattern matching is done with regexes, and the order in which the token patterns are
Copy link
Contributor

@tswast tswast Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this explanation and the clearer state names!

Fun fact: anything that can be parsed with a regular expression can also be parsed by a finite state machine and vice versa. It's called a regular language. That means we might have been able to convert this into one giant regular expression, but I like having these explicit states better.

I recall that lookahead & backtracking isn't a regular expression in the computer science definition, so I don't think this is technically a "regular" language, though.

Loading

Copy link
Contributor Author

@plamut plamut Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, regular expressions correspond to finite automata, although the regex implementation is a bit more powerful than that (AFAIK), as it supports things like lookahead and lookbehind.

The language itself recognized by the parser is not regular, true, as the grammar allows for recursive patterns such as nested dicts. A proof of non-regularity is left as an exercise to the reader. 😆

(hint: pumping lemma).

Loading

google/cloud/bigquery/magics/line_arg_parser/lexer.py Outdated Show resolved Hide resolved
Loading
google/cloud/bigquery/magics/line_arg_parser/parser.py Outdated Show resolved Hide resolved
Loading
Loading
Loading
Loading
tests/unit/test_magics.py Outdated Show resolved Hide resolved
Loading
plamut added 5 commits Sep 9, 2020
Apparently black just places all implicitly concatenated string
literals in a single line when short enough without replacing them
with a single string literal.
@plamut plamut force-pushed the fix-cellmagic-parser branch from 806a4ad to ed01a66 Sep 9, 2020
@plamut plamut requested a review from tswast Sep 9, 2020
plamut added 2 commits Sep 9, 2020
This is necessary to retain Python 2 compatibility.
The tokens are designed in a way that the scanner *always* returns
some match, even if just UNKNOWN or EOL. The "no matches" code path
can thus never be taken, but the coverage check can't know that.
tswast
tswast approved these changes Sep 9, 2020
Copy link
Contributor

@tswast tswast left a comment

Great work! Thanks for the clear explanations.

Loading

@tswast tswast merged commit dcfbac2 into googleapis:master Sep 9, 2020
10 checks passed
Loading
@plamut plamut deleted the fix-cellmagic-parser branch Sep 10, 2020
gcf-merge-on-green bot pushed a commit that referenced this issue Sep 22, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.28.0](https://www.github.com/googleapis/python-bigquery/compare/v1.27.2...v1.28.0) (2020-09-22)


### Features

* add custom cell magic parser to handle complex `--params` values ([#213](https://www.github.com/googleapis/python-bigquery/issues/213)) ([dcfbac2](https://www.github.com/googleapis/python-bigquery/commit/dcfbac267fbf66d189b0cc7e76f4712122a74b7b))
* add instrumentation to list methods ([#239](https://www.github.com/googleapis/python-bigquery/issues/239)) ([fa9f9ca](https://www.github.com/googleapis/python-bigquery/commit/fa9f9ca491c3f9954287102c567ec483aa6151d4))
* add opentelemetry tracing ([#215](https://www.github.com/googleapis/python-bigquery/issues/215)) ([a04996c](https://www.github.com/googleapis/python-bigquery/commit/a04996c537e9d8847411fcbb1b05da5f175b339e))
* expose require_partition_filter for hive_partition ([#257](https://www.github.com/googleapis/python-bigquery/issues/257)) ([aa1613c](https://www.github.com/googleapis/python-bigquery/commit/aa1613c1bf48c7efb999cb8b8c422c80baf1950b))


### Bug Fixes

* fix dependency issue in fastavro ([#241](https://www.github.com/googleapis/python-bigquery/issues/241)) ([2874abf](https://www.github.com/googleapis/python-bigquery/commit/2874abf4827f1ea529519d4b138511d31f732a50))
* update minimum dependency versions ([#263](https://www.github.com/googleapis/python-bigquery/issues/263)) ([1be66ce](https://www.github.com/googleapis/python-bigquery/commit/1be66ce94a32b1f924bdda05d068c2977631af9e))
* validate job_config.source_format in load_table_from_dataframe ([#262](https://www.github.com/googleapis/python-bigquery/issues/262)) ([6160fee](https://www.github.com/googleapis/python-bigquery/commit/6160fee4b1a79b0ea9031cc18caf6322fe4c4084))


### Documentation

* recommend insert_rows_json to avoid call to tables.get ([#258](https://www.github.com/googleapis/python-bigquery/issues/258)) ([ae647eb](https://www.github.com/googleapis/python-bigquery/commit/ae647ebd68deff6e30ca2cffb5b7422c6de4940b))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants