Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEL parser tweaks + standardized AST nodes (to abstract ANTLR from users) #13

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b9bb0a7
removed unused boilerplate
dvdotsenko Nov 8, 2020
5cf265c
local test runner
dvdotsenko Nov 8, 2020
6a1c306
Adam's taxon_expr > taxon patch
dvdotsenko Nov 8, 2020
7570f39
harden WORD - not starting with number, push up literals higher in ra…
dvdotsenko Nov 8, 2020
1edb2d3
reorder dependencies with higher-order structures higher in file
dvdotsenko Nov 9, 2020
e5aab07
allow mixed case in keywords: True == tRUE
dvdotsenko Nov 9, 2020
edacb52
split Lexer and Parser into separate files
dvdotsenko Nov 9, 2020
47efd54
Py and JS code adapted / rerendered with new split Lexer, MODULE AND …
dvdotsenko Nov 9, 2020
fe85ebf
dump out Tel parser. Embed into PqlParser and expose as own entry point
dvdotsenko Nov 9, 2020
849e936
PQLParser example + test
dvdotsenko Nov 9, 2020
be271be
add column ::TypeCast() and AS Alias support
dvdotsenko Nov 10, 2020
764aefe
switch PQL parser to produce AST + ast.Node family of classes
dvdotsenko Nov 12, 2020
47d6070
round-trip PQL <> AST and AST <> JSON parsers, renderers
dvdotsenko Nov 13, 2020
fdf4e1c
JSON representation now uses `__typename` as name of key for node name
dvdotsenko Nov 14, 2020
ff1368a
add ANTLR support for SQL-like `set key = value` statement for commun…
dvdotsenko Nov 14, 2020
dc058e3
add support for FROM statement
dvdotsenko Nov 15, 2020
995f29a
fully unpack TEL expressions into AST (was kept as string)
dvdotsenko Nov 15, 2020
e3bcd07
make sure that parser speaks ARRAYS not single statements - to reflec…
dvdotsenko Nov 15, 2020
bc03863
add ast.tools.find_all and tests
dvdotsenko Nov 15, 2020
26cf2f8
allow Node instances to be hashable
dvdotsenko Nov 16, 2020
dba29df
add test hashable
dvdotsenko Dec 3, 2020
04df449
don't let robots format. This code is for humans
dvdotsenko Dec 3, 2020
53ead64
reduce scope to TEL only
dvdotsenko Dec 5, 2020
6f40979
rename `function` token to `fn` to avoid reserved word collision in JS
dvdotsenko Dec 5, 2020
cac4984
boilerplate fixup
dvdotsenko Dec 5, 2020
cadc42e
add .raw_value property to ast.Taxon to standardize taxon value expre…
dvdotsenko Dec 5, 2020
9a38f23
enable LIKE, BETWEEN and IN expression operators
dvdotsenko Dec 5, 2020
5c1afa6
FIX - Taxon.raw_value prop renamed to Taxon.value to reflect "process…
dvdotsenko Dec 5, 2020
e5744d6
tone down agreegious unary parsing. Ignore unary + and merge - into n…
dvdotsenko Dec 5, 2020
6e346d4
change Visitor parser helpers from imperative class methods to chaine…
dvdotsenko Dec 8, 2020
0bcf50f
add ILIKE support
dvdotsenko Dec 8, 2020
6e73986
move ANTLR visitor responsible for TEL-to-AST extraction to seprate file
dvdotsenko Dec 8, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 69 additions & 16 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,42 +9,95 @@ VENDOR_NAME:=panoramic
IMAGE_NAME:=tel-grammar
JAVA_IMAGE_NAME_FULL?=$(VENDOR_NAME)/java-$(IMAGE_NAME)
PYTHON_IMAGE_NAME_FULL?=$(VENDOR_NAME)/python-$(IMAGE_NAME)
PYTHON_IMAGE_TESTS_NAME_FULL?=$(VENDOR_NAME)/python-$(IMAGE_NAME)-tests

WORKDIR=/usr/src/app

image-java:
docker build \
--pull \
--pull \
-t $(JAVA_IMAGE_NAME_FULL):latest \
-f docker/Dockerfile-java .

image-python:
docker build \
--pull \
--pull \
-t $(PYTHON_IMAGE_NAME_FULL):latest \
-f docker/Dockerfile-python .

PHONY: image-java image-python

test: image-python
docker run --rm ${PYTHON_IMAGE_NAME_FULL}:latest python -m pytest tests

_TEST_IMAGE_MARKER:=/tmp/.$(VENDOR_NAME)-$(IMAGE_NAME)-testrunner-done
$(_TEST_IMAGE_MARKER): python/requirements.txt python/requirements-tests.txt
docker build \
-t $(PYTHON_IMAGE_TESTS_NAME_FULL) \
-f docker/Dockerfile-python-tests .
touch $(_TEST_IMAGE_MARKER)

.PHONY: test

test-dev: $(_TEST_IMAGE_MARKER)
docker run -it --rm \
-v $(PWD)/python:$(WORKDIR) \
--workdir ${WORKDIR} \
$(PYTHON_IMAGE_TESTS_NAME_FULL) \
pytest -s tests/

build-code-python:
docker run \
# see shipping/Jenkinsfile and keep in sync
test:
docker run -it --rm \
-v $(PWD):$(WORKDIR) \
--workdir ${WORKDIR} \
--rm ${JAVA_IMAGE_NAME_FULL}:latest \
java -Xmx500M -cp '/usr/local/lib/antlr-4.8-complete.jar:$$CLASSPATH' org.antlr.v4.Tool -visitor -Dlanguage=Python3 -o python/src/tel_grammar/antlr -Xexact-output-dir grammar/Tel.g4

build-code-js:
docker run \
python:3.7 \
bash -c "pip install --upgrade tox && tox -e py37 -c python/tox.ini"
docker run -it --rm \
-v $(PWD):$(WORKDIR) \
--workdir ${WORKDIR} \
python:3.8 \
bash -c "pip install --upgrade tox && tox -e py38 -c python/tox.ini"
docker run -it --rm \
-v $(PWD):$(WORKDIR) \
--workdir ${WORKDIR} \
--rm ${JAVA_IMAGE_NAME_FULL}:latest \
java -Xmx500M -cp '/usr/local/lib/antlr-4.8-complete.jar:$$CLASSPATH' org.antlr.v4.Tool -visitor -Dlanguage=JavaScript -o js-temp/ -Xexact-output-dir grammar/Tel.g4
python:3.9 \
bash -c "pip install --upgrade tox && tox -e py39 -c python/tox.ini"

.PHONY: test test-dev

image-antlr:
DOCKER_BUILDKIT=1 docker build \
-t antlr \
-f docker/Dockerfile-antlr .

# https://github.com/antlr/antlr4/issues/2335
# solves "cannot find token file" error
grammar/PqlLexer.tokens: grammar/PqlLexer.g4
docker run --rm \
-v $(PWD):/mnt \
antlr \
-o ./ \
grammar/PqlLexer.g4

build-code-python: grammar/PqlLexer.tokens grammar/PqlParser.g4# image-antlr
docker run --rm \
-v $(PWD):/mnt \
antlr \
-visitor \
-Dlanguage=Python3 \
-Xexact-output-dir \
-o python/src/pql_grammar/antlr \
grammar/PqlLexer.g4 \
grammar/PqlParser.g4

build-code-js: grammar/PqlLexer.tokens grammar/PqlParser.g4 # image-antlr
docker run --rm \
-v $(PWD):/mnt \
antlr \
-visitor \
-Dlanguage=JavaScript \
-Xexact-output-dir \
-o js-temp/ \
grammar/PqlLexer.g4 \
grammar/PqlParser.g4

build-code: build-code-python build-code-js

.PHONY: build-code-python
.PHONY: image-antlr build-code-python build-code-js build-code
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ To release a new version of the library, follow these steps:
## Introduction

This repository contains formal definition of grammar for TEL written in [ANTLR v4](https://github.com/antlr/antlr4).
It can generate following components in both python and JavaScript to handle parsing string expressions:
It can generate following components in Python, JavaScript to handle parsing string expressions:

- *lexer* - splits string expression into tokens
- *parser* - connects tokens into parse tree (similar to AST)
Expand All @@ -25,25 +25,27 @@ It can generate following components in both python and JavaScript to handle par

Current documentation on the language is available [here](https://diesel-service.operamprod.com/documentation#taxon-expression-language-tel).

## How to use it
## Local Development

### `make image-java`
### `make image-antlr`

It builds local docker image to run ANTLR commands. You need to run this command before you may run ANTLR-related make commands.

### `make image-python`

It builds local docker image to run python tests. This image is used to run tests on the current grammar.
It builds local docker image to run ANTLR commands.
You need to run this command before you may run ANTLR-related make commands.

### `make build-code-python`

It generates all components in python language


### `make build-code-js`

It generates all components in JavaScript language

### `make test-dev`

Runs tests on the current version of grammar in quick mode.
Reuses pre-built python image (3.8) to mount local python code and tests and run them.

### `make test`

Runs tests on the current version of grammar.
Runs same tests as above, but against multiple supported python versions, using TOX config.
(Takes much much longer to run because each python image is built from scratch each time.)
12 changes: 12 additions & 0 deletions docker/Dockerfile-antlr
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM java:8

ENV ANTLR_VERSION=4.8
ENV CLASSPATH .:/antlr-${ANTLR_VERSION}-complete.jar:$CLASSPATH

ADD http://www.antlr.org/download/antlr-${ANTLR_VERSION}-complete.jar /usr/bin/
RUN chmod +r /usr/bin/antlr-${ANTLR_VERSION}-complete.jar \
&& ln /usr/bin/antlr-${ANTLR_VERSION}-complete.jar /usr/bin/antlr.jar

WORKDIR /mnt

ENTRYPOINT ["java", "-jar", "/usr/bin/antlr.jar"]
27 changes: 27 additions & 0 deletions docker/Dockerfile-python-tests
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
ARG PYTHON_VERSION=3.8
FROM python:${PYTHON_VERSION} as baseimage

ARG WORKDIR=/usr/src/app
WORKDIR $WORKDIR

ARG PYTHONUSERBASE=/usr/src/lib

# PYTHONUNBUFFERED: Force stdin, stdout and stderr to be totally unbuffered. (equivalent to `python -u`)
# PYTHONHASHSEED: Enable hash randomization (equivalent to `python -R`)
# PYTHONDONTWRITEBYTECODE: Do not write byte files to disk, since we maintain it as readonly. (equivalent to `python -B`)
ENV PYTHONUNBUFFERED=1 \
PYTHONHASHSEED=random \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUSERBASE=$PYTHONUSERBASE \
PATH="${PYTHONUSERBASE}/bin:${PATH}"

# Setup PYTHONUSERBASE directory
# we allow running / managing these folders by non-root users. Thus need chmod
RUN set -ex; \
mkdir -p $PYTHONUSERBASE && chmod 777 ${PYTHONUSERBASE}; \
mkdir -p $WORKDIR && chmod 777 ${WORKDIR}

COPY python/requirements.txt python/requirements-tests.txt ./
RUN pip install \
-r requirements.txt \
-r requirements-tests.txt
3 changes: 3 additions & 0 deletions grammar/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.*
!*.g4
!.gitignore
112 changes: 112 additions & 0 deletions grammar/PqlLexer.g4
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
lexer grammar PqlLexer;

// mostly SQL-compatible (except for some TEL-isms where marked):

AND : '&&'; // TEL
EQ : '==';
GT_EQ : '>=';
LT_EQ : '<=';
NOT_EQ1 : '!=';
NOT_EQ2 : '<>';
OR : '||'; // TEL. !! CONFLICT WITH SQL where it's string concatenator !!
SHIFT_LEFT : '<<';
SHIFT_RIGHT : '>>';

AMP : '&';
ASSIGN : '=';
CLOSE_PAREN : ')';
COLON: ':';
COMMA : ',';
DOT : '.';
FORWARD_SLASH : '/';
GT : '>';
LT : '<';
MINUS : '-';
MOD : '%';
OPEN_PAREN : '(';
PIPE : '|';
PLUS : '+';
QUESTION_MARK: '?';
SCOL : ';';
STAR : '*';
TILDE : '~';
UNDER: '_';

// SQL keywords we adapt:
K_AND : A N D;
K_BETWEEN : B E T W E E N;
K_FALSE : F A L S E;
K_ILIKE: I L I K E ;
K_IN : I N;
K_IS : I S;
K_ISNULL : I S N U L L;
K_LIKE : L I K E;
K_NOT : N O T;
K_NOTNULL : N O T N U L L;
K_NULL : N U L L;
K_OR : O R;
K_TRUE : T R U E;

NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;

// Note, use of TEL escaping variant,
// escaping is NOT SQL style "double-char":
// TODO: allow both in TEL to avoid translation headaches
DOUBLE_QUOTED_STRING: DOUBLE_QUOTED_STRING_TEL ;
DOUBLE_QUOTED_STRING_TEL : '"' ( '\\"' | ~'"' )* '"' ;
DOUBLE_QUOTED_STRING_SQL : '"' ( '""' | ~'"' )* '"' ;

// Note, use of TEL escaping variant,
// Note, escaping is NOT SQL style "double-char":
// TODO: allow both in TEL to avoid translation headaches
SINGLE_QUOTED_STRING: SINGLE_QUOTED_STRING_TEL ;
SINGLE_QUOTED_STRING_TEL: '\'' ( '\\\'' | ~'\'' )* '\'' ;
SINGLE_QUOTED_STRING_SQL: '\'' ( '\'\'' | ~'\'' )* '\'' ;

SINGLE_LINE_COMMENT
: ('--'|'//'|'#') ~[\r\n]* -> channel(HIDDEN)
;

MULTILINE_COMMENT
: '/*' .*? ( '*/' | EOF ) -> channel(HIDDEN)
;

SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;

WORD
: [a-zA-Z_][a-zA-Z_0-9]*
;

fragment DIGIT : [0-9];

fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
72 changes: 72 additions & 0 deletions grammar/PqlParser.g4
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/*
SQL-inspired "Pano Query Language" syntax
focusing on Expressions

Weird parts:
- Taxon is a SQL-column-like object with similar heritage (namespace etc)
and extra syntax for optionality
- Some operator characters are more "programming" than SQL
Example: Eq compare '==' vs SQL-like '=' (though '=' could be converted to '==' internally)
*/

parser grammar PqlParser;

options {
tokenVocab = PqlLexer;
}

// entry point
parseTel: expr EOF ;

expr
: unary_operator=( MINUS | PLUS | K_NOT ) right=expr
| left=expr operator=( STAR | FORWARD_SLASH | MOD ) right=expr
| left=expr operator=( PLUS | MINUS ) right=expr
| left=expr operator=( LT | LT_EQ | GT | GT_EQ ) right=expr
| left=expr operator=( ASSIGN | EQ | NOT_EQ1 | NOT_EQ2 | K_IS ) right=expr
| left=expr is_negated=K_NOT? operator=(K_LIKE | K_ILIKE) right=expr
| left=expr is_negated=K_NOT? operator=K_IN OPEN_PAREN right_list=exprList CLOSE_PAREN
| left=expr operator=( K_AND | AND ) right=expr
| left=expr operator=( K_OR | OR ) right=expr
// BETWEEN must come after AND or risk being parsed before it
// resulting in `a BETWEEN b` where `AND c` fragment is outside of BETWEEN expression
| left=expr is_negated=K_NOT? operator=K_BETWEEN right=expr
| OPEN_PAREN inner=expr CLOSE_PAREN
| literalValue
| fn
| taxon
;

exprList: expr ( COMMA expr )* ;

// Note that function supports optional list of arguments trapped as `expr`
// which allows us to have
// named (`arg1=value1, arg2=value2'` and
// positional (`value1, value2`) args.
// Named ones will come as `expr` with left=expr,operator=ASSIGN,right=expr contents.
// You might need to express these as ordered dict / list of tuples to preserve names of args.
// Positional will be whatever literal or other single-valued expr content could be.
fn: function_name=identifierMultipart OPEN_PAREN arguments=fnArgs? CLOSE_PAREN ;
fnArgs: fnArg ( COMMA fnArg)* ;
fnArg: ( argument_name=WORD ASSIGN)? argument_value=expr ;

// TODO: TAXON_TAG_DELIMITER is being killed off. Remove when we migrate out of taxon tags.
taxon:
is_optional=QUESTION_MARK?
( namespace=identifierMultipart PIPE )?
slug=identifierMultipart
// TODO: drop this when we drop Data Tags system.
// May conflict with TypeCast expression
( COLON tag=identifierMultipart )?
;

identifierMultipart: WORD ( DOT WORD )* ;

literalValue
: NUMERIC_LITERAL
| DOUBLE_QUOTED_STRING
| SINGLE_QUOTED_STRING
| K_NULL
| K_TRUE
| K_FALSE
;
Loading