Add xpath.md; upgrade xpath lib to v1.1.11 so we can start to use reg…

…exp func `matches` in xpath query (#124)
jf-tech · Nov 24, 2020 · b13fcee · b13fcee
1 parent 8a02d06
commit b13fcee
Show file tree

Hide file tree

Showing 5 changed files with 336 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -14,8 +14,8 @@ Golang Version: 1.14
 Docs:
 - [Getting Started](./doc/gettingstarted.md): a tutorial for writing your first omniparser schema.
 - [IDR](./doc/idr.md): in-memory data representation of ingested data for omniparser.
-- [XPath Based Data Extraction and Filtering](./doc/xpath.md): xpath queries are essential to omniparser schema writing.
-Learn the concept and tricks in depth.
+- [XPath Based Record Filtering and Data Extraction](./doc/xpath.md): xpath queries are essential to omniparser schema
+writing. Learn the concept and tricks in depth.
 - [Use of `custom_func`, Specially `javascript`](./doc/use_of_custom_funcs.md): An in depth look of how `custom_func`
 is used, specially the all mighty `javascript` (and `javascript_with_context`).
 - [CSV Schema in Depth](./doc/csv_in_depth.md): everything about schemas for CSV input.

diff --git a/doc/xpath.md b/doc/xpath.md
@@ -0,0 +1,330 @@
+# XPath Based Record Filtering and Data Extraction
+
+The foundation of omniparser transform operations is anchored on [IDR](./idr.md) and XPath based record
+filtering and data extraction. It's vital to understand each supported file format's IDR structure to
+effectively and efficiently craft XPath queries in `transform_declarations` to achieve desire transform
+objectives.
+
+## Record Filtering
+
+Many times some records ingested are not suitable/desirable to be transformed into output. Omniparser, more
+specifically the current latest version (`"omni.2.1"`) handler, allows record level filtering using XPath
+query. Let's see one example in CSV:
+
+```
+ORDER_ID,CUSTOMER_ID,COUNTRY
+1234,CUST_1,US
+N/A
+1235,CUST_2,AU
+```
+
+We want omniparser to ingest and transform records with `order_id=1234,1235` and skip the line with
+`'N/A'`. To achieve this, we can insert `xpath` into the root `object` of `FINAL_OUTPUT` in
+`transform_declarations`:
+
+```
+    "transform_declarations": {
+        "FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": {
+            ...
+        }}
+    }
+```
+
+Let's take a look how the transform works for first data line `1234,CUST_1,US`:
+1. Omniparser reads the first line in and converts it into a [CSV specific IDR tree](./idr.md#csv-aka-delimited):
+    ```
+    Node(Type: DocumentNode)
+        Node(Type: ElementNode, Data: "ORDER_ID")
+            Node(Type: TextNode, Data: "1234")
+        Node(Type: ElementNode, Data: "CUSTOMER_ID")
+            Node(Type: TextNode, Data: "CUST_1")
+        Node(Type: ElementNode, Data: "COUNTRY")
+            Node(Type: TextNode, Data: "US")
+    ```
+2. `FINAL_OUTPUT.xpath` is then executed at the root of the IDR tree, and result is a match! So this
+line/record will be processed.
+
+Now take a look the second line `N/A`:
+1. The IDR tree looks like:
+    ```
+    Node(Type: DocumentNode)
+        Node(Type: ElementNode, Data: "ORDER_ID")
+            Node(Type: TextNode, Data: "N/A")
+    ```
+2. `FINAL_OUTPUT.xpath` is executed at the root of the IDR tree, and result is not a match. This line/record
+will be skipped.
+
+Each input format has its own unique IDR structure, record filtering XPath needs to take it into consideration
+to be effective.
+
+Clever use of positive/negative regexp [`matches`](https://github.com/antchfx/xpath#expressions) (slightly
+slower but very powerful), or [`starts-with`, `ends-with`, `contains`](https://github.com/antchfx/xpath#expressions),
+or even direct string comparisons [`==`, `!=`](https://github.com/antchfx/xpath#expressions) in
+`FINAL_OUTPUT.xpath` gives schema writers the freedom of either processing certain lines/records, or skipping
+certain lines/records.
+
+If `FINAL_OUTPUT` doesn't have `xpath`, which is fairly common, then there is no record filtering, meaning
+all records ingested by omniparser file format specific readers will be processed and transformed.
+
+## Data Extraction
+
+The most common use of `xpath` is for data extraction. Consider again the sample CSV and schema in
+[Record Filtering](#record-filtering), let's amend the schema to:
+```
+    "transform_declarations": {
+        "FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": {
+            "order_id": { "xpath": "ORDER_ID", "type": "int" },
+            "customer_id": { "xpath": "CUSTOMER_ID", "type": "int" },
+            "country": { "xpath": "COUNTRY" }
+        }}
+    }
+```
+
+The `xpath` attributes on `"order_id"`, `"customer_id"`, and `"country"` tell omniparser where to get
+the field string data from. When `xpath` **not** appearing with `object`, `template`, `custom_func`, or
+`custom_parse`, then it is a data extraction directive telling omniparser to extract the text data at the
+location specified by the `xpath` query. Note in this situation, omniparser will require the result set of
+such `xpath` queries to be of a single node: if such `xpath` query results in more than one node, omniparser
+will fail the current record transform (but will continue onto the next one as this isn't considered fatal).
+
+## Data Context and Anchoring
+
+Whether `xpath` is used for record filtering or data extraction/anchoring, it's always good to know the
+current IDR tree "cursor" position against which an `xpath` query, if present, will be executed.
+
+The current "cursor" position when a transform of `FINAL_OUTPUT` starts is always at the top of an IDR tree.
+So record filtering `FINAL_OUTPUT.xpath` is always executed against the root fo the IDR tree. The "cursor"
+position remains unchanged until a new anchoring `xpath` is encountered. Typically, schema writers will need
+to change cursor anchoring positions more often in hierarchical file formats, such as EDI/JSON/XML, than
+"flat" file formats, like fixed-length or CSV.
+
+Let's take a look at a [sample schema](../extensions/omniv21/samples/json/2_multiple_objects.schema.json)
+for JSON input:
+
+```
+1    "transform_declarations": {
+2        "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {
+3            "authors": { "array": [ { "xpath": "books/*/author" } ] },
+4            "book_titles": { "array": [ { "xpath": "books/*/title" } ] },
+5            "books": { "array": [ { "xpath": "books/*", "object": {
+6                "author": { "xpath": "author" },
+7                "year": { "xpath": "year", "type": "int" },
+8                "price": { "xpath": "price", "type": "float" },
+9                "title": { "xpath": "title" }
+10            }} ] },
+11            "publisher": { "xpath": "name" },
+12            "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }},
+13            "original_book_array": { "xpath": "books", "custom_func": { "name": "copy" }}
+41        }}
+42    }
+```
+Notes:
+- Line numbers are added for easier reference.
+- Only `transform_declarations` section is included here for brevity.
+
+Consider this [input](../extensions/omniv21/samples/json/2_multiple_objects.input.json):
+```
+1 {
+2     "publishers": [
+3         {
+4             "name": "Scholastic Press",
+5             "books": [
+6                 {
+7                     "title": "Harry Potter and the Philosopher's Stone",
+8                     "price": 9.99,
+9                     "author": "J. K. Rowling",
+10                     "year": 1997
+11                 },
+12                 {
+13                     "title": "Harry Potter and the Chamber of Secrets",
+14                     "price": 10.99,
+15                     "author": "J. K. Rowling",
+16                     "year": 1998
+17                 }
+18             ]
+19         }
+20 }
+```
+
+Now let's go through the schema and input together to see how `xpath` anchoring is used.
+
+1. schema `2        "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {`
+
+    This is record filtering, saying, we'd like to process and transform every record matching
+    `/publishers/*`. In this simplified input example, there is only one JSON object matches it: it's the
+    object starting at line 3 and finishing at line 19. With this line, the transform starts, and now the
+    cursor is anchored at the top of this object.
+
+2. schema `3            "authors": { "array": [ { "xpath": "books/*/author" } ] },`
+
+    Unlike `object` transform, `array` transform itself doesn't/may not have `xpath` attribute: an `array`
+    transform is a collection of child transforms, each of which can optionally have its own `xpath`.
+    This schema line says, `authors` in the output is an array, of which, each element is a string whose
+    value comes from the `xpath` data extraction `books/*/author`. So with the input above, we will have
+    `"authors": [ "J. K. Rowling", "J. K. Rowling" ]` in the final output.
+
+3. schema `4            "book_titles": { "array": [ { "xpath": "books/*/title" } ] },`
+
+    Very similar to `authors` output above, `book_titles` output will be like:
+    `"book_titles": [ "Harry Potter and the Philosopher's Stone", "Harry Potter and the Chamber of Secrets" ]`
+    in the final output.
+
+4. schema `5            "books": { "array": [ { "xpath": "books/*", "object": {`
+
+    Similar to `authors` and `book_titles` above, what this line says is, `books` in the output should be an
+    array of objects, each of which, the IDR cursor should be anchored on `books/*` for its processing and
+    transform. In other words, omniparser will anchor the IDR cursor on the JSON object from line 6 through
+    line 11 for the first array element object transform, and then anchor on the JSON object from line 12
+    through line 17 for the second array element object transform.
+
+5. schema `6                "author": { "xpath": "author" },` and through line 9
+    Recall in 4., omniparser has put the cursor on actual book object. Now line 6 through line 9 simply
+    extract string values from the object and put into the corresponding output fields.
+
+6. schema `12            "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }},`
+
+    This is an interesting schema construct: we want `first_book` in the output to be a direct copy of the
+    first book object inside input's `books` JSON array. `"xpath": "books/*[position() = 1]"` achieves the
+    "only first book object" filtering. `"custom_func": { "name": "copy" }` achieves the direct copying.
+
+    As you notice, `custom_func` transform can have (optional) `xpath` attribute as well. If `xpath` is present
+    for a `custom_func`, then everything inside the `custom_func`, namely those argument transforms, are all
+    anchored on the cursor position prescribed by the `xpath`.
+
+When `xpath` is used for anchoring and cursoring, it can appear with `object`, `template`, `custom_func`, and
+`custom_parse`.
+
+## Static and Dynamic XPath Queries
+
+While `xpath` is the most commonly used filtering, anchoring and data extraction directive in schemas, it (the
+query itself) is completely static, meaning the query is fixed and static at schema writing time, thus can't
+be used where data dependent runtime dynamic query is needed.
+
+Consider the following [sample input](../extensions/omniv21/samples/json/3_xpathdynamic.input.json):
+```
+[
+    {
+        "line_items": [
+            {
+                "product": {
+                    "variant": {
+                        "option2": "Blue",
+                        "option1": "M"
+                    },
+                    "options": [
+                        {
+                            "index": 2,
+                            "name": "color/pattern",
+                            "values": [
+                                "Blue",
+                                "Green"
+                            ]
+                        },
+                        {
+                            "index": 1,
+                            "name": "Size",
+                            "values": [
+                                "M",
+                                "L"
+                            ]
+                        }
+                    ]
+                }
+            }
+        ]
+    }
+]
+```
+Notice the `options` array specifies what allowed/possible options are for a product and then in `variant`
+of `product`, it specifies what actual options are included.
+
+The [sample schema](../extensions/omniv21/samples/json/3_xpathdynamic.schema.json):
+```
+    "transform_declarations": {
+        "FINAL_OUTPUT": { "xpath": "/*", "object": {
+            "order_info": { "object": {
+                "order_items": { "array": [
+                    { "xpath": "line_items/*", "object": {
+                        ....
+                        "color": { "xpath_dynamic": {
+                            "custom_func": {
+                                "name": "concat",
+                                "args": [
+                                    { "const": "product/variant/option" },
+                                    { "xpath": "product/options/*[name='color/pattern']/index" }
+                                ]
+                            }
+                        }},
+                        "size": { "xpath_dynamic": {
+                            "custom_func": {
+                                "name": "concat",
+                                "args": [
+                                    { "const": "product/variant/option" },
+                                    { "xpath": "product/options/*[name='Size']/index" }
+                                ]
+                            }
+                        }},
+                        ...
+                    }}
+                ]}
+            }}
+        }}
+    }
+```
+
+The schema wants to transform `optoin1` and `option2` in the input into `color` and `size` in output. The
+difficulty is how to figure out `optoin1` is mapped to `color` and `option2` to `size`. If we look at the
+input's `options` array, it says `"index": 1` is for size and `"index": 2` is for color. To extract data
+for `color` field in the output, we need to dynamically construct an XPath query by
+`product/variant/option` + `product/options/*[name='color/pattern']/index`. Similar XPath construction is
+needed for `size` field data extraction.
+
+`xpath_dynamic` is used in such a situation. It basically says, unlike `xpath` is always a constant and static
+string value, `xpath_dynamic` is computed, by either `custom_func`, or `custom_parse`, or `template`, or
+`external`, or `const`, or another `xpath` direct data extraction.
+
+`xpath_dynamic` can be used everywhere `xpath` is used, except on `FINAL_OUTPUT`. `FINAL_OUTPUT` can only
+use `xpath`.
+
+## XPath Query Result-set Cardinality
+
+Everytime when an `xpath` or `xpath_dynamic` query is executed against an IDR node (and its subtree), the
+result is always a set of nodes: could be an empty set, or a set of one node, or a set of multiple nodes.
+Depending on which transform is in play, different outcomes, including error, can follow.
+
+- `xpath`/`xpath_dynamic` used alone, aka data extraction transform:
+
+    - Example: `"field1": { "xpath": "PATH/TO/DATA" }`
+    - The result set must be either empty or of a single node. When empty, `""` is used; when a single
+    node is returned for the query, the node's text data will be used; when more than one node is returned,
+    omniparser will return a transform error (non-fatal).
+
+- `xpath` used in `FINAL_OUTPUT`:
+
+    - Example: `"FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {`
+    - The result set can be either empty, or of one node, or of multiple nodes.
+
+- `xpath`/`xpath_dynamic` used in `object`, `custom_func`, `custom_parse`, `template` transform
+(other than `FINAL_OUTPUT` or directly under an `array` transform):
+
+    - Example: `"contact": { "xpath": "PATH/TO/CONTACT", "object": {`
+    - Example: `"temperature": { "xpath": "PATH/TO/TEMPERATURE", "custom_func": {`
+    - Example: `"wind_forecast": { "xpath": "PATH/TO/WIND", "template": {`
+    - The result set can only be either empty or of one node. Multiple node result set will cause parser error.
+
+- `xpath`/`xpath_dynamic` used in transform that is directly under `array` transform:
+
+    - Example: `"titles": { "array": [ { "xpath": "books/*/title" } ] }`
+    - Example: `"titles": { "array": [ { "xpath": "books/*/title" }, { "xpath": "movies/*/title" } ] }`
+    - The first example is the most commonly used scenario, that is, the `array` contains homogeneous element
+    transforms. In this case, the `xpath` can return empty, or one node, or multiple nodes and results will
+    be used as the array's elements.
+    - The second example shows the flexibility of `array` transform, that it can contain different transforms:
+    one set of titles is of book titles and another set of movie titles. All titles, books' or movies', are
+    contained by the array. Similar to the first case, both `xpath` result sets can return empty, one node or
+    multiple nodes. All are fine and accepted by the parser.
+
+## Supported XPath Features
+
+Omniparser relies on https://github.com/antchfx/xpath (thank you!) for XPath query parsing and execution.
+Check its github page for the full syntax and function support list.
diff --git a/extensions/omniv21/samples/customfileformats/jsonlog/sample_schema.json b/extensions/omniv21/samples/customfileformats/jsonlog/sample_schema.json
@@ -4,7 +4,7 @@
         "file_format_type": "jsonlog"
     },
     "transform_declarations": {
-        "FINAL_OUTPUT": { "xpath": ".[(severity='WARNING' or severity='ERROR' or severity='CRITICAL') and source='api']", "object": {
+        "FINAL_OUTPUT": { "xpath": ".[matches(severity, '^(WARNING|ERROR|CRITICAL)$') and source='api']", "object": {
             "timestamp": { "xpath": "timestamp" },
             "source": { "const": "api" },
             "severity": { "custom_func": {

diff --git a/go.mod b/go.mod
@@ -4,7 +4,7 @@ go 1.14
 
 require (
 	github.com/antchfx/xmlquery v1.3.1
-	github.com/antchfx/xpath v1.1.10
+	github.com/antchfx/xpath v1.1.11
 	github.com/bradleyjkemp/cupaloy v2.3.0+incompatible
 	github.com/dlclark/regexp2 v1.2.1 // indirect
 	github.com/dop251/goja v0.0.0-20201002140143-8ce18d86df5f

diff --git a/go.sum b/go.sum
@@ -7,6 +7,8 @@ github.com/antchfx/xmlquery v1.3.1 h1:nIKWdtnhrXtj0/IRUAAw2I7TfpHUa3zMnHvNmPXFg+
 github.com/antchfx/xmlquery v1.3.1/go.mod h1:64w0Xesg2sTaawIdNqMB+7qaW/bSqkQm+ssPaCMWNnc=
 github.com/antchfx/xpath v1.1.10 h1:cJ0pOvEdN/WvYXxvRrzQH9x5QWKpzHacYO8qzCcDYAg=
 github.com/antchfx/xpath v1.1.10/go.mod h1:Yee4kTMuNiPYJ7nSNorELQMr1J33uOpXDMByNYhvtNk=
+github.com/antchfx/xpath v1.1.11 h1:WOFtK8TVAjLm3lbgqeP0arlHpvCEeTANeWZ/csPpJkQ=
+github.com/antchfx/xpath v1.1.11/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
 github.com/armon/consul-api v0.0.0-20180202201655-eb2c6b5be1b6/go.mod h1:grANhF5doyWs3UAsr3K4I6qtAmlQcZDesFNEHPZAzj8=
 github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=
 github.com/beorn7/perks v1.0.0/go.mod h1:KWe93zE9D1o94FZ5RNwFwVgaQK1VOXiVxmqh+CedLV8=