Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore: Add transformations to correlation data links #61799

Merged
merged 36 commits into from
Feb 22, 2023

Conversation

gelicia
Copy link
Contributor

@gelicia gelicia commented Jan 19, 2023

What is this feature?

Transformations act as a lens in which we focus on specific pieces of source data to be used by target data. This implements the regex and logfmt transformations as a first pass.

Why do we need this feature?

We have the ability to correlate data from one datasource to another, but most of the time we will need to do something to the source data to make it suitable for the target data source. Transformations are the answer to that and this is the first example of how they may work.

Who is this feature for?

Anyone using correlations.

Which issue(s) does this PR fix?:

Fixes #60023

Special notes for your reviewer:

Example 1

Example datasource provisioning yaml with regex only

  - name: testData-correlations
    isDefault: false
    editable: true
    type: testdata
    correlations:
      - targetUID: WyFv5154z
        label: "Superhero freeform"
        description: "this is a test correlation from provisioning"
        config:
          type: query
          target:
            editorMode: "code"
            format: "table"
            rawQuery: "true"
            rawSql: "SELECT * FROM superhero WHERE name=''${name}''"
            refId: "A"
          field: "text"
          transformations:
            - type: "regex"
              expression: "(Superman|Batman)"
              variable: "name"

You will need to edit the target datasource to be one that is available in your environment. I have a postgres datasource running with a table of superhero data.

Using the above provisioned datasource, use the scenario 'CSV Content' and use the following CSV

date,text
1674078628,This is a news article about Superman. Batman was not involved at all.

This will create a link to the target datasource defined and the query will contain "Superman" instead of the variable name.

Example 2

Example datasource provisioning yaml with regex and logfmt

  - name: testData-correlations
    isDefault: false
    editable: true
    type: testdata
    correlations:
        - targetUID: WyFv5154z
          label: "Superhero 2 transformations"
          description: "2nd test transformations"
          config:
            type: query
            target:
                editorMode: "code"
                format: "table"
                rawQuery: "true"
                rawSql: "SELECT * FROM superhero WHERE name='$${name}' AND alignment='$${align}'"
                refId: "A"
            field: "text"
            transformations:
                - type: "logfmt"
                - type: "regex"
                   expression: "text=.*(good|bad).*"
                   variable: "align"

You will need to edit the target datasource to be one that is available in your environment. I have a postgres datasource running with a table of superhero data.

Using the above provisioned datasource, use the scenario 'CSV Content' and use the following CSV

date,text
1674078628,"name=Superman text=""he did a good thing"""
1674078628,"name=Thanos text=""he did a bad thing"""
1674078628,"name=Batman text=""he did a bad thing"""

This will create a link to the target datasource defined and the query will contain "Superman" instead of the variable name.

@github-actions
Copy link
Contributor

Backend code coverage report for PR #61799
No changes

@github-actions
Copy link
Contributor

github-actions bot commented Jan 19, 2023

Frontend code coverage report for PR #61799

Plugin Main PR Difference
correlations 94.9% 87.23% -7.67%
explore 86.26% 86.28% .02%

Copy link
Contributor

@ifrost ifrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, with some small changes around regexp capturing (see the comment) I was able to extract some pieces of the line to get something like this working:

Screenshot 2023-01-24 at 23 34 29

public/app/features/explore/utils/links.ts Outdated Show resolved Hide resolved
@gelicia gelicia requested a review from ifrost January 30, 2023 14:03
devenv/datasources.yaml Outdated Show resolved Hide resolved
pkg/services/correlations/models.go Outdated Show resolved Hide resolved
if (link.internal?.transformations) {
const fieldValue = field.values.get(rowIndex);
link.internal?.transformations.forEach((transformation) => {
if (transformation.type === 'regex' && transformation.expression) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement it in a cleaner way? This has the potential to grow quite a bit, maybe break the logic for transformations out to its own file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will defo need it in the future and maybe move it to feature/correlations but we can do it later. Also you can do it now as it may make it easier to test just the transformation logic.

Copy link
Contributor

@ifrost ifrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I was able to extract fields from Loki nicely with change. Probably worth adding some tests to dataLink.ts / links.ts

pkg/services/correlations/models.go Outdated Show resolved Hide resolved
@gelicia gelicia added this to the 9.5.0 milestone Jan 31, 2023
@gelicia gelicia added the no-backport Skip backport of PR label Jan 31, 2023
@@ -3038,6 +3038,9 @@ exports[`better eslint`] = {
[0, 0, 0, "Do not use any type assertions.", "0"],
[0, 0, 0, "Do not use any type assertions.", "1"]
],
"public/app/features/correlations/transformations.ts:5381": [
[0, 0, 0, "Do not use any type assertions.", "0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened DefinitelyTyped/DefinitelyTyped#64263 to get rid of this

@@ -192,7 +192,7 @@ export const safeParseJson = (text?: string): any | undefined => {
};

export const safeStringifyValue = (value: any, space?: number) => {
if (!value) {
if (value === undefined || value === null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this slightly changes the behaviour of a bunch of things, like the panel model inspector in dashboards and variables, did we test it's safe? (especially for variables as false and 0 would have resulted in an empty string, but "false" and "0" now, not sure if that's even possible tho)

Copy link
Contributor Author

@gelicia gelicia Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just couldn't imagine why we would want true to evaluate to "true" and false to evaluate to an empty string. Let's look at the usages of this function outside how it is used in this PR

  • features/dashboard/state/PanelModel.ts - passes the entire model into this function, and that model is a required parameter. I am extremely confident that PanelModel will never evaluate to false or 0, there are a lot of required fields in it.
  • features/variables/utils.ts - safe-stringifies the first arg, concatenates all args together split by a space except the last one and checks that string has matches for any of the three formats for variables. I do not believe that having false or 0 evaluate to an empty string or not will have any impact on this, because the safe-stringify only runs on the first argument, and in all cases a variable reference must start with either $ or [ which is not possible with a false or 0 scenario.
  • features/variables/inspect/utils.ts - Similar to the above, we safe-stringify a value and check if it matches any of the variable patterns. False or 0 or empty string will all not match. That value is not used again - it will use the matching groups instead.
  • plugins/datasource/prometheus/datasource.tsx - This will only run if the if statement depending on the same value is true, so if the value passed in is 0 or false, it will never run in the first place

I'm very confident based on those usages that this logic will not impact anything.

@gelicia gelicia requested a review from Elfo404 February 7, 2023 18:13
@gelicia
Copy link
Contributor Author

gelicia commented Feb 8, 2023

To document some discussion that happened off this PR, @Elfo404 @ifrost and I looked at this solution with the original design doc and decided on a couple things

  • Going with this format of transformations instead of a solution emulating promtail's parsing pipeline is fine - it will be easier to display from a UI standpoint and maybe simpler to understand. We may introduce the parsing pipeline system later.
  • The current regex limitations of only doing one match is insufficient. Features already exist within javascript's regex paradigm to define multiple, named captures that could be used for naming variables. We should utilize that already existing functionality out of the box.
  • We should not allow users to name a regex variable. In the event that multiple regex matches are made, having one variable name would be confusing

To this effect, I have kept the existing solution but implemented logic to take advantage of named capture groups. I have added a test to show how this works, and an example provisioning and CSV pair will be available below.

However, when looking into this, I really would like reviewers to reconsider removing the regex variable name from what can be defined. In my opinion, in order of increasing complexity, regexs could fall into the following categories

  1. An unnamed single capture group with a variable name defined in the transformation
  2. A named single capture group
    • if a variable is defined, it is ignored in favor of the capture group name
  3. Multiple unnamed capture groups
    • not currently supported, but could be supported in the format of variableName[0] and so on
  4. Multiple named capture groups where order is enforced
    • This means that your defined capture groups need to appear in the order they were defined for the match. So for example, with (?<align>(good|bad))(?<name>(Superman|Batman) good Superman would work and Batman bad would not.
    • I feel most people wanting to make correlations would not want to require a consistent ordering of their data since data ordering can change over time
  5. Multiple named capture groups where order does not matter
    • if a variable is defined, it is ignored in favor of the capture group name
    • It is worth noting that making multiple transformations of type 2 would be far easier than including backreferences to every capture group. I do think we should support this, but consider it an edge case.
    • An example regex for this is (?=.*(?<align>(good|bad)))(?=.*(?<name>(Superman|Batman))).

There could also be combinations of the above in the same regex, but the pattern would always be that named capture groups would get those names and be guaranteed to work, and any unnamed capture groups in the same expression would most likely not work.

I think we should prioritize the user experience of that first category. I predict users that can figure out named capture groups will understand that the names they define override what is set in their configuration, vs a user who is struggling to understand why their variable has to match their field name with no alternative. I also think that a variable definition gives us a reasonable path forward to supporting multiple unnamed capture groups.

In short (ha!) I think users will most likely take the path of creating 10 simple regex transformations rather than 1 complex regex expression, and we should prioritize making that as simple as possible.

Currently the PR keeps the transformation variable name option. If you disagree with the above, I can remove it.

Example datasource provisioning yaml with named regex capture groups

date,text
1674078628,"Superman did a good thing"
1674078628,"bad Batman did a thing"
        - targetUID: WyFv5154z
        label: "Superhero complicated regex"
        description: "this is a test regex correlation from provisioning"
        config:
          type: query
          target:
            editorMode: "code"
            format: "table"
            rawQuery: "true"
            rawSql: "SELECT * FROM superhero WHERE name='$${name}' AND alignment='$${align}'"
            refId: "A"
          field: "text"
          transformations:
            - type: "regex"
              expression: "(?=.*(?<align>(good|bad)))(?=.*(?<name>(Superman|Batman)))"

@gelicia gelicia requested a review from a team as a code owner February 8, 2023 22:08
@gelicia
Copy link
Contributor Author

gelicia commented Feb 8, 2023

Also, none of this covers another use case that we are holding off on, which is when one capture group matches multiple times 🙃

@ifrost
Copy link
Contributor

ifrost commented Feb 9, 2023

It's gonna be difficult to have one transformation to support all use cases. I like that we start with something powerful that can be simplified to the user later.

Also we know that regex is complex to many users anyway. That's why we're providing logfmt and we can provide more in the future based on use cases (e.g. extracting key value pairs, and labels). We don't want users to use only regex.

I can imagine we can add a new transformation (simpleregex) that would be the default allowing to write an expression and name of the first match (basically what you had had before the change).

A simpleregex won't prevent as from doing transformations in the future, I can imagine a data frame transformation that would behave this way. The only thing we need to remember is to keep configuration decoupled from the current logic - so as we said we shouldn't use "variable" property in the settings yet (but we could call it "name" to indicate it's a name of the first matching).

Anyway, I feel its powerful and flexible with multiple matches and still open to go in many directions in the future.

@gelicia
Copy link
Contributor Author

gelicia commented Feb 9, 2023

The only thing we need to remember is to keep configuration decoupled from the current logic - so as we said we shouldn't use "variable" property in the settings yet

@ifrost In my opinion, adding a variable name option doesn't tie configuration to logic any more than any other definitions we specify, but I trust in your vision of this feature and will remove it.

Additionally, this limitation now means users cannot do multiple transformations with unnamed capture groups - it will override the fieldName variable with the last transformation. Again, I anticipate the most likely use case will be multiple simple regexes, and this will prohibit that from working without defining the capture group name.

@ifrost
Copy link
Contributor

ifrost commented Feb 9, 2023

@ifrost In my opinion, adding a variable name option doesn't tie configuration to logic any more than any other definitions we specify, but I trust in your vision of this feature and will remove it.

I was only concerned about the naming. Let's say we have provisioning looking like:

  - name: testData-correlations
    correlations:
      - targetUID: WyFv5154z
        config:
          type: query
          target: ...
          field: "text"
          transformations:
            - type: "regex"
              expression: "(Superman|Batman)"
              variable: "name"

And in the future, we want to use real transformations where regex does not create a variable, but a new field so we'd like to express it in configuration and call it fieldName or name, e.g.:

  - name: testData-correlations
    correlations:
      - targetUID: WyFv5154z
        config:
          type: query
          target: ...
          field: "text"
          transformations:
            - type: "regex"
              expression: "(Superman|Batman)"
              fieldName: "name"

Only the name of the last property changes. From the user perspective and other configuration options, nothing changes and we can start adding more super-sophisticated transformations to the mix.

Of course, it's just a minor thing, we can deprecate the config, and support variable for some along with fieldName.

Totally agree it "doesn't tie configuration to logic". It seems like a valid transformation that takes a regex, simple expression, and creates a field with name provided by the user.

@gelicia
Copy link
Contributor Author

gelicia commented Feb 9, 2023

Alrighty, after another conversation with @ifrost and @Elfo404 - we agreed that having something that allows users to map a regex to a variable name, but naming it mapValue is best for keeping the plurality and definition future proof.

Copy link
Contributor

@ifrost ifrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff 🎉

I think it'd be great to have better provisioning validation, but we can add it in a separate PR.

For example we should check if the correct structure of the config is provided. I was testing it with the example in the PR description and it created a link without transformations (they are incorrectly nested in the yaml example, so please update it). I also mistyped transformation type (regexp instead of regex) and got a link with no transformations and no errors. We do it with some other properties (e.g. we validate if a correct correlation type, i.e. query) is provided and return an error when provisioning is parsed.

@@ -40,12 +40,21 @@ export interface DataLink<T extends DataQuery = any> {
internal?: InternalDataLink<T>;
}

/** @internal */
export interface DataLinkTransformationConfig {
type: 'regex' | 'logfmt';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: It could be an enum

@gelicia gelicia requested a review from a team as a code owner February 10, 2023 23:31
@gelicia gelicia requested a review from ifrost February 11, 2023 20:15
@gelicia
Copy link
Contributor Author

gelicia commented Feb 11, 2023

@ifrost I added the provisioning checks you asked for (although the errors do seem to come out a little garbled, not sure if there's something I'm missing with that) and the enum change. I checked on how one could do enums in go and it didn't seem as straightforward as typescript so I left it as is - let me know if you have any thoughts about that.

@gelicia
Copy link
Contributor Author

gelicia commented Feb 16, 2023

@ryantxu This is what we discussed yesterday - let me know what you think would be a better term for this feature

Copy link
Contributor

@ifrost ifrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex validation doesn't seem to check if expression is provided (it's required by regex transformation to make it work)

@gelicia gelicia merged commit 06dfe21 into main Feb 22, 2023
@gelicia gelicia deleted the kristina/transformation branch February 22, 2023 12:53
ryantxu pushed a commit that referenced this pull request Mar 2, 2023
* bring in source from database

* bring in transformations from database

* add regex transformations to scopevar

* Consolidate types, add better example, cleanup

* Add var only if match

* Change ScopedVar to not require text, do not leak transformation-made variables between links

* Add mappings and start implementing logfmt

* Add mappings and start implementing logfmt

* Remove mappings, turn off global regex

* Add example yaml and omit transformations if empty

* Fix the yaml

* Add logfmt transformation

* Cleanup transformations and yaml

* add transformation field to FE types and use it, safeStringify logfmt values

* Add tests, only safe stringify if non-string, fix bug with safe stringify where it would return empty string with false value

* Add test for transformation field

* Do not add null transformations object

* Break out transformation logic, add tests to backend code

* Fix lint errors I understand 😅

* Fix the backend lint error

* Remove unnecessary code and mark new Transformations object as internal

* Add support for named capture groups

* Remove type assertion

* Remove variable name from transformation

* Add test for overriding regexes

* Add back variable name field, but change to mapValue

* fix go api test

* Change transformation types to enum, add better provisioning checks for bad type name and format

* Check for expression with regex transformations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Glue: Perform transformation/extraction when a DataLink is created
5 participants