Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux commands for analyzing input data #300

Merged
merged 14 commits into from
May 16, 2023
Merged

Flux commands for analyzing input data #300

merged 14 commits into from
May 16, 2023

Conversation

fsteeg
Copy link
Member

@fsteeg fsteeg commented May 4, 2023

@fsteeg fsteeg self-assigned this May 4, 2023
@fsteeg
Copy link
Member Author

fsteeg commented May 5, 2023

The first module is ready for functional review, assigning @TobiasNx. Deployed to playground for testing: your sample from #294, using the new fix-list-paths command.

From the documentation annotation:

Lists all paths found in a record. These paths can be used in a Fix to address fields. Options: count (output occurence frequency of each path, sorted by highest frequency first; default: true), index (output individual repeated subfields and array elements with index numbers instead of '*'; default: false)

(If we count it's always sorted by highest frequency, because if we count we'll automatically sort internally, which would make it hard to have a consistent implementation with separate control over the sorting. I think this way it's both conceptually simple and hopefully covers your use cases.)

@fsteeg fsteeg assigned TobiasNx and unassigned fsteeg May 5, 2023
@fsteeg
Copy link
Member Author

fsteeg commented May 8, 2023

And the second module, for listing values, is ready for review too. Deployed to playground: the example from above, but calling fix-list-values with one of the resulting paths.

From the documentation annotation:

Lists all values found for the given path. The paths can be found using fix-list-paths. Options: count (output occurence frequency of each value, sorted by highest frequency first; default: true)

(Sorting behavior is as described for fix-list-paths above.)

@fsteeg fsteeg changed the title WIP: flux commands for analyzing input data Flux commands for analyzing input data May 8, 2023
@fsteeg fsteeg marked this pull request as ready for review May 8, 2023 14:03
@TobiasNx
Copy link
Collaborator

Implementation looks good.
Only the option count= should behave differently. Only the template should change and count-triples should not be dropped. So that you have the list of values/paths but not the counted occurence. The path/values would be uniq and no duplicate should be given:

something like this:

// This part can be adjusted to your own data source. The example here is marc21
"https://raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21"
| open-http
| as-lines
| decode-marc21

// This part can be reused for all incoming data streams to list all fix paths.
| fix("nothing()",repeatedFieldsToEntities="true", entityMemberName="*")
| flatten
| stream-to-triples
| count-triples(countBy="predicate")
| template("${s}")
| print;

@TobiasNx TobiasNx assigned fsteeg and unassigned TobiasNx May 10, 2023
@TobiasNx
Copy link
Collaborator

Also perhaps the template should be | template("${o}\t|\t${s}") ( when count=false then | template("${s}"))

@fsteeg
Copy link
Member Author

fsteeg commented May 10, 2023

Only the template should change and count-triples should not be dropped

Deployed to playground for testing:

Also perhaps the template should be | template("${o}\t|\t${s}")

I've changed both templates to use \t|\t instead of \t :

(It seems I currently have to open these playground links in a new window, otherwise it's empty.)

@fsteeg fsteeg assigned TobiasNx and unassigned fsteeg May 10, 2023
@TobiasNx
Copy link
Collaborator

Great paths are unique now. +1
Concerning the template I am not sure if I prefer [occurece] | [path] more. This may be a simple preference. So I am okay with that. +1

Concerning the sorting I thought about it again.

| stream-to-triples
| count-triples(countBy="predicate")
| template("${s}")
| print;

This does not sort by occurence automatically. How does your function deviate from this? The alphabetical sorting helps to get an understanding of the source data and its schema. Therefore I would prefer if we also would support alphabetical sorting too.

@TobiasNx TobiasNx assigned fsteeg and unassigned TobiasNx May 12, 2023
@fsteeg
Copy link
Member Author

fsteeg commented May 12, 2023

Concerning the template I am not sure if I prefer [occurece] | [path] more.

I'm only trying to implement your requirement, I don't have my own preference here. Could you give me an example of the desired output for both commands?

@blackwinter
Copy link
Member

Just a side note: Wouldn't it make sense to make the template configurable?

@TobiasNx
Copy link
Collaborator

Concerning the template I am not sure if I prefer [occurece] | [path] more.

I'm only trying to implement your requirement, I don't have my own preference here. Could you give me an example of the desired output for both commands?

[path] | [occurrence]
6890 .5.* | 18

vs
[occurrence] | [path]
18 | 6890 .5.*

@TobiasNx
Copy link
Collaborator

Concerning the template I am not sure if I prefer [occurece] | [path] more.

I'm only trying to implement your requirement, I don't have my own preference here. Could you give me an example of the desired output for both commands?

[path] | [occurrence] 6890 .5.* | 18

vs [occurrence] | [path] 18 | 6890 .5.*

but keep it the way you implement it

@fsteeg
Copy link
Member Author

fsteeg commented May 12, 2023

The alphabetical sorting helps to get an understanding of the source data and its schema.

I've changed it to sort alphabetically if count is false:

fix-list-paths(count="false")

Concerning the template I am not sure if I prefer [occurece] | [path] more.

I've changed it to use ${o}\t|\t${s} as a common default template for both commands:

fix-list-paths
fix-list-values("68900.0.*")

With an option to set it (thanks for the input @blackwinter):

fix-list-paths(template="${s}\t${o}")

@fsteeg fsteeg assigned TobiasNx and unassigned fsteeg May 12, 2023
@TobiasNx
Copy link
Collaborator

Thanks, thats all great!

@TobiasNx TobiasNx assigned fsteeg and unassigned TobiasNx May 12, 2023
@fsteeg fsteeg requested a review from blackwinter May 15, 2023 07:22
@fsteeg fsteeg assigned blackwinter and unassigned fsteeg May 15, 2023
Copy link
Member

@blackwinter blackwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some (more or less) minor remarks.

@blackwinter blackwinter assigned fsteeg and unassigned blackwinter May 15, 2023
- Sort dependencies alphabetically
- Rename classes and flux commands for consistency with others
- Use backticks for options and their defaults in `@Description`
- Remove unused code, add final modifiers

See https://gitlab.com/oersi/oersi-etl/-/issues/238
@fsteeg
Copy link
Member Author

fsteeg commented May 16, 2023

Thank you for the review! Addressed your feedback in 172a701.

@fsteeg fsteeg requested a review from blackwinter May 16, 2023 07:43
@fsteeg fsteeg assigned blackwinter and unassigned fsteeg May 16, 2023
@blackwinter blackwinter assigned fsteeg and unassigned blackwinter May 16, 2023
@fsteeg fsteeg merged commit be0dbc8 into master May 16, 2023
1 check passed
@fsteeg fsteeg deleted the oersi-238 branch May 16, 2023 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants