Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 0002 for mobility extension #116

Merged
merged 5 commits into from
Jan 31, 2023
Merged

RFC 0002 for mobility extension #116

merged 5 commits into from
Jan 31, 2023

Conversation

schlingling
Copy link
Contributor

@schlingling schlingling commented Jan 29, 2023

For my new PR, i updated the RFC-files with the feedback received from #115 and #111. This PR holds the concept of processing GTFS-data using FileSystem and FilePickers instead of Collections proposed by @felix-oq in #115.

@schlingling schlingling self-assigned this Jan 29, 2023
@schlingling schlingling changed the title Draft: RFC 0002 for mobility extension RFC 0002 for mobility extension Jan 29, 2023
* New `io-datatypes` called `File` and `FileSystem`
* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
* The `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really when a block gets no input but rather a block can decide to abort its execution such that subsequent blocks are not executed at all. So a FilePicker can decide to abort if the given path does not lead to a file in the given FileSystem.

As an example, see this sketch:

image

There, the topmost FilePicker aborts because it is unable to find the file. Hence, subsequent blocks are not executed. Note that other FilePicker blocks continue their execution if they can find their files. So the abortion only affects blocks that are subsequent to the aborting block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is how that is done, I'd vote for blocks being able to explicitly output a non-existing value (e.g. an None Option). If that is the output of a blog, downstream blocks are not executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the concept, thanks @felix-oq, will add this propsotion to the RFC

}
```

### io-datatype Collection
A Collection concept could look like this and should be added to `io-datatypes.ts`. Each CollectionElement gets then wrapped in a Collection of CollectionElements, so we can have metadata for elements of collections such like a index and name of elements.
Not sure, how to *implement* methods in an interface in `io-datatype.ts` or is this realized in the Block `FilePicker` @felix-oq?
Copy link
Contributor

@felix-oq felix-oq Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In io-types.ts you can define a custom interface and instantiate a new IOType instance that corresponds to your interface:

export interface FileSystem {
  // declare necessary attributes...
  // declare necessary methods...
}

export const FILE_SYSTEM_TYPE = new IOType<FileSystem>();

In order to implement the interface, you can create a class which provides the attributes / methods demanded by the interface.

To have a better folder structure you may move the io-types.ts file into a new folder io-types and add your file system implementation as an additional file into that folder.

Regarding the implementation of related blocks:

  • The ArchiveInterpreter needs to be able to instantiate a FileSystem instance in order to output it as a result.
  • The FilePicker needs to work with methods provided by a FileSystem instance in order to read the file according to the provided path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Added the folder-structure-proposal to the RFC

}
```

### datatype Undefined
### datatype Undefined *(Requires implementation from scratch)*
For an implementation of an optional-mechanism for eg. columns, we need a new datatype ´undefined´(Attention: Not talking about io-datatype, i mean datatype). Optional column's datatype would then be `text or undefined`. So we also need a grammar feature for an OR-represenation in Jayvee
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The support for optional columns should be enabled via #85 and thus can be considered out of scope regarding this RFC. So for now, you can solely focus on obligatory columns until we have decided on how to treat optional columns in the language.

Would you also agree @rhazn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only obligatory columns, assume agency id is obligatory.

### 4) CSVInterpreter *(Requires minor change)*
Input: File, Output: Sheet

In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pick your second proposition. Then you are right, the existing examples need a small rewrite: CSVFileExtractor needs to be replaced with an HTTPExtractor and a subsequent CSVFileInterpreter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (Even though I will repeat my comment that we need statement and decisions in an RFC, not questions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformulated to statement

In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!!
### 6) SQLiteSink *(Requires minor change)*
Input: Table, Output: void
This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using separate SQLiteSink blocks for each file would a way to implement the pipeline without having to introduce the concept of parallel inputs. That would result in a lot of duplicate code but would also be suited to initially demonstrate the proof of concept, without the effort of modifying the execution logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me in the scope of this RFC. In the implementation we'd need to make sure that different SQLiteSink blocks do not recreate the database but otherwise works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed statement

}
```
This mapping is later also used in the `SQLiteTablesLoader`, to define the output-table-names. The syntax for `layouts` is inspired by the MyTableBuilder-syntax from issue RFC for cell ranges #109
Not sure, how to *implement* multiple input/output of blocks (for SQLiteSink and ArchiveInterpreter)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that multiple outputs are already implemented, so for ArchiveInterpreter it should work out of the box. For multiple inputs, the execution logic in the interpreter needs to be modified. It is located in this method:

async function runPipeline(
pipeline: Pipeline,
runtimeParameters: Map<string, string | number | boolean>,
loggerFactory: LoggerFactory,
): Promise<ExitCode> {

After it has been implemented, the validation, that prevents multiple inputs, needs to be removed:

} else if (pipes.length > 1 && whatToCheck === 'input') {
for (const pipe of pipes) {
accept(
'error',
`At most one pipe can be connected to the ${whatToCheck} of a ${block.type}`,
{
node: pipe,
property: 'to',
},
);
}

Let me know whether you need help with the implementation, we can do a pair programming session if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, thank you for explaining and the offer! Will dry-run the solution with multiple SQLiteSinks and later on change the execution logic.

Copy link
Contributor

@rhazn rhazn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the final version I'd ask you to rewrite questions into decisions but otherwise we are nearly there I think 👍 .

* New `io-datatypes` called `File` and `FileSystem`
* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
* The `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is how that is done, I'd vote for blocks being able to explicitly output a non-existing value (e.g. an None Option). If that is the output of a blog, downstream blocks are not executed.


extension: string //The file extension

filetype: string //The MIME type of the file taken from the Content-Type header (for HTTP requests only) Otherwise inferred from the file extension, Could default to text/plain or application/octet-stream for unknown or missing file extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of a RFC all these "Could be?" questions need to be statements of a choice. The question can be raised in this discussion. I'd vote for application/octet-stream and ArrayBuffer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to statement, will raise questions in future via comment in GH

}
```

### datatype Undefined
### datatype Undefined *(Requires implementation from scratch)*
For an implementation of an optional-mechanism for eg. columns, we need a new datatype ´undefined´(Attention: Not talking about io-datatype, i mean datatype). Optional column's datatype would then be `text or undefined`. So we also need a grammar feature for an OR-represenation in Jayvee
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only obligatory columns, assume agency id is obligatory.

```
block MyHttpExtractor oftype HttpExtractor {
url: "https://www.data.gouv.fr/fr/datasets/r/c4d9326f-9f41-4dfb-9746-31bc97a31fc6";
content-type: string //the expected content-type of the http-call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against specifying content type here. It places way too much of a burden on the language user. Just infer this from the call/file ending or hardcode to octed stream right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

```
block MyZipInterpreter oftype ZipInterpreter{
block ZipArchiveInterpreter oftype ArchiveInterpreter{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need a parameter for archive type, right now only accepting the string "zip" imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added param to RFC

### 4) CSVInterpreter *(Requires minor change)*
Input: File, Output: Sheet

In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (Even though I will repeat my comment that we need statement and decisions in an RFC, not questions).


A LayoutsValidator (Attention: here we talk about multiple Layouts) gets as input an collection of sheets and validates every sheet using a single, dedicated LayoutValidator (for a single layout). As an parameter the LayoutsValidator gets a mapping of filenames to layouts in order to be able to process multiple files/layouts within one block. Every sheet in the collection has its corresponding layout, wrapped in the layouts-block. After the validation of every sheet is sucessfull, the LayoutsValidator outputs a collection of validated tables.
### 5) LayoutValidator and Layouts *(Requires minor change)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These I would consider out of scope for this RFC, the only thing relevant for us is the named columns and we will handle that in another RFC. Here you can just summarize that we will only deal with required columns and reuse the existing language features for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed description

In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!!
### 6) SQLiteSink *(Requires minor change)*
Input: Table, Output: void
This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me in the scope of this RFC. In the implementation we'd need to make sure that different SQLiteSink blocks do not recreate the database but otherwise works for me.

Copy link
Contributor Author

@schlingling schlingling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all discussed points into an new commit.

* New `io-datatypes` called `File` and `FileSystem`
* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
* The `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the concept, thanks @felix-oq, will add this propsotion to the RFC


extension: string //The file extension

filetype: string //The MIME type of the file taken from the Content-Type header (for HTTP requests only) Otherwise inferred from the file extension, Could default to text/plain or application/octet-stream for unknown or missing file extensions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to statement, will raise questions in future via comment in GH

}
```

### io-datatype Collection
A Collection concept could look like this and should be added to `io-datatypes.ts`. Each CollectionElement gets then wrapped in a Collection of CollectionElements, so we can have metadata for elements of collections such like a index and name of elements.
Not sure, how to *implement* methods in an interface in `io-datatype.ts` or is this realized in the Block `FilePicker` @felix-oq?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Added the folder-structure-proposal to the RFC

```
block MyHttpExtractor oftype HttpExtractor {
url: "https://www.data.gouv.fr/fr/datasets/r/c4d9326f-9f41-4dfb-9746-31bc97a31fc6";
content-type: string //the expected content-type of the http-call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

```
block MyZipInterpreter oftype ZipInterpreter{
block ZipArchiveInterpreter oftype ArchiveInterpreter{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added param to RFC

### 4) CSVInterpreter *(Requires minor change)*
Input: File, Output: Sheet

In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformulated to statement


A LayoutsValidator (Attention: here we talk about multiple Layouts) gets as input an collection of sheets and validates every sheet using a single, dedicated LayoutValidator (for a single layout). As an parameter the LayoutsValidator gets a mapping of filenames to layouts in order to be able to process multiple files/layouts within one block. Every sheet in the collection has its corresponding layout, wrapped in the layouts-block. After the validation of every sheet is sucessfull, the LayoutsValidator outputs a collection of validated tables.
### 5) LayoutValidator and Layouts *(Requires minor change)*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed description

In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!!
### 6) SQLiteSink *(Requires minor change)*
Input: Table, Output: void
This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed statement

}
```
This mapping is later also used in the `SQLiteTablesLoader`, to define the output-table-names. The syntax for `layouts` is inspired by the MyTableBuilder-syntax from issue RFC for cell ranges #109
Not sure, how to *implement* multiple input/output of blocks (for SQLiteSink and ArchiveInterpreter)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, thank you for explaining and the offer! Will dry-run the solution with multiple SQLiteSinks and later on change the execution logic.

@schlingling schlingling merged commit 8b211ac into main Jan 31, 2023
schlingling added a commit that referenced this pull request Jan 31, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 31, 2023
@schlingling
Copy link
Contributor Author

This PR is a basis for #123

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants