RFC 0002 for mobility extension #116

schlingling · 2023-01-29T12:34:13Z

For my new PR, i updated the RFC-files with the feedback received from #115 and #111. This PR holds the concept of processing GTFS-data using FileSystem and FilePickers instead of Collections proposed by @felix-oq in #115.

felix-oq · 2023-01-30T08:34:10Z

rfc/0002-mobility-extension.md

+* New `io-datatypes` called `File` and `FileSystem`
+* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
+* The  `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
+* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)


Not really when a block gets no input but rather a block can decide to abort its execution such that subsequent blocks are not executed at all. So a FilePicker can decide to abort if the given path does not lead to a file in the given FileSystem.

As an example, see this sketch:

There, the topmost FilePicker aborts because it is unable to find the file. Hence, subsequent blocks are not executed. Note that other FilePicker blocks continue their execution if they can find their files. So the abortion only affects blocks that are subsequent to the aborting block.

The question is how that is done, I'd vote for blocks being able to explicitly output a non-existing value (e.g. an None Option). If that is the output of a blog, downstream blocks are not executed.

I got the concept, thanks @felix-oq, will add this propsotion to the RFC

felix-oq · 2023-01-30T08:44:20Z

rfc/0002-mobility-extension.md

 }
 ```

-### io-datatype Collection
-A Collection concept could look like this and should be added to `io-datatypes.ts`. Each CollectionElement gets then wrapped in a Collection of CollectionElements, so we can have metadata for elements of collections such like a index and name of elements.
+Not sure, how to *implement* methods in an interface in `io-datatype.ts` or is this realized in the Block `FilePicker` @felix-oq?


In io-types.ts you can define a custom interface and instantiate a new IOType instance that corresponds to your interface:

export interface FileSystem { // declare necessary attributes... // declare necessary methods... } export const FILE_SYSTEM_TYPE = new IOType<FileSystem>();

In order to implement the interface, you can create a class which provides the attributes / methods demanded by the interface.

To have a better folder structure you may move the io-types.ts file into a new folder io-types and add your file system implementation as an additional file into that folder.

Regarding the implementation of related blocks:

The ArchiveInterpreter needs to be able to instantiate a FileSystem instance in order to output it as a result.

The FilePicker needs to work with methods provided by a FileSystem instance in order to read the file according to the provided path.

Great, thanks! Added the folder-structure-proposal to the RFC

felix-oq · 2023-01-30T09:09:05Z

rfc/0002-mobility-extension.md

 }
 ```

-### datatype Undefined
+### datatype Undefined *(Requires implementation from scratch)*
 For an implementation of an optional-mechanism for eg. columns, we need a new datatype ´undefined´(Attention: Not talking about io-datatype, i mean datatype). Optional column's datatype would then be `text or undefined`. So we also need a grammar feature for an OR-represenation in Jayvee


The support for optional columns should be enabled via #85 and thus can be considered out of scope regarding this RFC. So for now, you can solely focus on obligatory columns until we have decided on how to treat optional columns in the language.

Would you also agree @rhazn?

Yes, only obligatory columns, assume agency id is obligatory.

felix-oq · 2023-01-30T09:13:27Z

rfc/0002-mobility-extension.md

+### 4) CSVInterpreter *(Requires minor change)*
+Input: File, Output: Sheet
+
+In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example  pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.


Let's pick your second proposition. Then you are right, the existing examples need a small rewrite: CSVFileExtractor needs to be replaced with an HTTPExtractor and a subsequent CSVFileInterpreter.

+1 (Even though I will repeat my comment that we need statement and decisions in an RFC, not questions).

Reformulated to statement

felix-oq · 2023-01-30T09:19:54Z

rfc/0002-mobility-extension.md

-In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!! 
+### 6) SQLiteSink *(Requires minor change)*
+Input: Table, Output: void
+This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself. 


Using separate SQLiteSink blocks for each file would a way to implement the pipeline without having to introduce the concept of parallel inputs. That would result in a lot of duplicate code but would also be suited to initially demonstrate the proof of concept, without the effort of modifying the execution logic.

Fine for me in the scope of this RFC. In the implementation we'd need to make sure that different SQLiteSink blocks do not recreate the database but otherwise works for me.

changed statement

felix-oq · 2023-01-30T09:35:46Z

rfc/0002-mobility-extension.md

-}
-```
-This mapping is later also used in the `SQLiteTablesLoader`, to define the output-table-names. The syntax for `layouts` is inspired by the MyTableBuilder-syntax from issue RFC for cell ranges #109
+Not sure, how to *implement* multiple input/output of blocks (for SQLiteSink and ArchiveInterpreter)?


Note that multiple outputs are already implemented, so for ArchiveInterpreter it should work out of the box. For multiple inputs, the execution logic in the interpreter needs to be modified. It is located in this method:

jayvee/apps/interpreter/src/interpreter.ts

Lines 80 to 84 in fc9ffb8

async function runPipeline(

pipeline: Pipeline,

runtimeParameters: Map<string, string | number | boolean>,

loggerFactory: LoggerFactory,

): Promise<ExitCode> {

After it has been implemented, the validation, that prevents multiple inputs, needs to be removed:

jayvee/libs/language-server/src/lib/validation/block-validator.ts

Lines 236 to 246 in fc9ffb8

} else if (pipes.length > 1 && whatToCheck === 'input') {

for (const pipe of pipes) {

accept(

'error',

`At most one pipe can be connected to the ${whatToCheck} of a ${block.type}`,

{

node: pipe,

property: 'to',

},

);

}

Let me know whether you need help with the implementation, we can do a pair programming session if you want.

wow, thank you for explaining and the offer! Will dry-run the solution with multiple SQLiteSinks and later on change the execution logic.

rhazn

For the final version I'd ask you to rewrite questions into decisions but otherwise we are nearly there I think 👍 .

rhazn · 2023-01-30T14:28:35Z

rfc/0002-mobility-extension.md

+* New `io-datatypes` called `File` and `FileSystem`
+* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
+* The  `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
+* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)


The question is how that is done, I'd vote for blocks being able to explicitly output a non-existing value (e.g. an None Option). If that is the output of a blog, downstream blocks are not executed.

rhazn · 2023-01-30T14:29:54Z

rfc/0002-mobility-extension.md

+
+  extension: string //The file extension
+
+  filetype: string //The MIME type of the file taken from the Content-Type header (for HTTP requests only) Otherwise inferred from the file extension, Could default to text/plain or application/octet-stream for unknown or missing file extensions


In the context of a RFC all these "Could be?" questions need to be statements of a choice. The question can be raised in this discussion. I'd vote for application/octet-stream and ArrayBuffer here.

changed to statement, will raise questions in future via comment in GH

rhazn · 2023-01-30T14:31:39Z

rfc/0002-mobility-extension.md

 }
 ```

-### datatype Undefined
+### datatype Undefined *(Requires implementation from scratch)*
 For an implementation of an optional-mechanism for eg. columns, we need a new datatype ´undefined´(Attention: Not talking about io-datatype, i mean datatype). Optional column's datatype would then be `text or undefined`. So we also need a grammar feature for an OR-represenation in Jayvee


Yes, only obligatory columns, assume agency id is obligatory.

rhazn · 2023-01-30T14:32:29Z

rfc/0002-mobility-extension.md

 ```
 block MyHttpExtractor oftype HttpExtractor {
    url: "https://www.data.gouv.fr/fr/datasets/r/c4d9326f-9f41-4dfb-9746-31bc97a31fc6";
+    content-type: string //the expected content-type of the http-call


I am against specifying content type here. It places way too much of a burden on the language user. Just infer this from the call/file ending or hardcode to octed stream right now.

rhazn · 2023-01-30T14:33:33Z

rfc/0002-mobility-extension.md

 ```
-block MyZipInterpreter oftype ZipInterpreter{
+block ZipArchiveInterpreter oftype ArchiveInterpreter{


Will need a parameter for archive type, right now only accepting the string "zip" imho.

added param to RFC

rhazn · 2023-01-30T14:34:18Z

rfc/0002-mobility-extension.md

+### 4) CSVInterpreter *(Requires minor change)*
+Input: File, Output: Sheet
+
+In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example  pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.


+1 (Even though I will repeat my comment that we need statement and decisions in an RFC, not questions).

rhazn · 2023-01-30T14:37:03Z

rfc/0002-mobility-extension.md


-A LayoutsValidator (Attention: here we talk about multiple Layouts) gets as input an collection of sheets and validates every sheet using a single, dedicated LayoutValidator (for a single layout). As an parameter the LayoutsValidator gets a mapping of filenames to layouts in order to be able to process multiple files/layouts within one block. Every sheet in the collection has its corresponding layout, wrapped in the layouts-block. After the validation of every sheet is sucessfull, the LayoutsValidator outputs a collection of validated tables.
+### 5) LayoutValidator and Layouts *(Requires minor change)*


These I would consider out of scope for this RFC, the only thing relevant for us is the named columns and we will handle that in another RFC. Here you can just summarize that we will only deal with required columns and reuse the existing language features for that.

changed description

rhazn · 2023-01-30T14:38:14Z

rfc/0002-mobility-extension.md

-In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!! 
+### 6) SQLiteSink *(Requires minor change)*
+Input: Table, Output: void
+This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself. 


Fine for me in the scope of this RFC. In the implementation we'd need to make sure that different SQLiteSink blocks do not recreate the database but otherwise works for me.

schlingling

Added all discussed points into an new commit.

schlingling · 2023-01-31T16:30:25Z

rfc/0002-mobility-extension.md

+* New `io-datatypes` called `File` and `FileSystem`
+* New blocktypes `HTTPExtractor`, `ArchiveInterpreter` and `FilePicker`
+* The  `io-datatype` `Table` needs to store its name to be able to handle multiple tables as input
+* An abort-mechanism must be implemented, when a block gets empty/null/undefined input (in our case FilePicker)


I got the concept, thanks @felix-oq, will add this propsotion to the RFC

schlingling · 2023-01-31T16:36:45Z

rfc/0002-mobility-extension.md

+
+  extension: string //The file extension
+
+  filetype: string //The MIME type of the file taken from the Content-Type header (for HTTP requests only) Otherwise inferred from the file extension, Could default to text/plain or application/octet-stream for unknown or missing file extensions


changed to statement, will raise questions in future via comment in GH

schlingling · 2023-01-31T16:40:48Z

rfc/0002-mobility-extension.md

 }
 ```

-### io-datatype Collection
-A Collection concept could look like this and should be added to `io-datatypes.ts`. Each CollectionElement gets then wrapped in a Collection of CollectionElements, so we can have metadata for elements of collections such like a index and name of elements.
+Not sure, how to *implement* methods in an interface in `io-datatype.ts` or is this realized in the Block `FilePicker` @felix-oq?


Great, thanks! Added the folder-structure-proposal to the RFC

schlingling · 2023-01-31T16:47:27Z

rfc/0002-mobility-extension.md

 ```
 block MyHttpExtractor oftype HttpExtractor {
    url: "https://www.data.gouv.fr/fr/datasets/r/c4d9326f-9f41-4dfb-9746-31bc97a31fc6";
+    content-type: string //the expected content-type of the http-call


schlingling · 2023-01-31T16:48:30Z

rfc/0002-mobility-extension.md

 ```
-block MyZipInterpreter oftype ZipInterpreter{
+block ZipArchiveInterpreter oftype ArchiveInterpreter{


added param to RFC

schlingling · 2023-01-31T16:49:01Z

rfc/0002-mobility-extension.md

+### 4) CSVInterpreter *(Requires minor change)*
+Input: File, Output: Sheet
+
+In the package `tabular` a `CSVFileExtractor` is already implemented, which loads a CSV from an URL and outputs a Sheet. How to implement the CSVInterpreter here (Can i introduce a new Blocktype or should I rewrite the existing `CSVFileExtractor`? If we go with the second proposition, how should i handle multiple incoming io-type as the existing Extractor has void as input, and the new Interpreter wants a file? I would suggest, to rewrite the existing example  pipelines (gas and cars), to use the new `HTTPExtractor` as well, then we just need one `CSVFileInterpreter` and now longer an `CSVFileExtractor`.


Reformulated to statement

schlingling · 2023-01-31T16:55:28Z

rfc/0002-mobility-extension.md


-A LayoutsValidator (Attention: here we talk about multiple Layouts) gets as input an collection of sheets and validates every sheet using a single, dedicated LayoutValidator (for a single layout). As an parameter the LayoutsValidator gets a mapping of filenames to layouts in order to be able to process multiple files/layouts within one block. Every sheet in the collection has its corresponding layout, wrapped in the layouts-block. After the validation of every sheet is sucessfull, the LayoutsValidator outputs a collection of validated tables.
+### 5) LayoutValidator and Layouts *(Requires minor change)*


changed description

schlingling · 2023-01-31T17:04:33Z

rfc/0002-mobility-extension.md

-In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!! 
+### 6) SQLiteSink *(Requires minor change)*
+Input: Table, Output: void
+This Block needs to be adapted, to handle multiple Inputs. As the parallel processing of the differnt files does not depend on each other, we potentially could use for every file an own SQLiteSink? We change the SQLiteSink receive the table name via the io-datatype `Table` itself. 


changed statement

schlingling · 2023-01-31T17:05:47Z

rfc/0002-mobility-extension.md

-}
-```
-This mapping is later also used in the `SQLiteTablesLoader`, to define the output-table-names. The syntax for `layouts` is inspired by the MyTableBuilder-syntax from issue RFC for cell ranges #109
+Not sure, how to *implement* multiple input/output of blocks (for SQLiteSink and ArchiveInterpreter)?


wow, thank you for explaining and the offer! Will dry-run the solution with multiple SQLiteSinks and later on change the execution logic.

schlingling · 2023-02-26T10:04:03Z

This PR is a basis for #123

changed proposal to FileSystem-Approach

baff93b

schlingling self-assigned this Jan 29, 2023

schlingling added 2 commits January 29, 2023 17:28

added jayvee pipeline for fileSystem approach

d650a16

added Layout explanation

ca4e5af

schlingling requested a review from rhazn January 29, 2023 16:53

schlingling changed the title ~~Draft: RFC 0002 for mobility extension~~ RFC 0002 for mobility extension Jan 29, 2023

schlingling added 2 commits January 29, 2023 17:56

added annotiations to visualization of pipeline

3ff3384

fixed paths

d426654

felix-oq reviewed Jan 30, 2023

View reviewed changes

rhazn approved these changes Jan 30, 2023

View reviewed changes

schlingling commented Jan 31, 2023

View reviewed changes

schlingling merged commit 8b211ac into main Jan 31, 2023

schlingling added a commit that referenced this pull request Jan 31, 2023

added feedback from review of #116

5ab06b4

github-actions bot locked and limited conversation to collaborators Jan 31, 2023

	async function runPipeline(
	pipeline: Pipeline,
	runtimeParameters: Map<string, string \| number \| boolean>,
	loggerFactory: LoggerFactory,
	): Promise<ExitCode> {

	} else if (pipes.length > 1 && whatToCheck === 'input') {
	for (const pipe of pipes) {
	accept(
	'error',
	`At most one pipe can be connected to the ${whatToCheck} of a ${block.type}`,
	{
	node: pipe,
	property: 'to',
	},
	);
	}


		extension: string //The file extension

		filetype: string //The MIME type of the file taken from the Content-Type header (for HTTP requests only) Otherwise inferred from the file extension, Could default to text/plain or application/octet-stream for unknown or missing file extensions


		A LayoutsValidator (Attention: here we talk about multiple Layouts) gets as input an collection of sheets and validates every sheet using a single, dedicated LayoutValidator (for a single layout). As an parameter the LayoutsValidator gets a mapping of filenames to layouts in order to be able to process multiple files/layouts within one block. Every sheet in the collection has its corresponding layout, wrapped in the layouts-block. After the validation of every sheet is sucessfull, the LayoutsValidator outputs a collection of validated tables.
		### 5) LayoutValidator and Layouts (Requires minor change)

RFC 0002 for mobility extension #116

RFC 0002 for mobility extension #116

Conversation

schlingling commented Jan 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix-oq Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhazn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlingling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlingling commented Feb 26, 2023

schlingling commented Jan 29, 2023 •

edited

Loading

felix-oq Jan 30, 2023 •

edited

Loading