New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for gtfs extension #111
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
CLA already submitted by physical letter (should arrive in Erlangen soon) |
The bot will not give you piece until you comment the "I have read the CLA Document and I hereby sign the CLA" and trigger a recheck, sorry. Would be great if you can do that in addition to the signed CLA. Thanks, will look at this early next week! |
I have read the CLA Document and I hereby sign the CLA |
Recheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very well written RFC I think, thank you. I left some notes, @felix-oq would you be so nice and do a review as well? If anything is unclear, feel free to ping me on slack. FYI @georg-schwarz as well, happy if you have input but not a requirement.
rfc/0002-gtfs-extension.md
Outdated
Disclaimer: This RFC is part of my master-thesis "Archiving open transport data using the JValue tooling ecosystem" supervised by @rhazn. I'll open an PR for that, in order to discuss the neccessary changes to Jayvee due to my implemenation. | ||
|
||
# Summary | ||
This RFC enables a pipeline extracting, validating and loading GTFS-data by providing an GTFS-endpoint under consideration of the [GTFS-speficiation](https://developers.google.com/transit/gtfs/reference). For that reason some changes and extensions of Jayvee have to be made. The overall goal of this RFC is processing GTFS-data with a minimun of changes/extensions in Jayvee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"GTFS-specification", "minimum"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed these typos
rfc/0002-gtfs-extension.md
Outdated
|
||
Each of the following subchapters explains the idea behind. | ||
|
||
## 1) HttpExtractor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, should be part of the std extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can implement these in the std folder.
rfc/0002-gtfs-extension.md
Outdated
} | ||
``` | ||
|
||
## 2) ZipInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, should be part of the std extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?
rfc/0002-gtfs-extension.md
Outdated
} | ||
``` | ||
|
||
## 3) GtfsInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, collections of sheets should be part of the std extension.
This block, however, should be part of the new extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i will implement the GtfsInterpreter in the mobility-extension
TODO: Lets discuss in our call, where to implement collections exactly (i think, in the io-types.ts file).
rfc/0002-gtfs-extension.md
Outdated
``` | ||
|
||
### 4.1) Layout | ||
As GTFS defines a whole datamodel consisting of multiple dimensions, we can provide (in some future) the user with a predefined set of layouts. Some columns in GTFS-csv-files are optional and conditional optional, what can be reflected in the grammar using `?` and `!` as an flag. From a generic perspective, these flags can be considered in every Validator, depending on the implementation, since optional columns could be in all kinds of csv-files not just in gtfs-csv-files. A vision is, that the GTFS-pipeline later on processes a list of GTFS-Endpoints. Because every endpoint has at least the required-columns, we need to have the optional-mechanism in our layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should represent the optional columns with their datatype, e.g. saying their datatype is text or undefined
. For that, we'd need an undefined datatype in Jayvee and a way to combine these types.
For this RFC, I would only include required columns and ignore everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a chapter, describing this behavior
rfc/0002-gtfs-extension.md
Outdated
|
||
In a GTFS-Validator, some conditional (aka logical) checks could possibly be applied during the validation (not just a static header/datatype validaton) | ||
|
||
In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very important and interesting requirement, I think we should let this influence #114 directly. I'll mention it there. I could see selecting cell ranges by names, aka column A
or column "agency_id"
(note the quotation marks to show it is a string).
rfc/0002-gtfs-extension.md
Outdated
### 4.2) Layouts | ||
As we want to process a zip-file containing multiple csv-files, we also need a mechanism to map csv-files to layouts. In my [.jv-file](0002-gtfs.jv), every layout gets defined as usual (header, column, datatypes --> please have look at chapter 4.1). Multiple layouts get then wrapped in a new grammar-feature, which maps csv-filenames to layouts. This layouts-artefact gets referenced by `SQLiteTablesLoader` and `LayoutsValidator`, too. | ||
``` | ||
layouts gtfsLayouts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need a new grammar feature for this. We definitely should talk about how to express collections with and without string keys in our language (and if we want to).
Originally, we had an idea to consider an attribute that was added multiple times as a collection. For collections with a string index, we'd use <key> with <value>
syntax for each element, for arrays just the value
. In this case, it could be:
block GtfsValidator oftype LayoutsValidator {
validationLayouts: "agency" with agencyLayout;
validationLayouts: "stops" with stopsLayout;
}
An alternative could be to allow key, value tuples, consider:
block GtfsValidator oftype LayoutsValidator {
validationLayouts: ("agency", agencyLayout);
validationLayouts: ("stops", stopsLayout);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets discuss this in our call
rfc/0002-gtfs-extension.md
Outdated
|
||
``` | ||
block GtfsLoader oftype SQLiteTablesLoader { | ||
tables: gtfsLayouts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I would honestly just assume the keys of the collection can be used as table names for now, write that down and remove the tables
attribute here:
block GtfsLoader oftype SQLiteTablesLoader {
file: "./gtfs.db";
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
rfc/0002-gtfs.jv
Outdated
|
||
layout agencyLayout { | ||
header row 1: text; | ||
column agency_id!: text; //Conditional optional columns marked with ! (but could be any other literal as well... i took that one because ? stands for optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?
In any case, as discussed before, I would remove the !
/?
logic for now and just consider the required columns + consider this cond. required column as required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?
A dataset may contain data from multiple agencies. This field is required when the dataset contains data for multiple transit agencies, otherwise it is optional.
I remove the ! ? stuff in the new commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apply all changes mentioned directly and will add a new commit. Should i resolve all comments, where no more action is needed @rhazn?
rfc/0002-gtfs-extension.md
Outdated
} | ||
``` | ||
|
||
## 2) ZipInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?
rfc/0002-gtfs-extension.md
Outdated
} | ||
``` | ||
|
||
## 3) GtfsInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i will implement the GtfsInterpreter in the mobility-extension
TODO: Lets discuss in our call, where to implement collections exactly (i think, in the io-types.ts file).
rfc/0002-gtfs-extension.md
Outdated
} | ||
``` | ||
|
||
## 4) LayoutsValidator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a chapter for describing the new io-datatypes needed
rfc/0002-gtfs-extension.md
Outdated
``` | ||
|
||
### 4.1) Layout | ||
As GTFS defines a whole datamodel consisting of multiple dimensions, we can provide (in some future) the user with a predefined set of layouts. Some columns in GTFS-csv-files are optional and conditional optional, what can be reflected in the grammar using `?` and `!` as an flag. From a generic perspective, these flags can be considered in every Validator, depending on the implementation, since optional columns could be in all kinds of csv-files not just in gtfs-csv-files. A vision is, that the GTFS-pipeline later on processes a list of GTFS-Endpoints. Because every endpoint has at least the required-columns, we need to have the optional-mechanism in our layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a chapter, describing this behavior
rfc/0002-gtfs-extension.md
Outdated
### 4.2) Layouts | ||
As we want to process a zip-file containing multiple csv-files, we also need a mechanism to map csv-files to layouts. In my [.jv-file](0002-gtfs.jv), every layout gets defined as usual (header, column, datatypes --> please have look at chapter 4.1). Multiple layouts get then wrapped in a new grammar-feature, which maps csv-filenames to layouts. This layouts-artefact gets referenced by `SQLiteTablesLoader` and `LayoutsValidator`, too. | ||
``` | ||
layouts gtfsLayouts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets discuss this in our call
rfc/0002-gtfs-extension.md
Outdated
|
||
``` | ||
block GtfsLoader oftype SQLiteTablesLoader { | ||
tables: gtfsLayouts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
rfc/0002-gtfs.jv
Outdated
|
||
layout agencyLayout { | ||
header row 1: text; | ||
column agency_id!: text; //Conditional optional columns marked with ! (but could be any other literal as well... i took that one because ? stands for optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?
A dataset may contain data from multiple agencies. This field is required when the dataset contains data for multiple transit agencies, otherwise it is optional.
I remove the ! ? stuff in the new commit.
This PR is a basis for #123 |
See rfc/0002-gtfs-extension.md for detailled explanation. Really looking foward to your feedback!