Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for gtfs extension #111

Closed
wants to merge 4 commits into from
Closed

RFC for gtfs extension #111

wants to merge 4 commits into from

Conversation

schlingling
Copy link
Contributor

See rfc/0002-gtfs-extension.md for detailled explanation. Really looking foward to your feedback!

@schlingling schlingling added the enhancement New feature or request label Jan 22, 2023
@schlingling schlingling self-assigned this Jan 22, 2023
@github-actions
Copy link

github-actions bot commented Jan 22, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@schlingling
Copy link
Contributor Author

CLA Assistant Lite bot: Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

You can retrigger this bot by commenting recheck in this Pull Request

CLA already submitted by physical letter (should arrive in Erlangen soon)

@rhazn
Copy link
Contributor

rhazn commented Jan 22, 2023

The bot will not give you piece until you comment the "I have read the CLA Document and I hereby sign the CLA" and trigger a recheck, sorry. Would be great if you can do that in addition to the signed CLA. Thanks, will look at this early next week!

@schlingling
Copy link
Contributor Author

schlingling commented Jan 22, 2023

I have read the CLA Document and I hereby sign the CLA

@schlingling
Copy link
Contributor Author

Recheck

Copy link
Contributor

@rhazn rhazn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written RFC I think, thank you. I left some notes, @felix-oq would you be so nice and do a review as well? If anything is unclear, feel free to ping me on slack. FYI @georg-schwarz as well, happy if you have input but not a requirement.

rfc/0002-gtfs-extension.md Outdated Show resolved Hide resolved
Disclaimer: This RFC is part of my master-thesis "Archiving open transport data using the JValue tooling ecosystem" supervised by @rhazn. I'll open an PR for that, in order to discuss the neccessary changes to Jayvee due to my implemenation.

# Summary
This RFC enables a pipeline extracting, validating and loading GTFS-data by providing an GTFS-endpoint under consideration of the [GTFS-speficiation](https://developers.google.com/transit/gtfs/reference). For that reason some changes and extensions of Jayvee have to be made. The overall goal of this RFC is processing GTFS-data with a minimun of changes/extensions in Jayvee.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GTFS-specification", "minimum"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed these typos


Each of the following subchapters explains the idea behind.

## 1) HttpExtractor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, should be part of the std extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can implement these in the std folder.

}
```

## 2) ZipInterpreter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, should be part of the std extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?

}
```

## 3) GtfsInterpreter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, collections of sheets should be part of the std extension.

This block, however, should be part of the new extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i will implement the GtfsInterpreter in the mobility-extension
TODO: Lets discuss in our call, where to implement collections exactly (i think, in the io-types.ts file).

```

### 4.1) Layout
As GTFS defines a whole datamodel consisting of multiple dimensions, we can provide (in some future) the user with a predefined set of layouts. Some columns in GTFS-csv-files are optional and conditional optional, what can be reflected in the grammar using `?` and `!` as an flag. From a generic perspective, these flags can be considered in every Validator, depending on the implementation, since optional columns could be in all kinds of csv-files not just in gtfs-csv-files. A vision is, that the GTFS-pipeline later on processes a list of GTFS-Endpoints. Because every endpoint has at least the required-columns, we need to have the optional-mechanism in our layout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should represent the optional columns with their datatype, e.g. saying their datatype is text or undefined. For that, we'd need an undefined datatype in Jayvee and a way to combine these types.

For this RFC, I would only include required columns and ignore everything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a chapter, describing this behavior


In a GTFS-Validator, some conditional (aka logical) checks could possibly be applied during the validation (not just a static header/datatype validaton)

In the specifiation, the order of the columns is not defined, so we need to access the columns by their names, not their index as every GTFS-endpoint could possibly have a different order!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very important and interesting requirement, I think we should let this influence #114 directly. I'll mention it there. I could see selecting cell ranges by names, aka column A or column "agency_id" (note the quotation marks to show it is a string).

### 4.2) Layouts
As we want to process a zip-file containing multiple csv-files, we also need a mechanism to map csv-files to layouts. In my [.jv-file](0002-gtfs.jv), every layout gets defined as usual (header, column, datatypes --> please have look at chapter 4.1). Multiple layouts get then wrapped in a new grammar-feature, which maps csv-filenames to layouts. This layouts-artefact gets referenced by `SQLiteTablesLoader` and `LayoutsValidator`, too.
```
layouts gtfsLayouts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need a new grammar feature for this. We definitely should talk about how to express collections with and without string keys in our language (and if we want to).

Originally, we had an idea to consider an attribute that was added multiple times as a collection. For collections with a string index, we'd use <key> with <value> syntax for each element, for arrays just the value. In this case, it could be:

block GtfsValidator oftype LayoutsValidator {
    validationLayouts: "agency" with agencyLayout;
    validationLayouts: "stops" with stopsLayout;
}

An alternative could be to allow key, value tuples, consider:

block GtfsValidator oftype LayoutsValidator {
    validationLayouts: ("agency", agencyLayout);
    validationLayouts: ("stops", stopsLayout);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss this in our call


```
block GtfsLoader oftype SQLiteTablesLoader {
tables: gtfsLayouts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I would honestly just assume the keys of the collection can be used as table names for now, write that down and remove the tables attribute here:

block GtfsLoader oftype SQLiteTablesLoader {
    file: "./gtfs.db";
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

rfc/0002-gtfs.jv Outdated

layout agencyLayout {
header row 1: text;
column agency_id!: text; //Conditional optional columns marked with ! (but could be any other literal as well... i took that one because ? stands for optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?

In any case, as discussed before, I would remove the !/? logic for now and just consider the required columns + consider this cond. required column as required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?

A dataset may contain data from multiple agencies. This field is required when the dataset contains data for multiple transit agencies, otherwise it is optional.

I remove the ! ? stuff in the new commit.

Copy link
Contributor Author

@schlingling schlingling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apply all changes mentioned directly and will add a new commit. Should i resolve all comments, where no more action is needed @rhazn?

}
```

## 2) ZipInterpreter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: How to deal with that? When i start implementation, should i create them already in std-folder or first in extensions->mobility and we move this later?

}
```

## 3) GtfsInterpreter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i will implement the GtfsInterpreter in the mobility-extension
TODO: Lets discuss in our call, where to implement collections exactly (i think, in the io-types.ts file).

}
```

## 4) LayoutsValidator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a chapter for describing the new io-datatypes needed

```

### 4.1) Layout
As GTFS defines a whole datamodel consisting of multiple dimensions, we can provide (in some future) the user with a predefined set of layouts. Some columns in GTFS-csv-files are optional and conditional optional, what can be reflected in the grammar using `?` and `!` as an flag. From a generic perspective, these flags can be considered in every Validator, depending on the implementation, since optional columns could be in all kinds of csv-files not just in gtfs-csv-files. A vision is, that the GTFS-pipeline later on processes a list of GTFS-Endpoints. Because every endpoint has at least the required-columns, we need to have the optional-mechanism in our layout.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a chapter, describing this behavior

### 4.2) Layouts
As we want to process a zip-file containing multiple csv-files, we also need a mechanism to map csv-files to layouts. In my [.jv-file](0002-gtfs.jv), every layout gets defined as usual (header, column, datatypes --> please have look at chapter 4.1). Multiple layouts get then wrapped in a new grammar-feature, which maps csv-filenames to layouts. This layouts-artefact gets referenced by `SQLiteTablesLoader` and `LayoutsValidator`, too.
```
layouts gtfsLayouts {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss this in our call


```
block GtfsLoader oftype SQLiteTablesLoader {
tables: gtfsLayouts;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

rfc/0002-gtfs.jv Outdated

layout agencyLayout {
header row 1: text;
column agency_id!: text; //Conditional optional columns marked with ! (but could be any other literal as well... i took that one because ? stands for optional)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be conditionally optional if routes have an agency_id that is not optional? Is this optional if no routes file (or other file that has agency_id as required) exists?

A dataset may contain data from multiple agencies. This field is required when the dataset contains data for multiple transit agencies, otherwise it is optional.

I remove the ! ? stuff in the new commit.

@schlingling schlingling changed the title RFC for GTFS extension RFC for mobility extension Jan 25, 2023
@schlingling schlingling deleted the rfc-gtfs-extension branch January 25, 2023 17:49
@github-actions github-actions bot locked and limited conversation to collaborators Jan 25, 2023
@schlingling schlingling changed the title RFC for mobility extension RFC for gtfs extension Jan 25, 2023
@schlingling
Copy link
Contributor Author

This PR is a basis for #123

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants