Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid subfolders in GTFS files #379

Merged
merged 3 commits into from Jul 18, 2023
Merged

Forbid subfolders in GTFS files #379

merged 3 commits into from Jul 18, 2023

Conversation

thbar
Copy link
Contributor

@thbar thbar commented May 22, 2023

See #377

@google-cla
Copy link

google-cla bot commented May 22, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@thbar
Copy link
Contributor Author

thbar commented May 22, 2023

I have accepted the CLA.

@markstos
Copy link
Contributor

We've seen files within folders being a problem several times at RideAmigos, such that we have unilaterally pre-filter all zip files to flatten them before continuing to process then.

I'm not sure it would be appropriate to add to the spec, but some zip tools like the zip binary have an option to strip out directory names, so even if the input has nested folders, the output doesn't:

zip -r --junk-paths OUT.zip ./IN

@leonardehrenfried
Copy link
Contributor

@thbar According to the contribution process, you need to send a message to the mailing list announcing this PR: https://github.com/google/transit/blob/master/gtfs/CHANGES.md#specification-amendment-process

(I support the proposal, btw).

Co-authored-by: Leonard Ehrenfried <mail@leonard.io>
@leonardehrenfried
Copy link
Contributor

I'm not sure it would be appropriate to add to the spec, but some zip tools like the zip binary have an option to strip out directory names, so even if the input has nested folders, the output doesn't:

I'm in no position to make decisions about this, but adding references to specific software tools and command line invocations feels a little "un-spec-ish".

@markstos
Copy link
Contributor

I'm in no position to make decisions about this, but adding references to specific software tools and command line invocations feels a little "un-spec-ish".

True.

@thbar
Copy link
Contributor Author

thbar commented May 23, 2023

@thbar According to the contribution process, you need to send a message to the mailing list announcing this PR: https://github.com/google/transit/blob/master/gtfs/CHANGES.md#specification-amendment-process

@leonardehrenfried thank you, I have posted over there (probably pending moderation).

@isabelle-dr isabelle-dr linked an issue May 24, 2023 that may be closed by this pull request
@emmambd
Copy link
Collaborator

emmambd commented Jun 5, 2023

It's great to see a lot of support for this change! I have one question about the must language in this PR. One of the principles of GTFS is that changes should be backwards compatible and we want to avoid changes that would make existing feeds invalid.

Any thoughts on changing this proposed must to should?

@markstos
Copy link
Contributor

markstos commented Jun 5, 2023

In this case, it feels more like a clarification. The first line of the GTFS website now says:

"A GTFS feed, which contains static transit information, is composed of a number of text (.txt) files that are contained in a single ZIP file."

Elsewhere it already says "All dataset files must be zipped together".

It's already reasonable to interpret those as being at the top level.

In terms of backcompat, I expect a survey of existing feeds would find that something well above 95% or compliant, and I would not be surprised if that number is > 99%.

Adding the term /should/ doesn't help much, because it applies that consumers must still handle the sub-folder case.

Because the vast-majority of feeds are already compatible and because there's an easy fix (zip --skip-junk), I believe in this case the term "must" still fulfills the spirit of backwards compatibility .

@leonardehrenfried
Copy link
Contributor

I agree with @markstos. If we use "should" then we don't need this PR since you "should" already do that.

We want to make those feeds invalid and have some unambiguous language in the spec to point producers to.

@laurentg
Copy link

laurentg commented Jun 9, 2023

According to what we see ourselves on the field, less than 1% of the GTFS are in that situation (with subfolders). As a side note, we (Mecatran) requires explicitly to enable a "lenient" mode on our tools to load zip with subfolders, we do not handle them by default. Also, interesting questions can arise when you have multiple GTFS inside a single zip in different subfolders (saw that once).

@leonardehrenfried
Copy link
Contributor

@thbar I think we have had enough time to discuss this. Do you think that we can start the voting process?

@markstos
Copy link
Contributor

To @laurentg's point: If we allow subfolders and expect consumers to "flatten" the files, the subfolders could contain stops.txt or other files in multiple subfolders and the spec does not clarify how to resolve that, and different flattening methods could produce differing results- discarding one copy or the other or crashing. The simplest resolution is to ignore subfolders, which appears to be both the intent and what 99% are already doing in practice.

It's also important to note that while the headline says "forbid subfolders", that's not what the spec wording actually says-- it says that the data set files must be at the top level. So for example, macOS has been known to leave hidden metadata files in "dot" files or folders that users may not even realize exist. According to the spec proposal, sub-folders will be tolerated, but ignored as data file locations. This is a good approach.

@leonardehrenfried
Copy link
Contributor

I don't think that @laurentg was arguing that allowing subfolders is a good idea. He just pointed out the can of worms when you do allow them and you've re-enforced this point.

@thbar
Copy link
Contributor Author

thbar commented Jun 16, 2023

@thbar I think we have had enough time to discuss this. Do you think that we can start the voting process?

@leonardehrenfried sure! I'm not well aware of who does that and how, but definitely yes.

@emmambd
Copy link
Collaborator

emmambd commented Jun 21, 2023

Hi @thbar! You as the PR advocate can open a vote based the governance process outlined in CHANGES.md. The next steps would be:

Let us know if you have any other questions about how to proceed.

@thbar
Copy link
Contributor Author

thbar commented Jul 4, 2023

Thanks @emmambd! Following the template:

This pull request has been open for more than one week, so I am calling for a vote. This is to clarify the layout of files in a GTFS archive and to ensure files will be at the top level, to simplify data ingestion.

Please vote with a +1 (for) or -1 (against) in the comments. Voting ends on 2023-07-13 at 23:59:59 UTC.

Thanks!

@gcamp
Copy link
Contributor

gcamp commented Jul 4, 2023

+1 Transit

@emmambd emmambd added Status: Voting Pull Requests where the advocate has called for a vote as described in the changes.md GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule labels Jul 4, 2023
@leonardehrenfried
Copy link
Contributor

+1 Opentripplanner

@skinkie
Copy link
Contributor

skinkie commented Jul 4, 2023

+1 OpenGeo

@drewda
Copy link

drewda commented Jul 5, 2023

+1 from @interline-io

Our team has spent a good deal of time adding support to www.transit.land for nested subfolders, nested zips, and strange combinations of both nested subfolders and nested zips. Not expecting this situation to improve overnight, but it will still be useful to have the spec explicitly forbid these situations.

@westontrillium
Copy link

+1 from Trillium, though we recommend simplifying this text:

In order to reduce complexity for consumers, zip files containing dataset files in a subfolder are invalid.

to: "The files must reside at the root level directly, not in a subfolder."

@philip-cline
Copy link

+1 Arcadis IBI Group

@Bertware
Copy link

Bertware commented Jul 7, 2023

+1 Trafiklab/Samtrafiken

@markstos
Copy link
Contributor

+1 RideAmigos

@emmambd emmambd removed the Status: Voting Pull Requests where the advocate has called for a vote as described in the changes.md label Jul 14, 2023
@thbar
Copy link
Contributor Author

thbar commented Jul 18, 2023

Since I got an email from MobilityData indicating the vote is over and I can adapt slightly based on feedback, I'll follow the simplification proposed by @westontrillium above, and just keep "The files must reside at the root level directly, not in a subfolder."

@thbar
Copy link
Contributor Author

thbar commented Jul 18, 2023

PR wording has been simplified for brevity. I'll reply to the MobilityData email I got for merging.

@eliasmbd eliasmbd merged commit 4460771 into google:master Jul 18, 2023
2 checks passed
@eliasmbd
Copy link
Collaborator

The vote has passed! PR has been merged! Congratulations.

@thbar thbar deleted the patch-1 branch July 18, 2023 13:27
@isabelle-dr isabelle-dr mentioned this pull request Jul 27, 2023
4 tasks
gcamp pushed a commit to TransitApp/transit that referenced this pull request Sep 25, 2023
* Forbid subfolders in GTFS files

See google#377

* Update gtfs/spec/en/reference.md

Co-authored-by: Leonard Ehrenfried <mail@leonard.io>

* Simplify wording based on feedback

---------

Co-authored-by: Leonard Ehrenfried <mail@leonard.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Thoughts on forbidding "subfolder inside zip"?