Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transferred from GitLab] Creating GTFS DB #5

Closed
daphshez opened this issue Jul 12, 2016 · 5 comments
Closed

[Transferred from GitLab] Creating GTFS DB #5

daphshez opened this issue Jul 12, 2016 · 5 comments

Comments

@daphshez
Copy link
Collaborator

daphshez commented Jul 12, 2016

[Originally posted by @nitzangur ]

A basic scheme is now available under gtfs/.

Next steps:

Scheme level:

  1. Validate index-creation syntax.
  2. Deside which indexes are wanted. 3 Create enums for relevant fields (inline comments).
  3. Make sure type-sizes are fine.
  4. Consider using TinyInt instead of INT when relevant.
  5. Check requested type-size for coordinates.
  6. Check requested type-size for shape_dist_traveled.

IT level:

  1. Create a Postgresql DB based on this scheme.
  2. Make sure the DB is accessible for developing and querying.

Code level:

  1. Change the code under gtfs/parser/ so it will parse GTFS data into the new DB instead of into SQLite DB.
  2. Migrate some other GTFS parsing code (e.g. route stories).
  3. Create a cron to automatically parse the GTFS data every day. (Yehuda - is there any old cron to work with?)

Feel free to add/change some steps. Please response to this issue if you intend to perform some of these steps. (I guess those remarks are relevant to all of the issues.)

@nitzangur
Copy link
Collaborator

A valid scheme was uploaded in commit 25ed52c.
Postgresql DB is available under the same server as the SIRI-related DB (see here). DB name: gtfs.

@daphshez
Copy link
Collaborator Author

Are we planning to override the GTFS data on every import or do we want to accumulate it over time?

I think accumulation makes sense for analysis but it means the import script has to be more clever than the current script.

E.g. in the calendar table, every GTFS service only have future dates in the start_date field. The import script will have to recognise that this is the same old service and re-use the service record.

What do you think?

@nitzangur
Copy link
Collaborator

There is an option to exclude fields from the history table. If I get it right - excluding a field (e.g. the start_date field) means it won't create separated history records. Bottom line: the current import script (once it works) should do the job.
See: http://ebean-orm.github.io/docs/features/history

@efratoio
Copy link
Collaborator

efratoio commented Jul 25, 2016

Hi,
There are some issues that prevent the insertion to postgres to be completed.

  1. The statements of "CREATE TABLE doesn't end with ";" so cannot be executedfrom the schema.sql file. is this intended?
  2. "routes" "calendar" and another table works! but:
    a. in shapes there is a column "shape_dist_traveled" not in file
    b. in stop_times "trip_id" is integer but the format is: "20132304_260516" for example
    c. the same for trips and trips_id
    d. the "parent_station" field in stops is empty, so it cannot be converted to integer

Notice this project - they created postgres schema for gtfs and used for trip_id chracter dataype:

https://github.com/jedhorne/py-gtfs-postgres/blob/master/schema/gtfs_schema.create.sql

@daphshez
Copy link
Collaborator Author

I tested, cleaned up and documented the code written by @efratoio & @nitzangur.

I am happy to say that it works, and we have the sample gtfs data in the obus database on our server now! Thanks for everyone who worked on this issue!

There's a readme file that documents how to run it.

While this works, there is a performance issue. In my tests on the server is took about 1 second per 1000 records. The stop_times table in recent GTFS files has about ~20M records, so it would take hours to insert it. I think the key is to do some kind of batch, rather than inserting 1-by-1. I am going to open a separate issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants