Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract schedule data input #9

Closed
xamanu opened this issue Sep 17, 2016 · 27 comments
Closed

Abstract schedule data input #9

xamanu opened this issue Sep 17, 2016 · 27 comments

Comments

@xamanu
Copy link
Contributor

xamanu commented Sep 17, 2016

Currently, the data about the schedule information is getting pulled in a very custom way into the script.

Others would like to use the script. For now I see different ways for feeding the script:

  • Currently @grote uses a json file with route numbers and then pulls data from the Fenix network website and parses the time information.
  • @jamescr is putting schedule information directly in a json file
  • We (MapaNica) would like to use a CSV file with frequencies and run some logic on it.

I suggest moving different ways on pulling schedule data into different files, which could be handled similar to small plugins. Basically all of them then have to output the same information to the main script: Stop times of the first and last stops (mostly terminals, I guess).

Probably we could specify in the config file (#3) the name of the schedule information plugin. Do you have any suggestions on how to implement this properly?

This is an important step in order to make this tool generically usable.

@grote
Copy link
Owner

grote commented Sep 18, 2016

I think this should be the last step of making this script truly general, also because it is the most difficult one and probably the one where we are not yet sure what we really need and how these yet unknown use-cases can be covered best by an abstraction layer.

Just feeding in static files might be to inflexible. I expect most people still needing to integrate some logic. There could maybe be something like a abstract GTFSWriter class that already has methods for most tasks and is essentially a wrapper around the transitfeed library. This class could then be extended by different region specific classes that need to overwrite some functionality.

But that's just a first idea. Let's see how exactly the various use-cases look like and what architecture would accommodate them best.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 18, 2016

In a first, more or less easy and working step, I was thinking about an abstract ScheduleImporter class which then inherits to sub classes, such as FenixImporter, IncoferImporter, MapanicaImporter. I'd lean very closely to the current implementation and basically let these classes to import and do logic on data in the constructor and then have methods to serve schedule data for weekdays, saturday and sunday.

@grote
Copy link
Owner

grote commented Sep 18, 2016

If you like, I don't mind if you start factoring out things into different classes. You should just be prepared that this might need to be refactored again. Maybe it would also be a good idea to start specifying the classed (maybe in UML?) here in this issue before the implementation is started, so we have a chance to discuss the new architecture first.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 18, 2016

Of course, steady refactoring is a good thing. I don't expect all things to be carved in stone. As a first step I would change as less as possible but to make it possible to use additional conversion, for @jamescr and myself. I hope this will grow in the future and improve quality over time.

Yes, I agree it'd be helpful to draw classes first and discuss them here. Thanks for your willingness to help with that! I will draw some sketches we can discuss.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 19, 2016

Here a first sketch about the structure. Probably there is still a lot to talk about. But it's hopefully a start and base to discuss the structure.

osm2gtfs

Probably it'd make sense to wrap all together in one class, which also does the OSM querying for us. And then uses transitfeed to build the GTFS all together. Is this what you mean with the GTFSWriter?

@grote
Copy link
Owner

grote commented Sep 20, 2016

I was thinking of GTFSWriter, because the main job of that class will be to write the GTFS feed. A schedule importer would make sense if we had a generic format for schedule data, but we don't have that and maybe will never have.

So maybe we start simple and just move all Fenix logic into a class that extends a GtfsWriter class. This includes custom methods for Route such as add_linha(). There's a bunch of helper methods in the osmhelper classes that can probably stay there as they might be useful for others. The only problem is that each change potentially breaks the serialized objects that osm2gtfs stores on disc with OSM data.

Now that I think more about it, your approach is probably better. So the osm2gtfs script should continue to write and validate the GTFS feed. It should also add the agency and the feed info from the config file like it is now. So that part doesn't change (for now).

The only thing that is really specific for each area is adding the routes, trips and shapes to the transitfeed schedule object. So what we could do as a start is just have an abstract ScheduleCreator class with only one method (for now). Something like add_schedule(schedule, routes, stops).

FenixScheduleCreator then extends this class and implements the add_schedule() method with the current code and some changes so it doesn't work on the Route classes directly.

If we wanted to introduce more methods, there could for example be a add_stops(stops) method or maybe this so generic that the ScheduleCreator could already do that or at least have a default implementation for it, so classes can overwrite it if they need to. FenixScheduleCreator wouldn't need to overwrite it.

When the route information comes completely from OSM, there could also be a add_routes(routes) method with a default implementation.

Contrary to the sketch above, the ScheduleCreator wouldn't need get methods. It would directly add things to the transitfeed schedule object. What do you think?

@jamescr I'm pinging you, so you can also add your thoughts to the discussion if you want to.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 20, 2016

I was thinking of GTFSWriter, because the main job of that class will be to write the GTFS feed. [...]

So the osm2gtfs script should continue to write and validate the GTFS feed. It should also add the agency and the feed info from the config file like it is now. So that part doesn't change (for now).

This is how I would understand the GTFSWriter, as the basic wrapper for all. This could live in the script or be put into a class. I've no idea nor preference which would be better. A class seems to be a bit cleaner to me, but I have no idea why 😄

The only thing that is really specific for each area is adding the routes, trips and shapes to the transitfeed schedule object.

This is a really interesting point and I'd like to elaborate more on it. When I started to make my head around it I came to the assumption that actually routes, shapes and stops can be added to transitfeed without any local particularity nor time information. I thought it could be created directly from the data obtained from OSM. The only very flexible part is how to generate the trips (and their schedules/time information) by different kind of data sources and logic.

This is why I introduced the extendable class ScheduleImporter although I like ScheduleCreator even more. This leads us to another issue: Naming.

I tried to not use names that are already used for something else in the script. E.g. route: a route in OSM is a RouteVariant (ida or vuelta), but a route in transitfeed/GTFS is a RouteMaster (ida and vuelta together). Just using route, I thought, could be misleading. A similar, but not as complicated problem I see now with schedule as in transitfeed it's the main class that holds everything together and myself (maybe it's just me) thinks more about the time information in relation to all this. So, I'd even suggest to call this class TripsCreator - but only following my previous assumption (which I'm not sure if this is correct), that trips are really the only flexible part that needs to be extended for different cities.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 20, 2016

Here an updated drawing:
osm2gtfs

@jamescr
Copy link
Collaborator

jamescr commented Sep 21, 2016

I agree with the idea of a osm2gtfs that handle arguments and config. A OSMHelper that, using OSM nomenclature and their own objects, give us route masters (and routes), stops and shapes.

The ScheduleCreator (TripsCreator in the diagram) seems nice, but maybe we should use the GTFS nomenclature there and as much as possible the transitfeed objects. Although I need to check deeper transitfeed, this could help avoiding the naming issue @xamanu mention.

Then the GTFSWritter (method inside the osm2gtfs?) puts all together (here we should be careful with nomenclature) and validate it.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 26, 2016

@jamescr TripCreator is using the GTFS nomenclature, as there they are called trips. That is why I renamed it to this name.

If there is no more feedback, can I understand the proposed structure to be suitable for all in the team? Thanks for your time looking into this.

@grote
Copy link
Owner

grote commented Sep 26, 2016

I think the only difference at the moment is that your proposal will only let the different creators create trips, while I think it would be more flexible to pass in the schedule object and let the creator also handle adding stops and routes (with default implementation in the super creator). This is more flexible and allows the creator to change not only trip creation. For example, I could imagine this being useful when the creator also creates an overview over what routes available in the schedule are still missing from OSM.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 26, 2016

Thanks for your input @grote! Allowing to merge geo data from other sources with osm data would be a completely new extended functionality. And even more than the name osm2gtfs promises. As we have discussed in other occasions, I'm really in favor of pushing people to get the data properly in OpenStreetMap (and transport schema 2) and then extract it from there. For now, and in this issue I think we should go the way to abstract the schedule creation process. Merging of different geo sources would probably be a complete new and complex step, which need a lot more thinking.

@grote
Copy link
Owner

grote commented Sep 26, 2016

I think there's a misunderstanding. I don't want to introduce new complex steps. On the contrary, I want to keep things as simple and flexible as possible and don't want to merge different geo sources.

To summarize my proposal: Do a ScheduleCreator instead of a TripsCreator that works on a transitfeed schedule object and is also responsible for adding routes and stops to it.

@jamescr
Copy link
Collaborator

jamescr commented Sep 26, 2016

There is something with the TripsCreator that I haven't clear, What returns the getTrips method?.
First I thought that we could return a list of Trips (the transitfeed object), but in order to this we need the Schedule object because of the StopId. If we cannot return a List of Trips, another option could be to return a matrix of stop names and times, the header being the stop names and the other rows the stop times, but this option will be kind of overhead.

The ScheduleCreator @grote proposes makes sense to me. (I reminds me a little the GTFSWriter we talked)

@xamanu
Copy link
Contributor Author

xamanu commented Sep 26, 2016

Yes, I also have in mind to create the optional frequencies.txt (in the future) and having a ScheduleCreator makes sense, as this is something optional/special for us. I may also mixing up here GTFSWriter and ScheduleCreator ideas. So, I tried to get it together in the UML diagram. Is this what you mean?

osm2gtfs

If we'd eliminate the TripsCreator we ended up with the same custom spaghetti code from the current state just moved into the ScheduleCreator. I think it's a good idea to separate data collection (from OSM or some other source) and trip creations (time information).

@xamanu
Copy link
Contributor Author

xamanu commented Sep 26, 2016

@jamescr The getTrips() would return a list of trips you can then add to the transitfeed Schedule object routes with AddTrip() in the ScheduleCreator. Does this makes sense? Or do I miss a step here? Maybe the object to pass to this function should be RouteMaster rather than a RouteVariant.

Stop_ids: Interesting question. I'm haven't really thought about that. But I guess they are created/defined in the stop objects, based on OSM ids or we could use a special instance of wikidata to create a centralized place where to hold unique ids for stops. In a scenario where transport agencies create their GTFS they define also their stop_ids.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 26, 2016

I was just talking directly with @jamescr over Mumble and we made changes to the uml drawing. Basically we figured out to move directly stops, trips and routes from transitfeed. We introduced a Creator for each one of the components. Which then get's implemented to support different types of inputs. Csv, json or whatsoevery for trips, and for now OSM (PT2) for the routes and stops.

osm2gtfs

I like it, as it makes sense. But I feel also we are moving around without getting to the point 😄 Maybe @grote input or sleeping a night over it brings more clarity.

@grote
Copy link
Owner

grote commented Sep 26, 2016

Great you guys got together to improve this. Looks nice!

I'm just worried a bit that this is already getting too complex for the first step. Maybe we save this for the second step and start a bit simpler with less classes and creators.

Here's some pseudo code for how it could be simpler in the first step:

schedule = transitfeed.Schedule()
creator = FenixCreator(schedule)
creator.add_routes(osm_routes) # has default implementation in ScheduleCreator
creator.add_stops(osm_stops) # has default implementation in ScheduleCreator
creator.add_trips() # is abstract in ScheduleCreator

So this is really minimal, has just two more classes compared to now and these classes only have three public methods. Maybe it is better to start like that and then build things up from there as needed?

@xamanu
Copy link
Contributor Author

xamanu commented Sep 29, 2016

Thanks for the pseudo code, @grote. This glues together what wasn't clear, yet. But I still don't like the custom code (FenixCreator) in there, so I'm introducing a simple factory class here, that returns a different object, whether it is Fenix, Incofer or MapaNica. I guess, this is the main aim of this issue here, and wasn't drawn in the diagram, yet.

Further I think it makes it a lot easier to code (already now) if we separate stop and routes logic into two separate classes. For now, I agree, we don't need to have an abstract classes defining them as this is going to be one standard implementation.

So, this would look something like this:

osm2gtfs

And the pseudo code for the main (control) script logic would look like:

schedule = transitfeed.Schedule()
factory = ScheduleFactory(config)

# Define creators
routes_creator = factory.get_routes_creator()
stops_creator = factory.get_stops_creator()
trips_creator = factory.get_trips_creator()

# Add elements
schedule = routes_creator.add_routes(schedule)
schedule = stops_creator.add_stops(schedule)
schedule = trips_creator.add_trips(schedule)

# Validate and write GTFS with transitfeed....

This seems still understandble to me in one step and would allow us to abstract it in future steps easily.

@grote
Copy link
Owner

grote commented Sep 29, 2016

Awesome @xamanu! I think this looks really good for the first step and we are very close to a consensus!

The only thing that can maybe be improved is that the StopCreator and the RouteCreator need to be instantiated as separate classes outside of the ScheduleFactory. Wouldn't it be better if the region specific classes would pick the creators they need? Otherwise you would need three config elements for all three creators. I'd prefer have only one config option for the creator and then have the creator decide. Do you think that would be possible to improve?

@xamanu
Copy link
Contributor Author

xamanu commented Sep 29, 2016

I don't think we'd need three config elements for each type of creator. Let me explain: We have two (flexible) sources of information:

  1. Geo information (from OpenStreetMap)
  2. Time information (from CSV, json, and some logic applied, whatever)

Therefore we need only two. For the geo information (1) we already have the query part and we just introduce one new region specific argument to distinguish between the different import/logic of the time information (2).

If, at some point in the future, we really wanted to extend the flexibility on the geo information part (... support other sources than OSM or other schemas than the standard) we could either:

  • extend ScheduleFactory to understand this regional selection and follow your argument of one single config element that collects all the parts it needs
  • or simply extend the query block to define different geo inputs or structures

But anyway I think this is not really something that needs to be decided now.

@xamanu
Copy link
Contributor Author

xamanu commented Sep 29, 2016

I just came up with another idea. No clue if you guys like it: Naming convention.

We could just do the following in ScheduleFactory (once again very pseudo code):

def get_routes_creator()
  if class_exists(RoutesCreator[config.regional_selector])
    return RoutesCreator[config.regional_selector](config)
  else 
    return RoutesCreator(config)

Same for StopsCreatorand TripsCreator.

This would allow us to have always a fallback to the standard implementation which always can be extended on demand, not by code nor more configuration, just by defining a class with a certain name which changes the implementation of functions that differ to the standard implementation: e.g. TripsCreatorFenix, StopsCreatorFenix, RoutesCreatorFenix

@grote
Copy link
Owner

grote commented Sep 30, 2016

So did I understand correctly then that the ScheduleFactory also provides three different creators based on the config file value? If so, how does it know which ones to choose for which config value?

I thought the ScheduleFactory would just return one ScheduleCreator, so when you implement a region, you just need to derive from one class and can just pick from the available Stop and Route creators within that class. I also haven't fully understood why we need extra creators for stops and routes since this data comes from OSM and their implementation should almost never change. So having them as methods with a default implementation in the abstract ScheduleCreator sounds a lot simpler to me.

But I see that your idea with the class finder might take care of the creator choosing problem. However, I am not sure we need that level of complexity for something one simple method (that you can override if you want) would also give us.

Quoting your pseudo code:

# Add elements
schedule = routes_creator.add_routes(schedule)
schedule = stops_creator.add_stops(schedule)
schedule = trips_creator.add_trips(schedule)

Would we need to re-assign the schedule here each time or would the creator work on the same schedule object?

@xamanu
Copy link
Contributor Author

xamanu commented Sep 30, 2016

I had a call today with @grote and we could talk more deeply about how to implement the changes. One mayor information I wasn't explaining properly in the proposal was to move all respective code from the current osmhelper.__init__.py to the RoutesCreator and StopCreator.

Selection of the respective creators is happening based on naming convention (just implemented in my repo) and only one new config element (selector) needs to be introduced.`

Regarding the last question by @grote it's not really about re-assigning schedule each time but rather to pass the schedule object into the functions by reference (probably my pseudo code is misleading as it's not python)

@xamanu
Copy link
Contributor Author

xamanu commented Oct 2, 2016

Instead of moving data retrieval and caching logic from osmhelper into the creators I left them in the osmhelper module, but moved it inside of a class, so a OsmHelper object containing the retrieved data can be injected into the creators.

This allows us to keep the Creator classes (and the related pull request) smaller and understandable. Separately we can talk about the data structure itself (#30) and a refactoring of the OsmHelper (#31).

@xamanu
Copy link
Contributor Author

xamanu commented Oct 2, 2016

I kept on drawing to keep things easier to understand

osm2gtfs

grote pushed a commit that referenced this issue Oct 2, 2016
@xamanu
Copy link
Contributor Author

xamanu commented Oct 2, 2016

Wonderful! This went in and was merged to the main repository. Closing this issue here.

@xamanu xamanu closed this as completed Oct 2, 2016
@xamanu xamanu self-assigned this Nov 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants