Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In memory storage #7

Merged
merged 27 commits into from
Mar 16, 2017
Merged

In memory storage #7

merged 27 commits into from
Mar 16, 2017

Conversation

faultyserver
Copy link
Member

Instead of relying on a SQL backend to store information, this PR implements a custom storage solution for visits, Timetable's idea of a vehicle arriving at a station. These visits are stored in a map with a complex key of (Station, Timestamp, Route, Trip) in order to best optimize the queries that are likely to be made on the data.

From a dataset with 30 routes, ~1000 stops, and ~250,000 stop_times, the total memory footprint of Timetable is ~200MB, and the average response times for queries with less than 50 results is just over 1 millisecond, compared to the previous average query time of ~70 milliseconds.

As it turns out, a direct SQL representation of the GTFS data is not efficient for what Timetable is trying to do (calculate the next/last visits a route makes to a station). Instead, the information is better stored in memory using a tree-like structure, where the results of a query can be found in near-constant time, and will often be located in contiguous memory blocks (making iteration easy).

This commit starts work on parsing raw GTFS archives into this tree-like structure.
By keying the map with by (trip, departure time) and having a custom comparator that sorts by trip then departure time, results for `next_visits` and `last_visits` can be found in `2log(n)` time.

With the custom comparator, Visits are stored in order, so a single search will return an iterator that can be used to find the next `n` visits on a trip. Simply filtering the iterator will return the next `n` visits to a given station. Using two queries will give a range of visits between two timepoints.
`TripTimestamp` has been renamed to `visit_list_key` to be more generic and fitting to its use case. Additionally, `Route` has been added to the key struct to allow O(logn) access to the first Visit on a Route.
Bounds objects are pairs of iterators indicating the limits of an intended result set. The iterators are only guaranteed to be valid within these bounds. Attempting to iterate past a bound is undefined behavior. Bounds objects can also be reversed to change the direction of iteration.

Bounds objects are generated from a single key by calling the `*_upper|lower_bound` functions on the key, returning a new key that points to the first/last relevant Visit in the list (either the bound of a Route or of a Trip).
Simply storing IDs instead of references to full objects in each type avoids implicitly deleting the copy operations. Additionally, usings only strings in the key structure makes lexicographic comparisons much more simple through `std::tie`.
…bol errors.

An irrelevant problem for now, but the definition of these functions in a header file requires that they be inlined to avoid having them included in each including translation unit (e.g., multiple .cc files -> duplicate symbol errors).
These members were originally added to track the start and end times of a trip, which would be inferred from the list of stop_times that it includes. However, since Trips are no longer including a list of stop_times, this information is not inherent to a Trip, and thus shouldn't be stored as a member.
All fields listed in Google's GTFS reference pages are now included and appropriately types. Enum types still need to get proper enumerations, and are currently represented by integers. Optional fields could potentially pose a problem. Hopefully they will not. Field mapping is also not yet a thing, but will probably be done with pointer-to-member templates.
The `field_mapping` type wraps gtfs fields with mappings from their name in the archive to their name as a member of the type representation. This also allows for null handling and other edge-case conditions in a central location.
After many difficulties trying to deal with true generics (allowing any type to be parsed/applied), I realized that a GTFS feed will only ever have one of a few types (bool, string, int, and float). With that, it's simple enough to make a template-less struct that can apply those types to a variable. Adding a template parameter for a Class type allows for application to member variables via pointer-to-data-members. This parameter is known by `csv_parser`, meaning it can store a map of these applicators (previously an issue because the applicator needed to know the ValueType of the member it was applying to). Adding new types will still only require changes in 3 places (`gtfs/types.h`, the `field_mapper` struct, and `csv_parser.parse`), so the tradeoff is pretty good.

A new issue has arisen where the current CSV parsing library can't handle commas within fields (even if they are legally escaped with triple quotes). Either I'll need to find a new library, or write something really simple to just do what's needed here.
Since everything is being stored in memory now, there is no need to have the archive loaded into a database, and thus the sql libraries are also not necessary.
Timetable no longer relies on a CSV parsing library to read GTFS archives. Instead, it manually parses entries using `getline` with a comma delimiter and special-cases quoted string entries.

The decision to manually parse files was made because GTFS is a loose standard, meaning it has many optional fields that change the parsing requirements (number of columns, column order, cell formatting).

Getting the manual parsing to work also implied some minor changes to `field_mapper`, but these are negligible, as their only use case is in CSV parsing.
This was far too difficult too discover for what the solution was, and was a problem born out of unnecessary complexity. So:

There was a typo in `stop_time.h` on the `drop_off_type` field mapping (was `dropoff_type`). This typo was exposing some unexpected behavior with `field_mapper`, and particularly with `type_map_t`. The best explanation is probably to walk through the steps of what was happening:

  1. `csv_parser::parse` determines the fields that are present in the CSV file, storing them in an ordered vector called `headers`.
  2. It then proceeds to read line by line, tokenizing each line and creating a vector parallel to `headers` called `columns`.
  3. For each pair in `(header, columns)` pair:
    a. The appropriate `field_mapper` instance (called `mapper`) is looked up based on `header`.
    b. Switching on the type that `mapper` represents, `column` is cast to that type and passed to `mapper.apply`.

The problem that came up starts becoming apparent through the details of how 3a works. Because the field `drop_off_type` did not exist in the parser's `field_types` (the typo had it as `dropoff_type`), and because `type_map_t` is an alias for `std::map<std::string, field_mapper>`, the behavior is to return a new instance of the map's `value_type` (in this case, `field_mapper`) using the default constructor. This default constructor does not set any of the members of `field_mapper`, and so their values are undefined.

Moving on to 3b, since the `type` member of this new `field_mapper` instance is undefined, it will almost always be left to match the `default` case of the switch statement. And here lay the second problem. The `default` case was placed above `case tSTRING`, as if to say "any columns without a type should be interpreted as strings". This may have been well intentioned, but it is nonsensical, as all `field_mapper` instances are automatically assigned a value for the `type` member.

Now, the cascading nature of switch statements takes effect and `mapper.apply(inst, string)` is called. However, since `string_target` has an undefined value, the memory address to which data is going to be written is also undefined. Experiential data showed that it was almost always the base address of `inst` (i.e., `string_target` was `NULL`, so the offset was 0), meaning the value would be written to the beginning of the memory for `inst`, regardless of the data that was actually stored there. In this case, that meant the value for `drop_off_type` was written over the data location for `trip_id` (the first data member of `stop_time`).

Because C++ is compiled and does very little in terms of dynamic data integrity checking, Timetable still saw `trip_id` as a valid string (even though the first few bytes of it were overwritten by the rogue `field_mapper` from above). Examining `inst.trip_id` would then give garbage values, as it had been corrupted. Even though most of the original data was there, some critical portion had been overwritten and the observed value was incorrect.

The solution for this whole issue was to simply move the `default` case of the switch statement in 3b to *after* all of the valid configurations, essentially making it a nop. The result (sufficiently) enforces that `header` map to an existing `field_mapper` instance in `field_types` in order to actually be applied.
`libsqlite3` was still being included in the build command, despite not being used by Timetable.
Originally, the static parser members were protected and the plan was to have functions which wrapped the parsing operations. However, it turns out to be much simpler to just call the parser directly, and so the members should be made public.
The parsing function now returns a vector of objects parsed from the appropriate file in the given directory.
These types were a remnant from the previous implementation dependent on a SQL database for data storage. Moving everything into memory and having a full GTFS library eliminates the need for these types.
…on, getting ready for filter_iterator.

CSV Parsing can now be done all at once (as before, but now called `all`) or as a stream (via `initialize` and `next`).

Also, `visit_list_key` and its usages have also been re-arranged to be `(Station, StopTime, Route, Trip)` to better accomodate its primary use case (See #1). For uses better managed by a different arrangement, another key/map will likely be made.

In general, Timetable is now ready to have a `filter_iterator` that takes a custom predicate and only returns results that pass the predicate condition. The primary use of this will be to filter out visits whose service is not being offered in a given timeframe (a la `Timetable::Calendar`), or to return only visits from a given set of routes.
`is_active` returns true if the given Visit is active on the given date, according to the Timetable's Calendar.

I'm convinced that these sort of actions should be kept in visitor-like classes, but I'm not entirely sure how to go about doing so. In the mean time, adding these to Timetable lets them function as intended.
`visits_between` returns a `bounds_t` for the given keys from the lower bound of the first to the upper bound of the second. Iterating through these bounds should yield the inclusive set `[key1, key2]`.
Once a file has been parsed by `csv_parser`, there is no need to keep it open, since all of it's information has been translated to a usable format.

Closing the file is done automatically when calling `all()`, but must be done manually if using the streaming form (`open`, `has_next`, etc.) using `finish()`.
`route_map` contains all of the routes from `routes.txt` indexed by their ID, meaning they can quickly be looked up based on the `route_id` from a `trip`.
`routes_by_short_name` allows requests to specify the `short_name` of a route as an identifier, rather than requiring them to know the exact `route_id` (often a uuid) ahead of time.
This initial implementation takes in parameters and returns a list of departure times (no arrival times, yet) matching those parameters.

Basic testing shows that the implementation is not accurate (visits are not returned in ascending time order), and it seems that the current structure of `visit_list` is partly at fault. The keying strategy currently keeps all visits on the same trip in ascending time order, but visits on the next trip may occur chronologically before them, resulting in an out-of-order result set, and thus incorrect results.
While `visit_list_key` was comparing visits correctly, the constructor did not match the comparison order. While not affecting the logical comparison, it made the uses of keys less obvious.
…rder.

`visits_between` seems to be the call that will be made most commonly, so testing with it is more accurate to real-world use.

Per #5, the argument order of the RPCs has been adjusted to have `route` as the next-to-last parameter. At the moment, only the route-agnostic RPC is implemented.

Per #6, `route` and a temporary `headsign` placeholder have been added to each visit in an RPC response. I'm not entirely sure that `headsign` is a rational field to include, as it is an optional GTFS field, and doesn't really provide any information that can't be gleaned from the route itself.
The `Visit` wrapper class makes it easy to collect and access all of the information associated with a visit (e.g., the stop_time, route, trip, and stop) and also defines the structure of entries in each RPC response via its `operator MsgPack` definition, ensuring a consistent format across all calls.
@faultyserver
Copy link
Member Author

This PR also includes implementations for #5 and #6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant