Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output csv/tsv files coordinate reference system #1561

Open
vsp-gleich opened this issue Jun 22, 2021 · 29 comments
Open

output csv/tsv files coordinate reference system #1561

vsp-gleich opened this issue Jun 22, 2021 · 29 comments
Labels

Comments

@vsp-gleich
Copy link
Contributor

Over the last years we added some standard analysis output as csv/tsv files such as trips.csv.gz, drt_trips.csv and persons.csv. Some of those contain coordinates (e.g. the aforementioned files). Currently those are written out directly from matsim with the coordinate reference system matsim was using that can be found in the GlobalConfigGroup, but is not specified in the output trips.csv. That is potentially dangerous since crs issues may arise without the user noticing e.g. when reading into R and doing spatial operations. Many map plotting libraries in R/python/Javascript seem to work out of the box with WGS84 input whereas input in other coordinate reference systems has to be converted before usage. E.g. for aftersim this might take some 11 seconds on top of 10 seconds file reading for a trips.csv file. So talking to Billy I had the impression that we have the following options:

  1. Leave everything as is and hope users will always provide trips.csv and config.xml and check the crs with the config.xml
  2. Somehow add the crs into the trips.csv, e.g. below the header line with the column names add a comment line with such information after an #
  3. Use WGS84 for all csv/tsv output files. WGS84 because it can be applied worldwide and is what map plotting libraries use.
  1. is somewhat error-prone and it is not clear whether the user will notice any issue e.g. when using an R script which returns just the number of rides in a certain area (which then happens to be 20% less due to some crs issue). Reading a second file (config.xml) is not nice and it's xml which is generally harder to read in R/Python.
  2. is safer, but requires csv/tsv readers capable of interpreting the # as a comment and would need to read each file twice (once with # lines skipped as a comment and another time to pick up the crs after the # ).
  3. Could make things easier when plotting maps and avoids the troubles of 1) and 2), but messes up when combining with other files which still use the globalConfig crs (e.g. network.xml), so we then happen to use different crs where we have globalConfig's crs consistently everywhere today. WGS84 is not good for calculating beeline distances.

I have a slight tendency towards 3) because we mostly use those csv output files for R / python and aftersim Javascript analysis and so far I see little mixing with other files using the globalConfig's crs. 2) would be the second best option for me. Any opinions?

@tschlenther
Copy link
Contributor

technically there is
4. Add two columns containing longitude and latitude such that every line contains the coordinates in WGS84 and the crs used inside MATSim.

This would have the downside of enlarging the file with some kind of duplicate data but would make the WGS84 coordinates always available without losing direct compatibility to the MATSim crs.

I would probably also vote for 3) but then vote to set the column names to longitude/latitude or include "WGS84" such that it is really clear that coordinates are in WGS84 and might differ from other files like the network

@vsp-gleich
Copy link
Contributor Author

Kai says we have already decided to go for WGS84 in all files including input and output plans, network, etc on the very long run. So we would opt for 3).

@JWJoubert
Copy link
Contributor

JWJoubert commented Jun 22, 2021

I fully appreciate the frustration of dealing with conflicting CRSs. But just for clarity...

Kai says we have already decided to go for WGS84 in all files...

would any distance handling outside of MATSim then require one to take care of the conversion yourself? Or what is the role of the recently introduced attribute in the network.xml.gz file indicating the CRS? Will that too become deprecated? I would opt first for @tschlenther's option (4), then (3).

@sebhoerl
Copy link
Contributor

sebhoerl commented Jun 23, 2021

Actually, when we deal with this situation (for instance in our demand synthesis pipeline), we provide two files:

  • One in CSV which is easy to access and read for everybody, but without spatial information, as it is ambiguous because of the aforementioned reasons
  • One in GPKG format which contains all the columns of the CSV, but additionally the geometric information (as a line geometry with two points)

Same is true for activities, where we have a CSV and then a GPKG with point geometries.

Of course one could also use SHP here, but GPKG is nicer because it is a self-containing file and not distributed over multiple ones like SHP. In any case, the advantage here is that both SHP and GPKG carry the CRS internally, so there is no ambiguity.

The geometric information is then easy to process in QGIS / geopandas / etc.

For MATSim, I could imagine having a flag which decides whether to only write the CSV / only the geometric version in SHP or GPKG format / or both.

@jfbischoff
Copy link
Collaborator

jfbischoff commented Jun 23, 2021

One could probably also define an output CRS (in xml), and add the projection-File to the output directory, ie., output_CRS.prj .
Converting coordinates shouldn't be a problem in Python or R.

I'm personally against WGS84 defaults everywhere (who knows, maybe we will simulate the moon next), but I would opt that working groups could handle the issue via their consistency checkers if required.

@vsp-gleich
Copy link
Contributor Author

  1. Add two columns containing longitude and latitude such that every line contains the coordinates in WGS84 and the crs used inside MATSim.
    This would have the downside of enlarging the file with some kind of duplicate data but would make the WGS84 coordinates always available without losing direct compatibility to the MATSim crs.

Some files like legs.csv.gz or trips.csv.gz are already rather large (200MB / 100MB respectively for a Berlin 10% run) despite zipping, so adding 4 new columns (start_lat, start_lon, end_lat, end_lon) with likely not very repeating information probably gives a notable increase in size which makes those files less usable. Initially we could open the trips.csv for small and medium scenarios in Excel for visual inspection (e.g. for debugging), that moves out of reach. I have mixed feelings about this.

@vsp-gleich
Copy link
Contributor Author

vsp-gleich commented Jun 23, 2021

One could probably also define an output CRS (in xml), and add the projection-File to the output directory, ie., output_CRS.prj .

To me this sounds like option 1) from above, the output crs is already available in the output_config.xml. Maybe reading a output_CRS.prj is easier than reading the output_config.xml and might solve rare issues of different implementations of the same crs, but it's still 2 separate files which can be mixed up with other runs and are less comfortable to use (never forget to copy both, set 2 paths, ...).

@kt86
Copy link
Contributor

kt86 commented Jun 23, 2021

What is with an option to write out the current one with the CRS from the config and an additional file with _WGS84, e.g. trips_WGS84.csv.gz
(I think, this goes a bit in the direction from the solution of @sebhoerl )

This will increase the overall output size but the user can choose, which one to use for further analysis/work. And each file will not get enlarged by additional columns.

I am not an expert in this field to say if this idea is better/easier to use than the one from @sebhoerl or the other way round.

@vsp-gleich
Copy link
Contributor Author

vsp-gleich commented Jun 23, 2021

Given the current efforts in developing more generic output analysis and visualization tools, e.g. aftersim but maybe also Tramola or some matsim-analysis R/python libraries that might be written in the future, I think we should really agree on some minimum simulation output those analysis and visualization tools can rely on. Everybody is free to implement their own additional output files in whatever format with whatever additional columns, but it would be good to have some defined minimum output (e.g. modestats.* for sure, likely also some trips.* file and more) in all matsim simulations.

So what should that minimum output be like? Probably it would be good to restrict to as few file formats as possible and consistently use the same locale-independent delimiters and decimal point (for e.g. csv/tsv/txt), so matsim users have to learn how to read and write only that few file formats instead of many different ones and analysis and visualization tools have less effort to identify delimiters and decimal point.

  • Xml is what is already used for all the simulation input files, supports versioning and writing a crs, but it is very unhandy to read in comparison to e.g. a csv file. There must be a reason why so much output analysis files do not use it.
  • Tab-separated .txt and ; or , separated .csv are the formats which we currently use for default analysis outputs such as modestats.txt and trips.csv. Those formats are exceptionally easy to read in any text editor and R/python/etc. and even zless for zipped files on a server, but have issues with versioning and adding crs information and different delimiters and decimal points. Best solution seems to be *.tsv, because this makes at least the delimiter clear and is still easy to read in R/python, possibly combined with option 2. (the comment line proposed above to add information on columns, e.g. the crs or description of column names).
  • geo-json is maybe nicer for geo information than txt/csv/tsv and can still be manipulated in a normal text editor, but lacks proper versioning and is harder to read in R/python/Excel/etc than txt/csv. (still maybe less hard than xml)
  • shp with its 5 files per shp is not nice and it's harder to read than txt/csv/tsv. It does not seem the right thing for lots of data, but it works well for geo-data.
  • gpkg sounds nice for properly saving geo data. I don't know about versioning. It is a SQLite database container. So I assume we cannot open it in a text editor/Excel/zless, rather we always need to use R/python/.. and there is likely less wide-spread support of this new data format than for geo-json or csv.

I can follow @sebhoerl's argument to supply csv or gpkg or both, but as a minimum output I would prefer to either always have the csv or always have the gpkg, but not to write analysis code able to read both. There is no problem with optionally adding a copy of the same data in a different format, but one of those output files should be compulsory. When deciding between csv and gpkg for a minimum output specification I tend towards csv for its better accessibility. Then omitting the geo data in the csv does not help with using that csv for geo referenced analysis.

In theory we could try to always reference some other file with the coordinates in it. That can work well for nodes, links (look up in network.xml), TransitStopFacilities (look up in transitschedule.xml) and Facilities (look up in facilities.xml), but it does not work well e.g. for an interaction activity where an agent walked to some position on a link and takes a car. Those usually have no facility or node we could possibly reference. For those places where we can reference and look up a node that still means reading in a network.xml which is unhandy because it is a second file to provide and not to mix up with another run and because reading .xml files and linking them to the .csv is unhandy. So there is a point for adding coordinates into those output analysis .csv files.

Overall, I have a tendency to fix this while staying with .txt/.csv/.tsv, possibly slowly converging towards one of those formats. Then it would be 1), 2), 3) or 4) for that minimum simulation output again. Plus whatever others want to add as optional additional files.

@jfbischoff
Copy link
Collaborator

To me this sounds like option 1) from above, the output crs is already available in the output_config.xml. Maybe reading a output_CRS.prj is easier than reading the output_config.xml and might solve rare issues of different implementations of the same crs, but it's still 2 separate files which can be mixed up with other runs and are less comfortable to use (never forget to copy both, set 2 paths, ...).

Well, there are already legs, trips and persons. Adding a fourth with the CRS seems something a user could do. I assume most people know which CRS they are working with, anyway, so adding the projection is really just an additonal output service.

@JWJoubert
Copy link
Contributor

JWJoubert commented Jun 23, 2021

I assume most people know which CRS they are working with...

Until those files (plans, network, facilities,...) start being passed around from one study to the next, among students or collaborators.

Also, this is indeed an assumption. Don't shoot the messenger... Not all projected coordinate reference systems are that straightforward. The "official" version for South Africa is EPSG:2053 which is westing (y, positive towards the west) and southing (x, positive towards the south). Now, this messes up any visualisation. And NO, that is NOT a typo: x is south-north, and y is west-east ;-) It is still on Via's to-do list to be able to visualise it. As a result, we rely heavily on being able to define an "adapted" version, and luckily MATSim's infrastructure allows that. For example, TransformationFactory.SA_Lo19. But that WKT format is only visible within MATSim. A number of GIS applications apparently have a hard time visualising it (if you do not have the project file or custom WKT).

But, recently, I realised that pt2matsim (actually, GeoTools) requires you to provide the official authority code and does not check what is available and defined in MATSim.

We have found our own way to manage/deal with this, but if you want to think ahead... just consider this as others may (in future) run into similar challenges when joining the growing MATSim user groups. And while one can argue "well, let's just use UTM..." it comes back and bit you (we've been there, and abandoned it) because it is not the "official" version used in the local context (as absurd as the "official" version might be - point in case).

@billyc
Copy link
Contributor

billyc commented Jun 29, 2021

Thank you @vsp-gleich for starting this thread!

@kainagel expressed a strong opinion / plea for all standard outputs that are not "internal" working files to use WGS84 lat/long, but I do understand those of you who are voicing strong concerns about that approach. (Hence this discussion)

@JWJoubert - is your coord reference system representable using a Proj4 definition string? I'm using proj4 in all my tools, and that's also what our Python MATSim library uses internally to get from weird to WGS84.

@sebhoerl - I really like GPKG as a replacement for shapefiles, and in general I love SQLite-based file solutions! That format doesn't seem well-supported in JavaScript though; the npm library hasn't been updated in four years and gets almost no downloads. I'm curious, is GPKG better known in other languages and/or fields? Are the files compressed by default? I can play with it to see how it works out, but I share the other commenters' sentiments that CSV is probably the easiest choice for import/export with other tools. I've been toying with more database-oriented approaches for data storage in aftersim, but so far our file-based tools have worked out best for the team here at VSP.

For aftersim, my latest approach is to just try every method I can to determine the projection.

  • If the CSV has a projection comment-line such # CRS: EPSGxxxx, it will use that. (I made that up, we should agree on the format of the line)
  • If it does not, look for the output_config.xml and fetch the coordinate system there. But many old runs (and lots of trimmed output folders in public-svn) are either missing the file or the coordinateSystem parameter...
  • so at some point I need to halt and just ask the user what the CRS is. Often they won't know the answer, but maybe they can try a few and see if any "look right"... 🤷

Which is why having a solid community answer to this dilemma would be really helpful.

For my visualizations, I absolutely have to have coords in lat/long, so if we choose a solution where the output files are NOT already in WGS84, then every time we load these files they take around 40% longer to ingest, which starts turning into tens-of-seconds of real waiting time for some use cases.

Any other thoughts on this?

@sebhoerl
Copy link
Contributor

@billyc GPKG works very well in a Python / geopandas / QGIS toolchain, but true, I'm not sure how well-supported the format is elsewhere (in Java I would expect that it is well-supported by Geotools, but maybe it is wishful thinking ;)

@billyc
Copy link
Contributor

billyc commented Jun 29, 2021

@billyc GPKG works very well in a Python / geopandas / QGIS toolchain, but true, I'm not sure how well-supported the format is elsewhere (in Java I would expect that it is well-supported by Geotools, but maybe it is wishful thinking ;)

@sebhoerl I had already experimented with converting the output_trips and output_legs into a standalone SQLite file, and the querying/filtering is so performant and wonderful. I didn't try using GPKG though and will do so.

This stack exchange post makes it sound like GeoTools does support GPKG, but the question is about a problem with projections, hahahahahaha ✨✨

@JWJoubert
Copy link
Contributor

@billyc it is not a Proj4 string, but Well-Known Text (WKT) format. Like many in the org.matsim.core.utils.geometry.geotools.MGC.

so at some point, I need to halt and just ask the user what the CRS is. Often they won't know the answer, but maybe they can try a few and see if any "look right"...

...or not, if they are users of (an old version of) Saturn, which could not work with negative coordinates (southern hemisphere), so users just added an arbitrarily large number to the y-value. And when you ask them, they can't member what that value was.

I do not have a preference for any format. The purpose of my comment is just to remind you that projected coordinate reference systems are trickier than what you may now recall. Think back and remember what you did not know 😉 Once you've worked with these on a near-daily basis, it becomes common-sense and straightforward. I think there are many MATSim (wannabe) users and data curators that would benefit from the CRS tagging along in some way.

@mrieser
Copy link
Contributor

mrieser commented Jun 29, 2021

Via supports GPKG thanks to GeoTools, that works okay.
A GPKG file can actually store multiple "layers" or "geo-datasets", which might be an advantage or disadvantage. (if there is more than one, one has to ask the user which one to work with)

geojson, based on the standard definition v2, should always contain wgs84. (v1 allowed for other CRSs too, but v2 reduced that to always use WGS84).

geojson has the problem that it actually is one single data entity (a feature collection), which means it must be loaded and written all-at-once. there is geojsonl (geojson-line delimited) where each "feature" must be written as a json-element on a single line. This allows for reading and writing one-by-one, which is often more suitable for the large datasets MATSim produces.

In generel, more and more geo-visualization tools pop up (e.g. kepler.gl), and they all tend to expect WGS84.

@neuma
Copy link
Contributor

neuma commented Jun 29, 2021

Some thoughts after reading this thread

  • At some dev meeting we agreed on having all output files use the output CRS specified in the config.
  • Straight forward solution is to use WGS84 as the output CRS - anybody tested this?
  • If some output data uses a different CRS than the output CRS from the config, this CRS needs to be explicitly specified in that file.
  • Additionally adding the same datum in some other CRS like sameCoordButWGS84 is ok but would add to the file size - random numbers do not like to be zipped
  • I really like flat files because you never know which tool chain you are going to use
  • I prefer gpkg over shape mainly because of handling and size limits

So to my understanding the current situation is... The user needs to look up the output CRS in the config. However, this information is lost if the user does not know or does not care about the config, or, @JWJoubert point, the file is passed on without the config.

I also see a tendency towards WGS84 for exchanging geo data. So having WGS84 as standard MATSim input and output is maybe the way to go. Running MATSim with a non-cartesian coordinate system is a different story though.

@mrieser
Copy link
Contributor

mrieser commented Jun 29, 2021

random numbers do not like to be zipped

One could probably reduce the accuracy of the coordinates written out to reduce files' size. See xkcd.com/2170 ;-) (Theoretically, a float would probably be enough to store such a coordinate value instead of a double, even helping to save memory)

@sebhoerl
Copy link
Contributor

To complicate things even more, and since we are talking about weird projections, we were working once with @balacmi on some US data. It took us a while to get that suddenly all the distances written out by MATSim were given in feet and our speeds in m/s didn't fit at all ;-)

@billyc
Copy link
Contributor

billyc commented Jun 29, 2021

@mrieser exactly!! I do that in the Python library already, five points past the decimal are enough to get lat/long accurately on a street map.

@neuma I like your summary a lot, and for the exchange of data, having WGS84 really is so much more friendly to most every other toolchain out there.

How interesting would it be to have inputs and outputs all in WGS84, and thus very interchangeable -- but internal to MATSim, convert (before running) to meters around a useful center point? Maybe I'm dreaming. Off topic for this discussion I guess.

@vsp-gleich
Copy link
Contributor Author

How interesting would it be to have inputs and outputs all in WGS84, and thus very interchangeable -- but internal to MATSim, convert (before running) to meters around a useful center point? Maybe I'm dreaming. Off topic for this discussion I guess.

As far as I understood it, this is what @kainagel was talking about. Then projection would be WGS84 in all input and output files and we only have to convert internally for beeline computation. That might need a projection setting in the config though, unless we come up with some auto-detection based on which lat/longs matsim finds in its input data.

@vsp-gleich
Copy link
Contributor Author

vsp-gleich commented Jun 29, 2021

One counter argument to using WGS84 for output files mentioned in this discussion here was that it makes beeline computation harder. I can partly understand that argument, it is more difficult, but there are libraries for this and even programming ourselves is not impossible (see e.g. https://www.mkompf.com/gps/distcalc.html). It needs somewhat more computation time, but for output analysis this sounds ok for me and having all output in WGS84 gives performance gains for map plotting etc. where we have to convert to a different projection now.
Concerning trips.csv and legs.csv we already have the option to add a custom extension which adds more columns with more data. So if some people really need coordinates in a different projection, they can add them there. But this is not something I would like to implement for all geo-referenced output files.

@neuma
Copy link
Contributor

neuma commented Jun 30, 2021

As far as I understood it, this is what @kainagel was talking about. Then projection would be WGS84 in all input and output files and we only have to convert internally for beeline computation. That might need a projection setting in the config though, unless we come up with some auto-detection based on which lat/longs matsim finds in its input data.

So the idea would be to check all coordinates received by reading the input data and then choose the UTM and the zone based on the least deviation? Which UTM would that be?

How many digits do we want for WGS84 coordinates? What is the minimum resolution? 1 meter in some equatorial country?

What do we do with the height information or Z-axis?

@sebhoerl
Copy link
Contributor

sebhoerl commented Jun 30, 2021

My opinion on the whole topic would actually be that it is fine as it is. The remark on GPKG was an example of how we have set up and standardize our analysis pipelines, but this is on the post-processing side where we can add one line of code in case we need to convert some projection. However, reading the discussion on WGS84, I must say I would find it weird to have x and y attributes in the files, which then actually contain latitudes and longitudes (and see @neuma's remark on z values). And I think one can expect some minimal willingness to read documentation and some basic knowledge of what they are doing when people are using MATSim. It is a different story if there are tools like Via or Tramola which focus on being user-friendly, but then there are ways to check that all the necessary input information is present and consistent.

Actually, I have never used all the internal conversion functionality of MATSim, because I always found it safer to just provide the files already in the correct projection. Hence, the question of what the output projection may be never came up. Maybe this is rather the problem? Without the internal conversion stuff, we would force people to think about their projection and make sure everything is consistent in the first place.

@vsp-gleich
Copy link
Contributor Author

vsp-gleich commented Jun 30, 2021

However, reading the discussion on WGS84, I must say I would find it weird to have x and y attributes in the files, which then actually contain latitudes and longitudes (and see @neuma's remark on z values).

We can rename them lat/lon if that helps. But looking for height information in WGS84 and its conversion to other projections, it seems it is indeed less straightforward than lat/lon conversion.

Actually, I have never used all the internal conversion functionality of MATSim, because I always found it safer to just provide the files already in the correct projection.

We have for example our own Open-Berlin Scenario build in one projection (EPSG:31468 Gauss-Krüger DHDN GK4) and a Berlin model build by Senozon for the AVÖV project in a different projection (EPSG:25832 UTM zone 32N projection). Since we did similar stuff with both models we wanted to re-use e.g. shape files with the boundaries of certain areas of the city or compare results. So it does happen that we use different projections and risk mixing them up.

@vsp-gleich
Copy link
Contributor Author

So the idea would be to check all coordinates received by reading the input data and then choose the UTM and the zone based on the least deviation? Which UTM would that be?

That was my addition. On your link you find a nice map of UTM zones. That data on within which boundaries a certain UTM zone is the most accurate is hopefully accessible online, so we can e.g. create a bounding box around all nodes in the network.xml, calculate the centroid of that bounding box and look up in which UTM zone that centroid is located.

How many digits do we want for WGS84 coordinates? What is the minimum resolution? 1 meter in some equatorial country?

Sounds reasonable.

What do we do with the height information or Z-axis?

Good question, I haven't talked about that with Kai.

@jfbischoff
Copy link
Collaborator

My opinion on the whole topic would actually be that it is fine as it is.

I agree with that.
I would not object adding lat and long in WGS as an additional column. This seems to solve everyone's problem.

We have for example our own Open-Berlin Scenario build in one projection (EPSG:31468 Gauss-Krüger DHDN GK4) and a Berlin model build by Senozon for the AVÖV project in a different projection (EPSG:25832 UTM zone 32N projection). Since we did similar stuff with both models we wanted to re-use e.g. shape files with the boundaries of certain areas of the city or compare results. So it does happen that we use different projections and risk mixing them up.

Okay, but that is clearly also a technical debt the Berlin team is aware of for roughly 10 years and that has been messy several times :-)

@billyc
Copy link
Contributor

billyc commented Jun 30, 2021

If the team decides to add lat/long to any output files, let's please use a standard naming such as
what kepler.gl expects:

kepler.gl will auto detect layer, if the column names follows certain naming conventions.
kepler.gl creates a point layer if your CSV has columns that are named <name>_lat and <name>_lng or <name>_latitude and <name>_longitude, or <name>_lat and <name>_lon.

If we standardize on one of these, then coordinates can be auto-sensed by many analysis tools without any fuss at all. Exciting!

@JWJoubert
Copy link
Contributor

How many digits do we want for WGS84 coordinates?

Just a heads up: we had to implement a specific Coord converter for the AttributeConverter as we use WGS84 coordinates extensively in Attributes. And there we learned by burning our fingers that decimal degrees needs more decimal places than the typical rounded projected coordinates 🤨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants