Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate moving towards a federated network of Headway instances #50

Open
1 of 9 tasks
ellenhp opened this issue Jun 2, 2022 · 24 comments
Open
1 of 9 tasks

Comments

@ellenhp
Copy link
Member

ellenhp commented Jun 2, 2022

Lots of users have asked for this, and it could alleviate the need for a really slick bootstrapping UI (see #49). There are a number of blockers, mostly privacy related:

  • Base map
    • Investigate offline downloads to help with privacy (I don't want the server to know where in the basemap I pan/zoom to at any given moment)
  • Geocoding
    • Get Mapbox Carmen working in the browser
    • Integrate Carmen into the headway frontend for offline geocoding
    • Confirm granularity of Carmen tiles isn't too fine for privacy's sake
  • Routing
    • Get valhalla compiling to wasm
    • Integrate valhala into the headway frontend for offline routing
    • Serve raw valhalla tiles
    • Investigate the privacy issues inherent in sending ad-hoc requests for valhalla tiles to an untrusted server (can they reliably determine my future location information?)
    • Investigate whether offline downloads could help with this privacy issue

edit: These are just the hard blockers, there's a lot of architectural stuff that would need to happen before this becomes feasible.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 2, 2022

Like a lot of other mapbox libraries, it seems that carmen doesn't build on arm64. I doubt it works in the browser after all. It seems like it might be node-only, and a lot of its dependencies rely on C++. The concept seems sound though. It just seems like a lot of work to get it working client-side. Same with Valhalla, I've heard that getting Valhalla to build for native clients is very hard. Web clients would probably be even harder.

edit: It may be possible to get carmen working in the browser. It appears that most of its native dependencies are written in rust and may (?) compile to wasm neatly.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 3, 2022

Philosophically speaking, I really like that headway's current scope is small, but I do recognize the utility of having maps available for more than just individual metro areas. As long as location information (edit: including search queries, and routing queries) never leaves the device, I do think I'd be open to adding in an option for federation.

I've heard that getting valhalla building for native mobile clients takes about a month, and I imagine it's as hard or harder to get it to build in emscripten. The geocoder would likely need to be written from scratch since Mapbox Carmen being designed for server-side use leaves me very skeptical that it will work well over a mobile network. I think that Tantivy might be usable as a geocoder if you indexed OSM nodes as you downloaded individual tiles but that would probably be another month to prove out, more to get it working well. And it will never be as good as Photon is now. It would take some additional work to geocode places that you've never downloaded tiles for. Perhaps the client could download data from Who's On First on startup and index that.

The plus side of all this is that if these things were all performed client-side, running a server becomes very easy. The server-side component to Headway would essentially become a tileserver and static site server only, plus OpenTripPlanner if you want to do transit directions. That could open the door to larger installations.

Still, the scope of all of this sounds completely different than what I originally set out to build two weeks ago. And I think a lot of what makes Headway special compared to offline maps apps that you can get for phones is the fact that it takes a different approach to ensuring privacy, self-hosting. I'd love to hear peoples' thoughts.

Overall it really feels like a distraction to me. There are a lot things that could be done right now to improve Headway, and putting about two months of full-time work into something like this makes very little sense to me at this point in time.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 3, 2022

Valhalla is apparently also big when compiled for client-side use valhalla/valhalla#1860 (comment)

What might make sense is keeping valhalla server side but having home servers request routing tiles from other servers instead of just proxying the requests. Similar to some federated platforms where you trust your home server, but not any of the other servers. Perhaps Carmen could be used on the backend in a similar way. Carmen doesn't support search-as-you-type if I remember correctly but it would still be a better experience than not having any maps or geocoding services available at all.

@bwoodcock
Copy link

A few very-unstructured thoughts, just based on my own use-cases and ideology:

I'm absolutely for decentralization and federation. I'm also absolutely for complete user control of their own information. So, from my own perspective, any approach which created a repository of data about users would be one I wouldn't use. Lots of projects work well by dint of some middling number of donors contributing and supporting the necessary infrastructure, without expectation of any return, other than their own ability to use a stable system. NTP and SKS are examples. I believe there are a bunch in the weather/air-quality tracking space, and aircraft tracking, and so forth.

There's already a large community of people supporting OpenStreetMap, and giving them more benefit in return for their work, and more public recognition of the value, will help reinforce that positive loop.

So how do you protect users' privacy, if they're pulling data from, and using computational resources on, servers that aren't under their individual control? I always like a layered defense, try to break things down and provide multiple overlapping solutions... So, first, do locally what can be done locally. If data (like map tiles, or transit route/schedules) can be cached efficiently on the client, do so, so you're not needlessly asking for more than necessary. Likewise, do query minimization... Don't ask any one party for more data than you need from that party, and split queries up among, or rotate among, multiple servers wherever it can be done at little cost. Creating chaff isn't a good solution for a system that depends upon volunteered resources, but a mixmaster approach (such as is used in TOR and ODNS) can be a reasonably efficient way of providing a reasonable level of privacy, if there's any reason to assume that entry and exit nodes won't collude. Which is more of a societal-norms issue than a security one... the security assumption is always that they would collude. But in a layered defense, having a societal norm that they don't, and then not putting all your eggs in that basket (which is how iCloud Private Relay fails) seems a reasonable approach.

There's a lot to be learned from recursive DNS, both in terms of successes and failures. Aggregating lots of users behind something which mixes their queries, caches whatever it can, and distributes the cache misses as broadly as possible, while doing query minimization, transport-layer encryption, and end-to-end validation, is all a win. And staying the hell away from fingerprintable HTTP stacks, using a lean protocol that only does as much as is necessary and no more.

Sorry this isn't organized better. I've been up for 26 hours. :-/ Happy to discuss any of it further.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 3, 2022

Get some sleep!

I feel like the biggest win for privacy that's within reach would be making the stack so lightweight that it could run on a $5/mo VPS. Then people could stand up an instance as their homeserver, and the homeserver could request tiles (map tiles, routing tiles, carmen tiles, etc) from the rest of the network to fulfill that user's requests. This would require using Carmen for geocoding, since it's the only geocoder I know of that can work on hierarchically tiled indexes instead of one big index. Photon could remain as a regional auto-complete option for instances that choose to devote extra memory to it.

Transit routing would still require sending route endpoints to untrusted servers though. I don't see a way around that since Valhalla doesn't do transit routing well.

I really like the idea of balancing requests for tiles across many different nodes that cover the same area.

@bwoodcock
Copy link

Also, re offline downloads, there are plenty of lessons to be learned from the BitTorrent folks.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 3, 2022

Yep, because even though I might not be able to contribute to hosting an entire metro area on my $5 VPS, it's still capable of caching tiles, and could serve those tiles to other nodes if the opportunity arises. Similar to the way that even a leech on a torrent still seeds data when it can.

@bwoodcock
Copy link

I really like the idea of balancing requests for tiles across many different nodes that cover the same area.

If you divide load by hashing the tile (or whatever data-chunk) to the server, and have client choose randomly from among the servers that are serving that tile, it prevents any server from developing much of a picture of what any one client is up to, even over a span of many requests. Also, it means that any individual server doesn't need to hold very much data, so the size (and corresponding resource consumption) of any one server can go all the way down to the granularity of a single tile or data-chunk.

@bwoodcock
Copy link

Oh, another use-case that would be really attractive: a privacy-protecting alternative to Apple's "find my friends." Back in the dot-com days, there was a company that would tell you whenever you were near one of your friends, without getting into much more detail than that. It was super-useful for finding out that one of your friends was transiting the same airport you were, or attending the same conference. So, something with finer control that Apple gives, that doesn't share information with any central party, and defaults to less information, would be good. This is one of those problems that you probably need to drag a cryptographer in for, though, if you want a solution that's more efficient than unicasting the data to each friend, encrypted to their public key.

@bwoodcock
Copy link

That requires federation, though, because one's mobile device (which knows where one is) isn't necessarily reliably connected, so it needs to be able to push a "last seen here, when" message to something that is reliably connected, whenever it can, and then you need to flood those through a set of reliably-connected servers that other clients are checking for relevant updates. (Which could be pulled based on the client's location, or could be pushed based on the client's friend's identity.)

@bwoodcock
Copy link

...some federated platforms where you trust your home server, but not any of the other servers.

The problem with that in the DNS space (which is what I mostly know) is that homeserver people get all excited, set up their own server, but it doesn't know anything, so it's all cache misses all day, so it has to go get everything from elsewhere, and it's doing so using their own static IP, which identifies them uniquely.

Whereas if they were roaming around as a mobile client, they'd be getting different IPs behind different NATs. And if they were using a big shared cache, their own queries would be lost in the noise (on the other side of that cache).

So it only really works if you have a big cache that's fronting for you and lots of other people, and you trust that cache. And that's a tough combination.

Which was why I was suggesting something layered, with a mixmaster approach to caching, but also having the clients distributing their queries pretty actively.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 4, 2022

I'm not willing to have clients distribute requests unless those requests do not contain PII, which is tough with maps. We can talk all day about the potential for a z14 tile request to leak information, but if the clients are sending free-text queries to a server, that's far worse. The first thing anyone does when they see a new maps app is type in their home address. There's nothing I can do about that, other than try to prevent it from leaking to a server other than the one that served them the page.

I wonder if fast_paths could be made to work with tiled routing graphs. It seems to support running in a browser which is pretty cool. Valhalla really does not want to build for WASM, I spent a few hours yesterday on that and it wasn't a fun time.

Geocoding is an open question. I think the architecture that Carmen uses could probably work in the browser but I'm not sure if Carmen itself could or if something would need to be written from scratch.

This is a pretty big undertaking but it would be a big win to have this all work with federation, because that's a community-oriented way to get worldwide coverage while minimzing privacy implications of talking to untrusted servers.

@bwoodcock
Copy link

bwoodcock commented Jun 4, 2022

So, a few caveats, just to be clear: You know mapping, I don't. I'm not a privacy expert; they exist, I'm not one of them. And I'm going to try stating a few things very simply not because I think you don't know them, but because neither of us knows what the other knows, and that's a way of laying a foundation for a conversation about a complicated and fraught topic. So, consider this me thinking out loud, not me attempting to lecture from a nonexistent position of authority.

I'm not willing to have clients distribute requests unless those requests do not contain PII...
The first thing anyone does when they see a new maps app is type in their home address.

PII is the combination of data in a way that identifies a person. Thus, a height, 170cm, is not PII unless it is combined with other identifying information, like a name or an identity number. Names and identity numbers are intended to be identifying information; that is, they're intended to be, if not actually unique, at least unique within some specific and useful context. Unique within the employees of a company; unique within the set of currently-active MasterCard numbers. There are also clearly non-unique pieces of identifying information, which are typically not unique, except at the margins of a bell-curve of demographic distribution (a person who is more than 250cm tall; a person who resides in Hot Springs County, Wyoming; a person whose system font is Comic Sans) or measured with extraordinary precision (the person who is at exactly 38.89767637490 -77.0365297932 right now), but which, in combination with other such pieces of information, specify a single individual increasingly narrowly. Thus, the person who lives at a certain street address and weighs 55kg may constitute PII, even if hundreds of people live at that address, and tens of millions weigh 55kg.

We can surmise the existence of people who weigh 55kg, without knowing anything about them, or even whether we're correct that they exist. We can surmise that people may live at an address, if a residence exists there, and the area isn't depopulated, without knowing anything about them.

So, I would assert that there are a number of things we can do to sanitize the request you're positing:

  1. If there's no special relationship between the client and the server, the server has no way of knowing that this is the first request that the client has made, and will attach no special significance to it.

  2. If the client performs query minimization (by, for instance, only asking for the tiles associated with a range of addresses, rather than a specific address) we get a fuzzing function which will help in most cases.

  3. If the server is selected randomly from among a set which are identified by query hash, then the client becomes one of many, all of which ask the same question, further reducing their uniqueness.

  4. If a caching proxy is used, and the proxy serves multiple users, and query minimization is used (only cache misses, and the minimum necessary information is requested to complete the answer) then the query can only be associated with the set of users of the proxy, not an individual user, and cached responses will be private. This assumes that the proxy is, itself, trustworthy; a significant caveat.

  5. If a mixmaster is used, there should be no identifying characteristics of the client exposed to the server, and no identifying characteristics of the query exposed to the mixmaster.

  6. If caching is applied at every level, including the client, and data is in timestamped and signed blobs, a client may go months between queries for their most-frequently-used blobs, if there's a reasonable way of assessing whether their presently-cached blob is still current. This is a cryptographic problem beyond my competence, but one for which I believe there's a relatively simple mechanism.

So, if you combine several, or ideally, eventually, most or all of the mechanisms above, the potentially-identifying pieces of data remain relatively compartmentalized, and don't combine to form PII.

if the clients are sending free-text queries to a server...

Yeah, that would violate query minimization. (And, I guess I assume it goes without saying that TLS is table-stakes, and DANE should be strongly encouraged, and all data should be signed. But assumptions are bad, so I'll just say it.)

I recognize that everything I'm suggesting above is work, and to do well, quite a lot of work. I don't imagine it can be done by one person alone in any usefully-short amount of time. And I'm not in a position to help with the actual coding, since that's not a skill I have. But I think many good things are achieved by people coming together to execute a shared vision of how things could be better, and I think you've started down an exemplary path, so I'm enthusiastic and optimistic that, if you continue to lead by example, others will join in and each do their part to help achieve a larger goal than you could reasonably expect to achieve on your own.

I don't imagine that I can play more than a small part in that, but I've done a bit of protocol design, and a bit of operational deployment, and a bit of network security, so I may be able to help a little bit early in the process, and a little bit late in the process.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 4, 2022

The problem with maps is that most classes of requests to the server can themselves be PII or at least vaguely sensitive when correlated together. Cache layers, spreading out queries and whatnot can all help with the correlation part though which is huge. I think this all becomes much easier to reason about when the free-text geocoding queries are out of the picture, and when routing can be accomplished client-side (for all the routing modes except transit).

At that point we can start talking more about obfuscation of user behavior through various means. Until then, I'm not sure how many layers of indirection would make me comfortable with an unknown server getting a geocoder request for my home address. A geocoder request for an address is pretty solid signal that someone will be beginning a trip to that location in the near future. Depending on the address you can sometimes guess who it is, and even when you can't, you know where it is, so it's pretty easy to figure out who by simply going to that location.

The good news though is that I'm making progress getting routing to work offline, and I have a few ideas for how to get geocoding to work offline.

edit: Valhalla is compiling to wasm32, but not yet linking. It is bedtime though. But I'm hopeful that this weekend I'll be able to check off the first box at the top of this issue #50 (comment)

@ellenhp
Copy link
Member Author

ellenhp commented Jun 4, 2022

@bwoodcock Would it make sense to just use IPFS here and have each headway instance serve as an IPFS gateway for all headway tiles? That way requests for tiles could be split among all available instances somehow. Maybe round-robin (cache misses all day) or maybe something a bit more likely to cause cache hits while still preventing any instance from correlating all of a client's traffic together.

@3nprob
Copy link
Contributor

3nprob commented Jun 6, 2022

This is a killer feature and what would really take headway to the next level. Even more interesting if implemented in a way that could be reused by other use-cases that aren't necessarily maps.

That being said, it's quite an endeavor and there are several important design decisions (you already noted some compromises that may have to be done wrt privacy). There's also DoS attacks, spam, and other security implications... And how to handle updates to maps and conflict resolution while trying to keep a consistent user experience? Granted all of these can be solved for.

While p2p would be amazing, a more realistic and pragmatic approach would probably be federation (making headway easy to self-host and share data between instances, but having the client only connect to instances chosen by the user).

I think it comes down to your priorities for this project. If you want to explore exciting things and explore new territory and don't want compromises it could make sense to look at this now.

But if the priority is to make something people actually use, I do have some concerns that tackling this issue now is premature and will make headway join the countless "PoC 99% complete but project not practically useful quite yet, and may never be" "graveyard"... Or endless bikeshedding.

It might make sense to keep discussing and keeping it in the background to try to have other design decisions and development (like tool and library choices) be aligned with having it happen in the future and try to not close doors.

Just my 5c but I would focus on making self-hosting accessible and further developing the frontend functionality to make it attractive to use before attempting to implement this. Then return to this issue once there're enough users to make representative evaluation of a solution.

I don't see why IPFS can not be used for this.

@ellenhp
Copy link
Member Author

ellenhp commented Jun 6, 2022

But if the priority is to make something people actually use, I do have some concerns that tackling this issue now is premature and will make headway join the countless "PoC 99% complete but project not practically useful quite yet, and may never be" "graveyard"... Or endless bikeshedding.

I think this is spot-on unfortunately, but during the HN traffic spike I can't tell you how many times I saw people type in cities outside the seattle metro area, or worse, addresses. It left me with the impression that a lot of people are going to use other people's self-hosted instances without regard to the information they leak to the operator, and also that what people really seem to want is global data. If nothing more the privacy problems make me think tackling the offline stuff now instead of later would be a good thing. The architectural questions of federation can be handled later, but I would like to de-risk the offline geocoding and offline routing ASAP.

For the future though, I really like the idea of just using IPFS for storage of the global tiles, and maybe activitypub to publish when a server goes online and is available as an IPFS tile gatway. Keeping it simple (relatively speaking) would be good, because there are only a few people working on this project with any regularity right now.

@3nprob
Copy link
Contributor

3nprob commented Jun 6, 2022

Maybe a good bridge would be to ensure that the frontend and backend are properly decoupled, and make an accessible UI to choose instance? That would help with transparency and allow people to use the same UI for areas hosted by different servers. One of those servers could be local and using IPFS and/or some other distributed protocol.

From the client perspective, a federated or p2p network should look like any other "server".

EDIT: Another thing that could improve the privacy situation: Decouple the instance selection into three: [tileserver/search/route]. This was you could have a global trusted instance for search (resource-wise should be realistic to self-host with modest resources), while relying on external untrusted or less-trusted servers for tiles. One could self-host transit for their home area only. Further down the line when distribution becomes a thing, this could incentivize people to participate in a peer-to-peer network: By running a p2p instance, you get increased privacy from plausible deniability wrt if queried tiles are for you or on behalf of someone else. This would solve or include solving #17 and #32.

@bwoodcock
Copy link

Would it make sense to just use IPFS here and have each headway instance serve as an IPFS gateway for all headway tiles?

In principle, using an existing protocol that's already been privacy-vetted, is always a better plan than inventing something new.

I was familiar with the principle of IPFS, but I don't have any personal experience with it, so I spoke with one of the folks close to the project, and his summary was that in principle it was the right kind of answer to the issue you face; that there might be grant money available to cover the costs of doing an integration; but that we should be aware that performance was not yet significantly better than Amazon Glacier (which they regard as their principal "competitor").

https://aws.amazon.com/s3/storage-classes/glacier/

I got the sense that he wasn't talking about the "milliseconds" version. I'll continue to pursue this and report back.

@3nprob
Copy link
Contributor

3nprob commented Jun 7, 2022

AIUI IPFS is best viewed as a protocol and transfer network where performance is determined by the nodes involved in the lookup and transfer of a piece of data. E.g. pinning on a close and performant host will be very different than if you're fetching it from across the globe from some throttled shared host.

As such it's not really meaningful to compare to something like Glacier in that way. So I'm surprised to hear them frame it that way.

Also, privacy is a non-goal of IPFS. You'd have to access it over some form of overlay network or proxy to get privacy properties. See here and IPFS docs. That being said, the privacy compromise in IPFS may still be acceptable for the tile transfer use-case. IMO as long as everything works fine over Tor and/or I2P, relying on those to solve for anonymity can be good enough, no?

@ellenhp
Copy link
Member Author

ellenhp commented Jun 7, 2022

Also, privacy is a non-goal of IPFS. You'd have to access it over some form of overlay network or proxy to get privacy properties. See here and IPFS docs. That being said, the privacy compromise in IPFS may still be acceptable for the tile transfer use-case. IMO as long as everything works fine over Tor and/or I2P, relying on those to solve for anonymity can be good enough, no?

AFAIK this is correct. But, as a client, if I keep it reasonably unpredictable whose gateway I'm using at any given moment then the IPFS network observing the results of a cache miss at that node doesn't really bother me. An adversary would need to both observe my connection and actively monitor the IPFS network globally for cache misses. Anyone with that in their threat model should already be using tor or i2p IMO.

Besides, the most granular data we're serving for basemaps is a z14 tile, and requesting one of those doesn't actually leak that much information if it's in a city, especially compared to free-text queries or just giving the server your route endpoints, both of which we currently do. For rural areas, it's possible that a z14 tile may only have a few residences in it, but I don't know how to solve that other than attempting to pack more data into the z12-13 tiles.

The most granular data we will need to serve for routing will be similarly large. I think they're 0.25 degrees square. For Geocoding it's looking like Mapbox Carmen might not work for our needs (#58) so that's a bit of an open question, but I have some ideas for how to adapt Carmen. Mapbox designed it to be fairly modular so I think that I might be able to borrow from the hard work they've already done.

@ellenhp
Copy link
Member Author

ellenhp commented Aug 23, 2022

In addition to my earlier conclusions about Mapbox Carmen being difficult to port to the browser, they've chosen to completely delete the carmen repository, along with carmen-core. I have a few forks around but I'm going to take that as a sign that I shouldn't use it and move on.

Given some of the recent developments with maps.earth I think I want to shift the focus of this issue towards the one part of the stack that can't be hosted for the whole planet: transit.

OpenTripPlanner just doesn't scale like that. I think it would be really cool to include transit on maps.earth by allowing third parties to host their own OTP instances anywhere on the globe that they want. Headway could then show those instances as options to users who want transit directions for a specific area. There are still some privacy implications but it's mostly a problem of messaging, and I think messaging is something we can figure out. It doesn't require writing a geocoder from scratch, at least :)

@ghobs91
Copy link

ghobs91 commented Aug 23, 2022

A lot of interesting insight into the "p2p/federated maps" concept here, especially the privacy implications!

Having said that, I've just come across a project called Peermaps that seems to take one of the approaches mentioned here, that of distributing map tile data between users over IPFS. The webapp is testable (in an alpha state), if anyone's curious.

@ellenhp
Copy link
Member Author

ellenhp commented Aug 27, 2022

A lot of interesting insight into the "p2p/federated maps" concept here, especially the privacy implications!

Having said that, I've just come across a project called Peermaps that seems to take one of the approaches mentioned here, that of distributing map tile data between users over IPFS. The webapp is testable (in an alpha state), if anyone's curious.

I'm extremely curious how they're going to handle geocoding, and curious as to why they didn't implement p2p basemap rendering as a plugin to maplibre-gl-js (or mapbox-gl-js if this all happened before mapbox made everything proprietary). Looks super interesting though. I guess they're working on a p2p spatial index. Geocoding is much more complex than just spatial indexing though, it has me stumped but if they get something working I'd be interested to potentially try and integrate it because I hate the idea of people sending me free-text data :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants