New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does an MVP for an OA geocoder look like? #12
Comments
This is a great idea, and thank you for pushing it forward. For the data ingestion, we’ve talked internally about what an ElasticSearch data prep process would look like for small extracts of OA data. @orangejulius or @dianashk are most up-to-date on this topic. For the machine images, I think it would make sense to tighten the list and support a smaller range of possibilities. When an offered format stops working, it’s a debugging and support bummer for us. I think Heroku is an interesting direction, and I’ve built “app builders” before that anyone with an account should be able to point-and-click their way through. High effort, maximum reach, strong dependency on single vendor. I think that Docker or Vagrant approaches are a weak compromise: easy for the kinds of nerds who don’t need it, but still too difficult for mortals. AMI is up there somewhere, and could be scripted using Amazon’s API and a builder-style approach with some effort. I have a weak bias toward a trash-and-replace model for the data updates. If it’s easy to set one of these up, it should be easy to use rapid replacement instead of updating. |
I like the idea! Could you say more about who might use it? I'm trying to figure out who isn't served by just using a public geocoder, either free or paid. |
This is a very cool use of Pelias, so we're excited to see it come to fruition... even if we don't have the bandwidth to do it ourselves. Hooray for open-source! As it stands today, Pelias is already setup to ingest all or any subset of OA data that you point it at. Setting this up isn't elegant at the moment, and this is where the majority of the work needs to be done. We're working on something to make it a bit simpler to install and build the whole system. Users would still need to install Elasticsearch on their own. So this effectively covers step 1. I personally like the idea of supporting something simple and accessible, like Heroku, for the first attempt at a builder. If that is all successful, we can always branch out to support other platforms. But no need to rush there. As for automated updates, we can set it up to rebuild on a schedule, like we currently do with our hosted Mapzen Search instance of Pelias. We rebuild weekly, because we do the whole world and it takes a few days. But with a small dataset you can rebuild daily or even hourly to keep the data fresh. We don't currently support real-time updates, so getting that implemented would require some significant effort. |
There are no free, public geocoders that aren't license-restricted (e.g., Google) or query-restricted (e.g., TAMU). So there's a big obstacle for a lot of people. Paid geocoders have a price tag that's a real burden on good work. (For example, I wanted to geocode every business in Virginia, as a public service. That was going to cost $1,200. Nope. Turned out, Virginia has a geocoder that is open to the world, and I used that, which took care of the ~75% of addresses that are within Virginia.) The next obstacle is speed. Making a call to a remote API takes time. Making a million calls to a remote API takes a million times longer. Being able to run a geocoder locally is vastly faster. I appreciate that, from your perspective, geocoding seems like a highly-available service. But that's true for vanishingly few people. |
Would you please explain this process further? If I wanted to stand up a Pelias instance for the greater Charlottesville, VA area, what steps would that entail? |
Thanks @waldoj! My apologies, I didn't mean to question whether an install-your-own geocoder was a good idea. I was just trying to understand who might be users of it. I think you've identified three reasons for running your own: more useful than a free service, cheaper than a paid service, and faster if you run it locally. I think in all cases it's a user who is motivated to do a bit more work to get things running for themselves rather than just paying a service provider. To that last point, faster if you run locally, that would argue for a self-hosted option. Ie: not Heroku or EC2 or some other remote server, but something you can run on your local network as well. One old school proposal for a deliverable: an Ubuntu PPA that lets you run Another old-school proposal is just good documentation. Work with Pelias to make it really easy for someone who knows some command line to install it, then write those download + import scripts. That requires more work on the part of the user than Ubuntu packages, but is (in theory) usable in many Unix environments. For modern new stuff everyone seems to love Docker. A Docker container that just served geocoding data would be pretty neat. I agree with @migurski that it's more realistic to only support one or a small set of possibilities. |
I am convinced about the self-hosted option. I know that Mapnik has a ton of experience with Ubuntu releases and later with a PPA, so I'd like see if @springmeyer has any wisdom or advice to share. |
That's too bad, because you should. I often convince myself that my terrible ideas are brilliant! :)
I don't think self-hosted is the only use case, I just think it's a good one. But I am persuaded that, in terms of prioritization for deployment methods, it's worth favoring deployment methods that work locally ahead of those that only work remotely. Docker works well for both—you can run it locally, or can you can deploy it to AWS/Heroku/DigitalOcean. Seems like the way to start! |
I'll research the PPA path. For various reasons I'm really bullish on that and not on Docker these days, mostly due to some experience with Docker oddities biting me. |
We've been talking about npm-ifying pelias so that you can Then our efforts would be in building a really lovely configuration & build wizard to help folks pick the datasets/regions they're most interested in. |
Yeah, that is my thinking as well. |
Oops, did not mean to hit the mic drop button. |
I have done a bit of work on getting .debs and PPAs set up. I’ve successfully installed a package of my own from a non-PPA URL added to Mostly cross-referencing suggestions from these articles: My goal is to get to approximately where Dane and @rcoup succeeded with https://launchpad.net/~mapnik |
After a bit of back-and-forth with a helpful Ubuntu Launchpad person, I’ve gotten… someplace.
It’s a surprisingly fiddly process but I’m liking the progress. Feeling like it’s a thing that’s possible to understand. |
Do you have a feeling for if a Debian/Ubuntu package is a reasonable deliverable? I threw that out there as an idea but I'm not confident it's the right thing. |
I don’t have a feeling for it yet. I believe this is a one-time pain and so far it’s been about the same level of b.s. as I’ve experienced with Docker and Vagrant. It still looks worthwhile. |
I've really dived into Docker into the past week, and I feel good about using a .deb as a deliverable. That's a single line in a Dockerfile, and of course just as easy to use outside of Docker. I like it. |
I guess the question is requiring Ubuntu. Is that OK for our target users? I think it's the best guess of the Linux distros, but I see a lot of CentOS/RedHad variants in use too. |
I’m not as familiar with the Red Hat environment, so I wonder whether it’s possible or advisable to skip the PPA route, and self-host .deb files and RPMs in one place? Having spent some time with PPA’s, it’s attractive to just put a .deb at a URL someplace and be done with it. I haven't yet successfully installed my test package at https://launchpad.net/~migurski/+archive/ubuntu/hello. |
PPAs offer a lot of advantages though, it's required to make The drawback of supporting RPMs too isn't so much building the RPM, it's sorting out the operating system compatibilities, library versions, etc. That's why I suggested just supporting Ubuntu LTS 16.04; the M in MVP. |
It's fine for Docker, at least (because I don't think many people could care which distro that their Docker instance runs). Personally, I look forward to the problem of people saying "gosh, I'd love to use this, but I use CentOS." That seems like a bridge worth crossing when we come to it. :) |
Spoke with Nelson offline, and he offered to help with two things I'm stuck on: PPAs with multiple owners (since we’ll likely want one called |
Followup to the last note:
|
I got Pelias API published and installed to my PPA sandbox.
|
Based my tests, I think this would be the bones of an installation process for Ubuntu 16.04, and ought to work manually or in a container-type context:
|
That's a pretty straightforward set of instructions! Shame it's all third party repos, but perhaps that's unavoidable. |
Yeah. ElasticSearch suggests that the open JDK might work, but @baldur reports having seen problems using it with ES. Only Oracle’s is officially supported. Getting https://github.com/pelias/schema and sample data in there is a next step. A possible Dockerfile:
|
I was wondering if this was complicated enough it should be encapsulated in a script, or a Dockerfile, or an image. The nice thing is the Ubuntu packaging is worth the effort since it makes that script simpler too. |
I had to add, after the first line:
I know that |
It finally died with this:
I'm not sure why (other than Java ¯_(ツ)_/¯), but I'll see if I can figure out what's up. |
Damn, I bet that's the part where it asks for a license click-through. |
I would be curious to learn more about why Oracle’s Java is necessary for ES. Maybe for smaller uses, it’d be sufficient to use Open JRE? |
Here is a product matrix of which JVMs work with which Elasticsearch versions. I don't see Open JRE on there, but I know very little about Java, so that may or may not mean anything. |
The OpenJDK in 16.04 says this about itself:
So it’s using IcedTea. I believe Java 8 is internally 1.8, so it also matches the supported 1.7.0.55+ version number. Waldo, what happens if you replace the |
Simpler possible Dockerfile:
Waldo, for me it was not necessary to install |
Running that Dockerfile yields this:
I needed to add these to get this to run:
When I did that, this was the outcome:
|
I guess the 16.04 Docker image is much slimmer than the server distribution, which I suppose makes sense. So:
|
Trying to create the Pelias index failed for me with this message:
@orangejulius pointed out that Pelias wants ElasticSearch 1.7, so the process should look like this with 1.7 instead of 2.x:
That works:
|
Aw yeah, getting some results from a single-county import: http://dpaste.com/0612XXJ |
! |
Mazel! Sent from my iPhone
|
Expands on sample scripts in openaddresses/openaddresses-ops#12
This is basically current: https://github.com/openaddresses/pelias-ubuntu-xenial#readme There’s still some documentation to do around database setup, address import, and why @#$% elasticsearch doesn’t want to start on boot. Also, Amazon are taking their time making an Ubuntu 16.04 image available and there’s not yet a supported upgrade path, so maybe we should build these for 14.04 as well? |
Progress report: I’ve run the setup above on a few machines, and I’m slowly working through the foibles of ElasticSearch. It’s pretty greedy for RAM; even running import on a 4GB had troubles and @missinglink suggests 8GB. Still don’t have an idea on getting it to start at boot. I did build Ubuntu 14.04 versions of all the packages, though. This is getting close to blog post or tutorial state, though I still there are going to be some bad ops surprises for users. |
I blogged the process for getting this set up, here: http://mike.teczno.com/notes/openaddr/5min-geocoder.html |
That's amazing @migurski. |
I… think it’s possible to close this issue? |
I agree. It might be good to find a place to put your blog post in our repo On Sat, May 28, 2016, 13:12 migurski notifications@github.com wrote:
|
Good call, I’ll do that. |
Added a link to the bottom of the post, http://mike.teczno.com/notes/openaddr/5min-geocoder.html. |
👍 |
I envision a bespoke Pelias instance creator, where somebody can indicate what physical area that they're interested in, and get a geocoder preloaded with that data from OpenAddresses. I think these are the basic components of that:
The idea is to close the loop on the publication and consumption of address data. Right now, governments publish address data, which we aggregate within OpenAddresses, and the private sector uses address data published on OpenAddresses. That fails to provide incentives for governments to continue to publish that data. (This is unrelated to those governments who publish address data via ArcGIS, in which case we're getting the data where they happen to store it. They already have existing, internal incentives.) This model will allow governments to run local geocoders (much faster than an API) powered by their own data, that improve as they improve their own data, and that are only updated as often as they update their public data. This creates a better incentive for them to publish that data.
I propose that the MVP for this consists of step 1 in the above list. The 2 subsequent steps depend on step 1, so it can't be either of those. And step 1, on its own, is useful—people can use that as-is, or build atop it.
What's the consensus here? Is this a good MVP? Are the subsequent steps the correct ones? Bonus questions: Do existing project volunteers have the capacity to make step 1 happen, or is this something that should be bid out? (Is it even plausible to bid this out?)
The text was updated successfully, but these errors were encountered: