Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processing lidar data on Pangeo, uploading data and installing third-party binaries for processing #463

Closed
d-diaz opened this issue Nov 5, 2018 · 11 comments

Comments

@d-diaz
Copy link

d-diaz commented Nov 5, 2018

I'm interested in trying out processing lidar acquisitions (hundreds or thousands of tiles of point clouds) on Pangeo, but wanted to see whether this is possible with the current setup.

Right now, all the lidar data are on local storage or could be accessed from an FTP server. Would it be better to transfer that to a cloud storage service to access more easily the data with Pangeo? Or could we start up the process by uploading from local or downloading to Pangeo from FTP?

I have developed Python wrappers for command line tools that process lidar data (LAStools, FUSION) and am processing them using a Dask-Distributed pipeline. Is it possible to install additional executables and Python packages on Pangeo? Or would these types of additions require deploying another Pangeo version on our own?

Thanks for any advice or guidance you can offer.

@jhamman
Copy link
Member

jhamman commented Nov 5, 2018

@d-diaz - thanks for opening up this issue. I'm curious what format your data is currently stored in -netCDF/HDF or something else? Generally, streaming data out from and FTP server is going to be a challenge and without knowing more, I'd say its going to work much better coming out of cloud storage. How large is you dataset? We could probably help you put some on GCS to test your workflow.

Customizing your processing environment is something we've been working on quite a bit. For now, the solution you can use to get started today with Pangeo is http://binder.pangeo.io. If you are interested in taking this further, we can think about setting up a jupyterhub with your desired environment.

@rabernat
Copy link
Member

rabernat commented Nov 5, 2018

Perhaps we should include a link to the cookie cutter repo on the main binder landing page:
https://github.com/pangeo-data/cookiecutter-pangeo-binder

I have found the cookie cutter to be very useful for developing examples.

@martindurant
Copy link
Contributor

An FTP file system backend is probably quite doable. For example, https://github.com/martindurant/filesystem_spec/pull/20 in fsspec does FTP over SSH (SFTP); fsspec is not yet a released project, but my plan is that it should replace the bytes/file-system functionality in Dask, and therefore allow direct access to all sorts of stores. Raw FTP in python seems less capable, but still does allow loading a file from some byte offset, if the server supports it.

Of course, putting the data on GCS/S3 is a totally fine solution.

@adamsteer
Copy link

I’m keen to see where it this goes! As a non-pangeo user (but longtime github lurker), @jhamman - does Pangeo more or less require netCDF/HDF as a data store?

It looks like @d-diaz is reading LAS/Z tiles into numpy arrays (or Xarray) already - which would solve passing data around internally. So I think my question is - would a file store require an internal index (without an LAX sidecar file if you’re using LAStools)?

LAS/Z files don’t have one, and the data are generally ungridded.

@d-diaz
Copy link
Author

d-diaz commented Nov 5, 2018

@jhamman, the lidar data are usually in LAS (or compressed as LAZ) format, which have standard specifications promulgated by the American Society for Photogrammetry and Remote Sensing (ASPRS). To get the data, I'm currently just using wget on a local server and pointing it to the FTP or HTTP address from the various lidar data providers.

The other issue that I'd like help navigating is that I'm using LAStools command line tools for doing the lidar processing. These executables can be downloaded from the web, but they are windows binaries (so need to have wine installed to be able to run on a Linux server). In addition, LAStools enforces some constraints on the command line tools unless a license is provided. I have a license (text file), but would need to upload it to Pangeo as well.

@adamsteer, the pipeline as I currently have it set up creates and works on flat files (*.laz, *.lax, *.lay, *.asc, *.tif) produced by LAStools commands and other wrapper functions I've developed. These files can be cleaned up and downloaded as the pipeline runs or after it is done executing. I haven't yet migrated to using something like PDAL pipelines which have more direct Python integration and interaction with point data as numpy arrays. Regardless of the software, there is almost always some files or a database that need to be created and populated to hold spatial index information about the point cloud files.

@martindurant
Copy link
Contributor

Similarly, HTTP can be viewed as a file-like object, if you have python code that expect to read data from such a file-like object. If such code doesn't exist, and you only have the spec and/or compiled implementation, then you would have to do some work - but I would care to bet that the format is simple enough.

@d-diaz
Copy link
Author

d-diaz commented Nov 11, 2018

It looks like the Pangeo binder @rabernat mentions could resolve the issue of setting up additional python packages (e.g., PDAL) or downloading other packages (e.g., LAStools, wine) using the docker container setup.

Anybody know how you could work around providing a license text file for this without publishing your license? I'd imagine you could hack around this once you have the computing environment set up in the cloud by creating the text file from your notebook or command line and then writing the license info into it.

@jhamman
Copy link
Member

jhamman commented Nov 14, 2018

@d-diaz - I'm not entirely sure I follow the license issue. Most packages ship with their own license. Most binders also ship with a license as well (e.g. https://github.com/rabernat/pangeo_ecco_examples). So provided you can install the software you need to use, you can license the binder however suites your needs.


I think a good way for you to get started would be to package up a small bit of lidar data in a github repo and build a binder repository. That will help you sort out your questions of installing packages and licensing. From there, we can arrange how to get some larger lidar data up on gcs.

@d-diaz
Copy link
Author

d-diaz commented Nov 14, 2018

Sorry, by license I meant license key required to execute a proprietary software package like LAStools. I was asking how to securely provide a private license key.

@jhamman
Copy link
Member

jhamman commented Nov 28, 2018

@d-diaz - sorry for the slow turn around here. I don't know how you would do this. We are pretty focused on free/open-source python tools here so this just hasn't come up yet. Not to detract you from looking into it, just to say I don't have an answer to your question.

@d-diaz
Copy link
Author

d-diaz commented Dec 12, 2018

After mulling this over, I think it will be similar to requiring user to enter an API key in the notebook, i will just need to take the license key and write it to an underlying text file that LAStools looks for.

Assuming this will work, and acknowledging that it seems like I’ll need to “dockerize” a Pangeo binder repo from cookiecutter template to get these non-Python packages install them, am closing this issue.

@d-diaz d-diaz closed this as completed Dec 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants