Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS Incremental Layer Updater #1583

Closed
echeipesh opened this issue Jul 11, 2016 · 1 comment
Closed

HDFS Incremental Layer Updater #1583

echeipesh opened this issue Jul 11, 2016 · 1 comment
Milestone

Comments

@echeipesh
Copy link
Contributor

Current implementation of HDFS layer updater punts on the merging issue and loads all of the currently stored records to merge in the update before writing it all out again. This makes it prohibitive to maintain a large layer with small incremental updates and scoped queries.

We can actually implement an incremental HDFS layer update by leveraging recent improvements: #1556

When we receive an update we also get metadata M which contains KeyBounds[K] of the update. We can decompose these bounds into sequence of SFC ranges using the key index which is already associated with the layer: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/index/KeyIndex.scala#L9

By comparing it with the ranges associated with already written files we know which map files specifically intersect with the update: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/hadoop/formats/FilterMapFileInputFormat.scala#L21

At this point the update needs to be partitioned and sorted as it would be when initially written: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala#L79

Once this is done we can close over the list of affected files and their ranges to .mapPartitions over the update set. Because we know that map files are sorted we can avoid reading them into an RDD but instead perform an in-sync traversal of the update partition and any map files that intersect it. The merge function can be applied during this traversal. By keeping the update as in-sync traversal of iterators we can keep the memory pressure of this operation about constant.

WARNING: performing this process will create map files which intersect files already stored, this violates the assumptions and may break layer reading while the update is in process (maybe). While unlikely it is also possible to have file name conflict when creating merged files.

If thats the case the files to be merged need to be renamed before being read, rendering them unavailable to queries while the update is in progress. The alternative would result in duplicate conflicting records potentially being returned when the update is in progress. This is a distributed write so there is no perfect solution, but care should be taken here to achieve the best possible behavior which is probably a rolling removal and update of files as the writes happen.

As the existing files are merged and closed they need to be deleted.

Note:
Current implementation assumes that that a MapFile ranges from the index of its first record until the first index of the next closest MapFile. This can obviously drastically over-estimate the range of an individual file and in fact pull in a file into merge that does not in reality intersect with the update. My hunch is that the number of such files would be directly proportional to the number of records being updated and would "average out" to be OK. A side and significant benefit of this assumption is that it will result in merging of tail MapFiles, those files that were written at the tail end of the partition and only contain a few records. This should keep MapFile fragmentation in check as the layer is continually updated. However, all of this is basically a reasoned hunch and could use some experimentation and measurement when implementing this feature.

@echeipesh echeipesh added this to the Future milestone Jul 11, 2016
@echeipesh echeipesh changed the title HDFS Layer Updater HDFS Incremental Layer Updater Jul 11, 2016
@lossyrob lossyrob modified the milestones: 1.1, Future Oct 18, 2016
@lossyrob lossyrob removed IO labels Oct 20, 2016
@lossyrob lossyrob modified the milestones: 1.2, 1.1 Mar 12, 2017
@echeipesh
Copy link
Contributor Author

Resolved by: #2384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants