HDFS Incremental Layer Updater #1583

echeipesh · 2016-07-11T18:03:23Z

Current implementation of HDFS layer updater punts on the merging issue and loads all of the currently stored records to merge in the update before writing it all out again. This makes it prohibitive to maintain a large layer with small incremental updates and scoped queries.

We can actually implement an incremental HDFS layer update by leveraging recent improvements: #1556

When we receive an update we also get metadata M which contains KeyBounds[K] of the update. We can decompose these bounds into sequence of SFC ranges using the key index which is already associated with the layer: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/index/KeyIndex.scala#L9

By comparing it with the ranges associated with already written files we know which map files specifically intersect with the update: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/hadoop/formats/FilterMapFileInputFormat.scala#L21

At this point the update needs to be partitioned and sorted as it would be when initially written: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala#L79

Once this is done we can close over the list of affected files and their ranges to .mapPartitions over the update set. Because we know that map files are sorted we can avoid reading them into an RDD but instead perform an in-sync traversal of the update partition and any map files that intersect it. The merge function can be applied during this traversal. By keeping the update as in-sync traversal of iterators we can keep the memory pressure of this operation about constant.

WARNING: performing this process will create map files which intersect files already stored, this violates the assumptions and may break layer reading while the update is in process (maybe). While unlikely it is also possible to have file name conflict when creating merged files.

If thats the case the files to be merged need to be renamed before being read, rendering them unavailable to queries while the update is in progress. The alternative would result in duplicate conflicting records potentially being returned when the update is in progress. This is a distributed write so there is no perfect solution, but care should be taken here to achieve the best possible behavior which is probably a rolling removal and update of files as the writes happen.

As the existing files are merged and closed they need to be deleted.

Note:
Current implementation assumes that that a MapFile ranges from the index of its first record until the first index of the next closest MapFile. This can obviously drastically over-estimate the range of an individual file and in fact pull in a file into merge that does not in reality intersect with the update. My hunch is that the number of such files would be directly proportional to the number of records being updated and would "average out" to be OK. A side and significant benefit of this assumption is that it will result in merging of tail MapFiles, those files that were written at the tail end of the partition and only contain a few records. This should keep MapFile fragmentation in check as the layer is continually updated. However, all of this is basically a reasoned hunch and could use some experimentation and measurement when implementing this feature.

The text was updated successfully, but these errors were encountered:

echeipesh · 2017-10-05T14:21:43Z

Resolved by: #2384

echeipesh added IO enhancement labels Jul 11, 2016

echeipesh added this to the Future milestone Jul 11, 2016

echeipesh changed the title ~~HDFS Layer Updater~~ HDFS Incremental Layer Updater Jul 11, 2016

lossyrob modified the milestones: 1.1, Future Oct 18, 2016

lossyrob removed IO labels Oct 20, 2016

lossyrob modified the milestones: 1.2, 1.1 Mar 12, 2017

jamesmcclain mentioned this issue Sep 18, 2017

Remove Full Outer Join From HDFS Backend #2384

Closed

1 task

echeipesh closed this as completed Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS Incremental Layer Updater #1583

HDFS Incremental Layer Updater #1583

echeipesh commented Jul 11, 2016

echeipesh commented Oct 5, 2017

HDFS Incremental Layer Updater #1583

HDFS Incremental Layer Updater #1583

Comments

echeipesh commented Jul 11, 2016

echeipesh commented Oct 5, 2017