New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better HDFS Support #1556

Merged
merged 21 commits into from Jul 5, 2016

Conversation

Projects
None yet
3 participants
@echeipesh
Contributor

echeipesh commented Jun 21, 2016

This PR improves HDFS layer writing and random value reading support.

HadoopValuesReader now opens up MapFile.Readers for each map file in the layer. These readers cache the index of available keys so they are able to provide quick lookups. This replaces and improves previous method of using FileInputFormat to query for a single record.

HadoopRDDWriter has multiple improvements:

  • Accomplishes it's task with a single shuffle step
    • This is possible because instead of estimating the blocks required by counting we simply roll over to a new file when the record being written is about to surpass the block boundary.
    • Instead of using a groupBy we sort our records on shuffle which uses IO index to partition the records
  • Writes happen through mapping the partition iterator, this is possible because the partition comes pre-sorted and greatly reduces memory pressure as records can be garbage collected after they are written.

Current testing shows:

  • HDFS ingest speeds exceed those seen in Accumulo
  • tile fetching latency is visually acceptable
  • that jobs using the GroupedCumulativeIterator approach are able to complete in settings where .groupBy/sort/write jobs are killed for memory violations.

Future improvements that on the mind:

  • Save bloom filter index for layer MapFiles to reduce memory requirements for HadoopValueReader and to produce quicker lookup in most cases.
  • In HadoopValueReader when fetching a record, cache all the tiles that share that index before filtering down to a single tile. These records are very likely to be asked for next and should be stored in LRU cache.
  • Devise a method to compare KeyIndex instances such that if the RDD to be saved is already partitioned by compatible index (where all key boundaries are shared) to avoid the extra shuffle.

@echeipesh echeipesh changed the title from [WIP] Better HDFS Support to Better HDFS Support Jul 5, 2016

rdd: RDD[(K, V)],
path: Path,
keyIndex: KeyIndex[K],
tileSize: Int = 256*256*8,
compressionFactor: Double = 1.3

This comment has been minimized.

@lossyrob

lossyrob Jul 5, 2016

Member

👏

@lossyrob

lossyrob Jul 5, 2016

Member

👏

writer.close()
// TODO: collect statistics on written records and return those
Iterator.empty
}.count

This comment has been minimized.

@lossyrob

lossyrob Jul 5, 2016

Member

👏

@lossyrob
@lossyrob

This comment has been minimized.

Show comment
Hide comment
@lossyrob

lossyrob Jul 5, 2016

Member

+1 after comments addressed

Member

lossyrob commented Jul 5, 2016

+1 after comments addressed

@echeipesh echeipesh merged commit 41ba42d into locationtech:master Jul 5, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@lossyrob lossyrob added this to the 1.0 milestone Oct 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment