Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Merge pull request #9 from ryan-williams/upgrades
Upgrade deps, docs
  • Loading branch information
ryan-williams committed Nov 22, 2017
2 parents c8a3ec2 + 409f02d commit fc4d907
Show file tree
Hide file tree
Showing 71 changed files with 330 additions and 434 deletions.
159 changes: 6 additions & 153 deletions README.md
@@ -1,163 +1,16 @@
Process [BAM files][SAM spec] using [Apache Spark] and [HTSJDK]; inspired by [hadoop-bam].
# spark-bam

http://hammerlab.org/spark-bam/
Process [BAM files][SAM spec] using [Apache Spark] and [HTSJDK]; extends/improves [hadoop-bam].

```bash
$ spark-shell --packages=org.hammerlab.bam:load:1.0.0-SNAPSHOT
```
```scala
import org.hammerlab.bam.spark._
import org.hammerlab.paths.Path

val path = Path("test_bams/src/main/resources/2.bam")

// Load an RDD[SAMRecord] from `path`; supports .bam, .sam, and .cram
val reads = sc.loadReads(path)
// RDD[SAMRecord]
Full docs: http://hammerlab.org/spark-bam/

reads.count
// 2500

import org.hammerlab.bytes._

// Configure maximum split size
sc.loadReads(path, splitSize = 16 MB)
// RDD[SAMRecord]
```

## Linking

### In SBT

```scala
libraryDependencies += "org.hammerlab.bam" %% "load" % "1.0.0-SNAPSHOT"
```

### In Maven

```xml
<dependency>
<groupId>org.hammerlab.bam</groupId>
<artifactId>load_2.11</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
```

### From `spark-shell`

```bash
spark-shell --packages=org.hammerlab.bam:load:1.0.0-SNAPSHOT
```

```scala
import org.hammerlab.bam.spark._
import org.hammerlab.paths.Path
val reads = sc.loadBam(Path("test_bams/src/main/resources/2.bam")) // RDD[SAMRecord]
reads.count // Long: 2500
```

### On Google Cloud

[spark-bam] uses Java NIO APIs to read files, and needs the [google-cloud-nio] connector in order to read from Google Cloud Storage (`gs://` URLs).

Download a shaded [google-cloud-nio] JAR:
To build them locally:

```bash
GOOGLE_CLOUD_NIO_JAR=google-cloud-nio-0.20.0-alpha-shaded.jar
wget https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.20.0-alpha/$GOOGLE_CLOUD_NIO_JAR
cd docs
jekyll serve -H 0.0.0.0
```

Then include it in your `--jars` list when running `spark-shell` or `spark-submit`:

```bash
spark-shell --jars $GOOGLE_CLOUD_NIO_JAR --packages=org.hammerlab.bam:load:1.0.0-SNAPSHOT
import org.hammerlab.bam.spark._
import org.hammerlab.paths.Path
val reads = sc.loadBam(Path("gs://bucket/my.bam"))
```

<!-- Intra-page links -->
[checks table]: #improved-record-boundary-detection-robustness
[getting an assembly JAR]: #get-an-assembly-JAR
[required path arg]: #required-argument-path
[api-clarity]: #algorithm-api-clarity

<!-- Checker links -->
[`eager`]: #eager
[`seqdoop`]: #seqdoop
[`full`]: #full
[`indexed`]: #indexed

<!-- Checkers -->
[eager/Checker]: https://github.com/hammerlab/spark-bam/blob/master/check/src/main/scala/org/hammerlab/bam/check/eager/Checker.scala
[full/Checker]: https://github.com/hammerlab/spark-bam/blob/master/check/src/main/scala/org/hammerlab/bam/check/full/Checker.scala
[seqdoop/Checker]: https://github.com/hammerlab/spark-bam/blob/master/seqdoop/src/main/scala/org/hammerlab/bam/check/seqdoop/Checker.scala
[indexed/Checker]: https://github.com/hammerlab/spark-bam/blob/master/check/src/main/scala/org/hammerlab/bam/check/indexed/Checker.scala

[`Checker`]: src/main/scala/org/hammerlab/bam/check/Checker.scala

<!-- test/resources links -->
[`cli/src/test/resources/test-bams`]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/resources/test-bams
[output/check-bam]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/resources/output/check-bam
[output/full-check]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/resources/output/full-check

<!-- External project links -->
[Apache Spark]: https://spark.apache.org/
[HTSJDK]: https://github.com/samtools/htsjdk
[Google Cloud Dataproc]: https://cloud.google.com/dataproc/
[bigdata-interop]: https://github.com/GoogleCloudPlatform/bigdata-interop/
[google-cloud-nio]: https://github.com/GoogleCloudPlatform/google-cloud-java/tree/v0.10.0/google-cloud-contrib/google-cloud-nio
[SAM spec]: http://samtools.github.io/hts-specs/SAMv1.pdf

<!-- Repos -->
[hadoop-bam]: https://github.com/HadoopGenomics/Hadoop-BAM
[spark-bam]: https://github.com/hammerlab/spark-bam
[hammerlab/hadoop-bam]: https://github.com/hammerlab/Hadoop-BAM/tree/7.9.0

[`BAMSplitGuesser`]: https://github.com/HadoopGenomics/Hadoop-BAM/blob/7.8.0/src/main/java/org/seqdoop/hadoop_bam/BAMSplitGuesser.java

<!-- Command/Subcommand links -->
[Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/Main.scala

[`check-bam`]: #check-bam
[check/Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/check/Main.scala

[`full-check`]: #full-check
[full/Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/check/full/Main.scala

[`compute-splits`]: #compute-splits
[spark/Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/spark/Main.scala

[`compare-splits`]: #compare-splits
[compare/Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/compare/Main.scala

[`index-blocks`]: #index-blocks
[IndexBlocks]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bgzf/index/IndexBlocks.scala
[`IndexBlocksTest`]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/scala/org/hammerlab/bgzf/index/IndexBlocksTest.scala

[`index-records`]: #index-records
[IndexRecords]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/index/IndexRecords.scala
[`IndexRecordsTest`]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/scala/org/hammerlab/bam/index/IndexRecordsTest.scala

[`htsjdk-rewrite`]: #htsjdk-rewrite
[rewrite/Main]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/main/scala/org/hammerlab/bam/rewrite/Main.scala

[`org.hammerlab.paths.Path`]: https://github.com/hammerlab/path-utils/blob/1.2.0/src/main/scala/org/hammerlab/paths/Path.scala
[Path NIO ctor]: https://github.com/hammerlab/path-utils/blob/1.2.0/src/main/scala/org/hammerlab/paths/Path.scala#L14
[Path URI ctor]: https://github.com/hammerlab/path-utils/blob/1.2.0/src/main/scala/org/hammerlab/paths/Path.scala#L157
[Path String ctor]: https://github.com/hammerlab/path-utils/blob/1.2.0/src/main/scala/org/hammerlab/paths/Path.scala#L145-L155

[`SAMRecord`]: https://github.com/samtools/htsjdk/blob/2.9.1/src/main/java/htsjdk/samtools/SAMRecord.java

[`LociSet`]: https://github.com/hammerlab/genomic-loci/blob/2.0.1/src/main/scala/org/hammerlab/genomics/loci/set/LociSet.scala

[`Pos`]: https://github.com/hammerlab/spark-bam/blob/master/bgzf/src/main/scala/org/hammerlab/bgzf/Pos.scala
[`Split`]: https://github.com/hammerlab/spark-bam/blob/master/check/src/main/scala/org/hammerlab/bam/spark/Split.scala

[linking]: #linking

[test_bams]: test_bams/src/main/resources
[cli/str/slice]: https://github.com/hammerlab/spark-bam/blob/master/cli/src/test/resources/slice

[cli]: https://github.com/hammerlab/spark-bam/blob/master/cli
@@ -1,6 +1,6 @@
package org.hammerlab.bam.benchmarks

import org.hammerlab.paths.Path
import hammerlab.path._

case class Datasets(datasets: Map[Dataset, Seq[BAM]])

Expand Down
@@ -1,7 +1,7 @@
package org.hammerlab.bam.benchmarks

import org.hammerlab.bytes.Bytes
import org.hammerlab.paths.Path
import hammerlab.bytes._
import hammerlab.path._

/**
* synthesize spreadsheet rows by parsing stats from files output by `check-bam` and `check-blocks`
Expand Down
@@ -1,8 +1,8 @@
package org.hammerlab.bgzf.block

import hammerlab.path._
import org.hammerlab.bgzf.block.Block.MAX_BLOCK_SIZE
import org.hammerlab.channel.SeekableByteChannel
import org.hammerlab.paths.Path

object FindBlockStart {
def apply(path: Path,
Expand Down
Expand Up @@ -2,7 +2,7 @@ package org.hammerlab.bgzf.block

import java.io.IOException

import org.hammerlab.paths.Path
import hammerlab.path._

case class HeaderSearchFailedException(path: Path,
start: Long,
Expand Down
Expand Up @@ -2,19 +2,19 @@ package org.hammerlab.bgzf.block

import java.io.{ Closeable, EOFException }

import hammerlab.iterator.SimpleIterator
import org.hammerlab.bgzf.block.Block.FOOTER_SIZE
import org.hammerlab.bgzf.block.Header.EXPECTED_HEADER_SIZE
import org.hammerlab.channel.ByteChannel
import org.hammerlab.io.Buffer
import org.hammerlab.iterator.SimpleBufferedIterator

/**
* Iterator over bgzf-block [[Metadata]]; useful when loading/decompressing [[Block]] payloads is unnecessary.
*
* @param ch input stream/channel containing compressed bgzf data
*/
case class MetadataStream(ch: ByteChannel)
extends SimpleBufferedIterator[Metadata]
extends SimpleIterator[Metadata]
with Closeable {

// Buffer for the standard bits of the header that we care about
Expand Down
4 changes: 2 additions & 2 deletions bgzf/src/main/scala/org/hammerlab/bgzf/block/Stream.scala
Expand Up @@ -3,18 +3,18 @@ package org.hammerlab.bgzf.block
import java.io.{ Closeable, EOFException, IOException, InputStream }
import java.util.zip.Inflater

import hammerlab.iterator.SimpleIterator
import org.hammerlab.bgzf.block.Block.{ FOOTER_SIZE, MAX_BLOCK_SIZE }
import org.hammerlab.channel.{ ByteChannel, SeekableByteChannel }
import org.hammerlab.io.Buffer
import org.hammerlab.iterator.SimpleBufferedIterator

import scala.collection.mutable

/**
* Iterator over BGZF [[Block]]s pointed to by a BGZF-compressed [[InputStream]]
*/
trait StreamI
extends SimpleBufferedIterator[Block]
extends SimpleIterator[Block]
with Closeable {

def compressedBytes: ByteChannel
Expand Down
Expand Up @@ -2,20 +2,19 @@ package org.hammerlab.bgzf.block

import java.io.Closeable

import hammerlab.iterator._
import org.hammerlab.bgzf.Pos
import org.hammerlab.channel.{ ByteChannel, SeekableByteChannel }
import org.hammerlab.iterator.FlatteningIterator._
import org.hammerlab.iterator.SimpleBufferedIterator

/**
* [[Iterator]] of bgzf-decompressed bytes from a [[Stream]] of [[Block]]s.
* @tparam BlockStream underlying [[Block]]-[[Stream]] type (basically: seekable or not?).
*/
trait UncompressedBytesI[BlockStream <: StreamI]
extends SimpleBufferedIterator[Byte]
extends SimpleIterator[Byte]
with Closeable {
def blockStream: BlockStream
val uncompressedBytes = blockStream.smush
val uncompressedBytes = blockStream.level
def curBlock: Option[Block] = uncompressedBytes.cur
def curPos: Option[Pos] = curBlock.map(_.pos)

Expand Down
Expand Up @@ -3,8 +3,7 @@ package org.hammerlab.bgzf.block
import java.io.FileInputStream
import java.nio.channels.FileChannel

import cats.implicits.catsStdShowForInt
import cats.syntax.all._
import hammerlab.show._
import org.hammerlab.bam.test.resources.bam2
import org.hammerlab.stats.Stats
import org.hammerlab.test.Suite
Expand Down

0 comments on commit fc4d907

Please sign in to comment.