Skip to content
Kazuaki Ishizaki edited this page Jan 21, 2016 · 21 revisions

Welcome to the spark-gpu wiki!

Go to the repository -> click here


This is a page for a prototype of Apache Spark to effectively store partition data in a columnar RDD in a binary format. The motivation is to accelerate Spark workloads by using GPU and SIMD in Apache Spark. A RDD in the original Apache Spark keeps data as Scala sequence for each row on Java heap. This prototype keeps data as a binary representation on a off-heap, as Dataset introduced in Spark 1.6. This prototype keeps also in a columnar storage, which is suitable for GPU and SIMD.

You can see our current performance improvement (more than 3x) at benchmark section.


You can run our prototype in your box with NVIDIA GPU card or run AWS EC2 by following the procedure described here

You can download pre-build binary from (http://github.com/kiszk/spark-gpu/wiki/Downloads).

Please also visit other pages from the menu in the right-hand side.

Current version has several limitations

  • support only x86_64 and ppc64le
  • support OpenJDK and IBM JDK
  • Support only NVIDIA GPU with CUDA (we confirmed with CUDA 7.0)
  • support CUDA 7.0 and 7.5 (should work with CUDA 6.0 and 6.5)
  • support scalar variables in primitive scalar types and an primitive array in RDD
  • support a new column format for map and reduce functions

Future plan

  • Generate GPU and SIMD code from a Spark application program
  • Now, a programmer has to provide CUDA function for GPU kernels in Spark functions. Or, limited code generation for map() and reduce() functions is enabled with "spark.gpu.codegen=true"