Skip to content
Ondřej Moravčík edited this page Jun 12, 2015 · 7 revisions

Custom serializer

Every RDD will be serialized by the same serializer defined on spark.ruby.serializer* options. If you want to have custom serializer for some RDD you can build one.

  • All serializers can be found at Rubydoc
# First way
marshal1 = Spark::Serializer::Marshal.new
compressed1 = Spark::Serializer::Compressed.new(marshal1)
serializer = Spark::Serializer::AutoBatched.new(compressed1)

# Second way
serializer = Spark::Serializer.build { auto_batched(compressed(marshal)) }

# Third way
serializer = Spark::Serializer.build("auto_batched(compressed(marshal))")

Uploading data

Data can be upload as single file.

rdd = sc.text_file(FILE, workers_num, serializer=nil)

All files on directory.

rdd = sc.whole_text_files(DIRECTORY, workers_num, serializer=nil)

Direct. Data must be iterable and choosen serializer must be able to serialized them.

rdd = sc.parallelize(data, workers_num, serializer=nil)
rdd = sc.parallelize([1,2,3,4,5], workers_num, serializer=nil)
rdd = sc.parallelize(1..5, workers_num, serializer=nil)

Options

workers_num
Min count of works computing this task.
(This value can be overwriten by spark)
serializer
Custom serializer.