Loading data

Custom serializer

Every RDD will be serialized by the same serializer defined on spark.ruby.serializer* options. If you want to have custom serializer for some RDD you can build one.

All serializers can be found at Rubydoc

# First way
marshal1 = Spark::Serializer::Marshal.new
compressed1 = Spark::Serializer::Compressed.new(marshal1)
serializer = Spark::Serializer::AutoBatched.new(compressed1)

# Second way
serializer = Spark::Serializer.build { auto_batched(compressed(marshal)) }

# Third way
serializer = Spark::Serializer.build("auto_batched(compressed(marshal))")

Uploading data

Data can be upload as single file.

rdd = sc.text_file(FILE, workers_num, serializer=nil)

All files on directory.

rdd = sc.whole_text_files(DIRECTORY, workers_num, serializer=nil)

Direct. Data must be iterable and choosen serializer must be able to serialized them.

rdd = sc.parallelize(data, workers_num, serializer=nil)
rdd = sc.parallelize([1,2,3,4,5], workers_num, serializer=nil)
rdd = sc.parallelize(1..5, workers_num, serializer=nil)

Options

workers_num: Min count of works computing this task.
(This value can be overwriten by spark)
serializer: Custom serializer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading data

Custom serializer

Uploading data

Options

Clone this wiki locally