Skip to content
Ondřej Moravčík edited this page Jun 12, 2015 · 16 revisions

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

For currently supported methods check

RDD methods are also available in lower camel case format.

rdd.flat_map(...)
rdd.flatMap(...)

Methods can be divided into 2 groups:

  • Transformations: append new operation to current RDD and return new
    • map(function)
    • flat_map(function)
    • map_partitions(function)
    • filter(function)
    • cartesian(other)
    • intersection(other)
    • sample(with_replacement, fraction, seed)
    • group_by_key(num_partitions)
    • ...
  • Actions: add operation and start calculations
    • take(count)
    • reduce(function)
    • aggregate(zero_value, seq_op, comb_op)
    • histogram(buckets)
    • collect
    • ...

Transformations

Add new function to current RDD and return new RDD which contains all previously defined functions.

rdd = sc.parallelize(...) # => Spark::RDD

rdd = rdd.map(lambda{|x| x*2}) # => Spark::PipelinedRDD
rdd = rdd.filter(lambda{|x| x > 2})
rdd = rdd.cartesian(rdd)
...

To start a calculation you need to call some action (e.g. .collect).

Actions

Start computations. May also add some own functionality.

rdd.collect
rdd.reduce(lambda{|sum, x| sum+x})
rdd.take_sample(true, 10)
...

Examples

Sum of numbers
sc.parallelize(0..10).sum
# => 55
Words count
rdd = sc.text_file(PATH)

rdd = rdd.flat_map(lambda{|line| line.split})
         .map(lambda{|word| [word, 1]})
         .reduce_by_key(lambda{|a, b| a+b})

rdd.collect_as_hash
To string
rdd = sc.parallelize(0..10)
rdd.map(:to_s).collect
Bind object
replace_to = '***'

def replacing(word)
  if word =~ /[0-9]+/
    replace_to
  else
    word
  end
end

rdd = sc.text_file('text.txt')
rdd = rdd.flat_map(lambda{|line| line.split})
rdd = rdd.map(method(:replacing))
rdd = rdd.bind(replace_to: replace_to)
rdd.collect
Caching
rdd = sc.parallelize(0...5, 1).map(lambda{|x| sleep(1); x})

rdd.collect # waiting 5s

rdd.cache

rdd.collect # waiting 5s
rdd.collect # waiting 0s
Clone this wiki locally