We can aggregate RDD data in Spark by using three different actions: reduce, fold, and aggregate. The last one is the more general one and someway includes the first two.

In [1]:
data_file = "file:///home/lygbug666/workdir/spark-py-notebooks/kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

# Inspecting interaction duration by tag

fold操作有初始参数供第一次调用，reduce操作则没有。


In [3]:
csv_data = raw_data.map(lambda x: x.split(","))

normal_csv_data = csv_data.filter(lambda x: x[41]=="normal.")
attack_csv_data = csv_data.filter(lambda x: x[41]!="normal.")

The function that we pass to reduce gets and returns elements of the same type of the RDD. If we want to sum durations we need to extract that element into a new RDD.

In [4]:
# add time together, change str to int 

normal_duration_data = normal_csv_data.map(lambda x: int(x[0]))
attack_duration_data = attack_csv_data.map(lambda x: int(x[0]))

In [7]:
total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y)
total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y)

print ("Total duration for 'normal' interactions is {}".\
    format(total_normal_duration))
print ("Total duration for 'attack' interactions is {}".\
    format(total_attack_duration))

Total duration for 'normal' interactions is 21075991
Total duration for 'attack' interactions is 2626792


In [8]:
normal_count = normal_duration_data.count()
attack_count = attack_duration_data.count()

print ("Mean duration for 'normal' interactions is {}".\
    format(round(total_normal_duration/float(normal_count),3)))
print ("Mean duration for 'attack' interactions is {}".\
    format(round(total_attack_duration/float(attack_count),3)))

Mean duration for 'normal' interactions is 216.657
Mean duration for 'attack' interactions is 6.621


We have a first (and too simplistic) approach to identify attack interactions.

# A better way, using aggregate

The aggregate action frees us from the constraint of having the return be the same type as the RDD we are working on. 
Like with fold, we supply an initial zero value of the type we want to return. 
We provide two functions. The first one is used to combine the elements from our RDD with the accumulator. 
The second function is needed to merge two accumulators. 
Let's see it in action calculating the mean we did before.

In [10]:
csv_data = raw_data.map(lambda x: x.split(","))

In [12]:
normal_sum_count = normal_duration_data.aggregate(
    (0,0), # initial value
    (lambda acc, value: (acc[0]+value, acc[1]+1)), # combine value with acc[0], kind of value with acc[1]
    (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))) # reduce accumulators

print ("Mean duration for 'normal' interactions is {}".\
    format(round(normal_sum_count[0]/float(normal_sum_count[1]),3)))

Mean duration for 'normal' interactions is 216.657


In [14]:
attack_sum_count = attack_duration_data.aggregate(
    (0,0),
    (lambda acc, value:(acc[0]+value, acc[1]+1)),
    (lambda acc1,acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])))

print ("Mean duration for 'attack' interaction is {}" .\
      format(round(attack_sum_count[0]/float(attack_sum_count[1]),3)))

Mean duration for 'attack' interaction is 6.621


In the previous aggregation, the accumulator first element keeps the total sum, while the second element keeps the count. Combining an accumulator with an RDD element consists in summing up the value and incrementing the count. Combining two accumulators requires just a pairwise sum.

In [34]:
x = sc.parallelize([1,2,3,4],3).aggregate((1,0),
    (lambda acc, value:(acc[0]+value, acc[1]+1)),
    (lambda acc1,acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])))
print(x)

(14, 4)


[ sum of RDD elements + acc initial value * No. of RDD partitions + acc initial value ]
if not defined the number of partitions is 8