In [4]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init(r'C:\spark\spark-3.5.0-bin-hadoop3')
import pyspark # only run this after findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data=[("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),("B", 60)]

inputRDD = spark.sparkContext.parallelize(data)
  
listRdd = spark.sparkContext.parallelize([1,2,3,4,5,3,2])

seqOp is a lambda function that specifies the sequence operation. In this case, it's a lambda function that takes two arguments x and y and returns their sum (x + y).

combOp is a lambda function that specifies the combine operation. Similarly, it takes two arguments x and y and returns their sum (x + y).

agg is a variable that stores the result of the aggregate operation.

listRdd is assumed to be an RDD that you want to aggregate.

The aggregate action is applied to the listRdd RDD. It takes three arguments:

The initial value for aggregation (0 in this case).
The sequence operation (seqOp), which is applied to elements within each partition of the RDD.
The combine operation (combOp), which is used to combine results from different partitions.
The result of the aggregation is stored in the agg variable.

Finally, print(agg) is used to print the result, which is 20.

The aggregation operation sums up all the elements in the RDD listRdd by applying the seqOp function within each partition and then combining the results using the combOp function. In this case, the elements in listRdd are assumed to be integers, and the result is the sum of all these integers, which is 20.

In [5]:
#aggregate
seqOp = (lambda x, y: x + y)
combOp = (lambda x, y: x + y)
agg=listRdd.aggregate(0, seqOp, combOp)
print(agg) # output 20

20


seqOp2 is a lambda function that specifies the sequence operation. It takes two arguments, x and y, where x is a tuple of two values (x[0], x[1]), and y is an element from the RDD. The lambda function returns a tuple where the first element is the sum of x[0] and y (the running total), and the second element is x[1] + 1 (the running count of elements).

combOp2 is a lambda function that specifies the combine operation. It takes two arguments, x and y, where both x and y are tuples with two values each. The lambda function returns a tuple where the first element is the sum of the first elements of x and y (the total sum), and the second element is the sum of the second elements of x and y (the total count).

agg2 is a variable that stores the result of the aggregate operation.

listRdd is assumed to be an RDD that you want to aggregate.

The aggregate action is applied to the listRdd RDD. It takes three arguments:

The initial value for aggregation, which is a tuple (0, 0) in this case, where the first element is the initial sum (0) and the second element is the initial count (0).
The sequence operation (seqOp2), which is applied to elements within each partition of the RDD.
The combine operation (combOp2), which is used to combine results from different partitions.
The result of the aggregation is stored in the agg2 variable.

Finally, print(agg2) is used to print the result, which is (20, 7). The first element of the tuple is the sum of all the elements in the RDD, and the second element is the count of elements in the RDD.

So, the result (20, 7) indicates that the sum of elements in listRdd is 20, and there are a total of 7 elements in the RDD.

In [6]:
#aggregate 2
seqOp2 = (lambda x, y: (x[0] + y, x[1] + 1))
combOp2 = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
agg2=listRdd.aggregate((0, 0), seqOp2, combOp2)
print(agg2) # output (20,7)

(20, 7)


In [7]:
agg2=listRdd.treeAggregate(0,seqOp, combOp)
print(agg2) # output 20

20


from operator import add imports the add function, which is a built-in Python function for addition. In this case, it will be used as the associative function for aggregation.

foldRes is a variable that will store the result of the fold operation.

listRdd is assumed to be an RDD containing integers that you want to sum.

The fold operation is applied to the listRdd RDD. It takes two arguments:

The initial value for aggregation, which is 0 in this case.
The associative function, which is add (the addition function) in this case.
The fold operation iterates through the elements in the RDD and aggregates them using the associative function. In this case, it adds all the integers in the RDD starting from the initial value of 0.

The result of the aggregation is stored in the foldRes variable.

Finally, print(foldRes) is used to print the result, which is 20. This indicates that the sum of all the elements in listRdd is 20.

So, the fold operation is a more general aggregation operation that allows you to specify an initial value and an associative function for the aggregation. In your case, it's used to calculate the sum of the elements.

In [8]:
#fold
from operator import add
foldRes=listRdd.fold(0, add)
print(foldRes) # output 20

20


In PySpark, the reduce operation is used to aggregate the elements of an RDD using a specified function, in this case, the add function for addition. Your code is using the reduce operation to calculate the sum of the elements in listRdd. Here's how the code works:

redRes is a variable that will store the result of the reduce operation.

listRdd is assumed to be an RDD containing integers that you want to sum.

The reduce operation is applied to the listRdd RDD with the add function as its argument. The add function is a built-in Python function for addition.

The reduce operation iterates through the elements in the RDD and aggregates them using the add function. It starts with the first element, adds it to the next element, and continues this process until all elements are aggregated into a single result.

The result of the aggregation is stored in the redRes variable.

Finally, print(redRes) is used to print the result, which is 20. This indicates that the sum of all the elements in listRdd is 20.

So, the reduce operation is a way to reduce an RDD to a single value by iteratively applying a function to the elements. In your case, it's used to calculate the sum of the elements using the add function.

In [9]:
#reduce
redRes=listRdd.reduce(add)
print(redRes) # output 20

20


add is a lambda function that specifies the aggregation operation. In this case, it's a lambda function that takes two arguments x and y and returns their sum (x + y).

redRes is a variable that will store the result of the treeReduce operation.

listRdd is assumed to be an RDD containing integers that you want to sum.

The treeReduce operation is applied to the listRdd RDD with the add function as its argument.

The treeReduce operation aggregates the elements in the RDD using a tree-based aggregation algorithm. This algorithm is more efficient for larger RDDs because it reduces the amount of data transferred between partitions.

The result of the aggregation is stored in the redRes variable.

Finally, print(redRes) is used to print the result, which is 20. This indicates that the sum of all the elements in listRdd is 20.

In summary, treeReduce is a more efficient version of reduce for larger RDDs, as it leverages a tree-based aggregation algorithm to optimize the computation and reduce data transfer overhead in a distributed environment.

In [10]:
#treeReduce. This is similar to reduce
add = lambda x, y: x + y
redRes=listRdd.treeReduce(add)
print(redRes) # output 20

20


data is a variable that will store the result of the collect operation.

listRdd is assumed to be an RDD containing elements that you want to collect.

The collect action is applied to the listRdd RDD. When you call collect(), it retrieves all the elements from all partitions of the RDD and collects them into a list in the driver program.

The result, which is a list containing all the elements from the RDD, is stored in the data variable.

Finally, print(data) is used to print the contents of the data list, which will display all the elements from the listRdd RDD.

It's important to note that using collect can be memory-intensive, especially if the RDD is large, because it brings all the data to the driver program. In practice, you should use collect with caution, and only when you have a good reason to bring the data to the driver, as it can lead to out-of-memory errors if the RDD is too large.

In [11]:
#Collect
data = listRdd.collect()
print(data)

[1, 2, 3, 4, 5, 3, 2]


listRdd is assumed to be an RDD containing elements for which you want to estimate the count.

listRdd.countApprox(1200) is calling the countApprox method on the RDD listRdd with a maximum acceptable relative standard deviation of 1200. This means that the method will return an approximate count of the elements with a relative standard deviation of up to 1200 (which controls the precision of the estimate).

The result of the countApprox method is then converted to a string and concatenated with the string "countApprox : " using the + operator.

Finally, the entire string is printed using the print function.

The output of this code will be a string that includes the estimated count of elements in listRdd with the specified relative standard deviation. For example, if the estimated count is 1000, the output would be something like:

In [13]:
#count, countApprox, countApproxDistinct
print("Count : "+str(listRdd.count()))
#Output: Count : 20
print("countApprox : "+str(listRdd.countApprox(1200)))
#Output: countApprox : (final: [7.000, 7.000])
print("countApproxDistinct : "+str(listRdd.countApproxDistinct()))
#Output: countApproxDistinct : 5
print("countApproxDistinct : "+str(inputRDD.countApproxDistinct()))
#Output: countApproxDistinct : 5

#countByValue, countByValueApprox
print("countByValue :  "+str(listRdd.countByValue()))


#first
print("first :  "+str(listRdd.first()))
#Output: first :  1
print("first :  "+str(inputRDD.first()))
#Output: first :  (Z,1)

#top
print("top : "+str(listRdd.top(2)))
#Output: take : 5,4
print("top : "+str(inputRDD.top(2)))
#Output: take : (Z,1),(C,40)

#min
print("min :  "+str(listRdd.min()))
#Output: min :  1
print("min :  "+str(inputRDD.min()))
#Output: min :  (A,20)  

#max
print("max :  "+str(listRdd.max()))
#Output: max :  5
print("max :  "+str(inputRDD.max()))
#Output: max :  (Z,1)

#take, takeOrdered, takeSample
print("take : "+str(listRdd.take(2)))
#Output: take : 1,2
print("takeOrdered : "+ str(listRdd.takeOrdered(2)))
#Output: takeOrdered : 1,2
print("take : "+str(listRdd.takeSample(withReplacement=False, num=10)))

Count : 7
countApprox : 6
countApproxDistinct : 5
countApproxDistinct : 5
countByValue :  defaultdict(<class 'int'>, {1: 1, 2: 2, 3: 2, 4: 1, 5: 1})
first :  1
first :  ('Z', 1)
top : [5, 4]
top : [('Z', 1), ('C', 40)]
min :  1
min :  ('A', 20)
max :  5
max :  ('Z', 1)
take : [1, 2]
takeOrdered : [1, 2]
take : [2, 5, 2, 3, 1, 4, 3]
