
# Compute Average of Numbers in an RDD using PySpark

This notebook demonstrates how to calculate the average of a list of numbers using PySpark. 
The function `compute_average` takes a list of numbers, parallelizes it into an RDD, 
maps it into pairs for sum and count, reduces these pairs, and finally computes the average.

## Steps Involved
1. Parallelize the list into an RDD.
2. Map each number to a `(1, (number, 1))` pair.
3. Use `reduceByKey` to sum up both the values and counts.
4. Calculate the average by dividing the total sum by the count.


In [None]:

from pyspark import SparkContext

def compute_average(sc, numbers):
    # Parallelize the input list into an RDD
    rdd = sc.parallelize(numbers)
    
    # Map each number to a (number, 1) pair for sum and count
    rdd_mapped = rdd.map(lambda x: (1, (x, 1)))
    
    # Reduce by summing up the values and counts
    reduced_rdd = rdd_mapped.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    
    # Compute the average by dividing sum by count
    avg = reduced_rdd.mapValues(lambda x: x[0] / x[1])
    
    # Collect the result
    result = avg.collect()
    return result

# Example usage
if __name__ == "__main__":
    sc = SparkContext.getOrCreate()
    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    result = compute_average(sc, numbers)
    print(result)
