<a href="https://colab.research.google.com/github/kptej/MyLearning/blob/main/spark_rdd_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyspark



**spark RDD**

RDD (Resilient Distributed Dataset) is a core building block of PySpark. It is a fault-tolerant, immutable, distributed collection of objects. Immutable means that once you create an RDD, you cannot change it. The data within RDDs is segmented into logical partitions, allowing for distributed computation across multiple nodes within the cluster.





In [None]:
#create sparkcontext using spark, and read the rdd list

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark_RDD_examples").getOrCreate()

print(spark)
print(spark.sparkContext.appName)
print(spark.sparkContext.getConf())
print(spark.sparkContext)

#create rdd
rdd = spark.sparkContext.parallelize([1,2,3,4,5])
#print RDD
print(rdd)
#collect RDD
print(rdd.collect())
print(rdd.count())




<pyspark.sql.session.SparkSession object at 0x7fa62271ab90>
Spark_RDD_examples
<pyspark.conf.SparkConf object at 0x7fa63aede690>
<SparkContext master=local[*] appName=Spark_RDD_examples>
ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289
[1, 2, 3, 4, 5]
5


In [None]:
data = [1,2,3,4,5,6,7,8,9,10,11,12]

rdd = spark.sparkContext.parallelize(data)

print(rdd.collect())

print(rdd.count())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
12


In [None]:
#empty Rdd
rdd = spark.sparkContext.emptyRDD

print(rdd)



<bound method SparkContext.emptyRDD of <SparkContext master=local[*] appName=Spark_RDD_examples>>


#rdd Tranformations


Apache Spark's Resilient Distributed Datasets (RDDs) support two primary types of operations: Transformations and Actions. Transformations are lazy operations that define a new RDD from an existing one, while Actions trigger the execution of these transformations and return results

**RDD Transformations**

Transformations are operations that create a new RDD from an existing one. They are evaluated lazily, meaning computation is deferred until an action requires the result. Transformations can be categorized into:

**Narrow Transformations:** Each output partition depends on a single input partition (e.g., map, filter).

**Wide Transformations:** Output partitions depend on multiple input partitions, often requiring data shuffling across the cluster (e.g., reduceByKey, join)

| Transformation                      | Description                                                               |                                                                                                                                  |
| ----------------------------------- | ------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `map(func)`                         | Applies a function to each element, returning a new RDD.                  |                                                                                                                                  |
| `flatMap(func)`                     | Similar to `map`, but can return multiple output elements for each input. |                                                                                                                                  |
| `filter(func)`                      | Returns elements that satisfy the predicate function.                     |                                                                                                                                  |
| `distinct()`                        | Removes duplicate elements.                                               |                                                                                                                                  |
| `union(otherRDD)`                   | Returns the union of two RDDs.                                            |                                                                                                                                  |
| `intersection(otherRDD)`            | Returns the intersection of two RDDs.                                     |                                                                                                                                  |
| `subtract(otherRDD)`                | Returns elements present in the first RDD but not in the second.          |                                                                                                                                  |
| `cartesian(otherRDD)`               | Returns the Cartesian product of two RDDs.                                |                                                                                                                                  |
| `groupByKey()`                      | Groups values with the same key.                                          |                                                                                                                                  |
| `reduceByKey(func)`                 | Merges values with the same key using the specified function.             |                                                                                                                                  |
| `sortByKey()`                       | Sorts RDD by key.                                                         |                                                                                                                                  |
| `join(otherRDD)`                    | Joins two RDDs by key.                                                    |                                                                                                                                  |
| `coalesce(numPartitions)`           | Reduces the number of partitions.                                         |                                                                                                                                  |
| `repartition(numPartitions)`        | Reshuffles data into a specified number of partitions.                    |                                                                                                                                  |
| `pipe(command)`                     | Pipes each partition through an external command.                         |                                                                                                                                  |
| `mapPartitions(func)`               | Applies a function to each partition.                                     |                                                                                                                                  |
| `sample(withReplacement, fraction)` | Samples a fraction of the data.                                           | ([sparkcodehub.com][1], [LinkedIn][2], [Apache Spark][3], [Stack Overflow][4], [DataFlair][5], [Medium][6], [Stack Overflow][7]) |

[1]: https://www.sparkcodehub.com/spark/rdd/transformations?utm_source=chatgpt.com "Mastering Apache Spark RDD Transformations - SparkCodeHub"
[2]: https://www.linkedin.com/pulse/spark-transformations-actions-lazy-evaluation-mohammad-younus-jameel?utm_source=chatgpt.com "Spark Transformations, Actions and Lazy Evaluation. - LinkedIn"
[3]: https://spark.apache.org/docs/latest/rdd-programming-guide.html?utm_source=chatgpt.com "RDD Programming Guide - Spark 3.5.5 Documentation"
[4]: https://stackoverflow.com/questions/45908291/rdd-transformation-and-actions?utm_source=chatgpt.com "RDD transformation and actions - apache spark - Stack Overflow"
[5]: https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/?utm_source=chatgpt.com "Spark RDD Operations-Transformation & Action with Example"
[6]: https://medium.com/%40sujathamudadla1213/spark-transformations-and-actions-ff4b576cbef8?utm_source=chatgpt.com "Spark RDD Transformations and Actions. | by Sujatha Mudadla"
[7]: https://stackoverflow.com/questions/78722890/where-can-i-find-an-exhaustive-list-of-actions-for-spark?utm_source=chatgpt.com "Where can I find an exhaustive list of actions for spark?"



In [2]:
#rdd Tranformations

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark_RDD_examples").getOrCreate()

rdd = spark.sparkContext.parallelize([1,2,3,4,5,6,7,8,9,10,11,12])

print(rdd.collect())



[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [6]:
#map:Applies a function to each element, returning a new RDD.
from pyspark import SparkContext

# Create an RDD from a Python list
numbers = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# Use the map transformation to square each element
squared_numbers = numbers.map(lambda x: x ** 2)

num = numbers.map(lambda x: (x,1) )

print(num.collect())
# Collect the results and print
print(squared_numbers.collect())


[(1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
[1, 4, 9, 16, 25]


In [21]:
#flat map
# flatMap() in Spark with an RDD — it's often used to split elements into multiple parts, like splitting sentences into words.
#🔍 Key Differences from map()
""" map() would return a list of lists: one list per sentence.
flatMap() flattens those lists into a single list of words."""

# Create an RDD of sentences
sentences = spark.sparkContext.parallelize([
    "Apache Spark is fast",
    "It supports many operations"
])

# Use flatMap to split each sentence into words
words = sentences.flatMap(lambda sentence: sentence.split())

w = sentences.flatMap(lambda x: x.split())

print(w.foreach(lambda x : print(x)))

# Collect and print the words
print(words.collect())

None
['Apache', 'Spark', 'is', 'fast', 'It', 'supports', 'many', 'operations']


In [23]:
#filter
"""
Returns elements that satisfy the predicate function.
Filtering out null or empty values in datasets
Selecting records that meet certain criteria (e.g., age > 30)
Removing bad or incomplete records
"""

words = spark.sparkContext.parallelize(["cat", "elephant", "rat", "dog", "giraffe"])

# Filter words with length greater than 3
long_words = words.filter(lambda word: len(word) > 3)

# Collect and print the result
print(long_words.collect())


filt = words.filter(lambda x: x != "cat")

print(filt.collect())

fil_cat = words.filter(lambda x: x == "cat")

print(fil_cat.collect())

diff_char = words.filter(lambda x: x != "cat" and x != "dog")

print(diff_char.collect())

['elephant', 'giraffe']
['elephant', 'rat', 'dog', 'giraffe']
['cat']
['elephant', 'rat', 'giraffe']


In [28]:
from typing import List
#distinct
"""
Removes duplicate elements.

The distinct() transformation removes duplicate elements from an RDD. It returns a new RDD that contains only the unique elements.
distinct() involves a shuffle operation, which may impact performance for large datasets. If performance is critical, consider combining it with map()
and reduceByKey() for custom deduplication.

"""

mylist = [1,1,2,2,3,3,4,5,6,7]

rdd = spark.sparkContext.parallelize(mylist)

print(rdd.collect())

print(rdd.distinct().collect())

print(rdd.sortBy(lambda x: x).collect())

print(rdd.distinct().sortBy(lambda x: x, ascending=True).collect())



[1, 1, 2, 2, 3, 3, 4, 5, 6, 7]
[2, 4, 6, 1, 3, 5, 7]
[1, 1, 2, 2, 3, 3, 4, 5, 6, 7]
[1, 2, 3, 4, 5, 6, 7]


In [32]:
#union
"""
The union() transformation in Apache Spark combines two RDDs into a single RDD that contains all elements from both.

It does not remove duplicates — use distinct() afterward if needed.
Both RDDs should have the same data type.

Returns the union of two RDDs.

"""

rdd1 = spark.sparkContext.parallelize([1, 2, 3])
rdd2 = spark.sparkContext.parallelize([3, 4, 5])

union_rdd = rdd1.union(rdd2)

print(union_rdd.collect())

print("after distinct")
distinct_combined = union_rdd.distinct()
print(distinct_combined.sortBy(lambda x: x, ascending=True).collect())
# Output: [1, 2, 3, 4, 5]



[1, 2, 3, 3, 4, 5]
after distinct
[1, 2, 3, 4, 5]


In [33]:
#intersection
"""
The intersection() transformation in Spark is used to return a new RDD that contains the common elements between two RDDs (i.e., the set intersection).

rdd1.intersection(rdd2): Compares both RDDs and keeps only the elements that appear in both.
It removes duplicates automatically, just like a mathematical set intersection.

intersection() can trigger a shuffle operation, which may be costly for large datasets.
Use it only when necessary, especially in distributed environments.
"""
rdd1 = spark.sparkContext.parallelize([1, 2, 3,4])
rdd2 = spark.sparkContext.parallelize([3, 4, 5,6])

intersection_rdd = rdd1.intersection(rdd2)

print(intersection_rdd.collect())  # Output: [3, 4]




[4, 3]


In [36]:
#subtract
"""
In Apache Spark RDDs, subtract() is a transformation used to remove elements present in another RDD. It performs a set difference operation.

Definition: Returns an RDD with elements from the first RDD that are not in the second RDD.
Use case: Filtering out unwanted data or comparing datasets.

"""

rdd1 = spark.sparkContext.parallelize([1, 2, 3,4,5])
rdd2 = spark.sparkContext.parallelize([3, 4, 5,6,7])

subt_rdd = rdd1.subtract(rdd2)

print(subt_rdd.collect())

"""
rdd1: [1, 2, 3, 4, 5]
rdd2: [3, 4, 5, 6, 7]
rdd1.subtract(rdd2): Returns elements in rdd1 that are not in rdd2 → [1, 2]
"""

[1, 2]


'\nrdd1: [1, 2, 3, 4, 5]\nrdd2: [3, 4, 5, 6, 7]\nrdd1.subtract(rdd2): Returns elements in rdd1 that are not in rdd2 → [1, 2]\n'

In [41]:
#cartesian
"""
The cartesian() transformation in Apache Spark returns the Cartesian product of two RDDs — meaning it returns all possible pairs of elements, one from each RDD.

rdd1.cartesian(rdd2) pairs each element in rdd1 with every element in rdd2.
Result is a new RDD with tuple pairs.

"""


rdd1 = spark.sparkContext.parallelize([1,2])
rdd2 = spark.sparkContext.parallelize(['a','b','c'])

cartesian_rdd = rdd1.cartesian(rdd2)

print(cartesian_rdd.collect())



[(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c')]


In [60]:
#groupby : Groups values with the same key.

"""
 groupBy() (or more commonly, groupByKey() for key-value RDDs) is a transformation that groups elements sharing the same key.
 It does not perform aggregation — it simply groups values, and then you can perform operations like sum, count, etc., afterward.

 Used on (key, value) pair RDDs.
 Groups all values with the same key into a single list.
 Often followed by a function like mapValues() or reduce() to perform aggregation (e.g., add).

 """

rdd = spark.sparkContext.parallelize([
    ("apple", 3),
    ("banana", 2),
    ("apple", 4),
    ("banana", 1),
    ("orange", 5) ])

group_rdd = rdd.groupByKey()

print(group_rdd.collect())

print("\n group key values")
for key, values in group_rdd.collect():
 print(f"Key: {key}, Values: {list(values)}")

print("\n sum of values")
data = group_rdd.mapValues(lambda x: sum(x))

print(data.collect())


[('apple', <pyspark.resultiterable.ResultIterable object at 0x7d5967461b10>), ('banana', <pyspark.resultiterable.ResultIterable object at 0x7d595326ca50>), ('orange', <pyspark.resultiterable.ResultIterable object at 0x7d5953359750>)]

 group key values
Key: apple, Values: [3, 4]
Key: banana, Values: [2, 1]
Key: orange, Values: [5]

 sum of values
[('apple', 7), ('banana', 3), ('orange', 5)]


In [69]:
#reduceByKey

"""
Merges values with the same key using the specified function.

reduceByKey is a transformation used on key-value (pair) RDDs in Apache Spark. It merges the values for each key using a specified reduce function.

This is useful when you want to aggregate data by key, like summing numbers for each word or category.

rdd.reduceByKey(func)

"""

rdd = spark.sparkContext.parallelize([
    ("apple", 1),
    ("banana", 2),
    ("apple", 4),
    ("banana", 1),
    ("orange", 5) ])


red_rdd = rdd.reduceByKey(lambda x,y: x+y)

print(red_rdd.collect())

#word count

a = "hello this is a word count hello this is a word count"
print('\n', a)

word = spark.sparkContext.parallelize(a.split())
#split
print("\n", word.collect())
#key& value
word_count = word.map(lambda x: (x,1))

print('\n' , word_count.collect())

#group key and values

grp_rdd = word_count.reduceByKey(lambda x,y: x+y)

print("\n", grp_rdd.collect())

[('apple', 5), ('banana', 3), ('orange', 5)]

 hello this is a word count hello this is a word count

 ['hello', 'this', 'is', 'a', 'word', 'count', 'hello', 'this', 'is', 'a', 'word', 'count']

 [('hello', 1), ('this', 1), ('is', 1), ('a', 1), ('word', 1), ('count', 1), ('hello', 1), ('this', 1), ('is', 1), ('a', 1), ('word', 1), ('count', 1)]

 [('hello', 2), ('this', 2), ('word', 2), ('is', 2), ('a', 2), ('count', 2)]


In [77]:
#Sorts RDD by key.

"""
Sorts RDD by key.

sortBy() is a transformation in Apache Spark used to sort the elements of an RDD based on a given key or function.

rdd.sortBy(keyfunc, ascending=True, numPartitions=None)

keyfunc: A function to extract the key for sorting.
ascending: Sort order (default is True).
numPartitions: Number of partitions after sort (optional).

"""

rdd1 = spark.sparkContext.parallelize([1,2,3,8,9,10,12,13])
rdd2 = spark.sparkContext.parallelize(["apple","abs","acd"])

sort_rdd = rdd1.sortBy(lambda x: x)

print(sort_rdd.collect())

sort_rdd2 = rdd2.sortBy(lambda x: x)

print(sort_rdd2.collect())

#sort by squre
sort_rdd3 = rdd1.sortBy(lambda x: x ** 2)

print(sort_rdd3.collect())

sort_rdd4 = rdd1.sortBy(lambda x: x, ascending=False, numPartitions=2)

print(sort_rdd4.collect())

print(sort_rdd4.getNumPartitions())

sort_rdd5 = rdd1.sortBy(lambda x: x, ascending=True, numPartitions=1)

[1, 2, 3, 8, 9, 10, 12, 13]
['abs', 'acd', 'apple']
[1, 2, 3, 8, 9, 10, 12, 13]
[13, 12, 10, 9, 8, 3, 2, 1]
2


In [80]:
#join:Joins two RDDs by key.

"""
In Apache Spark, the join() transformation is used to combine two RDDs based on keys, similar to a SQL INNER JOIN.
It's used with key-value pair RDDs ((key, value) format).
join() is used to combine two key-value pair RDDs (i.e., RDDs of the form (key, value)) based on matching keys.

It performs an inner join by default — only keys that are present in both RDDs will appear in the result.

joined_rdd = rdd1.join(rdd2)
Where:

rdd1: RDD with pairs like (K, V1)
rdd2: RDD with pairs like (K, V2)
Result: RDD of (K, (V1, V2)) for keys common in both RDDs.

"""

rdd1 = spark.sparkContext.parallelize([("a", 1), ("b", 2), ("c", 3)])
rdd2 = spark.sparkContext.parallelize([("a", "apple"), ("b", "banana"), ("d", "dragonfruit")])

joined_rdd = rdd1.join(rdd2)

print(joined_rdd.collect())

ljoin_rdd = rdd1.leftOuterJoin(rdd2)

print(ljoin_rdd.collect())

rjoin_rdd = rdd1.rightOuterJoin(rdd2)

print(rjoin_rdd.collect())

full_jion_rdd = rdd1.fullOuterJoin(rdd2)

print(full_jion_rdd.collect())


[('b', (2, 'banana')), ('a', (1, 'apple'))]
[('b', (2, 'banana')), ('c', (3, None)), ('a', (1, 'apple'))]
[('d', (None, 'dragonfruit')), ('b', (2, 'banana')), ('a', (1, 'apple'))]
[('d', (None, 'dragonfruit')), ('b', (2, 'banana')), ('c', (3, None)), ('a', (1, 'apple'))]


In [84]:
#coalesce(numPartitions)	Reduces the number of partitions.

"""
The coalesce() transformation reduces the number of partitions in an RDD. It is commonly used for optimizing performance before saving to disk
(especially after wide transformations or when writing small output files).

RDD.coalesce(numPartitions, shuffle=False)
numPartitions: The number of partitions you want.
shuffle (optional): If True, allows reshuffling of data for better distribution. By default, it is False (no shuffle, just merges adjacent partitions).


"""

data = spark.sparkContext.parallelize(range(1,20),6) #creates an RDD with 6 partitions

print(data.getNumPartitions())
print(data.glom().collect())

coalesce_rdd = data.coalesce(2) #reduce to 2 partions

print(coalesce_rdd.getNumPartitions())
print(coalesce_rdd.glom().collect())


6
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18, 19]]
2
[[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]


In [90]:
#repartition(numPartitions)	Reshuffles data into a specified number of partitions.

"""
repartition(numPartitions) is a transformation in Spark that reshuffles the data to increase or decrease the number of partitions in an RDD.

It performs a full shuffle, which means it can be expensive in terms of performance, but useful for load balancing or preparing data for
further operations like joins or saves.

rdd.repartition(numPartitions)

When to Use repartition():

When you want more parallelism by increasing partitions
After a coalesce() call that reduced partitions too much
Before a write to evenly distribute data across output files
"""

data = spark.sparkContext.parallelize(range(1,20), numSlices=2)# 2 partitions

print(data.getNumPartitions())
print(data.glom().collect())

repart_rdd = data.repartition(4) #reshuffle to slice of 4

print('\n', repart_rdd.getNumPartitions())
print(repart_rdd.glom().collect())

"""
⚠️ Note
If you're reducing partitions (e.g. from 10 to 2), prefer coalesce() instead of repartition() — it's more efficient because it avoids full data shuffle.
"""


print('\n',repart_rdd.coalesce(2).glom().collect())



2
[[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]

 4
[[], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [], []]

 [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], []]


In [98]:
from os import pipe
#pipe(command)	Pipes each partition through an external command.

"""
The pipe() transformation in Spark RDDs is used to run shell commands or external scripts on each RDD partition.
It allows piping partition data through an external process, often used when integrating with command-line tools or legacy systems.

📘 Syntax

rdd.pipe(command)

command: A string representing the shell command to run.
Each partition’s data is sent to the external command via standard input, and the command’s standard output becomes the new RDD.


"""

data = spark.sparkContext.parallelize([
    "hello this is a lines ",
    "this is new",
    "this is aswome" ], numSlices=2)

pip_rdd = data.pipe("grep this")


print(pip_rdd.collect())

"""
⚠️ Notes

The command runs independently on each partition.
The external tool must be installed and accessible on every executor node.
Data is streamed via stdin/stdout, so it's line-oriented.
It’s not portable across platforms (Windows/Linux) or clusters without consistent environment setup.

"""

['hello this is a lines ', 'this is new', 'this is aswome']


"\n⚠️ Notes\n\nThe command runs independently on each partition.\nThe external tool must be installed and accessible on every executor node.\nData is streamed via stdin/stdout, so it's line-oriented.\nIt’s not portable across platforms (Windows/Linux) or clusters without consistent environment setup.\n\n"

In [102]:
#mapPartitions(func)	Applies a function to each partition.
"""

The mapPartitions() transformation applies a function to each partition of the RDD (not to each element like map()), which can be more efficient when working with large datasets or external connections like databases.

✅ Syntax
RDD.mapPartitions(f)
Why Use mapPartitions()?
Performance: Reduces function call overhead by applying once per partition.
Resource Sharing: Ideal when you need to open a DB connection or expensive resource once per partition, not for every record.

"""


data = spark.sparkContext.parallelize([1,2,3,4,5], numSlices=2)
print(data)


mp_rdd = data.mapPartitions(lambda x: [i*2 for i in x])

print(mp_rdd.collect())

print("\n using function:")
def double(partition):#function
  return [i*2 for i in partition]

mp_rdd2 = data.mapPartitions(double)#functon call

print('\n', mp_rdd2.collect())


ParallelCollectionRDD[782] at readRDDFromFile at PythonRDD.scala:289
[2, 4, 6, 8, 10]

 using function:

 [2, 4, 6, 8, 10]


In [106]:
#sample(withReplacement, fraction)	Samples a fraction of the data.

"""
📘 sample() in Spark RDDs
The sample() transformation in Apache Spark is used to extract a random sample from an RDD.

🧪 Syntax
RDD.sample(withReplacement, fraction, seed=None)
withReplacement (bool) – Can elements be selected more than once? (True = yes)
fraction (float) – Approximate fraction of the dataset to sample (e.g., 0.1 = 10%)
seed (int, optional) – Random seed for reproducibility

If withReplacement=True, some elements may appear multiple times.
If withReplacement=False, it's more like a subset of the original data.
Useful for testing, prototyping, or stratified sampling from large datasets.

TypeError: RDD.sample() missing 2 required positional arguments: 'withReplacement' and 'fraction'
"""


data = spark.sparkContext.parallelize([1,2,3,4,5])

sample_rdd = data.sample(withReplacement=False, fraction=0.5)

print(sample_rdd.collect())

sample_rdd1 = data.sample(withReplacement=True, fraction=0.6)

print(sample_rdd1.collect())

[1, 3, 4, 5]
[4, 4]
