<a href="https://colab.research.google.com/github/mayureshpawashe/spark/blob/main/spark_practise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyspark

##Checking Spark Installation & Creating SparkSession

In [7]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkArchitecture").getOrCreate()
print("Spark Version:", spark.version)

Spark Version: 3.5.5


##Creating SparkSession (Driver Side)

In [8]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DriverExample").getOrCreate()

print("Driver is running and managing tasks.")

Driver is running and managing tasks.


##Running tasks on Executors

In [9]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

rdd = sc.parallelize([1, 2, 3, 4, 5])  # RDD created
squared_rdd = rdd.map(lambda x: x**2)  # Tasks assigned to Executors

print("RDD processed by Executors:", squared_rdd.collect())  # Fetch results

RDD processed by Executors: [1, 4, 9, 16, 25]


##Demonstrating Parallel Execution

In [10]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

rdd = sc.parallelize(range(1, 11), numSlices=2)  # Data is split into 2 partitions
tasks = rdd.map(lambda x: (x, x**2))  # Each partition is processed in parallel

print("Tasks executed on Executors:", tasks.collect())


Tasks executed on Executors: [(1, 1), (2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (10, 100)]


##Running Spark in Local Mode

In [11]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("LocalMode").getOrCreate()
print("Spark is running in Local Mode")


Spark is running in Local Mode


##Creating and Processing an RDD

In [14]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

# Creating an RDD from a Python list
rdd = sc.parallelize([1, 2, 3, 4, 5, 5])

squared_rdd = rdd.map(lambda x: x ** 2)  # Squaring each element
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)  # Filtering even numbers
mapped_rdd = rdd.map(lambda x: (x, x ** 3))  # Creating key-value pairs (x, x^3)
reduced_value = rdd.reduce(lambda x, y: x + y)  # Summing all elements
distinct_rdd = rdd.flatMap(lambda x: (x, x)).distinct()  # Duplicating and removing duplicates

print("RDD elements squared:", squared_rdd.collect())
print("Filtered (even numbers):", filtered_rdd.collect())
print("Mapped (x, x^3):", mapped_rdd.collect())
print("Sum of elements (reduce):", reduced_value)
print("Distinct elements (after flatMap and distinct):", distinct_rdd.collect())


#more methods
count = rdd.count()  # Counting elements in the RDD
first_element = rdd.first()  # Getting the first element
rdd_sum = rdd.sum()  # Computing the sum of all elements
rdd_max = rdd.max()  # Finding the max element
rdd_min = rdd.min()  # Finding the min element

print("Count of elements:", count)
print("First element:", first_element)
print("Sum of RDD elements:", rdd_sum)
print("Max element:", rdd_max)
print("Min element:", rdd_min)


RDD elements squared: [1, 4, 9, 16, 25, 25]
Filtered (even numbers): [2, 4]
Mapped (x, x^3): [(1, 1), (2, 8), (3, 27), (4, 64), (5, 125), (5, 125)]
Sum of elements (reduce): 20
Distinct elements (after flatMap and distinct): [2, 4, 1, 3, 5]
Count of elements: 6
First element: 1
Sum of RDD elements: 20
Max element: 5
Min element: 1


##Creating and Displaying a DataFrame

In [15]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()


+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



##