<a href="https://colab.research.google.com/github/mayureshpawashe/spark/blob/main/Sparkbasic_day4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Spark Day 4

#Tungsten Code

In [None]:
!pip install pyspark

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkAPI").getOrCreate()
data = [("Mayuresh", 25), ("Onkar", 30), ("Rohit", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

+--------+---+
|    Name|Age|
+--------+---+
|Mayuresh| 25|
|   Onkar| 30|
|   Rohit| 35|
+--------+---+



In [7]:
# Creating a DataFrame with a large dataset
big_data = [(i, f"User{i}", i * 2) for i in range(1, 100000)]
columns = ["ID", "Name", "Value"]
df = spark.createDataFrame(big_data, columns)
df.groupBy("Name").sum("Value").show()

+--------+----------+
|    Name|sum(Value)|
+--------+----------+
| User285|       570|
| User509|      1018|
| User958|      1916|
|User1212|      2424|
|User1292|      2584|
|User1346|      2692|
|User1690|      3380|
|User2093|      4186|
|User2757|      5514|
|User2782|      5564|
|User2977|      5954|
|User3131|      6262|
|User3176|      6352|
|User3403|      6806|
|User3819|      7638|
|User3839|      7678|
|User3991|      7982|
|User4271|      8542|
|User4494|      8988|
|User4535|      9070|
+--------+----------+
only showing top 20 rows



##After Tungsten (With Optimization)

In [8]:
df.groupBy("Name").sum("Value").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[Name#27], functions=[sum(Value#28L)])
   +- Exchange hashpartitioning(Name#27, 200), ENSURE_REQUIREMENTS, [plan_id=85]
      +- HashAggregate(keys=[Name#27], functions=[partial_sum(Value#28L)])
         +- Project [Name#27, Value#28L]
            +- Scan ExistingRDD[ID#26L,Name#27,Value#28L]




####Tungsten optimizes Spark jobs by reducing execution overhead and improving memory usage

##Catalyst Optimizer Code

In [10]:
from pyspark.sql.functions import col
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Doctor"), ("Charlie", 35, "Teacher")]
columns = ["Name", "Age", "Profession"]
df = spark.createDataFrame(data, columns)
filtered_df = df.filter(col("Age") > 28)
# Show optimized execution plan
filtered_df.explain()
###Catalyst Optimizer ensures filtering happens before scanning, making it faster

== Physical Plan ==
*(1) Filter (isnotnull(Age#67L) AND (Age#67L > 28))
+- *(1) Scan ExistingRDD[Name#66,Age#67L,Profession#68]


