In [0]:
'''
These are input output please write databricks pyspark script
empno   		name 		sal 	Deptno	
1     			Radha 		3000	 10 	 
2      			Kirshna		2000	 10	 
3			    rama		1000     10	            
1     			Venkata 	6000	 20 	 		
2      			Laxmana		4000	 20	           
3			Laxmi		2000     20
Output Dataset : 


empno   		name 		sal 	Deptno	nexthighest  saldiff
1     			Radha 		3000	 10 	 2000		1000
2      			Kirshna		2000	 10	 1000           1000
3			rama		1000     10	 0              1000
1     			Venkata 	6000	 20 	 4000		2000
2      			Laxmana		4000	 20	 2000           2000
3			Laxmi		2000     20	 0              2000

'''
from pyspark.sql.functions import col, lead, lit
from pyspark.sql.window import Window

data = [
    (1, "Radha", 3000, 10),
    (2, "Kirshna", 2000, 10),
    (3, "rama", 1000, 10),
    (1, "Venkata", 6000, 20),
    (2, "Laxmana", 4000, 20),
    (3, "Laxmi", 2000, 20)
]

columns = ["empno", "name", "sal", "Deptno"]
df=spark.createDataFrame(data,columns)

window_spec=Window.partitionBy("Deptno").orderBy(col("sal").desc())
#Calculate next highest salary using `lead`
df_with_next_highest=df.withColumn("nexthighest",lead("sal").over(window_spec))
#Fill null values with 0 (for cases where no next highest salary exists)
df_with_next_highest=df_with_next_highest.fillna({"nexthighest": 0})
df_final=df_with_next_highest.withColumn("saldiff",col("sal")-col("nexthighest"))
df_final.show(truncate=False)


+-----+-------+----+------+-----------+-------+
|empno|name   |sal |Deptno|nexthighest|saldiff|
+-----+-------+----+------+-----------+-------+
|1    |Radha  |3000|10    |2000       |1000   |
|2    |Kirshna|2000|10    |1000       |1000   |
|3    |rama   |1000|10    |0          |1000   |
|1    |Venkata|6000|20    |4000       |2000   |
|2    |Laxmana|4000|20    |2000       |2000   |
|3    |Laxmi  |2000|20    |0          |2000   |
+-----+-------+----+------+-----------+-------+



In [0]:
'''
1. How to Improve Data Load Performance in S3 or Delta Lake in Databricks
To optimize the loading of data into S3 or Delta Lake, consider the following strategies:

a) Partitioning and Bucketing
Partitioning: Split large tables by a key (e.g., date or region) to reduce the data scanned.
Bucketing: Hash-based grouping can improve joins and aggregations.
b) Optimize File Sizes
Ensure that file sizes are around 128 MB to 1 GB. Too many small files lead to inefficiencies (small files issue).
c) Auto Optimize
Enable Auto Optimize in Databricks to automatically compact small files when writing to Delta Lake.

SET spark.databricks.delta.autoOptimize.optimizeWrite = true;
d) Data Skipping
Delta Lake supports data skipping with statistics, so keeping the metadata up-to-date helps improve queries.
e) Z-Ordering
Use Z-Ordering to co-locate related data. This helps queries that filter on multiple columns.

OPTIMIZE delta_table_name ZORDER BY (column1, column2);
f) Caching
Cache frequently accessed data using Spark’s caching mechanisms.
2. How to Improve Performance with Many Small Tables in Databricks
Small tables can degrade performance if not handled properly. Here's how to manage:

a) Combining Small Tables
Combine small tables into larger tables or views to reduce metadata overhead.
b) Broadcast Joins
Use broadcast joins for small tables (under 10 MB) to avoid shuffling.

spark.sql.autoBroadcastJoinThreshold = 10485760  -- 10 MB threshold
c) Delta Lake Compaction
Compact small files using Delta’s OPTIMIZE command:

OPTIMIZE delta_table;
d) Databricks Auto Loader
Use Auto Loader for incremental loads if small files arrive frequently. This minimizes repeated processing of the same data.
e) Caching and Partitioning
Cache frequently accessed small tables and partition them appropriately.
3. Monitoring and Optimizing Databricks Job Performance
To monitor and optimize job performance, Databricks offers several tools:

a) Job Monitoring Dashboard
Use the Databricks Job Dashboard to view job run status, logs, and errors.
Navigate to Jobs > Click a specific job > Check Runs to monitor execution details.
b) Ganglia or Spark UI
Spark UI provides detailed insights into job stages, task execution, and shuffle operations.
c) Databricks Metrics
Enable Cluster Metrics to monitor CPU, memory, and disk usage for each cluster node.
You can access this under the "Metrics" tab for a running job.
d) Delta Lake Transaction Logs
Delta Lake maintains logs that allow you to track writes and updates. Analyze these logs for performance bottlenecks.
e) Optimizing Jobs
Adaptive Query Execution (AQE): Enable AQE to optimize joins and shuffles at runtime.

spark.sql.adaptive.enabled = true;
Caching: Cache tables or intermediate results to reduce recomputation.
f) Data Compaction and Z-Ordering
Regularly compact small files and apply Z-Ordering as explained above.
g) Auto Termination
Configure Auto Termination for clusters to save costs on idle clusters.
h) Alerting
Set up job failure alerts using Databricks alerts to get notifications via email or Slack when a job fails.
These practices will help ensure efficient job monitoring and data load performance in Databricks.
'''