-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Description
I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark.
When join is not there then query runs within 10secs.
With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)
Query is as such in pseudocode:
df1 = session.read.parquet('big_dataset.parquet')\
smalldf = session.read.parquet('small.parquet')
df2 = df1.select(*some_columns)
df_res=df2.join(smalldf, on='somecol', how='left')
df_res.write.parquet('result_dataset.parquet')
Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?
Metadata
Metadata
Assignees
Labels
No labels