Skip to content

Slow wide table left outer join on local machine #301

@eredzik

Description

@eredzik

I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark.
When join is not there then query runs within 10secs.

With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)

Query is as such in pseudocode:

df1 = session.read.parquet('big_dataset.parquet')\
smalldf = session.read.parquet('small.parquet')
df2 = df1.select(*some_columns)
df_res=df2.join(smalldf, on='somecol', how='left')
df_res.write.parquet('result_dataset.parquet')

Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions