Slow wide table left outer join on local machine

I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark. 
When join is not there then query runs within 10secs.

With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)

Query is as such in pseudocode:
```
df1 = session.read.parquet('big_dataset.parquet')\
smalldf = session.read.parquet('small.parquet')
df2 = df1.select(*some_columns)
df_res=df2.join(smalldf, on='somecol', how='left')
df_res.write.parquet('result_dataset.parquet')
```

Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow wide table left outer join on local machine #301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow wide table left outer join on local machine #301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions