In-memory large dataframe processing #4

romain-intel · 2019-12-02T09:25:43Z

Metaflow tries to make the life of data scientists easier; this sometimes means providing ways to optimize certain common but expensive operations. Processing large dataframes in memory can be difficult and Metaflow could provide ways to do this more efficiently.

tduffy000 · 2019-12-12T16:10:21Z

@romain-intel is the idea to support this locally or on an AWS instance?

Wondering if the idea is just making it more integrated with Apache Spark (via pyspark), or finding a way like an IterableDataset in Pytorch, to split loading among workers and have them loaded at model time.

I imagine the difficulty might be in the atomicity of a @step given that a feature selection & engineering step would be wholly separated from the model step. Know from experience that there are still a lot of pandas fans out there.

Would be curious to hear your thoughts on this.

savingoyal · 2019-12-12T16:26:31Z

@tduffy000 We have an in-house implementation of dataframe which provides faster primitive operations with a lower memory footprint than Pandas. This is supported both on local instance and in the cloud. One can use this implementation inside a step or even outside of Metaflow (just like the metaflow.s3 client).

leftys · 2019-12-15T18:29:37Z

Maybe the use of Metaflow could be somehow combined with Dask, which supports bigger-than-memory dataframes to solve this issue. I am not sure if/how it would be possible to serialize and restore Dasks big and lazy-evaluated dataframes between steps though.

juarezr · 2020-02-03T20:11:17Z

Maybe something like a dataflow transfered between steps like Bonobo.

Also here is other example of software product that uses datapickle and Dask to run dataflows clusterized in cloud.

benjaminbluhm · 2020-02-07T10:20:57Z

I think the possibility to use Apache Spark within Metaflow would be extremely useful. When you have your feature engineering workflow written in pyspark it's kind of a pain to translate everything to pandas and also it's hard to predict how well this will work on large datasets.

tekumara · 2020-07-30T09:26:42Z

Would something like https://vaex.readthedocs.io/en/latest/index.html be a possible solution here?

crypdick · 2021-03-13T20:02:11Z

@savingoyal any update to release the dataframe implementation?

Adding modin as a distributed drop-in for pandas dfs

talebzeghmi · 2021-12-04T05:38:36Z

another mention Spark Pandas https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

jimmycfa · 2022-01-24T22:39:44Z

Agree the Pandas on Spark reference by @talebzeghmi would be valuable, but you would still need a Spark context. I think being able to declare that your task run in AWS Glue would potentially allow for both Pandas on Spark or just vanilla pySpark as a step.

dsjoerg · 2023-11-01T19:45:48Z

@savingoyal any update to release the dataframe implementation?

Still interested! Would appreciate any update, especially if it's "yeah we're not going to do this in the forseeable future after all"

romain-intel added the enhancement New feature or request label Dec 2, 2019

multimeric mentioned this issue Feb 21, 2020

Download and save a large file as an artifact #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory large dataframe processing #4

In-memory large dataframe processing #4

romain-intel commented Dec 2, 2019

tduffy000 commented Dec 12, 2019

savingoyal commented Dec 12, 2019

leftys commented Dec 15, 2019

juarezr commented Feb 3, 2020

benjaminbluhm commented Feb 7, 2020

tekumara commented Jul 30, 2020

crypdick commented Mar 13, 2021

talebzeghmi commented Dec 4, 2021

jimmycfa commented Jan 24, 2022

dsjoerg commented Nov 1, 2023

In-memory large dataframe processing #4

In-memory large dataframe processing #4

Comments

romain-intel commented Dec 2, 2019

tduffy000 commented Dec 12, 2019

savingoyal commented Dec 12, 2019

leftys commented Dec 15, 2019

juarezr commented Feb 3, 2020

benjaminbluhm commented Feb 7, 2020

tekumara commented Jul 30, 2020

crypdick commented Mar 13, 2021

talebzeghmi commented Dec 4, 2021

jimmycfa commented Jan 24, 2022

dsjoerg commented Nov 1, 2023