-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark persist StorageLevel.MEMORY_AND_DISK) using kedro #2765
Comments
Hello @javier-rosas, sorry for the delay. In principle I don't see anything in your code that should fail, but I am not a PySpark expert. Could you please try it out and let us know if it works? |
Hi @javier-rosas, I'm researching the use of spark in combination with kedro and came on your issue. If I understand correct, then "the dataframes are too big for MemoryDataSet" would indicate that the dataframe is retrieved from the spark cluster to the driver program (which runs kedro), and the memory available to the driver is insufficient to contain the dataset. Using the pyspark StorageLevel class, would result in the dataframe not being 'downloaded' to the driver program (running Kedro), but persisted from the spark cluster itself. If yes, then I suspect that using the catalog will always result in dataframes being 'sent to' and 'pulled from' the cluster towards the driver program running kedro? (Unless you use dummy/memory datasets - documented in https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction -> i.e. data operations outside of the kedro DAG). Any feed-back welcome. Best regards, |
Hi @javier-rosas, do you still need help resolving this issue? |
I'm closing this issue now. Feel free to re-open if the problem persists! |
Description
I need to persist data in memory using PySpark's StorageLevel class:
from pyspark import StorageLevel
. I am aware of the MemoryDataSet type, but I am running a Databricks cluster with Spark. Unfortunately, the dataframes are too big for MemoryDataSet, so I was hoping I could use PySparks StorageLevel class.Context
Here is an example of a possible implementation. Is this possible? Notice the use of
result.persist(StorageLevel.MEMORY_AND_DISK)
in the node functions.The text was updated successfully, but these errors were encountered: