Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Spark not utilizing spark_config parameter from great_expectations.yml #1603

Closed
WesRoach opened this issue Jun 15, 2020 · 10 comments
Closed

Comments

@WesRoach
Copy link
Contributor

WesRoach commented Jun 15, 2020

Describe the bug

Running Spark in local mode.

I've added the spark_config key-value dict to my Spark datasource in great_expectations.yml.

  • Setting environment variable SPARK_DRIVER_MEMORY will be used by GE's Spark session
  • Removing env var SPARK_DRIVER_MEMORY and setting spark.driver.memory in yaml results in java heap space error.

Using the semi-example from https://docs.greatexpectations.io/en/latest/how_to_guides/configuring_datasources/how_to_configure_a_spark_filesystem_datasource.html

datasources:
  my_data.parquet__dir:
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: SparkDFDataset
    module_name: great_expectations.datasource
    spark_config: {
        "spark.local.dir": "/u01/data/spark-tmp",
        "spark.driver.memory": "16g"
    }
    class_name: SparkDFDatasource
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: <path-to-directory>

Also tried:

...
    spark_config: 
        spark.local.dir: /u01/data/spark-tmp
        spark.driver.memory: 16g
...

To Reproduce

  1. Start GE Spark job (profiling, checkpoint, etc)
  2. Check Spark's Environment (typically http://localhost:4040/environment/)
  3. Observe Spark's Environment variables

Expected behavior

Expected GE's Spark Session to utilize spark_config from great_expectations.yml.

Environment (please complete the following information):

  • OS: RHEL7
  • GE Version: 0.11.1
  • PySpark: 2.4.5
  • Conda: 4.8.2

Edit: Added pyspark, conda versions.

@jcampbell
Copy link
Member

@WesRoach -> I see this when using pyspark 3.0.0, which was just released today. Is that by chance the version you're running? I haven't looked yet to understand what changed.

@WesRoach
Copy link
Contributor Author

@jcampbell No, sorry - I should have specified - pyspark 2.4.5

@jcampbell
Copy link
Member

Ok. Any chance that the cluster to which you're connected is running spark 3.0.0? I'm confused a bit because I see spark 3.0 as an official release on github and pypi, but not on spark.apache.org. And, it broke our CI tests (specifically for this feature) on release...

I'll attempt to reproduce again locally on my environment.

@WesRoach
Copy link
Contributor Author

Running in local mode, spark.master: local[*] - it's a standalone Spark instance running on a single multi-core machine.

@mgorsk1
Copy link
Contributor

mgorsk1 commented Jul 10, 2020

@WesRoach any luck with this one ? I also encountered this and it seems that this option is not passed at all to SparkDFDataset init method

@jcampbell
Copy link
Member

My colleague @alexsherstinsky looked into this more, and it appears related to when we open and (don't) close the SparkSession handle. He's got a patch in the works.

@Dandandan
Copy link
Contributor

Dandandan commented Jul 21, 2020

Bumping into this issue I think.
Any update on a patch @jcampbell @alexsherstinsky ?

It looks like spark_config is not passed to __init__ of SparkDFDatasource

@Dandandan
Copy link
Contributor

I see that is also observed earlier by @mgorsk1

@Dandandan
Copy link
Contributor

Found the error:

in DatasourceConfigSchema

adding

spark_config = fields.Raw(allow_none=True)

causes it to deserialize the value from yaml and passes it to the DatasourceConfig object. Creating a PR now

@Dandandan
Copy link
Contributor

Created a PR for this, see #1713

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants