New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782
[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782
Conversation
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
👇 Click on the image for a new way to code review
Legend |
```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L83-L88 | ||
``` | ||
|
||
Then load data into the `Validator`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Rachel, I got to this part and am currently getting the following error:
Py4JJavaError: An error occurred while calling o124.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
- I'm beginning to look into it (Helpful StackOverflow) and I think there may be some additional set up needed on my machine in order for Spark to read from S3. As a consequence of this, I currently don't think it it's going to be a 1:1 conversion from the Pandas to Spark document, but some additional explanation (and set up) will be needed. I'll update to this thread once I have more information
|
||
```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L96-L102 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be wonderful to get this as a named snippet instead of having the line numbers. Could it be within the scope of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connecting your Spark environment to S3 may require some additional configuration.
First, we must ensure the following JAR dependencies are installed in the Spark/jars
directory
hadoop-aws
jar that matches your Spark versionaws-java-sdk-bundle
jar that matches your Spark version
In addition, the following steps may be needed to update the Spark configuration:
[import statements]
import pyspark as pyspark
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')
And AWS credentials to the SparkContext.
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', [AWS ACCESS KEY])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', [AWS SECRET KEY])
Additional information can be found in the official documentation for Hadoop and the following StackOverflow discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the extra step for getting those dependencies installed.
Unfortunately, updating the snippet would shift line numbers causing updates to be needed in a few other documents as well. I think that falls out of scope, and should be a PR of its own.
…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark
|
||
```python | ||
conf = pyspark.SparkConf() | ||
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we be able to change this to something a bit more general? Or mention why 3.3.1 is used here.
Spark possesses a few dependencies that need to be installed before it can be used with AWS. You will need to install the `aws-java-sdk-bundle` and `hadoop-aws` files corresponding to your version of pySpark, and update your Spark configuration accordingly. You can find the `.jar` files you need to install in the following MVN repositories: | ||
|
||
- [hadoop-aws jar that matches your Spark version](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) | ||
- [aws-java-sdk-bundle jar that matches your Spark version](aws-java-sdk-bundle jar that matches your Spark version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this link might need adjustment :)
…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Changes proposed in this pull request:
Definition of Done