[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

Rachel-Reverie · 2023-01-12T16:40:39Z

Changes proposed in this pull request:

partition the content of the "how to connect to data in s3 using spark" guide
Copy the AWS and S3/Pandas guide and swaps in the components for Spark to create a new guide.

Definition of Done

I have made corresponding changes to the documentation
I have run any local integration tests and made sure that nothing is broken.

- partitions the content of the "how to connect to data in s3 using spark" guide - Copies the AWS and S3/Pandas guide and swaps in the components for Spark to create a new guide.

netlify · 2023-01-12T16:41:11Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`fda1fa8`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/63d3d6407e995d0009dbd091
😎 Deploy Preview	https://deploy-preview-6782--niobium-lead-7998.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

ghost · 2023-01-12T16:50:45Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

…_spark

Shinnnyshinshin · 2023-01-19T01:50:15Z

docs/guides/connecting_to_your_data/cloud/s3/components_spark/_test_your_new_datasource.md

+```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L83-L88
+```
+
+Then load data into the `Validator`.


Hi Rachel, I got to this part and am currently getting the following error:

Py4JJavaError: An error occurred while calling o124.csv. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I'm beginning to look into it (Helpful StackOverflow) and I think there may be some additional set up needed on my machine in order for Spark to read from S3. As a consequence of this, I currently don't think it it's going to be a 1:1 conversion from the Pandas to Spark document, but some additional explanation (and set up) will be needed. I'll update to this thread once I have more information

Shinnnyshinshin · 2023-01-19T01:50:50Z

docs/guides/connecting_to_your_data/cloud/s3/components_spark/_test_your_new_datasource.md

+
+```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L96-L102
+```


It would be wonderful to get this as a named snippet instead of having the line numbers. Could it be within the scope of this PR?

Connecting your Spark environment to S3 may require some additional configuration.

First, we must ensure the following JAR dependencies are installed in the Spark/jars directory

hadoop-aws jar that matches your Spark version

aws-java-sdk-bundle jar that matches your Spark version

In addition, the following steps may be needed to update the Spark configuration:

[import statements]

import pyspark as pyspark from pyspark import SparkContext

conf = pyspark.SparkConf() conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')

And AWS credentials to the SparkContext.

sc = SparkContext(conf=conf) sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', [AWS ACCESS KEY]) sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', [AWS SECRET KEY])

Additional information can be found in the official documentation for Hadoop and the following StackOverflow discussion.

I've added the extra step for getting those dependencies installed.

Unfortunately, updating the snippet would shift line numbers causing updates to be needed in a few other documents as well. I think that falls out of scope, and should be a PR of its own.

- adds step for installing Spark dependencies for AWS

…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark

…_spark

Shinnnyshinshin · 2023-01-26T19:10:00Z

docs/deployment_patterns/how_to_use_gx_with_aws/components/_spark_s3_dependencies.md

+
+```python
+conf = pyspark.SparkConf()
+conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')


Would we be able to change this to something a bit more general? Or mention why 3.3.1 is used here.

Shinnnyshinshin · 2023-01-26T19:10:13Z

docs/deployment_patterns/how_to_use_gx_with_aws/components/_spark_s3_dependencies.md

+Spark possesses a few dependencies that need to be installed before it can be used with AWS.  You will need to install the `aws-java-sdk-bundle` and `hadoop-aws` files corresponding to your version of pySpark, and update your Spark configuration accordingly.  You can find the `.jar` files you need to install in the following MVN repositories:
+
+- [hadoop-aws jar that matches your Spark version](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws)
+- [aws-java-sdk-bundle jar that matches your Spark version](aws-java-sdk-bundle jar that matches your Spark version)


I think this link might need adjustment :)

- fixes link - explains version in example.

…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark

Shinnnyshinshin

🚀

…_spark

DOC-280

29e2498

- partitions the content of the "how to connect to data in s3 using spark" guide - Copies the AWS and S3/Pandas guide and swaps in the components for Spark to create a new guide.

Rachel-Reverie requested a review from donaldheppner as a code owner January 12, 2023 16:40

github-actions bot added the devrel This item is being addressed by the Developer Relations Team label Jan 12, 2023

Merge branch 'develop' into d/docs/doc-353/how_to_use_gx_with_aws_and…

1d960a9

…_spark

Shinnnyshinshin reviewed Jan 19, 2023

View reviewed changes

donaldheppner approved these changes Jan 25, 2023

View reviewed changes

Rachel-Reverie and others added 3 commits January 26, 2023 11:29

DOC-353

b3240f8

- adds step for installing Spark dependencies for AWS

Merge remote-tracking branch 'origin/d/docs/doc-353/how_to_use_gx_wit…

df8de4d

…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark

Merge branch 'develop' into d/docs/doc-353/how_to_use_gx_with_aws_and…

16ff989

…_spark

Shinnnyshinshin reviewed Jan 26, 2023

View reviewed changes

Rachel-Reverie added 2 commits January 26, 2023 13:14

DOC-353

742f52f

- fixes link - explains version in example.

Merge remote-tracking branch 'origin/d/docs/doc-353/how_to_use_gx_wit…

9f613e0

…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark

Shinnnyshinshin approved these changes Jan 26, 2023

View reviewed changes

Merge branch 'develop' into d/docs/doc-353/how_to_use_gx_with_aws_and…

fda1fa8

…_spark

Rachel-Reverie enabled auto-merge (squash) January 27, 2023 13:48

Rachel-Reverie merged commit 8d24fe9 into develop Jan 27, 2023

Rachel-Reverie deleted the d/docs/doc-353/how_to_use_gx_with_aws_and_spark branch January 27, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

Rachel-Reverie commented Jan 12, 2023

netlify bot commented Jan 12, 2023 •

edited

ghost commented Jan 12, 2023 •

edited by ghost

Shinnnyshinshin Jan 19, 2023

Shinnnyshinshin Jan 19, 2023

Shinnnyshinshin Jan 23, 2023 •

edited

Rachel-Reverie Jan 26, 2023

Shinnnyshinshin Jan 26, 2023

Shinnnyshinshin Jan 26, 2023

Shinnnyshinshin left a comment


		```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L96-L102
		```

[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

Conversation

Rachel-Reverie commented Jan 12, 2023

Definition of Done

netlify bot commented Jan 12, 2023 • edited

✅ Deploy Preview for niobium-lead-7998 ready!

ghost commented Jan 12, 2023 • edited by ghost

Legend

Shinnnyshinshin Jan 19, 2023

Choose a reason for hiding this comment

Shinnnyshinshin Jan 19, 2023

Choose a reason for hiding this comment

Shinnnyshinshin Jan 23, 2023 • edited

Choose a reason for hiding this comment

Rachel-Reverie Jan 26, 2023

Choose a reason for hiding this comment

Shinnnyshinshin Jan 26, 2023

Choose a reason for hiding this comment

Shinnnyshinshin Jan 26, 2023

Choose a reason for hiding this comment

Shinnnyshinshin left a comment

Choose a reason for hiding this comment

netlify bot commented Jan 12, 2023 •

edited

ghost commented Jan 12, 2023 •

edited by ghost

Shinnnyshinshin Jan 23, 2023 •

edited