Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] DOC-280: How to use GX with AWS S3 and Spark #6782

Merged

Conversation

Rachel-Reverie
Copy link
Contributor

Changes proposed in this pull request:

  • partition the content of the "how to connect to data in s3 using spark" guide
  • Copy the AWS and S3/Pandas guide and swaps in the components for Spark to create a new guide.

Definition of Done

  • I have made corresponding changes to the documentation
  • I have run any local integration tests and made sure that nothing is broken.

- partitions the content of the "how to connect to data in s3 using spark" guide
- Copies the AWS and S3/Pandas guide and swaps in the components for Spark to create a new guide.
@github-actions github-actions bot added the devrel This item is being addressed by the Developer Relations Team label Jan 12, 2023
@netlify
Copy link

netlify bot commented Jan 12, 2023

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit fda1fa8
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/63d3d6407e995d0009dbd091
😎 Deploy Preview https://deploy-preview-6782--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@ghost
Copy link

ghost commented Jan 12, 2023

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L83-L88
```

Then load data into the `Validator`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rachel, I got to this part and am currently getting the following error:

Py4JJavaError: An error occurred while calling o124.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  • I'm beginning to look into it (Helpful StackOverflow) and I think there may be some additional set up needed on my machine in order for Spark to read from S3. As a consequence of this, I currently don't think it it's going to be a 1:1 conversion from the Pandas to Spark document, but some additional explanation (and set up) will be needed. I'll update to this thread once I have more information

Comment on lines +40 to +42

```python file=../../../../../../tests/integration/docusaurus/connecting_to_your_data/cloud/s3/spark/inferred_and_runtime_yaml_example.py#L96-L102
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be wonderful to get this as a named snippet instead of having the line numbers. Could it be within the scope of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connecting your Spark environment to S3 may require some additional configuration.

First, we must ensure the following JAR dependencies are installed in the Spark/jars directory

In addition, the following steps may be needed to update the Spark configuration:

[import statements]

import pyspark as pyspark
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')

And AWS credentials to the SparkContext.

sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', [AWS ACCESS KEY])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', [AWS SECRET KEY])

Additional information can be found in the official documentation for Hadoop and the following StackOverflow discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the extra step for getting those dependencies installed.

Unfortunately, updating the snippet would shift line numbers causing updates to be needed in a few other documents as well. I think that falls out of scope, and should be a PR of its own.

Rachel-Reverie and others added 3 commits January 26, 2023 11:29
- adds step for installing Spark dependencies for AWS
…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark

```python
conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be able to change this to something a bit more general? Or mention why 3.3.1 is used here.

Spark possesses a few dependencies that need to be installed before it can be used with AWS. You will need to install the `aws-java-sdk-bundle` and `hadoop-aws` files corresponding to your version of pySpark, and update your Spark configuration accordingly. You can find the `.jar` files you need to install in the following MVN repositories:

- [hadoop-aws jar that matches your Spark version](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws)
- [aws-java-sdk-bundle jar that matches your Spark version](aws-java-sdk-bundle jar that matches your Spark version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this link might need adjustment :)

- fixes link
- explains version in example.
…h_aws_and_spark' into d/docs/doc-353/how_to_use_gx_with_aws_and_spark
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Rachel-Reverie Rachel-Reverie enabled auto-merge (squash) January 27, 2023 13:48
@Rachel-Reverie Rachel-Reverie merged commit 8d24fe9 into develop Jan 27, 2023
@Rachel-Reverie Rachel-Reverie deleted the d/docs/doc-353/how_to_use_gx_with_aws_and_spark branch January 27, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devrel This item is being addressed by the Developer Relations Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants