Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on Amazon s3 #127

Closed
DrPaulBrewer opened this issue Feb 20, 2016 · 9 comments

Comments

@DrPaulBrewer
Copy link

@DrPaulBrewer DrPaulBrewer commented Feb 20, 2016

Testing in docker on single 32-core amazon ec2 c3.8xlarge instance as:

     docker run -d -P -e PASSWORD=something jupyter/all-spark-notebook

Start new python2 notebook. Jupyter displays new notebook.

First Cell runs fine

 import pyspark
 import matplotlib
 sc = pyspark.SparkContext()

Second cell has error:

 myfile = sc.textFile("s3://bucketname/filename.csv")
 myfile.count()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe: java.io.IOException: No FileSystem for scheme: s3

Of course, the error points at the count, as sc.textFile() will return an RDD and does not attempt to access the file, deferring that to an action like .count().

AFAIK hadoop provides spark's s3 reader, and in 2015 there was a SO question about a similar error on hadoop:

http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation

providing the solution as including an omitted jarfile in the classpath defined in $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Looking around in the container with docker exec I don't see any HADOOP env vars set, and ran a find in root but could not find a hadoop-env.sh

I assume a workaround may to be to run a full spark cluster (say, with either the Amazon GUI or the scripts provided in spark) and connect the jupyter docker to it using the provided instructions, but have not had time to try that. It would be nice if the s3 reader worked in a stand alone case.

@DrPaulBrewer DrPaulBrewer changed the title pyspark sc.textFile() can not access files stored on s3 jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on s3 Feb 20, 2016
@DrPaulBrewer DrPaulBrewer changed the title jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on s3 jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on Amazon s3 Feb 20, 2016
@parente

This comment has been minimized.

Copy link
Member

@parente parente commented Feb 22, 2016

https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L18

The Spark 1.6.0 package used in the stack is prebuilt against Hadoop 2.6. It looks like someone filed a defect against Spark 1.4.1 about the missing jars/config:

https://issues.apache.org/jira/browse/SPARK-7442

It was closed stating that the problem is in hadoop upstream:

https://issues.apache.org/jira/browse/HADOOP-11863

which refers to this mailing list post:

http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kP28KDEjO4GQ@mail.gmail.com%3E

So either the s3 classes are in the image already but not configured for use (e.g., classpath). Or they're not in there at all and need to be added in plus configured for use.

@DrPaulBrewer

This comment has been minimized.

Copy link
Author

@DrPaulBrewer DrPaulBrewer commented Mar 9, 2016

That sounds consistent with some articles I found addressing use of s3 with spark.

Dec 2015
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

From Jun 2015 this Stack Overflow answer suggests that it was an issue with the Spark build he had with Hadoop 2.6 and that using a Spark built with Hadoop 2.4 fixed the problem for him.

I may not have time to look into this any time soon; but maybe this info will be useful to others.

@ksindi

This comment has been minimized.

Copy link

@ksindi ksindi commented Apr 26, 2016

I got this to work:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input() 
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://myBucket/myKey")
@parente

This comment has been minimized.

Copy link
Member

@parente parente commented Apr 26, 2016

@ksindi Thanks for sharing! Would you mind putting it on the recipes wiki page?

@parente parente closed this May 6, 2016
@RobinL

This comment has been minimized.

Copy link

@RobinL RobinL commented Jun 9, 2018

If you want to create a Docker that loads this config in automatically, you can do:

FROM jupyter/all-spark-notebook:latest

COPY script.py script.py
RUN python script.py

ENV PYSPARK_SUBMIT_ARGS  '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

COPY hdfs-site.xml /usr/local/spark/conf

With hdfs-site.xml

<configuration>
  <property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  </property>
</configuration>

and script.py (to trigger the relevant files being pre-installed in docker. a couple of lines in this are prob superfluous.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
@datawookie

This comment has been minimized.

Copy link

@datawookie datawookie commented Aug 30, 2018

@RobinL I've been hitting my head against this issue for a few hours and decided to put together your solution since it seems to be the most robust approach. I built the image and have got it running, but when I try to access anything on S3 I get a "Socket not created by this factory" error.

I was wondering whether your solution is still working for you or if you have found another way around this problem?

@jawadst

This comment has been minimized.

Copy link

@jawadst jawadst commented Nov 4, 2018

@datawookie I had the same issue and could fix it by:

  1. Using s3a:// URLs instead of s3:// or s3n://
  2. Upgrading the AWS SDK version to aws-java-sdk:1.11.95 (aws/aws-sdk-java#1032)
rochaporto pushed a commit to rochaporto/docker-stacks that referenced this issue Jan 23, 2019
@RobinL

This comment has been minimized.

Copy link

@RobinL RobinL commented Oct 2, 2019

Another update on this:

In addition to running spark in jupyter lab, I have been trying to set up two additional docker commands that allow us to run spark scripts in the same environment:

  1. Get a bash command prompt within the Docker container with docker run -it myimage /bin/bash
  2. Run a python script in the spark environment with docker run myimage python myfile.py
    where myimageis the Docker file specified above that builds on all-spark but adds additional config

The script.py above is a hack that makes sure that a bunch of required spark packages (jars) are pre-installed.

This works in jupyter but not for (1) and (2) above.

After lots of digging, I figured out why. the jars are installed at like ./home/jovyan/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar.

Jupyter ‘knows’ about the joyvan folder, but the bash prompt and python script don’t. So (1) and (2) can’t see the files in /home/jovyan and therefore go and download them again

The solution is to include this:

from pyspark.conf import SparkConf

conf = SparkConf().set(
    "spark.jars.ivy", "/home/jovyan/.ivy2/")

sc = SparkContext(conf=conf)

in any scripts which are running outside of jupyter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.